openWakeWord is a Python library for wake word and wake phrase detection — the "hey Alexa" layer you need before your voice pipeline gets involved. Drop it in front of any audio stream and get a per-frame score between 0 and 1 for each configured keyword.
Why I starred it
The commercial options here are Picovoice Porcupine and not much else worth mentioning. Both require data collection, licensing headaches, or both. openWakeWord sidesteps all of that with a single bet: use Google's pre-trained speech_embedding model as a frozen backbone, train tiny classification heads on top using 100% synthetic speech, and skip the manual recording sessions entirely.
That bet mostly pays off. The included "alexa" model outperforms Porcupine on the Dinner Party Corpus dataset (~5.5 hours of far-field speech), and a single Raspberry Pi 3 core can run 15–20 models simultaneously in real time. For an open-source project that requires zero labeled audio to train new models, those numbers are genuinely respectable.
How it works
The architecture is a three-stage pipeline, and the split is intentional.
Stage 1: A melspectrogram model (melspectrogram.onnx / .tflite) converts raw 16-bit 16kHz PCM audio into frequency features. It runs on fixed 1280-sample (80ms) frames.
Stage 2: Google's speech_embedding model (embedding_model.onnx) converts those melspectrograms into 96-dimensional audio embeddings. This backbone is frozen and shared — every wake word model you load runs through the same embedding pass once per frame. Adding a fifth model costs almost nothing in compute.
Stage 3: Lightweight classification heads (fully-connected networks or 2-layer RNNs) run on top of those embeddings. These are the actual wake word models — small, fast, and trained only on synthetic speech.
Opening openwakeword/model.py, the predict() method at line 232 handles all three stages in sequence. The frame accumulation logic is worth reading:
if n_prepared_samples > 1280:
group_predictions = []
for i in np.arange(n_prepared_samples//1280-1, -1, -1):
group_predictions.extend(
self.model_prediction_function[mdl](
self.preprocessor.get_features(
self.model_inputs[mdl],
start_ndx=-self.model_inputs[mdl] - i
)
)
)
prediction = np.array(group_predictions).max(axis=0)[None, ]
When you feed in more than 80ms at once it processes each sub-frame and returns the max — a deliberate tradeoff of latency for efficiency. Longer frames reduce CPU load; shorter frames reduce detection delay. The library handles both transparently.
The AudioFeatures class in openwakeword/utils.py wraps the melspectrogram and embedding models behind a unified interface, with the inference framework (tflite vs. onnx) abstracted away at construction time. The inference_framework flag in the Model initializer cascades down correctly if tflite isn't available, falling back to onnx automatically.
One useful detail: the prediction_buffer is a defaultdict of deque(maxlen=30), keeping the last 30 frame scores per model. The predict() method zeros out the first 5 frames during warmup — a small guard against the embedding model returning garbage before it settles. It's the kind of defensive code that only shows up after someone filed a bug.
The vad_threshold parameter wires in Silero VAD as a gate: the wake word prediction only passes through if the VAD score is above the threshold for the same frame. Useful for cutting false-accepts in continuous noisy environments.
Using it
Basic streaming detection:
import openwakeword
from openwakeword.model import Model
import pyaudio
import numpy as np
openwakeword.utils.download_models()
model = Model(wakeword_models=["hey_jarvis"], vad_threshold=0.5)
p = pyaudio.PyAudio()
stream = p.open(rate=16000, channels=1, format=pyaudio.paInt16,
input=True, frames_per_buffer=1280)
while True:
audio = np.frombuffer(stream.read(1280), dtype=np.int16)
prediction = model.predict(audio)
if prediction.get("hey_jarvis", 0) > 0.5:
print("Detected!")
For offline batch testing on WAV files:
from openwakeword.utils import bulk_predict
results = bulk_predict(
file_paths=["test1.wav", "test2.wav"],
wakeword_models=["hey_jarvis"],
ncpu=4
)
Training a custom model requires generating synthetic speech clips with the companion synthetic_speech_dataset_generation repo, then running the automated training notebook. The Google Colab notebook gets you to a working model in under an hour without touching a microphone.
Rough edges
English only. The README acknowledges it and it's a real ceiling — non-English wake words aren't supported and aren't on the roadmap in a concrete way.
The model license is more restrictive than the code license. Code is Apache 2.0, but the pre-trained models are CC BY-NC-SA 4.0 because of training data with unknown upstream licensing. If you're building a commercial product, you'll need to train your own models from scratch.
The tflite dependency situation changed with v0.6 — it now pulls ai-edge-litert instead of the old tflite-runtime, and the fallback logic in model.py handles the transition but not cleanly on all platforms. Windows still runs ONNX-only. The docs haven't fully caught up with the dependency changes.
No browser support. The FAQ mentions websocket streaming from a browser as a workaround, but there's no first-class JS path.
Bottom line
If you're building a local voice assistant, a Home Assistant integration, or anything that needs an always-on keyword trigger without sending audio to a cloud service, openWakeWord is the right choice. Skip it if you need sub-100ms latency on microcontrollers or non-English languages.
