openWakeWord: Open-Source Wake Word Detection Built on Frozen Speech Embeddings

August 31, 2024

|repo-review

by Florian Narr

openWakeWord: Open-Source Wake Word Detection Built on Frozen Speech Embeddings

openWakeWord is a Python library for wake word and wake phrase detection — the "hey Alexa" layer you need before your voice pipeline gets involved. Drop it in front of any audio stream and get a per-frame score between 0 and 1 for each configured keyword.

Why I starred it

The commercial options here are Picovoice Porcupine and not much else worth mentioning. Both require data collection, licensing headaches, or both. openWakeWord sidesteps all of that with a single bet: use Google's pre-trained speech_embedding model as a frozen backbone, train tiny classification heads on top using 100% synthetic speech, and skip the manual recording sessions entirely.

That bet mostly pays off. The included "alexa" model outperforms Porcupine on the Dinner Party Corpus dataset (~5.5 hours of far-field speech), and a single Raspberry Pi 3 core can run 15–20 models simultaneously in real time. For an open-source project that requires zero labeled audio to train new models, those numbers are genuinely respectable.

How it works

The architecture is a three-stage pipeline, and the split is intentional.

Stage 1: A melspectrogram model (melspectrogram.onnx / .tflite) converts raw 16-bit 16kHz PCM audio into frequency features. It runs on fixed 1280-sample (80ms) frames.

Stage 2: Google's speech_embedding model (embedding_model.onnx) converts those melspectrograms into 96-dimensional audio embeddings. This backbone is frozen and shared — every wake word model you load runs through the same embedding pass once per frame. Adding a fifth model costs almost nothing in compute.

Stage 3: Lightweight classification heads (fully-connected networks or 2-layer RNNs) run on top of those embeddings. These are the actual wake word models — small, fast, and trained only on synthetic speech.

Opening openwakeword/model.py, the predict() method at line 232 handles all three stages in sequence. The frame accumulation logic is worth reading:

if n_prepared_samples > 1280:
    group_predictions = []
    for i in np.arange(n_prepared_samples//1280-1, -1, -1):
        group_predictions.extend(
            self.model_prediction_function[mdl](
                self.preprocessor.get_features(
                    self.model_inputs[mdl],
                    start_ndx=-self.model_inputs[mdl] - i
                )
            )
        )
    prediction = np.array(group_predictions).max(axis=0)[None, ]

When you feed in more than 80ms at once it processes each sub-frame and returns the max — a deliberate tradeoff of latency for efficiency. Longer frames reduce CPU load; shorter frames reduce detection delay. The library handles both transparently.

The AudioFeatures class in openwakeword/utils.py wraps the melspectrogram and embedding models behind a unified interface, with the inference framework (tflite vs. onnx) abstracted away at construction time. The inference_framework flag in the Model initializer cascades down correctly if tflite isn't available, falling back to onnx automatically.

One useful detail: the prediction_buffer is a defaultdict of deque(maxlen=30), keeping the last 30 frame scores per model. The predict() method zeros out the first 5 frames during warmup — a small guard against the embedding model returning garbage before it settles. It's the kind of defensive code that only shows up after someone filed a bug.

The vad_threshold parameter wires in Silero VAD as a gate: the wake word prediction only passes through if the VAD score is above the threshold for the same frame. Useful for cutting false-accepts in continuous noisy environments.

Using it

Basic streaming detection:

import openwakeword
from openwakeword.model import Model
import pyaudio
import numpy as np

openwakeword.utils.download_models()
model = Model(wakeword_models=["hey_jarvis"], vad_threshold=0.5)

p = pyaudio.PyAudio()
stream = p.open(rate=16000, channels=1, format=pyaudio.paInt16,
                input=True, frames_per_buffer=1280)

while True:
    audio = np.frombuffer(stream.read(1280), dtype=np.int16)
    prediction = model.predict(audio)
    if prediction.get("hey_jarvis", 0) > 0.5:
        print("Detected!")

For offline batch testing on WAV files:

from openwakeword.utils import bulk_predict

results = bulk_predict(
    file_paths=["test1.wav", "test2.wav"],
    wakeword_models=["hey_jarvis"],
    ncpu=4
)

Training a custom model requires generating synthetic speech clips with the companion synthetic_speech_dataset_generation repo, then running the automated training notebook. The Google Colab notebook gets you to a working model in under an hour without touching a microphone.

Rough edges

English only. The README acknowledges it and it's a real ceiling — non-English wake words aren't supported and aren't on the roadmap in a concrete way.

The model license is more restrictive than the code license. Code is Apache 2.0, but the pre-trained models are CC BY-NC-SA 4.0 because of training data with unknown upstream licensing. If you're building a commercial product, you'll need to train your own models from scratch.

The tflite dependency situation changed with v0.6 — it now pulls ai-edge-litert instead of the old tflite-runtime, and the fallback logic in model.py handles the transition but not cleanly on all platforms. Windows still runs ONNX-only. The docs haven't fully caught up with the dependency changes.

No browser support. The FAQ mentions websocket streaming from a browser as a workaround, but there's no first-class JS path.

Bottom line

If you're building a local voice assistant, a Home Assistant integration, or anything that needs an always-on keyword trigger without sending audio to a cloud service, openWakeWord is the right choice. Skip it if you need sub-100ms latency on microcontrollers or non-English languages.

dscripka/openWakeWord on GitHub
dscripka/openWakeWord