Keyword Spotting

Real-time speech command detection — ONNX model running in pure TypeScript, zero dependencies

Understanding the use case

▼

Wake words and voice commands — Keyword spotting (KWS) is the task of detecting a small set of predefined words in a continuous audio stream. It powers "Hey Siri", "OK Google", and embedded voice controls. The key constraint: the model must run on-device with ultra-low latency and minimal power.

Why on-device? — Sending audio to a server adds 100-300ms of latency, requires an internet connection, and raises privacy concerns. A model running locally on a microcontroller responds in under 10ms, works offline, and never transmits audio data.

SpikyPanda approach — Train a small CNN model in PyTorch, export to ONNX, validate against onnxruntime with our 152-test conformance suite, then deploy via the SpikyPanda TypeScript runtime (browser) or CyanMycelium C++ runtime (MCU). Same model, same results, every platform.

Model size — The tiny model used here has 6,156 parameters and fits in 25 KB. For comparison, a typical voice assistant model is 50-500 MB. Our model is 2,000x smaller and runs on a $3 chip.

Understanding the pipeline

▼

Audio capture — The Web Audio API captures microphone input at 16kHz mono. A ring buffer holds the last 1 second of audio (16,000 samples).

MFCC features — Each 1-second window is transformed into a spectrogram-like representation: 512-point FFT with 10ms hops, passed through a 40-band Mel filterbank, log-compressed, then projected via DCT into 40 cepstral coefficients. The result is a [1, 40, 101] tensor — 40 frequency bands across 101 time frames.

ONNX inference — The MFCC tensor feeds into the SpikyPanda ONNX pipeline: OnnxParser.parse() deserializes the protobuf, OnnxGraphBuilder.build() constructs the compute DAG, and graph.run() executes all operators in topological order. Output is 12 logits, one per keyword class.

Operators used — Conv (1D convolution), Relu (activation), MaxPool (downsampling), GlobalAveragePool (spatial reduction), Gemm (classification head). All validated against onnxruntime to 1e-4 tolerance.

Live Demo

Microphone → MFCC (40×101) → SpikyPanda ONNX → Keyword
25 KB model • 6,156 parameters • ~2ms inference

Click Start to begin

SpikyPanda native spikypanda

Model Details

Model

Architecture

Conv1D(40→24, k=3) → BN → ReLU → Pool
Conv1D(24→24, k=3) → BN → ReLU → Pool
Conv1D(24→16, k=3) → BN → ReLU → GAP
Linear(16→12)

Training

Dataset: Google Speech Commands v2
30 epochs, Adam, cosine LR
12 classes, 1s audio windows

Deployment

Browser: SpikyPanda TS runtime
MCU: CyanMycelium C++ runtime
25 KB flash, <4 KB RAM

Inference

Avg: -- ms
Runs: 0
MFCC: ~5ms, Model: ~2ms