livekit/agents: A Serious Framework for Realtime Voice AI

livekit/agents is a Python framework for building realtime voice agents — programs that listen to audio over WebRTC, process it through a configurable STT/LLM/TTS pipeline, and respond back as audio in the same room. It sits on top of LiveKit's WebRTC infrastructure and handles everything from VAD to multi-agent handoffs.

Why I starred it

Most voice AI demos are held together with duct tape: sounddevice → Whisper → GPT → pyttsx3, all blocking in a thread. That works for demos. It breaks in production the moment a user interrupts mid-sentence or two people talk at once.

What caught my eye here is that LiveKit Agents was built from the start around the hard problems: interruption handling, end-of-turn detection, agent handoffs, and job scheduling across a fleet of workers. It treats voice as a realtime streaming problem, not a sequential request/response one.

How it works

The core abstraction is AgentSession in livekit-agents/livekit/agents/voice/agent_session.py. It's not just a container — it manages the full state machine: which agent is active, what turn-handling strategy is in use, and how incoming audio maps to LLM calls and outgoing speech.

Agent objects are composable. You define one with instructions and tools, pass it to a session, and swap to another via handoff (returning a new Agent instance from a tool function). The session tracks AgentActivity per active agent and reuses resources like the STT pipeline and realtime session across handoffs (_ReusableResources in agent_activity.py) — so switching agents mid-conversation doesn't introduce a cold-start delay.

The turn detection piece is where the engineering gets interesting. There are four modes, controlled by TurnDetectionMode in voice/turn.py:

"vad" — voice activity detection triggers end-of-turn
"stt" — STT finalization triggers it
"realtime_llm" — the model's server-side turn detection (OpenAI Realtime API style)
a _TurnDetector protocol — a local transformer model running on your server

That last mode ships as livekit-plugins-turn-detector. The plugin downloads a quantized ONNX model from Hugging Face, runs it via onnxruntime in a separate inference process, and feeds it a formatted chat context window (last 6 turns, 128 tokens max). It predicts the probability that the user has finished speaking — not just that they stopped making noise.

# from livekit-plugins/livekit-plugins-turn-detector/livekit/plugins/turn_detector/base.py
convo_text = self._tokenizer.apply_chat_template(
    new_chat_ctx, add_generation_prompt=False, add_special_tokens=False, tokenize=False
)
# remove the EOU token from current utterance
ix = convo_text.rfind("<|im_end|>")
text = convo_text[:ix]

It strips the EOU token before inference, then scores the probability of turn completion. The threshold is language-dependent (unlikely_threshold per language in the multilingual model). This matters because "I was thinking..." in English is almost never end-of-turn, but the same pause in another language might be.

Endpointing — how long to wait after silence before committing — is handled by DynamicEndpointing in voice/endpointing.py. It maintains two exponential moving averages: one for intra-utterance pauses (short words, thinking) and one for turn-ending pauses. Rather than a fixed 500ms window, it adapts to the speaker's rhythm over the session.

The LLM pipeline itself (voice/generation.py) uses asyncio channels throughout. perform_llm_inference returns immediately with a task and two channels — one for text chunks, one for function calls — and downstream consumers (TTS, tool execution) read from those channels concurrently. This is why TTS can start speaking before the LLM has finished generating.

Job scheduling uses a separate worker process model (worker.py). An AgentServer registers with LiveKit's server over a WebSocket, announces its current CPU load every 500ms via a moving average, and receives job assignments. Each job spins up in a subprocess by default (JobExecutorType.PROCESS). The load calculation is a singleton running in a daemon thread:

class _DefaultLoadCalc:
    def _calc_load(self) -> None:
        while True:
            cpu_p = self._cpu_monitor.cpu_percent(interval=0.5)
            with self._lock:
                self._m_avg.add_sample(cpu_p)

This means LiveKit's dispatch can route new calls away from a saturated worker without any manual configuration.

Using it

pip install "livekit-agents[openai,silero,deepgram,cartesia,turn-detector]~=1.4"

A minimal voice agent:

from livekit.agents import Agent, AgentServer, AgentSession, JobContext, cli
from livekit.plugins import silero, deepgram, openai, cartesia

server = AgentServer()

@server.rtc_session()
async def entrypoint(ctx: JobContext):
    session = AgentSession(
        vad=silero.VAD.load(),
        stt=deepgram.STT(model="nova-3"),
        llm=openai.LLM(model="gpt-4o-mini"),
        tts=cartesia.TTS(),
    )
    agent = Agent(instructions="You are a helpful voice assistant.")
    await session.start(agent=agent, room=ctx.room)
    await session.generate_reply(instructions="greet the user")

if __name__ == "__main__":
    cli.run_app(server)

Multi-agent handoff is done by returning a new agent from a @function_tool:

@function_tool
async def transfer_to_billing(self, context: RunContext):
    """Transfer to the billing agent."""
    return BillingAgent(), "Transferring you now."

The test framework deserves mention. You can drive an agent session programmatically, feed it text or audio, and assert on the event sequence with an LLM judge:

result = await sess.run(user_input="I need help with my order.")
await (
    result.expect.next_event()
    .is_message(role="assistant")
    .judge(llm, intent="agent should ask for order number")
)

That's a meaningful testing primitive for non-deterministic systems.

Rough edges

The plugin surface is enormous — 60+ plugins in livekit-plugins/ — but quality is uneven. Some are clearly first-party and well-maintained (openai, deepgram, silero, cartesia). Others look like community contributions with minimal documentation and no tests. There's no signal in the repo about which plugins are considered stable vs experimental.

The framework requires a running LiveKit server. You can use LiveKit Cloud or self-host, but there's no local-only mode for development without a WebRTC room. If you just want to test your agent logic without spinning up WebRTC infrastructure, the AgentSession.run() testing API helps, but it's not immediately obvious from the docs that it exists.

Documentation is functional but incomplete in places. The RecordingOptions TypedDict in agent_session.py has better inline docs than anything in the official docs site.

Bottom line

If you're building a voice agent that needs to handle real conversations — interruptions, multiple speakers, agent handoffs, production-scale job routing — this framework has done the hard thinking. The dynamic endpointing and semantic turn detection alone make it worth examining, even if you end up only borrowing the ideas.