Call Center AI: A Real Phone Bot You Can Deploy in an Afternoon

A Python service that answers and places phone calls using an AI agent. You POST a JSON payload with a phone number, a task description, and a claim schema — the bot calls the number, conducts the conversation, and stores the structured result.

Why I starred it

Most "AI phone agent" demos stop at a WebRTC widget in a browser. This one goes further: real PSTN phone numbers via Azure Communication Services, bidirectional audio streaming over WebSocket, acoustic echo cancellation implemented in the app layer, live transcription, and TTS — all wired together in one deployable service. It's a reference implementation of what the full stack actually looks like, not just the GPT part.

The claim extraction model is also genuinely useful. You define a typed schema (fields can be text, datetime, email, phone_number) per-call in the API request. The bot fills those fields through natural conversation, validates the values, and writes them to Cosmos DB. That's the core loop: call → converse → extract → store.

How it works

The entry point is app/main.py, a FastAPI app. Incoming calls from Azure Communication Services arrive as Event Grid events, routed to queue workers. The interesting path starts at on_new_call in app/helpers/call_events.py:

streaming_options = MediaStreamingOptions(
    audio_channel_type=MediaStreamingAudioChannelType.UNMIXED,
    content_type=MediaStreamingContentType.AUDIO,
    enable_bidirectional=True,
    start_media_streaming=False,
    transport_type=MediaStreamingTransportType.WEBSOCKET,
    transport_url=wss_url,
)
answer_call_result = await client.answer_call(
    callback_url=callback_url,
    cognitive_services_endpoint=CONFIG.cognitive_service.endpoint,
    incoming_call_context=incoming_context,
    media_streaming=streaming_options,
)

Once the call is connected, audio flows as raw PCM 16-bit 16 kHz over WebSocket. The app runs its own acoustic echo cancellation layer — AECStream in app/helpers/call_utils.py — which sits between the inbound microphone queue and the output queue. Without this, the bot would hear its own TTS responses back through the mic and confuse the speech recognizer.

Speech-to-text runs through Azure Cognitive Services via SttClient, which wraps PushAudioInputStream and buffers partial transcripts until a silence gate closes. Text-to-speech is handled through SpeechSynthesizer and fed back through the same bidirectional stream.

The LLM layer in app/helpers/llm_worker.py is where the fallback logic lives. Rather than using a framework, they wrote it directly with tenacity:

async def completion_stream(
    max_tokens: int,
    messages: list[MessageModel],
    system: list[SystemMessage],
    tools: list[ChatCompletionsToolDefinition] = [],
) -> AsyncGenerator[StreamingChatResponseMessageUpdate]:
    # Try first with primary LLM (fast by default)
    try:
        async for attempt in retryed:
            with attempt:
                async for chunck in _completion_stream_worker(is_fast=True, ...):
                    yield chunck
                return
    except Exception as e:
        ...
    # Then try more times with backup LLM
    async for attempt in retryed:
        with attempt:
            async for chunck in _completion_stream_worker(is_fast=False, ...):
                yield chunck

Primary LLM is gpt-4.1-nano (fast, cheap), fallback is gpt-4.1 (slow, accurate). Context window is capped at 20 messages to control latency and hallucination risk.

The tool/plugin system in app/helpers/llm_utils.py credits Microsoft's AutoGen project for the introspection approach. It inspects the DefaultPlugin class using Python's inspect module, extracts type annotations and docstrings, and auto-generates OpenAI tool definitions from them. Tools include end_call, new_claim, update_claim, transfer_to_human, send_sms, and RAG search. Each tool's docstring becomes the tool description the LLM sees — which is a reasonable way to keep tool descriptions close to the implementation.

Feature flags (silence timeouts, VAD thresholds, recording toggle) are read from Azure App Configuration and refreshed every 60 seconds without a restart.

Using it

Outbound call via the API:

curl --request POST \
  --url https://your-deployment/call \
  --header 'Content-Type: application/json' \
  --data '{
    "bot_company": "Acme IT",
    "bot_name": "Alex",
    "phone_number": "+11234567890",
    "task": "Gather hardware info from the employee about their issue.",
    "claim": [
      {"name": "hardware_info", "type": "text"},
      {"name": "incident_datetime", "type": "datetime"}
    ]
  }'

Local development skips the phone system:

python3 -m tests.local

Deployment on Azure:

az login
make deploy name=my-resource-group
make logs name=my-resource-group

The make deploy command runs Bicep templates that provision Container Apps, Cosmos DB, Event Grid, AI Search, Redis, and Azure Communication Services. The cost breakdown in the README is useful: ~$720/month for 1,000 ten-minute calls, dominated by Speech Services ($152) and Cosmos DB ($234).

Rough edges

The README explicitly flags this as a proof of concept, not production-ready. The checklist is honest: no multi-region deployment, no private networking, incomplete integration test coverage, no runbooks. The load_llm_chat function in call_llm.py and _completion_stream_worker both carry # TODO: Refacto, too long comments — they're doing too much in one place.

The Azure dependency is total. Every meaningful component (phone numbers, speech, LLM, storage, caching, search) routes through Azure services. If you want to use Twilio for calls or a non-Azure LLM endpoint, some of that is configurable, but the infrastructure assumptions are firmly Azure-shaped.

Test coverage is thin outside the persistence layer. The tests/ directory has conversation fixture files and some LLM unit tests, but nothing covering the audio pipeline or the call event handlers.

The app server recently switched from uvicorn to Granian (a Rust-based ASGI server), which is a meaningful performance call — the commit message says "much more performant and secure". But it also means one more non-standard dependency to reason about.

Bottom line

If you're building a voice AI agent on Azure and want a working reference that goes beyond "hello world", this is the clearest complete implementation available. It won't run on a Tuesday afternoon without an Azure subscription and some configuration time, but the architecture is legible and the code shows exactly how to wire audio streaming, STT/TTS, LLM tool calls, and structured data extraction into a single coherent service.