Bagofwords: Agentic Analytics That Learns Your Data

What it does

Bagofwords is a self-hosted analytics platform that sits between your data warehouse and an LLM. You ask questions in plain English, it writes SQL, runs it, and returns charts or tables. The twist: it has a memory layer, an instruction registry, and an LLM judge that scores every response — so it actually gets better at answering your specific questions over time.

Why I starred it

Most text-to-SQL tools treat every query as a fresh start. Bagofwords doesn't. It maintains context about your schema, your terminology, and your past corrections. The instruction registry caught my eye — it's a versioned, review-gated system where you define rules like "revenue always means net revenue excluding refunds" and the agent picks them up automatically. That's the kind of feature that makes the difference between a demo and something you'd trust in production.

How it works

The architecture is a multi-agent loop orchestrated by AgentV2 in backend/app/ai/agent_v2.py. When a question comes in, here's the chain:

ContextHub (backend/app/ai/context/context_hub.py) assembles the prompt. It pulls from eight different builders — schema, messages, widgets, instructions, code, resources, observations, and mentions — each contributing a slice of context. There's a hard token budget of 200k with an 8k output reserve, and it trims sections from the tail when the budget overflows.
PlannerV2 (backend/app/ai/agents/planner/planner_v2.py) streams a single-action decision. It doesn't call tools directly — it only decides what to do next. The planner uses partialjson to parse incomplete JSON from the LLM stream, emitting partial decision snapshots as tokens arrive. This means the UI can show reasoning in real time before the action is finalized.
ToolRunner (backend/app/ai/runner/tool_runner.py) executes the chosen tool with retry and timeout policies. It validates input against Pydantic schemas and tracks validation failures — after two consecutive failures on the same tool, it gives up and returns a readable error instead of looping forever.
Judge (backend/app/ai/agents/judge/judge.py) scores the response on two axes: instruction effectiveness (1-5) and context effectiveness (1-5). This runs as a separate LLM call after the main response, feeding scores back into the observability layer.

The tool system is clean. Every tool extends Tool from backend/app/ai/tools/base.py, declares a ToolMetadata with category, tags, and version, and implements run_stream as an async iterator of typed events (ToolStartEvent, ToolProgressEvent, ToolEndEvent). The ToolRegistry in backend/app/ai/registry.py auto-discovers tools by walking app.ai.tools.implementations with pkgutil.iter_modules — no manual registration needed.

class ToolRegistry:
    def _auto_register_all(self) -> None:
        import app.ai.tools.implementations as impl_pkg
        for module_info in pkgutil.iter_modules(impl_pkg.__path__, impl_pkg.__name__ + "."):
            importlib.import_module(module_info.name)
        for module_name, module in list(sys.modules.items()):
            if not module_name.startswith("app.ai.tools.implementations"):
                continue
            for _, obj in inspect.getmembers(module, inspect.isclass):
                if issubclass(obj, Tool) and obj is not Tool:
                    self.register(obj)

The InstructionTriggerEvaluator in backend/app/ai/agents/suggest_instructions/trigger.py is worth a look. It watches for correction patterns — keywords like "wrong", "actually", "should be", "exclude" — and code patterns via regex (SQL fragments, Python imports, pandas calls). When it detects a user is correcting the AI, it triggers instruction suggestions that get queued for review. The agent literally learns from being told it's wrong.

Using it

Getting started is a single Docker command:

docker run -p 3000:3000 bagofwords/bagofwords

That boots with SQLite by default. Point it at Postgres for anything real:

docker run -p 3000:3000 \
  -e BOW_DATABASE_URL=postgresql://user:pass@host:5432/db \
  bagofwords/bagofwords

Connect a data source (Snowflake, BigQuery, Postgres, and 20+ others), configure an LLM provider (OpenAI, Anthropic, Gemini, or any OpenAI-compatible endpoint), and start asking questions. It also exposes an MCP server, so you can query through Cursor or Claude Desktop while it tracks requests through the same observability pipeline.

Rough edges

The codebase is big — 1,098 files — and the backend is a monolith. The agent_v2.py constructor alone takes 15 parameters. There's no dependency injection; everything is wired through constructor arguments and global imports.

The judge system scores responses but doesn't do much with low scores yet. The score_instructions_and_context method defaults to 3/5 on any parse failure, which means bad scores get silently swallowed. The self-learning loop is more of a foundation than a complete feedback cycle.

Documentation is better than most open-source analytics tools — there's a proper docs site — but the internal code has minimal docstrings. The ContextHub has 100+ lines of utility functions before you reach the class definition, and tracing the full agent loop requires jumping through about eight files.

Alembic migrations number over 100, which suggests rapid iteration but also a schema that's been heavily patched. Something to watch if you're planning to self-host and upgrade frequently.

Bottom line

If you're running a data team that needs a self-hosted, LLM-powered analytics layer with actual governance (instructions, RBAC, audit logs), Bagofwords is one of the more complete open-source options. The agentic architecture is solid — the planner/tool/judge separation is well thought out, even if the codebase could use some trimming.