Auto-Analyst: DSPy-Powered Data Science Agents That Write Your Analysis

What it does

Auto-Analyst is an open-source platform where you upload a CSV, ask a question in plain English, and get back Python code, statistical analysis, ML models, or Plotly charts. The backend is a multi-agent system built on DSPy — each agent is a declarative signature that takes a goal and a dataset description and produces code plus a summary.

Why I starred it

The DSPy angle is what caught me. Most AI data tools use raw prompt chains or function-calling. Auto-Analyst defines each agent as a dspy.Signature — a typed input/output contract — and runs them through DSPy's compilation and prediction pipeline. That means the agent behavior is driven by the signature docstring (which serves as the prompt), and the framework handles the LLM call, parsing, and retry logic. It's a real-world example of DSPy used beyond research notebooks.

The planner is the other draw. You can target a specific agent with @preprocessing_agent or just ask a question and let the planner decompose it into a multi-agent pipeline: preprocessing, then regression, then visualization — each passing variables to the next.

How it works

The entry point is auto-analyst-backend/app.py — a FastAPI app that wires up session management, agent routing, and a streaming response system. The core pattern: every chat request hits /chat/{agent_name}, which resolves the agent, injects the session's dataset context, and runs a DSPy prediction.

Agents live in auto-analyst-backend/src/agents/agents.py. The four default agents — preprocessing_agent, statistical_analytics_agent, sk_learn_agent, and data_viz_agent — are each a dspy.Signature class. Here's the shape:

class preprocessing_agent(dspy.Signature):
    """You are a data preprocessing agent specializing in..."""
    goal = dspy.InputField(desc="User-defined goal...")
    dataset = dspy.InputField(desc="Information about the dataframe...")
    plan_instructions = dspy.InputField(desc="Agent-level instructions...")
    code = dspy.OutputField(desc="Generated Python code")
    summary = dspy.OutputField(desc="Concise bullet-point summary")

The prompt is the class docstring. The input/output fields are typed declarations. DSPy takes this signature, constructs the LLM call, and parses the response back into code and summary fields. No manual prompt templating, no output parsing boilerplate.

What makes it extensible: create_custom_agent_signature() in agents.py dynamically generates new dspy.Signature subclasses at runtime using Python's type(). You provide a name, description, and prompt template, and it builds a class with the right fields. Visualization agents get an extra styling_index input automatically based on name or category detection:

CustomAgentSignature = type(agent_name, (dspy.Signature,), class_attributes)

The planner is advanced_query_planner — another DSPy signature with a massive docstring that acts as a few-shot prompt. It receives the dataset context and available agent descriptions, then outputs a plan (a chain like preprocessing_agent -> statistical_analytics_agent -> data_viz_agent) and plan_instructions — a JSON object specifying what each agent should create, use, and do. The planner is capped at 10 agents, sorted by user usage frequency.

The model registry in src/utils/model_registry.py is thorough. It supports OpenAI, Anthropic, Groq, and Gemini models, all wrapped as dspy.LM instances. A five-tier credit system controls access — from gpt-5-nano at 1 credit to gpt-5.4-pro at 50 credits per query. Each model gets a dspy.LM with configured temperature clamping (min(1.0, max(0.0, ...))) and max token limits.

The deep analysis mode (src/agents/deep_agents.py) goes further. A deep_questions signature generates follow-up questions from your goal and dataset, decomposing a vague question like "why is churn rising?" into five targeted analytical sub-questions. Those feed back into the agent pipeline for multi-step analysis.

Using it

The hosted version is at autoanalyst.ai. For self-hosting:

git clone https://github.com/FireBird-Technologies/Auto-Analyst.git
cd Auto-Analyst/auto-analyst-backend
pip install -r requirements.txt
# Set your API keys in .env
python app.py

Upload a CSV, then either target an agent directly:

@data_viz_agent Show me a scatter plot of price vs. area

Or let the planner decide:

Clean the data, run a linear regression on price vs. area, and plot the results

The planner outputs a chain — preprocessing, then statistical analysis, then visualization — each agent receiving variables from the previous step.

Rough edges

The codebase is big but has some structural issues. app.py is a 600+ line monolith that mixes routing, middleware, styling configuration (seven Plotly chart templates hardcoded as dictionaries), and business logic. The agent templates are stored in a database but fall back to agents_config.json with a three-path search strategy that checks the backend dir, project root, and /app/ — a sign of deployment-environment coupling.

There are no tests. The test_datasets.ipynb notebook and src/utils/test.ipynb exist but they're exploratory notebooks, not automated test suites. For a codebase that generates and executes arbitrary Python code, that's a gap.

Git activity has slowed. The last meaningful feature commit was in October 2025, with only model updates and LFS cleanup since. The requirements.txt pulls in 50+ packages including pymc, xgboost, lightgbm, optuna, bokeh, and duckdb — a heavy footprint for a platform where most users will use pandas and plotly.

Session state is in-memory via AppState — there's no persistence for uploaded datasets across server restarts, which limits production viability without the enterprise tier.

Bottom line

Auto-Analyst is a solid reference implementation of DSPy-based multi-agent orchestration for data science. If you're building an AI analytics tool and want to see how declarative agent signatures, a planner, and a model registry fit together, this is worth reading. For production use, expect to trim the dependency tree and add your own guardrails around code execution.