ml-intern: Hugging Face's Autonomous ML Engineer

ml-intern is an autonomous ML agent that reads papers, browses HF docs and datasets, writes training scripts, and launches jobs on HF compute — all orchestrated from a local CLI.

Why I starred it

Most "AI agent" projects glue a few tool calls to an LLM and call it autonomy. This one is different in scope: the agent has direct access to HF's job scheduler, can spin up GPU sandboxes on A100s, search arXiv papers and datasets, write code, and persist everything as a session trace back to your private HF Hub dataset. The surface area is large enough that the engineering required to keep it from looping itself to death — and burning your Bedrock budget — turns out to be the most interesting part.

How it works

The core loop lives in agent/core/agent_loop.py. On each iteration it calls litellm.acompletion, parses any tool_calls[] out of the response, runs an approval check, executes via the ToolRouter, and appends results to the ContextManager. Up to 300 iterations per session. The LLM decides when to stop by returning a response with no tool calls.

The ToolRouter in agent/core/tools.py wires together a mix of built-in tools (research, HF docs, papers, datasets, repos, GitHub code search, sandbox, jobs, planning) and any MCP servers you configure. Built-ins are ToolSpec dataclasses:

@dataclass
class ToolSpec:
    name: str
    description: str
    parameters: dict
    handler: Callable[..., Awaitable[Any]]

Registering a new tool is one entry in create_builtin_tools(). MCP servers get added to configs/cli_agent_config.json and discovered at startup via fastmcp.

The doom loop detector

The most immediately practical piece of engineering is agent/core/doom_loop.py. LLMs get stuck: they call the same tool with the same args, get the same result, and do it again. The naive fix is a hard iteration cap. The actual fix here is a pattern detector that canonicalizes tool-call JSON before hashing it — sorting keys so {"a":1,"b":2} and {"b":2,"a":1} produce the same hash — then scans the last 30 messages for two patterns:

def detect_identical_consecutive(signatures, threshold=3) -> str | None:
    # 3+ identical consecutive calls → return tool name

def detect_repeating_sequence(signatures) -> list | None:
    # Sequences of length 2–5, repeated 2+ times (A→B→A→B)

When either fires, the agent injects a [SYSTEM: REPETITION GUARD] message telling the LLM to stop and try something different. The key insight in _normalize_args is that it also factors in the result hash — so legitimate polling (same args, changing output) doesn't falsely trip the detector.

Context compaction

Long ML training sessions generate large tool outputs. agent/context_manager/manager.py tracks a running_context_usage counter and triggers compaction when it crosses the threshold. There's a _MAX_TOKENS_PER_MESSAGE = 50_000 ceiling — messages over that get replaced with a placeholder before compaction runs. The comment explains why this exists:

# producing the infinite compaction loop seen 2026-05-03 in pod logs
# (200k context shrinks to 200k+ because one tool output is 80k tokens).

When compaction fires, the agent summarizes the middle of its conversation and discards it. If compaction fails — because a single preserved message is still too large — it raises CompactionFailedError and terminates the session cleanly instead of retrying (which would burn ~$3 per attempt on Opus).

The compaction prompt is deliberately decision-focused: "key decisions, the 'why' behind the decisions, problems solved." That framing matters for ML work where a session might span multiple training runs.

Prompt caching

agent/core/prompt_caching.py applies Anthropic's cache_control breakpoints on the tool block and system prompt for every Anthropic model. Non-Anthropic models pass through unchanged. This covers the ~4-5K static tokens that would otherwise be re-billed at full input cost on every turn.

HF jobs tool

agent/tools/jobs_tool.py talks to HF's job scheduler via the official huggingface_hub Python library. The full hardware menu goes from cpu-basic (2 vCPU / 16 GB) through a100x8 (96 vCPU / 1136 GB / 640 GB GPU). Scheduled jobs are distinguished from immediate runs by checking if the operation string starts with "scheduled " — which is the entirety of agent/core/approval_policy.py. Immediate GPU jobs and sandbox creates require user approval; scheduled jobs don't. The approval check lives before tool execution in the main loop.

Session upload

Every session gets uploaded to a private HF dataset ({username}/ml-intern-sessions) in Claude Code's JSONL trace format, which HF's Agent Trace Viewer understands natively. You can flip it public, private, or opt out entirely via config.

Using it

git clone git@github.com:huggingface/ml-intern.git
cd ml-intern
uv sync
uv tool install -e .

Then:

# Interactive
ml-intern

# Headless with a specific model
ml-intern --model anthropic/claude-opus-4-7 "fine-tune a DistilBERT on my sentiment dataset"

# Point at a local vLLM endpoint
ml-intern --model vllm/meta-llama/Llama-3.1-8B-Instruct "your prompt"

Model switching inside a session via /model. Session traces via /share-traces public.

Rough edges

The test suite is solid — 35+ unit tests covering the doom loop, compaction, approval policy, cost estimation, malformed args recovery, and session persistence. Integration tests require live credentials. The web frontend (frontend/) exists but the README focuses entirely on the CLI; the frontend config is a separate JSON with its own set of defaults.

The HF whoami call in context_manager/manager.py uses subprocess.run(["curl", "-4", ...]) instead of any HTTP library. The comment explains it: Python's httpx and urllib try IPv6 first, causing 40+ second hangs before falling back to IPv4. That's a pragmatic fix for a real deployment problem, but it's also the kind of thing that breaks on Windows.

The SFT tooling in agent/sft/ is referenced but not documented in the README at all.

Dependency surface is heavy: litellm, fastmcp, huggingface_hub, httpx, jinja2, rich, prompt_toolkit. For a CLI agent that's expected, but worth knowing before you pull it into an existing environment.

Bottom line

If you're doing active ML work in the HF ecosystem and want an agent that can actually launch training jobs rather than just draft Python files for you, this is worth running. The doom loop detector and compaction logic show the team has run this in production long enough to hit the failure modes.