What it does
agent-browser is a native Rust CLI for browser automation, designed specifically for AI agents. Instead of scripting through Playwright or Puppeteer, agents issue shell commands like agent-browser click @e3 or agent-browser snapshot and get structured output back. No Node.js runtime required at the browser layer.
Why I starred it
Browser automation for AI agents is a problem with too many moving parts. Most solutions layer Python or Node.js bindings on top of Chrome DevTools Protocol, adding latency and memory overhead that matters when an agent is running hundreds of browser actions per session. agent-browser strips all of that out. It compiles to a single Rust binary that talks CDP directly over WebSockets, manages a background daemon for session persistence, and exposes everything through CLI commands that any agent framework can call.
The ref-based interaction model is what really caught my attention. Instead of forcing agents to reason about CSS selectors or XPaths, agent-browser snapshot returns an accessibility tree where every interactive element gets a stable ref like @e2. The agent reads the tree, picks a ref, and runs agent-browser click @e2. That is a much better interface for an LLM than trying to construct document.querySelector('#app > div:nth-child(3) > button').
How it works
The architecture is a daemon model. When you run your first command, connection.rs spawns a background daemon process. The CLI communicates with this daemon over Unix domain sockets (TCP on Windows). Every subsequent command in the same session connects to the existing daemon, avoiding Chrome startup overhead on each invocation.
The daemon itself lives in cli/src/native/daemon.rs. It listens on a Unix socket, receives JSON-encoded commands, and dispatches them through execute_command in cli/src/native/actions.rs. The CDP connection is maintained by CdpClient in cli/src/native/cdp/client.rs — a clean WebSocket client built on tokio-tungstenite with its own message routing:
pub struct CdpClient {
ws_tx: Arc<Mutex<SplitSink<WebSocketStream<...>, Message>>>,
next_id: AtomicU64,
pending: PendingMap,
event_tx: broadcast::Sender<CdpEvent>,
raw_tx: broadcast::Sender<RawCdpMessage>,
_reader_handle: tokio::task::JoinHandle<()>,
_keepalive_handle: tokio::task::JoinHandle<()>,
}
Each CDP command gets an atomic ID, a oneshot channel for the response, and the reader task routes incoming messages to either pending responses or the event broadcast. The keepalive task sends WebSocket pings every 30 seconds to survive load balancer timeouts. Straightforward, but the details matter — they handle both Text and Binary CDP frames because remote providers like Browserless send responses as binary.
The snapshot system in cli/src/native/snapshot.rs (1,586 lines) is the most interesting piece. It fetches the full accessibility tree via CDP's Accessibility.getFullAXTree, then builds a custom TreeNode representation that categorizes every node by role. The role classification is explicit — INTERACTIVE_ROLES (buttons, links, textboxes), CONTENT_ROLES (headings, cells), and STRUCTURAL_ROLES (generic, group, list) are defined as const arrays at the top of the file. Interactive elements get refs assigned; structural elements get pruned in compact mode.
The ref assignment in element.rs generates short IDs like @e1, @e2 from backend DOM node IDs. This means refs are stable across snapshots as long as the DOM does not change — an agent can snapshot, reason, and act without refs going stale between commands.
There is also a built-in diff system. cli/src/native/diff.rs provides both snapshot diffs (text-based, using the similar crate for unified diffs) and screenshot diffs (pixel-level comparison with configurable color distance thresholds). An agent can run agent-browser diff snapshot to see what changed after an action — useful for verifying that a click actually did something.
Using it
# Install and set up Chrome
npm install -g agent-browser
agent-browser install
# Basic workflow
agent-browser open https://news.ycombinator.com
agent-browser snapshot -i # Interactive elements only
# Output (abbreviated):
# - link "Hacker News" [@e1]
# - link "new" [@e2]
# - link "past" [@e3]
# - link "comments" [@e4]
# ...
agent-browser click @e2 # Click "new"
agent-browser diff snapshot # See what changed
agent-browser screenshot page.png
Batch mode avoids per-command process startup:
agent-browser batch "open https://example.com" "snapshot -i" "click @e1" "screenshot"
The chat command is worth noting — it takes natural language instructions and translates them to browser actions:
agent-browser chat "go to github.com and search for rust projects"
Rough edges
The dependency footprint is lean for what it does — tokio, serde_json, reqwest, image, and a handful of crypto crates for the auth vault. The release profile is aggressive: opt-level = 3, lto = true, codegen-units = 1, strip = true. That produces a fast, small binary but compilation from source is slow.
Chrome discovery is robust (detects Chrome, Brave, Playwright, and Puppeteer installations), but agent-browser install downloads Chrome for Testing separately. If you already have Chrome, you are now managing two Chrome installations.
The test coverage is decent — snapshot.rs has unit tests for tree compaction, deduplication, and cross-origin iframe handling. But many of the native modules lack test files entirely. The parity_tests.rs and e2e_tests.rs exist but depend on a running browser, so they are more integration tests than unit tests.
At 28k stars and commits landing daily as of April 2026 (version 0.25.3), the project is actively maintained. The 0.x version signals it is still finding its API surface, but the pace of development is fast.
Bottom line
If you are building AI agents that need to interact with real web pages, agent-browser is the best CLI-first approach available. The daemon architecture keeps it fast, the ref-based snapshot model is a smart abstraction for LLMs, and the Rust core means you are not dragging a Node.js runtime into your agent loop.
