Katana: Next-Gen Web Crawler Built for Pipelines

November 17, 2024

|repo-review

by Florian Narr

Katana: Next-Gen Web Crawler Built for Pipelines

Katana is a web crawler from ProjectDiscovery — the same team behind nuclei, subfinder, and httpx. It handles both standard HTTP crawling and headless browser crawling, parses JavaScript for endpoints, and is explicitly designed to slot into automation pipelines rather than replace a browser.

Why I starred it

Most web crawlers are either too simple (wget recursive) or too heavy (Scrapy full-stack with middleware). Katana sits in the middle: a CLI tool with the scope control and output options you'd want for security reconnaissance or content mapping, but fast enough to run in a shell pipeline without babysitting it.

What actually got my attention was the traversal strategy configuration. You can switch between depth-first and breadth-first at the CLI level (-s depth-first or -s breadth-first), and the queue implementation backs that with a real heap. That's not something most crawlers expose.

How it works

The architecture is cleaner than I expected. The core interface in pkg/engine/engine.go is just two methods:

type Engine interface {
    Crawl(string) error
    Close() error
}

Three implementations satisfy it: standard, headless, and hybrid. The standard crawler wraps retryablehttp-go with a cookie jar and a sizedwaitgroup for concurrency. The headless crawler uses go-rod to drive a real Chromium instance.

The queue in pkg/utils/queue/ is where the traversal strategy lives. It's a priority queue backed by container/heap. When you push a URL, you push it with its depth as the priority:

func (ih *itemHeap) Less(i, j int) bool {
    return (*ih)[i].priority < (*ih)[j].priority
}

Lower depth values pop first — that's breadth-first. Flip the comparison and you get depth-first. The Strategy type in strategy.go maps CLI strings to the right queue behavior. Clean separation.

The parser in pkg/engine/parser/parser.go is a chain of ResponseParserFunc functions, each receiving a *navigation.Response and returning new *navigation.Request items. The list in NewResponseParser() covers everything — <a> tags, <link href>, <img src>, <iframe>, <embed>, HTTP Link headers, meta refresh, HTMX attributes, and custom field regexes. All modular. Adding a new source means adding one function to the chain.

Content deduplication happens in the request loop in pkg/engine/standard/crawl.go:

if !c.Options.UniqueFilter.UniqueContent(data) {
    return &navigation.Response{}, nil
}

There's a separate UniqueURL check in the queue's Enqueue path — URL-level dedup happens before the request, content-level dedup after. The mfonda/simhash dependency suggests content similarity hashing is somewhere in there, though I didn't trace it all the way.

The Enqueue method in pkg/engine/common/base.go is worth reading. It does the full validation chain: URL format, query param normalization, depth check, uniqueness, cycle detection, scope validation — all before a URL hits the queue. Depth filtering intentionally skips consuming the uniqueness cache, so a URL discovered over the limit at one depth can still be visited if found at a valid depth via another path.

The resume functionality is notable too: on Ctrl+C, it writes the current crawl state to ~/.config/katana/resume-<xid>.cfg. Old resume files (10+ days) are cleaned up automatically on startup.

Using it

Install:

go install github.com/projectdiscovery/katana/cmd/katana@latest

Basic crawl with JSON output and JavaScript parsing:

katana -u https://example.com -jc -j -d 3

Scope to the registered domain and run headless to catch client-rendered endpoints:

katana -u https://example.com -hl -fs rdn -d 4 -c 20

Filter to just interesting paths using DSL conditions:

katana -u https://example.com -mdc "status_code == 200 && contains(path, 'api')"

Store per-host fields for later analysis:

katana -u https://example.com -sf url,path -j -o results.jsonl

The --list-output-fields flag is useful — it prints every field available in JSON output, drawn from output.Result, navigation.Request, and navigation.Response structs.

The -filter-similar flag fingerprints URL paths with a trie and collapses /users/123 and /users/456 into a single crawl target once enough similar paths have been seen. The threshold is configurable via -fst.

Rough edges

The headless engine is explicitly marked experimental in the CLI flags (enable headless crawling (experimental)), and the pkg/engine/headless/TODOS.md file confirms there's unfinished work there. The hybrid mode — standard + headless together — is also experimental.

JavaScript crawling via -jsluice is called out as memory intensive in the flag description. No further detail. Worth testing on your target's JS bundle size before running it at scale.

The --automatic-form-fill flag exists but also carries an (experimental) label. I wouldn't rely on it for any thorough form interaction.

Documentation is reasonable but the CLI flag surface is enormous — 50+ flags across 8 groups. The README is a wall of flags. There's no getting-started guide beyond install + basic example.

Bottom line

If you're doing recon, content auditing, or building any pipeline that needs systematic URL discovery, katana is the right tool. The clean engine interface means you can also use it as a Go library rather than shelling out. Headless mode works but don't expect it to be stable for production automation yet.

projectdiscovery/katana on GitHub
projectdiscovery/katana