Crawlee: Web Scraping and Browser Automation for Node.js

December 23, 2024

|repo-review

by Florian Narr

Crawlee: Web Scraping and Browser Automation for Node.js

Crawlee is a Node.js library for web scraping and browser automation that abstracts over Playwright, Puppeteer, Cheerio, and raw HTTP — letting you swap backends while keeping the same request handler API. It's the backbone of the Apify platform but runs standalone just fine.

Why I starred it

Most scraping libraries punt on the operational problems. They give you a fetch wrapper or a browser handle, and you figure out concurrency, retries, proxy rotation, session persistence, and rate limiting yourself. Crawlee treats those as first-class concerns. The interesting part isn't the crawling API — it's the infrastructure it runs on top of.

How it works

The library is a monorepo under packages/. The key packages: @crawlee/core (base crawlers, queue, autoscaling), @crawlee/http (HTTP + Cheerio/JSDOM), @crawlee/playwright and @crawlee/puppeteer (browser crawlers), and the top-level crawlee convenience package that re-exports everything.

The autoscaling pool is the most interesting piece. It lives in packages/core/src/autoscaling/autoscaled_pool.ts and manages concurrency dynamically rather than running a fixed pool size. Every crawl runs through AutoscaledPool, which queries three signal sources every 10 seconds: CPU load, memory usage, and event loop lag (via snapshotter.ts). If any signal exceeds its threshold, the pool shrinks concurrency. If everything is healthy, it scales up by scaleUpStepRatio * desiredConcurrency, capped at maxConcurrency (default 200).

// packages/core/src/autoscaling/autoscaled_pool.ts
private _desiredConcurrency: number;
private readonly scaleUpStepRatio: number;   // default 0.05
private readonly scaleDownStepRatio: number; // default 0.05
private readonly maxTasksPerMinute: number;

The event loop signal is clever — it measures actual tick latency and marks the pool as overloaded if the Node.js event loop blocks for more than 50ms (maxBlockedMillis). That catches situations where CPU metrics look fine but the process is effectively stalled.

The request queue (v2, in packages/core/src/storages/request_queue_v2.ts) uses a local list-and-lock pattern. It fetches a batch of 25 requests at a time, locks them, and keeps a cache of up to 2,000,000 request IDs in memory. There's a RECENTLY_HANDLED_CACHE_SIZE of 1,000 to avoid immediate re-processing of just-completed requests. The v2 queue exists alongside the original — the change was adding a proper list-and-lock primitive to avoid double-processing under concurrent workers.

Session management in packages/core/src/session_pool/session.ts models each session as a user identity with cookies (tough-cookie jar), an error score, and a usage counter. Sessions self-heal: each successful request decrements errorScore by errorScoreDecrement (default 0.5). Hit maxErrorScore (default 3) and the session is retired and replaced. This error score design means a session isn't killed on first failure — useful for transient network errors that shouldn't burn a good session.

The router (packages/core/src/router.ts) is a simple label-based dispatcher. You tag requests with labels and register handlers per label. It's backed by a Map<string | symbol, handler> with a defaultRoute symbol for unmatched requests. Middleware runs before the matched handler via an ordered array of functions.

// packages/core/src/router.ts — the routing dispatch
private readonly routes: Map<string | symbol, (ctx: any)=> Awaitable<void>> = new Map();
private readonly middlewares: ((ctx: Context) => Awaitable<void>)[] = [];

Using it

Bootstrap a new project with the CLI:

npx crawlee create my-crawler
cd my-crawler && npm start

Or wire up a Playwright crawler manually:

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks, log }) {
        const title = await page.title();
        await Dataset.pushData({ title, url: request.loadedUrl });
        await enqueueLinks(); // auto-extracts and queues all links on the page
    },
    maxConcurrency: 10,
    maxRequestsPerCrawl: 100,
});

await crawler.run(['https://example.com']);

For pure HTML scraping with zero browser overhead:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ $, enqueueLinks }) {
        const title = $('title').text();
        await enqueueLinks({ globs: ['https://example.com/**'] });
    },
});

The labeled router pattern becomes useful when you're crawling sites with structurally different pages (listing vs. detail):

crawler.router.addHandler('LISTING', async ({ $, enqueueLinks }) => {
    await enqueueLinks({ selector: 'a.product', label: 'DETAIL' });
});

crawler.router.addHandler('DETAIL', async ({ $, pushData }) => {
    await pushData({ name: $('.product-name').text() });
});

Rough edges

The package split across @crawlee/core, @crawlee/http, @crawlee/playwright, etc. is correct for tree-shaking but makes version pinning annoying — you end up managing 4+ @crawlee/* packages in package.json. The top-level crawlee meta-package avoids this but pulls in all backends including Playwright and Puppeteer, which together are hundreds of megabytes.

The storage layer defaults to the local filesystem under ./storage. That's fine locally but the docs around plugging in a custom storage backend (needed for multi-instance deploys) are thin. The Apify platform has a hosted implementation, but the path for rolling your own is not well documented.

There's also a RequestQueue v1 and v2 coexisting in packages/core/src/storages/. The v2 is the default now, but the coexistence adds confusion — the docs don't make it clear which you're using or when v2 became the default.

Bottom line

If you're building production scrapers in Node.js and need everything beyond a simple fetch — concurrency, proxy rotation, session management, retry logic — Crawlee has it and it's well-engineered. If you're just scraping one endpoint, the abstraction cost isn't worth it.

apify/crawlee on GitHub
apify/crawlee