AutoRedact: Client-Side PII Redaction That Never Phones Home

What it does

AutoRedact takes images (or PDFs), runs OCR on them entirely in your browser, matches the recognized text against a bank of regex patterns for sensitive data — emails, IPs, credit card numbers, API keys — and draws black rectangles over anything it finds. No upload, no server, no data leaving your machine.

Why I starred it

Redacting screenshots before sharing them in Slack or docs is one of those tasks that sounds trivial until you realize most tools either require uploading to a server or involve manually dragging boxes over text you can barely read at 2x zoom. AutoRedact automates the detection part. Drop an image, get back a redacted version. The fact that it runs Tesseract.js locally means the sensitive data you're trying to protect never touches a network.

The CLI mode sealed it for me. Being able to pipe a directory of screenshots through npm run cli before sharing them is exactly the kind of workflow I want.

How it works

The architecture splits cleanly into platform-agnostic core logic and platform-specific adapters. The ICanvasFactory interface in src/core/interfaces.ts defines three methods — createCanvas, loadImage, and the canvas context operations. Two adapters implement it: BrowserCanvasAdapter wraps the DOM canvas API, NodeCanvasAdapter in src/adapters/NodeCanvasAdapter.ts wraps node-canvas. This means the entire detection and redaction pipeline in src/core/processor.ts doesn't care whether it's running in Chrome or a Node CLI.

The processing pipeline in processImage() follows a deliberate sequence:

Load the image via the adapter
Upscale 2x onto a canvas (OCR accuracy improves with higher resolution)
Clone to a second canvas and run preprocessing — grayscale conversion using the luminance formula 0.299R + 0.587G + 0.114B, then a 1.5x contrast boost (src/core/image-processing.ts)
Feed the preprocessed canvas to Tesseract.js
Run regex matching against the full OCR text
Map matches back to word-level bounding boxes
Draw black fillRect over matched words on the original (color) canvas
Downscale back to original dimensions

The 2x upscale before OCR is a smart move. Tesseract's accuracy drops noticeably on low-res screenshots, and this simple scaling trick costs almost nothing compared to the OCR pass itself.

The pattern matching lives in src/constants/patterns.ts and covers more ground than you'd expect. Beyond the obvious email and IPv4 patterns, there's IPv6, MAC addresses, IBANs, Bitcoin addresses, Indian PAN numbers, SSNs, and a solid set of API key patterns for Stripe, AWS, GitHub, OpenAI, Google Cloud, and Slack tokens. There's even a JWT detector and database connection string matcher:

// From src/constants/patterns.ts
secrets: [
    /\b((?:sk|pk)_(?:live|test)_[0-9a-zA-Z]{16,}|gh[pous]_[0-9a-zA-Z]{30,}|AKIA[0-9A-Z]{16,20})\b/g,
    /\bsk-[a-zA-Z0-9]{30,}\b/g,           // OpenAI
    /\bAIza[0-9A-Za-z\\-_]{35}\b/g,       // Google Cloud
    /\bxox[baprs]-[a-zA-Z0-9-]{10,}\b/g,  // Slack
    /\beyJ[a-zA-Z0-9_-]*\.[a-zA-Z0-9_-]+\.[a-zA-Z0-9_-]+\b/g,  // JWT
    /(?:postgres|mysql|mongodb|redis|sqlserver):\/\/[^\s]+/g,
    /-----BEGIN [A-Z ]+ PRIVATE KEY-----/g
],

The bounding box mapping in processor.ts is where it gets interesting. Tesseract returns word-level bounding boxes, but the regex matches operate on the full text string. The hasValidOverlap function in src/core/matcher.ts bridges the two — it checks both positional overlap in the text stream and string containment, so a multi-word match like user@example.com correctly maps back to each individual OCR word that contributes to it.

The allowlist and block-word system adds flexibility. You can whitelist known internal IPs or domain names to avoid false positives, or add custom block words for domain-specific terms like "Confidential" or project code names.

Using it

Browser mode is drag-and-drop. The CLI is more useful for batch work:

# Single image
npm run cli -- screenshot.jpg

# Disable IP detection, add custom block words
npm run cli -- invoice.jpg --no-ips --block-words "Confidential,SSN,Account"

# Allowlist known safe terms
npm run cli -- internal-doc.jpg --allowlist "192.168.1.1,ProjectX"

# Batch a directory
for f in input/*.jpg; do
  npm run cli -- "$f" -o "output/$(basename "$f")"
done

PDF support landed in v2.1.2 — it shells out to pdftoppm (poppler) to convert pages to images, processes each one, then stitches them back into a multi-page PDF using node-canvas's PDF surface.

Docker deployment is also an option. There's a Fastify-based API server in src/server.ts that accepts image uploads on /redact and returns the processed result.

Rough edges

No test files for the core logic. The test/ directory contains only sample images — no unit tests for the regex patterns or the bounding box mapping. Given how regex-heavy the detection is, a test suite with known-good and known-bad inputs would catch regressions fast.

The OCR preprocessing is basic — grayscale plus contrast. Images with colored backgrounds, watermarks, or low contrast text will likely produce poor OCR results. There's no adaptive thresholding or noise reduction.

The credit card regex \d{13,19} without Luhn validation means any long number sequence gets flagged. Phone numbers, order IDs, tracking numbers — all potential false positives. The comment in the code acknowledges this with a lookbehind to avoid UUIDs, but the broader false positive problem remains.

The CLI requires node-canvas native dependencies, which can be painful to install on some systems. The PDF path adds another system dependency (poppler-utils). Neither is bundled.

Bottom line

If you need automated PII redaction for screenshots or documents and can't send them to a cloud service, AutoRedact does the job. The adapter pattern makes the core logic portable, the pattern library is broader than most alternatives, and the CLI mode makes it scriptable. Best suited for teams that handle sensitive data and want a quick pre-sharing sanitization step.