Magika: Google's AI-Powered File Type Detector

What it does

Magika is a file content type detector that replaces magic-byte heuristics and extension guessing with a small deep learning model. Feed it a file — or raw bytes, or a stream — and it returns the content type, MIME type, and a confidence score. 5ms per file on a CPU, 200+ content types, ~99% accuracy on their test set.

Why I starred it

The file type detection problem is older than most developers. libmagic and its file command have been the default answer for decades. They use hand-crafted rules: look for %PDF- at byte 0, check the ELF magic bytes, match MIME type patterns. This works well for binary formats with clear magic bytes, but it falls apart on text-based formats — Python versus JavaScript versus TypeScript versus a config file full of JavaScript-like syntax. Those distinctions require understanding content, not just headers.

What caught my attention is the engineering tradeoff they made: instead of building a massive general model, they built the smallest model that can solve this specific problem well. The current model (standard_v3_3) weighs a few MB. It only reads 1024 bytes from the start of the file and 1024 from the end. That is it. No middle, no full scan. The feature extraction is constant-time regardless of file size — a 4KB Python file and a 4GB log file go through identical feature extraction.

Magika is already running at Google scale — Gmail, Drive, and Safe Browsing all use it to route files to the right security scanners, processing hundreds of billions of samples weekly. The numbers are real.

How it works

The core lives in python/src/magika/magika.py. The Magika class initializes an ONNX session against the bundled model.onnx and loads config.min.json. That config reveals the architecture decisions:

{
  "beg_size": 1024,
  "mid_size": 0,
  "end_size": 1024,
  "block_size": 4096,
  "min_file_size_for_dl": 8,
  "padding_token": 256
}

mid_size: 0 is deliberate. Earlier model versions tried reading bytes from the middle of the file; the current version dropped it. Turns out the beginning and end carry almost all the signal. Files under 8 bytes skip the model entirely — they fall through to a simpler UTF-8 decode check.

The feature extraction in _extract_features_from_seekable is worth reading. It reads a block_size chunk from the start, strips leading whitespace with lstrip(), takes the first 1024 bytes, and pads to 1024 with token 256 (out of the 0–255 byte range) if short. Same from the end with rstrip(). No tokenization, no embedding lookup — raw byte values as integers go straight into the model. The model learns to recognize patterns directly in byte space.

The inference pipeline batches aggressively. All files in a identify_paths() call go through feature extraction first, then a single batched ONNX inference call:

batch_raw_predictions_np = self._onnx_session.run(
    ["target_label"], {"bytes": batch_features}
)[0]

Internal batch size caps at 1000 files. After inference, _get_output_label_from_dl_label_and_score applies the confidence threshold logic. Per-content-type thresholds live in config.min.json — markdown has a lower threshold (0.75) than ignorefile (0.95), reflecting how easy each is to mistake. If the score doesn't clear the threshold under HIGH_CONFIDENCE mode (the default), Magika falls back to txt or unknown depending on whether the model at least got the text/binary classification right. That fallback design means the tool almost never returns a confident wrong answer.

The CLI is written in Rust (rust/cli/), not Python. It links against a separate Rust library (rust/lib/) that runs the same ONNX model via the ort crate. The Python API and the Rust CLI share the same model weights and config, which is exactly how it should be — one source of truth.

Using it

# CLI: scan a directory recursively
magika -r ./src | head -10
# src/index.ts: TypeScript source (code)
# src/config.json: JSON document (code)
# src/assets/logo.png: PNG image data (image)

# Get structured output for piping
magika --jsonl ./src/*.ts | jq '.result.value.output.label'
# "typescript"
# "typescript"

# Read from stdin
cat unknown_file | magika -
# -: Python source (code)

# Show confidence score
magika --output-score model.onnx
# model.onnx: ONNX ML model (model) [score: 0.9985]

Python API for programmatic use:

from magika import Magika

m = Magika()

# Bytes — useful for content from network streams
res = m.identify_bytes(open("mystery.bin", "rb").read())
print(res.output.label)   # e.g., "pdf"
print(res.score)          # 0.9991

# Batch paths — one ONNX inference call for all of them
results = m.identify_paths(["a.py", "b.js", "c.bin"])
for r in results:
    print(r.path, r.prediction.output.mime_type)

The identify_bytes vs identify_path vs identify_stream split is clean. All three normalize to a Seekable wrapper internally, so the extraction and inference code is shared.

Rough edges

The JavaScript/TypeScript bindings (js/) are marked experimental and power the browser demo but aren't a first-class supported interface. Go support is listed as WIP. If your stack is Python or Rust, you're on solid ground; anything else is a gamble on maturity.

There's no way to fine-tune the model on your own content types without retraining from scratch. The training code isn't public — the model weights and config are, but not the dataset or the training pipeline. That's a reasonable call for a security tool but limits extensibility.

The per-content-type threshold system is powerful but not transparent. You can see the threshold values in the config, but the documentation doesn't explain how they were chosen or what the false positive rate looks like at each threshold. For security-critical routing that matters.

At 16k+ stars with commits landing weekly as of April 2026, the project is actively maintained. The research paper was published at ICSE 2025, which suggests the core approach is stable.

Bottom line

If you're building pipelines that need to classify file content accurately — security scanners, upload handlers, document processing, data lakes with heterogeneous inputs — Magika is the right starting point. It's faster than heuristic tools on ambiguous text files, already proven at production scale, and the Python API is three lines to wire up.