Typesense: Search engine internals worth reading

August 31, 2023

|repo-review

by Florian Narr

Typesense: Search engine internals worth reading

Typesense is a typo-tolerant, in-memory search engine built in C++. Single binary, no JVM, no Lucene stack — you spin it up and query it over HTTP.

Why I starred it

Algolia is expensive at scale. Elasticsearch works but you're signing up for cluster tuning, JVM heap management, and a configuration surface area that rivals small cities. Typesense targets the gap: a self-hosted search engine that doesn't require a search infrastructure team to operate.

What caught my eye wasn't the pitch — it was that it's written in C++ from scratch. No Lucene under the hood. That means the data structures are their own, and reading the source tells you exactly how it performs.

How it works

The central data structure is an Adaptive Radix Tree (ART), implemented in src/art.cpp. The ART header in include/art.h defines four node types — art_node4, art_node16, art_node48, art_node256 — each sized to its child count. The node type is stored in a single uint8_t, and the tree upgrades a node when it fills up. This keeps memory tight: a node with 3 children takes 4 slots, not 256. Each leaf carries a sorted_array of document IDs, an offset index, and the raw offsets — that's the art_values struct in art.h.

typedef struct {
    art_node n;
    unsigned char keys[4];
    art_node *children[4];
} art_node4;

typedef struct {
    sorted_array ids;
    sorted_array offset_index;
    array offsets;
} art_values;

For token-to-document mapping, the inverted index uses posting_list_t (include/posting_list.h). The posting list is a linked chain of block_t structs, each holding a sorted_array of IDs, an offset index, and raw offsets. Blocks are navigated via an id_block_map — a std::map<last_id_t, block_t*> — so range lookups skip to the right block without scanning the full chain. The iterator decompresses block data into raw uint32 arrays for performance at query time.

The match scoring logic is in include/match_score.h. The get_match_score function packs six signals into a single 64-bit integer using bit shifts:

inline uint64_t get_match_score(const uint32_t total_cost,
                                 const uint32_t unique_words,
                                 const uint8_t synonym_score) const {
    uint64_t match_score = (
        (int64_t(words_present) << 40) |
        (int64_t(unique_words)  << 32) |
        (int64_t(255 - total_cost) << 24) |
        (int64_t(100 - distance)   << 16) |
        (int64_t(exact_match)      << 12) |
        (int64_t(255 - max_offset) <<  4) |
        (int64_t(synonym_score)    <<  0)
    );
    return match_score;
}

This lets them sort results with a single integer compare. Word presence sits at the top 40 bits, so it always dominates — then unique word count, typo cost, proximity, exact match, position, and synonym quality cascade down. Clean.

Top-K tracking uses a KV struct in include/topster.h that holds three int64_t scores — text match plus two configurable sort fields — along with vector distance for hybrid search. The struct also carries a reference_filter_results map for JOIN queries, which landed recently in the source.

The index itself (include/index.h) glues all of this together. Each field gets its own ART (art_t), with separate indices for facets (facet_map_t), numerics (num_tree.h), geo (geopolygon_index.h), and vectors (hnswlib). Facets use sparse hash maps keyed by document ID. HNSW is bundled directly in the repo for approximate nearest-neighbor vector search.

Raft-based clustering (src/raft_server.cpp) handles HA. Recent commits show active work on JOIN correctness — the join.cpp file has had fixes for nested joins and race conditions on related-collection writes as recently as March 2026.

Using it

# Start with Docker
docker run -p 8108:8108 -v/tmp/tsdata:/data \
  typesense/typesense:29.0 \
  --data-dir /data --api-key=localkey

# Create a collection
curl -X POST http://localhost:8108/collections \
  -H "X-TYPESENSE-API-KEY: localkey" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "products",
    "fields": [
      {"name": "title", "type": "string"},
      {"name": "price", "type": "float"},
      {"name": "category", "type": "string", "facet": true}
    ],
    "default_sorting_field": "price"
  }'

# Search with typo tolerance built in
curl "http://localhost:8108/collections/products/documents/search?\
q=wirelss+earbds&query_by=title&filter_by=price:<100&facet_by=category" \
  -H "X-TYPESENSE-API-KEY: localkey"

"wirelss earbds" finds "wireless earbuds" without any configuration — the typo budget is automatic.

For semantic/hybrid search you can point it at OpenAI or use a bundled S-BERT model:

{
  "name": "products",
  "fields": [
    {"name": "title", "type": "string"},
    {"name": "embedding", "type": "float[]", "embed": {
      "from": ["title"],
      "model_config": {"model_name": "ts/e5-small"}
    }}
  ]
}

The embedding is generated at index time. Hybrid search then blends keyword and vector scores automatically.

Rough edges

The entire index lives in RAM. That's the source of its speed and its hard constraint. 28M books index at 14GB RAM according to their own benchmarks. If your dataset doesn't fit in memory, this isn't your tool.

The JOIN feature is still maturing. Looking at recent commits — March and April 2026 patches for nested join crashes, race conditions on concurrent writes to related collections, and union search deduplication fixes — it's clearly not battle-hardened yet. I wouldn't ship complex multi-collection JOINs in production without stress testing.

Documentation is solid for the happy path (REST API, client libraries, basic schema) but thinner on operational topics — tuning the Raft cluster, memory sizing heuristics, backup strategies. The housekeeper.cpp runs background maintenance tasks but what it does and when is not documented anywhere obvious.

The conversational search / RAG mode is basically a wrapper around your configured LLM provider with Typesense results injected as context. If you're expecting a local inference engine, it's not that.

Bottom line

If you're running a product search, docs search, or catalog search and want Algolia-class typo tolerance without the per-query bill — run this. The C++ core and clean data structures mean it's genuinely fast on modest hardware. Don't use it if your index won't fit in RAM or if you need production-grade JOINs today.

typesense/typesense on GitHub
typesense/typesense