Memori: Agent-Native Memory Infrastructure That Hooks Into Your LLM Client

Memori is a memory layer for LLM agents. You hand it your OpenAI or Anthropic client, it wraps the client with middleware hooks, and from that point on every conversation is persisted and recalled automatically — no changes to your call sites.

Why I starred it

The memory problem for agents is genuinely unsolved at the infrastructure level. Most solutions fall into two camps: stuff the entire history into context (expensive, hits token limits fast), or bolt on a vector store and write retrieval logic yourself (a lot of plumbing). Memori's benchmark result caught my eye: 81.95% accuracy on LoCoMo with an average of 1,294 tokens per query, which is 4.97% of full-context footprint. That's the kind of number that changes the cost calculus for production agents.

The other thing that caught me was the architectural decision to operate at the client level rather than the prompt level. You don't write recall(query) calls anywhere. The SDK intercepts your existing client.chat.completions.create() calls and handles recall injection and persistence in the background.

How it works

The TypeScript SDK makes the architecture explicit. Opening memori-ts/src/memori.ts, the constructor registers three hooks onto an Axon instance:

// memori-ts/src/memori.ts
this.axon.hook.before(this.recallEngine.handleRecall.bind(this.recallEngine));
this.axon.hook.after(this.persistenceEngine.handlePersistence.bind(this.persistenceEngine));
this.axon.hook.after(this.augmentationEngine.handleAugmentation.bind(this.augmentationEngine));

Axon (their own middleware library, published separately as @memorilabs/axon) intercepts calls to registered LLM clients. Before each invocation, RecallEngine fires and injects relevant facts into the system prompt. After each invocation, PersistenceEngine stores the exchange and AugmentationEngine fires a background request to extract structured memory from the conversation.

The Python side does the same thing but you can trace the recall injection more precisely. In memori/llm/pipelines/recall_injection.py, the inject_recalled_facts function extracts the user query from kwargs, runs a vector search against stored facts, filters by a configurable relevance threshold, then injects the results differently depending on provider:

# memori/llm/pipelines/recall_injection.py
if llm_is_anthropic(...) or llm_is_bedrock(...):
    existing_system = kwargs.get("system", "")
    kwargs["system"] = existing_system + recall_context
elif llm_is_google(...):
    inject_google_system_instruction(kwargs, recall_context)
else:
    # OpenAI-style: prepend a system message
    messages.insert(0, {"role": "system", "content": recall_context.lstrip("\n")})

The injected block is wrapped in <memori_context> tags with an instruction to only use the context if relevant. Clean.

The memory model itself has three levels: entity (a user or persistent object), process (the agent or program), and session (the current interaction window). Augmentation extracts structured attributes from conversations — facts, preferences, relationships, skills, events — and stores them keyed to entity+process. Recall then queries against those facts using embeddings to find what's relevant to the current query.

For self-hosted deployments, the storage layer is properly pluggable. memori/storage/drivers/ has implementations for PostgreSQL, SQLite, MySQL, MongoDB, OceanBase, and Oracle. The BYODB path lets you point Memori at your own database instead of the cloud API.

Using it

Cloud mode is the quickest path:

pip install memori
export MEMORI_API_KEY=your_key

from memori import Memori
from openai import OpenAI

client = OpenAI()
mem = Memori().llm.register(client)
mem.attribution(entity_id="user_123", process_id="support_agent")

# From here on, all client.chat.completions.create() calls
# are intercepted — facts are recalled before, persisted after
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "I prefer TypeScript over Python."}]
)

For MCP-compatible agents (Claude Code, Cursor, Warp), there's a one-liner:

claude mcp add --transport http memori https://api.memorilabs.ai/mcp/ \
  --header "X-Memori-API-Key: ${MEMORI_API_KEY}" \
  --header "X-Memori-Entity-Id: your_username" \
  --header "X-Memori-Process-Id: claude-code"

That's it. No SDK integration needed for the MCP path — the agent just gets memory server-side.

Session management is explicit when you need it:

mem.new_session()      # start a fresh session
mem.set_session(sid)   # resume a specific session

Rough edges

The cloud path is the first-class experience and the BYODB docs are noticeably thinner. The benchmark paper (cited as arxiv.org/abs/2603.19935 in the README) compares against Zep, LangMem, and Mem0 but the methodology details are in a separate document that's not in the repo — you have to trust the numbers without being able to reproduce them easily from the source.

The augmentation extraction happens via their cloud API even in BYODB mode for the structured memory step — which means your conversation content still leaves your infrastructure. That's worth reading the docs on before deploying in a regulated environment.

Test coverage exists (tests/ has directories for llm, memory, storage, database, integration) but the integration tests are clearly designed to run against live infrastructure rather than being self-contained. Running pytest locally without credentials gets you partial coverage at best.

The recent commits are mostly version bumps and README updates. Active but not architecturally adventurous at the moment — the core machinery looks stable.

Bottom line

If you're building production agents and burning tokens on full-context replay or writing your own RAG retrieval layer, Memori is worth a look. The hook-based approach keeps your call sites clean, and the BYODB path gives you storage ownership if you need it.