Memory · LongMemEval-S

LongMemEval-S Leaderboard

Engram v0.1.0 vs the published frontier, judged on the same methodology.

LongMemEval-S is the standard 500-question evaluation for long-form agent memory. The numbers below come from each system's published methodology, judged by the LongMemEval official gpt-4o-mini judge. Engram v0.1.0 — the open-source memory layer in jamjet-labs/engram — lands fifth of five on raw accuracy. The matrix below shows why that isn't the whole story.

Raw accuracy is one axis. Engram is also Apache 2.0, runs against any OpenAI-compatible model (including local ones), is MCP-native, and self-hosts on SQLite. Each axis is a real engineering decision criterion — not all of them are visible in a single accuracy number.

Last updated 2026-05-09

Leaderboard

All scores reported on LongMemEval-S 500-question evaluation, judged by the official gpt-4o-mini judge. Each row links to the system's source.

#	System	Score	n	Judge	Reader / generator	Source
1	Chronos	95.6%	500	gpt-4o-mini official	Claude Opus 4.6	arXiv 2603.16862
2	Mastra	94.9%	500	gpt-4o-mini official	GPT-5-mini (no retrieval — full observation log)	mastra.ai/research
3	OMEGA	93.2% raw^*	500	gpt-4o-mini official	GPT-4.1	omegamax.co/benchmarks
4	Zep / Graphiti	71.2%	500	gpt-4o-mini official	GPT-4o	arXiv 2501.13956
5	Engram v0.1.0	64.6%	500	gpt-4o-mini official	gpt-4o-mini (synthesis-mode for preference questions; recall mode otherwise)	github.com/jamjet-labs/engram

^* OMEGA reports 95.4% task-averaged (mean of per-category scores) as its headline figure; 93.2% is the raw count (466/500). We cite the raw count for direct comparability with other rows.

AgentMemory excluded: AgentMemory publishes 96.2% on LongMemEval-S and is a strong system worth evaluating. Its run uses GPT-4o as judge — not the LongMemEval official gpt-4o-mini judge this leaderboard enforces. Same-judge methodology is the comparability discipline here. See their result at github.com/JordanMcCann/agentmemory.

Beyond accuracy: the engineering matrix

Same systems, six axes that matter past raw accuracy. Each cell is sourced; cells silent in the published docs read “(not documented)” rather than guessing.

System	Accuracy	License	Reader model	MCP-native	Runs locally	Multi-tenant by default
Chronos	95.6%	(not documented — research prototype; paper is CC-BY-4.0; no code repo)	Claude Opus 4.6	(not documented — no MCP mention in paper; commercial-API-only architecture)	(not documented — no public code; runs on commercial APIs only per paper)	(not documented)
Mastra	94.9%	(not documented — Mastra states "completely open source" but specific license not stated on research page)	GPT-5-mini	(not documented)	Partial — observer/reflector agents use Gemini-2.5-flash via API; no documented fully local path	(not documented)
OMEGA	93.2%	Apache 2.0	GPT-4.1	Yes — first-party MCP server	Yes — CPU-only, no external services, data in ~/.omega/omega.db	(not documented — single-user local design; no per-tenant API surface found)
Zep / Graphiti	71.2%	Graphiti: Apache 2.0 · Zep: commercial	GPT-4o	Yes — first-party MCP server (Graphiti repo mcp_server/)	Graphiti: Yes (self-hosted, Docker Compose) · Zep: No (managed SaaS)	Partial — Zep cloud (not documented; SaaS account-level isolation assumed) · Graphiti: not documented (build-your-own per README)
Engram v0.1.0	64.6%	Apache 2.0	gpt-4o-mini (or any OpenAI-compatible — including local Ollama)	Yes — FastAPI HTTP + first-party MCP server	Yes — SQLite, no cloud dependency	Yes (default) — `Scope(user_id, org_id)` enforced at SQL + HNSW level; no cross-tenant leakage by construction

How to read this

The frontier systems above use Claude Opus 4.6 (Chronos) or GPT-5-mini (Mastra) as their reader, which sets a hard floor on cost and a hard ceiling on which models you can run. Engram's 64.6% comes from the gpt-4o-mini stack — open-source, MCP-native, runs against any OpenAI-compatible model (including local ones via Ollama), self-hosts on SQLite, and ships per-tenant isolation by default. If your constraint is “highest raw accuracy regardless of cost or deployment shape,” pick Chronos, Mastra, or OMEGA. If your constraint is “the open-source memory system that runs in our VPC, doesn't lock us into a frontier model, ships per-tenant isolation by default, and speaks MCP today,” Engram is the system on that axis. We publish our number with the official judge, the methodology, and the reproduction script — submit yours.

Cost footnote

Rough cost-per-question estimates (back-of-envelope; verify against your own usage). Frontier-model systems running Claude Opus 4.6 or GPT-5-mini typically cost $0.05–$0.20 per LongMemEval-S question end-to-end. Engram's gpt-4o-mini stack runs ~$0.005 per question. Multi-thousand-question evaluation runs with frontier-model stacks cost real money; budget accordingly.

Methodology

Benchmark. LongMemEval-S — the standard 500-question evaluation for long-form agent memory.

Judge. LongMemEval official gpt-4o-mini judge. Same judge as every entry above.

Engram configuration. v0.1.0 release stack: gpt-4o-mini reader, synthesis-mode reading for preference questions, recall-mode reading with verifier for everything else, query decomposition, 6-built-in-tool registry, two-stage retrieval, classifier-driven category budgets, today-anchored temporal grounding, hybrid retrieval (vector + BM25 + recency + graph + importance + temporal). Both reading modes documented in src/engram/read/.

Reproduction:

git clone https://github.com/jamjet-labs/engram && cd engram
uv sync
export OPENAI_API_KEY=sk-...   # for the official judge + reader
export LONGMEMEVAL_ORACLE=path/to/longmemeval_oracle.json
uv run python -m benchmarks.smoke_runner --n 500 --decompose --tools

Per-question result JSON lands in benchmarks/reports/. Synthesis-mode is auto-routed by the smoke_runner's question classifier — no separate flag needed.

Per-category breakdown

Where Engram v0.1.0 is competitive today, and where the roadmap work is. Honest disclosure — single-session and preference categories are strong; multi-session and temporal-reasoning are the open roadmap items.

Category	Score	Raw	Notes
single-session-assistant	95%	53 / 56	Strongest category — synthesis-mode and recall-mode both excel here
single-session-user	83%	58 / 70	Strong
single-session-preference	70%	21 / 30	Strong — synthesis-mode reading is purpose-built for this category
knowledge-update	65%	51 / 78	Roadmap — belief revision and update detection planned
temporal-reasoning	54%	72 / 133	Roadmap — today-anchored grounding helps; long-span temporal chains need work
multi-session	51%	68 / 133	Roadmap — largest single uplift opportunity; cross-session entity linking planned

Single-session and preference categories are where Engram's synthesis-mode reader and per-category routing pull their weight. Multi-session and temporal-reasoning are the open roadmap items — see Engram's roadmap for the planned work.

Engram's own progression

How v0.1.0's number evolved from the initial 100-question stratified result to the comparable 500-question full set.

Run	Date	Score	n	Judge	Notes
v0.1.0 GA (initial)	2026-05-06	71.0%	100 stratified	gpt-4o-mini official	First judged result; stratified sampling over-represented single-session categories
v0.1.0 GA (full set)	2026-05-09	64.6%	500	gpt-4o-mini official	Leaderboard entry above. 6.4pp drop reflects long-tail categories (multi-session 51%, temporal-reasoning 54%) where stratification hid weakness — honest methodology disclosure.

Have a system on LongMemEval-S?

Submit your number with methodology — we add verified entries.

Submit your system →

Required: system name, score, n, judge, reader/generator model, license, MCP-native (yes/no/partial), runs-locally (yes/no/partial), reproduction link. Maintainer-triaged.

More: JamJet benchmarks index · Engram product page · Engram on GitHub