LongMemEval-S Leaderboard
Engram v0.1.0 vs the published frontier, judged on the same methodology.
LongMemEval-S is the standard 500-question evaluation for long-form agent memory. The numbers below come from each system's published methodology, judged by the LongMemEval official gpt-4o-mini judge. Engram v0.1.0 — the open-source memory layer in jamjet-labs/engram — lands fifth of five on raw accuracy. The matrix below shows why that isn't the whole story.
Raw accuracy is one axis. Engram is also Apache 2.0, runs against any OpenAI-compatible model (including local ones), is MCP-native, and self-hosts on SQLite. Each axis is a real engineering decision criterion — not all of them are visible in a single accuracy number.
Last updated 2026-05-09
Leaderboard
All scores reported on LongMemEval-S 500-question evaluation, judged by the official gpt-4o-mini judge. Each row links to the system's source.
| # | System | Score | n | Judge | Reader / generator | Source |
|---|---|---|---|---|---|---|
| 1 | Chronos | 95.6% | 500 | gpt-4o-mini official | Claude Opus 4.6 | arXiv 2603.16862 |
| 2 | Mastra | 94.9% | 500 | gpt-4o-mini official | GPT-5-mini (no retrieval — full observation log) | mastra.ai/research |
| 3 | OMEGA | 93.2% raw* | 500 | gpt-4o-mini official | GPT-4.1 | omegamax.co/benchmarks |
| 4 | Zep / Graphiti | 71.2% | 500 | gpt-4o-mini official | GPT-4o | arXiv 2501.13956 |
| 5 | Engram v0.1.0 | 64.6% | 500 | gpt-4o-mini official | gpt-4o-mini (synthesis-mode for preference questions; recall mode otherwise) | github.com/jamjet-labs/engram |
* OMEGA reports 95.4% task-averaged (mean of per-category scores) as its headline figure; 93.2% is the raw count (466/500). We cite the raw count for direct comparability with other rows.
AgentMemory excluded: AgentMemory publishes 96.2% on LongMemEval-S and is a strong system worth evaluating. Its run uses GPT-4o as judge — not the LongMemEval official gpt-4o-mini judge this leaderboard enforces. Same-judge methodology is the comparability discipline here. See their result at github.com/JordanMcCann/agentmemory.
Beyond accuracy: the engineering matrix
Same systems, six axes that matter past raw accuracy. Each cell is sourced; cells silent in the published docs read “(not documented)” rather than guessing.
| System | Accuracy | License | Reader model | MCP-native | Runs locally | Multi-tenant by default |
|---|---|---|---|---|---|---|
| Chronos | 95.6% | (not documented — research prototype; paper is CC-BY-4.0; no code repo) | Claude Opus 4.6 | (not documented — no MCP mention in paper; commercial-API-only architecture) | (not documented — no public code; runs on commercial APIs only per paper) | (not documented) |
| Mastra | 94.9% | (not documented — Mastra states "completely open source" but specific license not stated on research page) | GPT-5-mini | (not documented) | Partial — observer/reflector agents use Gemini-2.5-flash via API; no documented fully local path | (not documented) |
| OMEGA | 93.2% | Apache 2.0 | GPT-4.1 | Yes — first-party MCP server | Yes — CPU-only, no external services, data in ~/.omega/omega.db | (not documented — single-user local design; no per-tenant API surface found) |
| Zep / Graphiti | 71.2% | Graphiti: Apache 2.0 · Zep: commercial | GPT-4o | Yes — first-party MCP server (Graphiti repo mcp_server/) | Graphiti: Yes (self-hosted, Docker Compose) · Zep: No (managed SaaS) | Partial — Zep cloud (not documented; SaaS account-level isolation assumed) · Graphiti: not documented (build-your-own per README) |
| Engram v0.1.0 | 64.6% | Apache 2.0 | gpt-4o-mini (or any OpenAI-compatible — including local Ollama) | Yes — FastAPI HTTP + first-party MCP server | Yes — SQLite, no cloud dependency | Yes (default) — Scope(user_id, org_id) enforced at SQL + HNSW level; no cross-tenant leakage by construction |
How to read this
The frontier systems above use Claude Opus 4.6 (Chronos) or GPT-5-mini (Mastra) as their reader, which sets a hard floor on cost and a hard ceiling on which models you can run. Engram's 64.6% comes from the gpt-4o-mini stack — open-source, MCP-native, runs against any OpenAI-compatible model (including local ones via Ollama), self-hosts on SQLite, and ships per-tenant isolation by default. If your constraint is “highest raw accuracy regardless of cost or deployment shape,” pick Chronos, Mastra, or OMEGA. If your constraint is “the open-source memory system that runs in our VPC, doesn't lock us into a frontier model, ships per-tenant isolation by default, and speaks MCP today,” Engram is the system on that axis. We publish our number with the official judge, the methodology, and the reproduction script — submit yours.
Cost footnote
Rough cost-per-question estimates (back-of-envelope; verify against your own usage). Frontier-model systems running Claude Opus 4.6 or GPT-5-mini typically cost $0.05–$0.20 per LongMemEval-S question end-to-end. Engram's gpt-4o-mini stack runs ~$0.005 per question. Multi-thousand-question evaluation runs with frontier-model stacks cost real money; budget accordingly.
Methodology
Benchmark. LongMemEval-S — the standard 500-question evaluation for long-form agent memory.
Judge. LongMemEval official gpt-4o-mini judge. Same judge as every entry above.
Engram configuration. v0.1.0 release stack: gpt-4o-mini reader,
synthesis-mode reading for preference questions, recall-mode
reading with verifier for everything else, query decomposition, 6-built-in-tool
registry, two-stage retrieval, classifier-driven category budgets, today-anchored temporal
grounding, hybrid retrieval (vector + BM25 + recency + graph + importance + temporal). Both
reading modes documented in src/engram/read/.
Reproduction:
git clone https://github.com/jamjet-labs/engram && cd engram
uv sync
export OPENAI_API_KEY=sk-... # for the official judge + reader
export LONGMEMEVAL_ORACLE=path/to/longmemeval_oracle.json
uv run python -m benchmarks.smoke_runner --n 500 --decompose --tools
Per-question result JSON lands in benchmarks/reports/. Synthesis-mode is
auto-routed by the smoke_runner's question classifier — no separate flag needed.
Per-category breakdown
Where Engram v0.1.0 is competitive today, and where the roadmap work is. Honest disclosure — single-session and preference categories are strong; multi-session and temporal-reasoning are the open roadmap items.
| Category | Score | Raw | Notes |
|---|---|---|---|
| single-session-assistant | 95% | 53 / 56 | Strongest category — synthesis-mode and recall-mode both excel here |
| single-session-user | 83% | 58 / 70 | Strong |
| single-session-preference | 70% | 21 / 30 | Strong — synthesis-mode reading is purpose-built for this category |
| knowledge-update | 65% | 51 / 78 | Roadmap — belief revision and update detection planned |
| temporal-reasoning | 54% | 72 / 133 | Roadmap — today-anchored grounding helps; long-span temporal chains need work |
| multi-session | 51% | 68 / 133 | Roadmap — largest single uplift opportunity; cross-session entity linking planned |
Single-session and preference categories are where Engram's synthesis-mode reader and per-category routing pull their weight. Multi-session and temporal-reasoning are the open roadmap items — see Engram's roadmap for the planned work.
Engram's own progression
How v0.1.0's number evolved from the initial 100-question stratified result to the comparable 500-question full set.
| Run | Date | Score | n | Judge | Notes |
|---|---|---|---|---|---|
| v0.1.0 GA (initial) | 2026-05-06 | 71.0% | 100 stratified | gpt-4o-mini official | First judged result; stratified sampling over-represented single-session categories |
| v0.1.0 GA (full set) | 2026-05-09 | 64.6% | 500 | gpt-4o-mini official | Leaderboard entry above. 6.4pp drop reflects long-tail categories (multi-session 51%, temporal-reasoning 54%) where stratification hid weakness — honest methodology disclosure. |
Have a system on LongMemEval-S?
Submit your number with methodology — we add verified entries.
Submit your system →Required: system name, score, n, judge, reader/generator model, license, MCP-native (yes/no/partial), runs-locally (yes/no/partial), reproduction link. Maintainer-triaged.
More: JamJet benchmarks index · Engram product page · Engram on GitHub