← Back to JamJet benchmarks

LongMemEval-S Leaderboard

Engram v0.1.0 vs the published frontier, judged on the same methodology.

LongMemEval-S is the standard 500-question evaluation for long-form agent memory. The numbers below come from each system's published methodology, judged by the LongMemEval official gpt-4o-mini judge. Engram v0.1.0 — the open-source memory layer in jamjet-labs/engram — lands fifth of five on raw accuracy. The matrix below shows why that isn't the whole story.

Raw accuracy is one axis. Engram is also Apache 2.0, runs against any OpenAI-compatible model (including local ones), is MCP-native, and self-hosts on SQLite. Each axis is a real engineering decision criterion — not all of them are visible in a single accuracy number.

Last updated 2026-05-09

Leaderboard

All scores reported on LongMemEval-S 500-question evaluation, judged by the official gpt-4o-mini judge. Each row links to the system's source.

# System Score n Judge Reader / generator Source
1 Chronos 95.6% 500 gpt-4o-mini official Claude Opus 4.6 arXiv 2603.16862
2 Mastra 94.9% 500 gpt-4o-mini official GPT-5-mini (no retrieval — full observation log) mastra.ai/research
3 OMEGA 93.2% raw* 500 gpt-4o-mini official GPT-4.1 omegamax.co/benchmarks
4 Zep / Graphiti 71.2% 500 gpt-4o-mini official GPT-4o arXiv 2501.13956
5 Engram v0.1.0 64.6% 500 gpt-4o-mini official gpt-4o-mini (synthesis-mode for preference questions; recall mode otherwise) github.com/jamjet-labs/engram

* OMEGA reports 95.4% task-averaged (mean of per-category scores) as its headline figure; 93.2% is the raw count (466/500). We cite the raw count for direct comparability with other rows.

AgentMemory excluded: AgentMemory publishes 96.2% on LongMemEval-S and is a strong system worth evaluating. Its run uses GPT-4o as judge — not the LongMemEval official gpt-4o-mini judge this leaderboard enforces. Same-judge methodology is the comparability discipline here. See their result at github.com/JordanMcCann/agentmemory.

Beyond accuracy: the engineering matrix

Same systems, six axes that matter past raw accuracy. Each cell is sourced; cells silent in the published docs read “(not documented)” rather than guessing.

System Accuracy License Reader model MCP-native Runs locally Multi-tenant by default
Chronos 95.6% (not documented — research prototype; paper is CC-BY-4.0; no code repo) Claude Opus 4.6 (not documented — no MCP mention in paper; commercial-API-only architecture) (not documented — no public code; runs on commercial APIs only per paper) (not documented)
Mastra 94.9% (not documented — Mastra states "completely open source" but specific license not stated on research page) GPT-5-mini (not documented) Partial — observer/reflector agents use Gemini-2.5-flash via API; no documented fully local path (not documented)
OMEGA 93.2% Apache 2.0 GPT-4.1 Yes — first-party MCP server Yes — CPU-only, no external services, data in ~/.omega/omega.db (not documented — single-user local design; no per-tenant API surface found)
Zep / Graphiti 71.2% Graphiti: Apache 2.0 · Zep: commercial GPT-4o Yes — first-party MCP server (Graphiti repo mcp_server/) Graphiti: Yes (self-hosted, Docker Compose) · Zep: No (managed SaaS) Partial — Zep cloud (not documented; SaaS account-level isolation assumed) · Graphiti: not documented (build-your-own per README)
Engram v0.1.0 64.6% Apache 2.0 gpt-4o-mini (or any OpenAI-compatible — including local Ollama) Yes — FastAPI HTTP + first-party MCP server Yes — SQLite, no cloud dependency Yes (default)Scope(user_id, org_id) enforced at SQL + HNSW level; no cross-tenant leakage by construction

How to read this

The frontier systems above use Claude Opus 4.6 (Chronos) or GPT-5-mini (Mastra) as their reader, which sets a hard floor on cost and a hard ceiling on which models you can run. Engram's 64.6% comes from the gpt-4o-mini stack — open-source, MCP-native, runs against any OpenAI-compatible model (including local ones via Ollama), self-hosts on SQLite, and ships per-tenant isolation by default. If your constraint is “highest raw accuracy regardless of cost or deployment shape,” pick Chronos, Mastra, or OMEGA. If your constraint is “the open-source memory system that runs in our VPC, doesn't lock us into a frontier model, ships per-tenant isolation by default, and speaks MCP today,” Engram is the system on that axis. We publish our number with the official judge, the methodology, and the reproduction script — submit yours.

Cost footnote

Rough cost-per-question estimates (back-of-envelope; verify against your own usage). Frontier-model systems running Claude Opus 4.6 or GPT-5-mini typically cost $0.05–$0.20 per LongMemEval-S question end-to-end. Engram's gpt-4o-mini stack runs ~$0.005 per question. Multi-thousand-question evaluation runs with frontier-model stacks cost real money; budget accordingly.

Methodology

Benchmark. LongMemEval-S — the standard 500-question evaluation for long-form agent memory.

Judge. LongMemEval official gpt-4o-mini judge. Same judge as every entry above.

Engram configuration. v0.1.0 release stack: gpt-4o-mini reader, synthesis-mode reading for preference questions, recall-mode reading with verifier for everything else, query decomposition, 6-built-in-tool registry, two-stage retrieval, classifier-driven category budgets, today-anchored temporal grounding, hybrid retrieval (vector + BM25 + recency + graph + importance + temporal). Both reading modes documented in src/engram/read/.

Reproduction:

git clone https://github.com/jamjet-labs/engram && cd engram
uv sync
export OPENAI_API_KEY=sk-...   # for the official judge + reader
export LONGMEMEVAL_ORACLE=path/to/longmemeval_oracle.json
uv run python -m benchmarks.smoke_runner --n 500 --decompose --tools

Per-question result JSON lands in benchmarks/reports/. Synthesis-mode is auto-routed by the smoke_runner's question classifier — no separate flag needed.

Per-category breakdown

Where Engram v0.1.0 is competitive today, and where the roadmap work is. Honest disclosure — single-session and preference categories are strong; multi-session and temporal-reasoning are the open roadmap items.

Category Score Raw Notes
single-session-assistant 95% 53 / 56 Strongest category — synthesis-mode and recall-mode both excel here
single-session-user 83% 58 / 70 Strong
single-session-preference 70% 21 / 30 Strong — synthesis-mode reading is purpose-built for this category
knowledge-update 65% 51 / 78 Roadmap — belief revision and update detection planned
temporal-reasoning 54% 72 / 133 Roadmap — today-anchored grounding helps; long-span temporal chains need work
multi-session 51% 68 / 133 Roadmap — largest single uplift opportunity; cross-session entity linking planned

Single-session and preference categories are where Engram's synthesis-mode reader and per-category routing pull their weight. Multi-session and temporal-reasoning are the open roadmap items — see Engram's roadmap for the planned work.

Engram's own progression

How v0.1.0's number evolved from the initial 100-question stratified result to the comparable 500-question full set.

Run Date Score n Judge Notes
v0.1.0 GA (initial) 2026-05-06 71.0% 100 stratified gpt-4o-mini official First judged result; stratified sampling over-represented single-session categories
v0.1.0 GA (full set) 2026-05-09 64.6% 500 gpt-4o-mini official Leaderboard entry above. 6.4pp drop reflects long-tail categories (multi-session 51%, temporal-reasoning 54%) where stratification hid weakness — honest methodology disclosure.

Have a system on LongMemEval-S?

Submit your number with methodology — we add verified entries.

Submit your system →

Required: system name, score, n, judge, reader/generator model, license, MCP-native (yes/no/partial), runs-locally (yes/no/partial), reproduction link. Maintainer-triaged.