Why Our 83.2% LongMemEval Score Matters More Than 99%

How six iterations of honest benchmarking taught us more than any leaderboard number ever could.

Last week, a company called Supermemory published a blog post claiming roughly 99% accuracy on LongMemEval_S, the most rigorous public benchmark for long-term AI memory. The post went viral. Then they updated it and said the whole thing was a "social experiment," a parody meant to expose how meaningless memory benchmarking culture has become.

They were right to call that out. And we have the scars to prove it.

At Nunchi AI, we build Nexus, a memory infrastructure engine for AI agents powered by our Synapsis atomization engine. Over the past week, we ran LongMemEval_S against our production engine. Not a research prototype. Not a 12-agent leaderboard ensemble. The same engine that serves real API traffic.

We finished at 83.2%. What matters more than the number itself is how we got there.

What LongMemEval actually tests

LongMemEval_S is a 500-question benchmark published at ICLR 2025. Each question comes with 30 to 50 conversation sessions, roughly 115K tokens, that a memory system must ingest, index, and later retrieve in order to answer. It tests five core abilities.

Information Extraction: recalling specific facts from a single session
Preference Recall: understanding implicit user preferences
Multi-Session Reasoning: synthesizing facts scattered across multiple sessions
Temporal Reasoning: calculating dates and understanding event order
Knowledge Update: recognizing when old information has been superseded

The benchmark is hard because the answer may be hidden in only 1 to 3 sessions out of 50, the questions are deliberately indirect, and some require counting, comparing, or date arithmetic across sessions.

v1 the baseline: 76.8%

We started by feeding LongMemEval sessions through our production Synapsis pipeline: conversation, atomization, vector storage, cosine-similarity retrieval, and LLM answer generation.

The v1 results looked like this.

Info Extraction (User): 95.7%
Info Extraction (Assistant): 98.2%
Preference Recall: 66.7%
Multi-Session: 66.9%
Temporal Reasoning: 66.2%
Knowledge Update: 83.3%
Overall: 76.8%

The single-session extraction scores told us our retrieval engine worked. The weaker categories told us where to look next.

v2 the overengineering disaster: 60.8%

Confident in our diagnosis, we changed too much at once. We added multi-query decomposition with RRF, temporal boosts, knowledge-supersede logic that deleted older atoms, query expansion, and date-based resorting.

Overall dropped to 60.8%. Temporal Reasoning collapsed from 66.2% to 12.8%. Knowledge Update fell from 83.3% to 57.7%.

Three things went wrong.

Temporal boosts pushed merely recent atoms above actually relevant atoms.
Knowledge supersede logic deleted older atoms even though the benchmark often requires comparing old and new evidence together.
Date-based resorting rearranged atom order in the context window and pushed relevant evidence past the token limit.

The lesson was brutal and clear: do not touch the retrieval order.

v4 the sophisticated failure: 72.4%

We then tried a more principled architecture: a three-stage pipeline of Recall, Evidence Packing, and Answer, with category-specific packing strategies. We added round-robin session diversity, GPT-4o-mini preference summarization, and top-20 recency reranking for knowledge updates.

It was architecturally elegant. It scored 72.4%, still worse than v1.

The pattern stayed the same.

Every packing strategy that rearranged atom order hurt performance.
The preference summarization layer lost information that raw atoms preserved.
Round-robin diversity displaced highly relevant atoms with lower-relevance ones.

The three laws of memory retrieval

By v4, we had learned three rules the hard way.

Never delete atoms. The answer may require comparing old and new information.
Never reorder atoms. Cosine-similarity ranking is the best context arrangement for the LLM reader.
Never inject non-similarity signals into retrieval scores. Time, recency, and freshness belong outside the ranking function.

These are not just benchmark tips. They have become design constraints in our production engine.

v5 prompts only: 80.6%

Armed with those three rules, we restored v1 retrieval unchanged and modified only the prompts.

Preference Recall improved from 66.7% to 90.0%.
Knowledge Update improved from 83.3% to 91.0%.
Temporal Reasoning improved from 66.2% to 72.9% after explicitly injecting question_date into the prompt as the reference point.

Overall reached 80.6%. No retrieval changes. No architectural changes. Just better instructions for the LLM reader.

v6 the reading fix: 83.2%

We then analyzed the remaining Multi-Session failures. In 20 out of 46 wrong answers, the pattern was obvious: the LLM had all the atoms it needed, but stopped counting too early. It said "at least 3" when the answer was 4, missed the last item in a list, or hedged instead of committing.

The fix was one prompt addition.

Before answering, the model had to scan all memories, extract every relevant item into a numbered list, count the full list, and answer with the exact number. It was also told never to say "at least."

That pushed Multi-Session from 65.4% to 76.7%, and overall performance reached 83.2%.

A v5-to-v6 comparison makes the improvement pattern clearer.

Category	v5	v6	Change
Info Extraction (User)	94.3%	95.7%	+1.4
Info Extraction (Asst)	98.2%	100.0%	+1.8
Preference Recall	90.0%	83.3%	-6.7
Multi-Session	65.4%	76.7%	+11.3
Temporal Reasoning	72.9%	72.2%	-0.7
Knowledge Update	91.0%	89.7%	-1.3
Task-Averaged	85.3%	86.3%	+1.0
Overall	80.6%	83.2%	+2.6

The final v6 results themselves were:

Info Extraction (User): 95.7%
Info Extraction (Assistant): 100.0%
Preference Recall: 83.3%
Multi-Session: 76.7%
Temporal Reasoning: 72.2%
Knowledge Update: 89.7%
Overall: 83.2%

What 83.2% actually means

It is worth being precise about what this number is and is not.

It is the score of a production memory engine, the same Synapsis pipeline that handles real API traffic, evaluated on the toughest public long-term memory benchmark with GPT-4o as both answer generator and judge.

It is not the score of a research prototype optimized only for this benchmark. We did not build a 12-agent ensemble. We did not run multiple prompt variants in parallel and keep only the best answer. We did not use tricks we would never deploy in production.

For context, other LongMemEval_S results look roughly like this.

Mastra (Observational Memory): 95% with GPT-4o, but still a research architecture rather than a production API
Emergence AI: 86% with GPT-4o, RAG-based, with 5.65s latency
Supermemory (ASMR): about 99%, but self-described as a parody and dependent on 12 parallel agents per retrieval
Supermemory (production): Multi-Session 71.4% and Temporal 76.7%, the published category scores of their actual engine
TiMem: 76.88% with GPT-4o-mini
Zep/Graphiti: 71.2% with GPT-4o
Long-context GPT-4o without memory: about 60%

Our 83.2% sits near the top of production-viable systems. We also know where the remaining 16.8% lives: Multi-Session retrieval and Temporal Reasoning. Those are engineering problems with concrete paths forward, especially multi-query retrieval and anchor-based keyword recall.

Why the answer model matters more than you think

After locking in our v6 results, we ran the exact same retrieval pipeline and prompts with three different LLMs. Same atoms, same context order, same instructions. Only the reader model changed.

Category	n	GPT-4o	GPT-4o-mini	GPT-OSS-120B (Groq)
Info Extraction (User)	70	95.7%	95.7%	95.7%
Info Extraction (Asst)	56	100.0%	100.0%	98.2%
Preference Recall	30	83.3%	80.0%	43.3%
Multi-Session	133	76.7%	62.4%	25.6%
Temporal Reasoning	133	72.2%	57.1%	64.7%
Knowledge Update	78	89.7%	80.8%	82.1%
Overall	500	83.2%	73.8%	63.8%

The pattern is striking. Simple fact extraction is almost model-agnostic. All three models stay above 95% on single-session retrieval. The divergence happens in complex instruction following.

GPT-4o-mini drops 9.4 points overall, with the largest falloffs in Multi-Session and Temporal Reasoning. Those are exactly the categories where the v6 prompt says "list every item before counting" and "calculate from this reference date." Mini follows those instructions less reliably.

GPT-OSS-120B collapses to 25.6% on Multi-Session, a category where the atoms are present in context but the model fails to enumerate them. With 117B total parameters but only 5.1B active per forward pass, it handles extraction well but cannot sustain complex multi-step counting over a long context window. Preference Recall at 43.3% points to the same limit: synthesizing implicit patterns from scattered evidence exceeds what the active parameter budget can support.

That leads to two conclusions.

For benchmarking, a LongMemEval score without disclosing the answer model has limited meaning. The same retrieval engine can produce 83.2% or 63.8% depending on which LLM reads the context. Any system reporting results should specify the final answer model precisely.

For production, memory retrieval is necessary but not sufficient. The reader model is the ceiling. Invest in retrieval until it stops being the bottleneck, and then the answer model becomes the constraint. For Nexus, single-session extraction at 95 to 100% suggests retrieval has already reached that ceiling for simple queries. The remaining gains live in the reader's ability to reason over what has already been retrieved.

The honest benchmarking manifesto

Supermemory's social experiment exposed a real problem. Memory benchmarks are being gamed. Ensemble methods, prompt-variant selection, and research-only pipelines are producing numbers that say little about production quality.

We think LongMemEval reporting should follow a simple standard.

State clearly whether the system is production or research-only.
Report every category score, not just the overall score.
Disclose the answer-generation model.
Show the iteration history, including failures.

We published our failures, v2 at 60.8% and v4 at 72.4%, for exactly that reason. They contain more insight than the final number alone.

The memory layer is becoming critical infrastructure for AI agents, as important as tools and agent collaboration. That layer deserves honest measurement.

You can see the benchmark itself at LongMemEval on GitHub. Nexus is the memory engine behind AMCP, the Agent Memory Continuity Protocol. Learn more at Nexus.

Nexus v6 LongMemEval_S results: 83.2% overall, tested in March 2026. Engine version: Synapsis production. Answer model: GPT-4o. Judge model: GPT-4o, using the default LongMemEval setup.