GRAFOMEM — The conformance standard for agent memory

Abstract

A deterministic, capability-typed benchmark.

Each workload isolates one capability and is scored against an oracle that derives ground truth — including bi-temporal validity and tenant partitioning. The safety-critical capabilities, deletion and tenant isolation, are measured two-sided: for both leakage and over-restriction. The corpus is produced by deterministic generators, gated by independent validators, content-addressed, and regenerated byte-identically across architectures.

Two results

What anchors the standard.

Result 1

Independent levers, not one knob

Memory behavior resolves into orthogonal axes — representational capability, embedding quality, retention, and a two-boundary privacy primitive.

Supersession lifts recall +0.585 under drift — and +0.000 under distractor noise, where embedding quality is the only lever
Retention imposes a structural recall cliff no retriever can cross

Result 2

A claim does not certify behavior

Backends that advertise hard-delete and multi-tenancy, accept every call, and report success nonetheless leak completely — with no recall footprint to betray them (F10, F12).

The reason a type flag is insufficient — and a conformance suite is necessary
The single result the entire protocol is built to catch

Workloads → findings. Representational capability (W1–W2; F3–F5) · embedding quality (W3; F6–F7) · retention policy (W4; F8–F9, the bounded-K cliff) · the privacy primitive (W5–W6; F10–F13). The v0.2.0 release extends the suite to ten workloads and twenty findings.

Corpus provenance. 135 traces · 61,754 turns · 17,612 queries · corpus hash f049820b…b077ca6 — regenerated byte-identically from deterministic generators.

→ Full paper, findings table (F1–F20), and appendices

The benchmark behind the standard.

A deterministic, capability-typed benchmark.

What anchors the standard.