GMP isn't asserted — it's derived. GRAFOMEM: A Reproducible Benchmark for Agent Memory establishes, empirically, the dimensions a memory standard cannot leave unspecified, and shows why a conformance suite is necessary.
Each workload isolates one capability and is scored against an oracle that derives ground truth — including bi-temporal validity and tenant partitioning. The safety-critical capabilities, deletion and tenant isolation, are measured two-sided: for both leakage and over-restriction. The corpus is produced by deterministic generators, gated by independent validators, content-addressed, and regenerated byte-identically across architectures.
Memory behavior resolves into orthogonal axes — representational capability, embedding quality, retention, and a two-boundary privacy primitive.
Backends that advertise hard-delete and multi-tenancy, accept every call, and report success nonetheless leak completely — with no recall footprint to betray them (F10, F12).
Workloads → findings. Representational capability (W1–W2; F3–F5) · embedding quality (W3; F6–F7) · retention policy (W4; F8–F9, the bounded-K cliff) · the privacy primitive (W5–W6; F10–F13). The v0.2.0 release extends the suite to ten workloads and twenty findings.
Corpus provenance. 135 traces · 61,754 turns · 17,612 queries · corpus hash f049820b…b077ca6 — regenerated byte-identically from deterministic generators.
→ Full paper, findings table (F1–F20), and appendices