The Problem: Declared ≠ Observed
When you evaluate an agent-memory backend, you typically check its API surface: does it expose a delete() method? Does it accept a tenant_id parameter? If yes, you mark the capability as "supported" and move on.
But having a method signature is not the same as honoring it.
We started the GRAFOMEM project asking a simple question: if a backend declares HARD_DELETE, does deleted data actually disappear from all retrieval paths? The answer, in most cases we tested, is no.
Methodology
We built an open conformance suite called GMP (GRAFOMEM Memory Protocol) that defines 10 orthogonal capabilities for agent memory:
| Capability | What It Tests |
|---|---|
| AUDIT | Can you enumerate all stored memories? |
| SUPERSESSION_CHAIN | Can you version facts and retire predecessors? |
| BI_TEMPORAL | Can you query historical state at a point in time? |
| HARD_DELETE | Is deletion irrecoverable from all paths — retrieve and audit? |
| MULTI_TENANT | Is read/write isolation per-tenant actually enforced? |
| PROVENANCE | Are writes tagged with source metadata? |
| CRYPTOGRAPHIC_PROVENANCE | Are writes Ed25519-signed? |
| CONFLICT_DETECTION | Are overlapping-validity conflicts detected? |
| CROSS_SESSION_PROPAGATION | Do deletes propagate across sessions? |
| CONCURRENCY_CONTROL | What isolation level is actually provided? |
Each capability has a normative specification (RFC 2119 language) and a runnable test suite. A backend "supports" a capability if and only if it passes the suite for that capability — not if it declares it. We call this the M8 conformance rate: the fraction of declared capabilities that actually pass.
The entire corpus is content-addressed (BLAKE2b-256, hash f049820b…) and locked — anyone can re-run our tests and get identical results.
Key Findings
Finding F10: Deletion Leakage
We built a "soft delete" backend that marks records as deleted but doesn't actually remove them from the vector index. It declares HARD_DELETE. It has a delete() method that returns success.
The conformance suite caught it immediately. The test writes a fact, deletes it, then queries with the same predicate. The deleted fact comes back in the results. Every single time.
This isn't a contrived scenario. We found this pattern in backends that implement deletion by toggling a is_deleted flag without filtering the vector similarity search results. The API says "deleted" but the retrieval path disagrees.
Finding F12: Tenant Isolation Failure
We built three tenant backends:
- Scoped (correct): reads and writes are strictly per-tenant
- Leaky: accepts tenant_id but doesn't filter retrieval results by it
- Over-isolating: filters so aggressively that it blocks the tenant's own legitimate data
The conformance suite runs two-sided tests:
- Leakage direction: Write as tenant A, retrieve as tenant B — should return nothing
- Over-restriction direction: Write as tenant A, retrieve as tenant A — should return everything
The leaky backend passes neither. The over-isolating backend passes leakage but fails restriction. Only the correctly scoped backend achieves M8 = 1.000.
Finding F19–F20: Isolation Level Spectrum
We tested 5 backends along the concurrency isolation spectrum (serializable → snapshot → read_committed → no_isolation → "resurrecting"). Backends that declare read_committed but actually provide no isolation fail. Backends that claim serializable but allow dirty reads fail. The conformance suite captures the actual isolation behavior, not the declared level.
The Takeaway
A declared capability is not an observed one. Type signatures, API contracts, and recall metrics cannot catch these failures. You need conformance tests that probe behavior from both sides — does it leak forbidden data, and does it withhold permitted data?
What We Built
GRAFOMEM is the result:
- GMP v0.2.0: An open spec defining 10 frozen capability flags
- Conformance suite: 135 traces, 61,754 turns, 17,612 queries — corpus-locked with BLAKE2b
- Ed25519-signed reports: Tamper-evident conformance certificates
- Reference implementations: In-memory, SQLite, and PostgreSQL — all M8 = 1.000
- MIT-licensed: Spec, suite, and corpus are all open source
The whole thing runs with:
pip install grafomem grafomem conformance -b your_backend:YourStore -o report.json
And if you want the conformant backend without the ops:
# GRAFOMEM Cloud — managed, GMP-verified, zero infrastructure curl -X POST https://cloud.grafomem.com/v1/write \ -H "Authorization: Bearer gfm_..." \ -d '{"predicate": "prefers", "subject": "user_42", "object": "dark_roast"}'