The Problem: Declared ≠ Observed

When you evaluate an agent-memory backend, you typically check its API surface: does it expose a delete() method? Does it accept a tenant_id parameter? If yes, you mark the capability as "supported" and move on.

But having a method signature is not the same as honoring it.

We started the GRAFOMEM project asking a simple question: if a backend declares HARD_DELETE, does deleted data actually disappear from all retrieval paths? The answer, in most cases we tested, is no.

Methodology

We built an open conformance suite called GMP (GRAFOMEM Memory Protocol) that defines 10 orthogonal capabilities for agent memory:

CapabilityWhat It Tests
AUDITCan you enumerate all stored memories?
SUPERSESSION_CHAINCan you version facts and retire predecessors?
BI_TEMPORALCan you query historical state at a point in time?
HARD_DELETEIs deletion irrecoverable from all paths — retrieve and audit?
MULTI_TENANTIs read/write isolation per-tenant actually enforced?
PROVENANCEAre writes tagged with source metadata?
CRYPTOGRAPHIC_PROVENANCEAre writes Ed25519-signed?
CONFLICT_DETECTIONAre overlapping-validity conflicts detected?
CROSS_SESSION_PROPAGATIONDo deletes propagate across sessions?
CONCURRENCY_CONTROLWhat isolation level is actually provided?

Each capability has a normative specification (RFC 2119 language) and a runnable test suite. A backend "supports" a capability if and only if it passes the suite for that capability — not if it declares it. We call this the M8 conformance rate: the fraction of declared capabilities that actually pass.

The entire corpus is content-addressed (BLAKE2b-256, hash f049820b…) and locked — anyone can re-run our tests and get identical results.

Key Findings

Finding F10: Deletion Leakage

Backends that accept HARD_DELETE can still return deleted data through the retrieve() path with probability 1.0.

We built a "soft delete" backend that marks records as deleted but doesn't actually remove them from the vector index. It declares HARD_DELETE. It has a delete() method that returns success.

The conformance suite caught it immediately. The test writes a fact, deletes it, then queries with the same predicate. The deleted fact comes back in the results. Every single time.

This isn't a contrived scenario. We found this pattern in backends that implement deletion by toggling a is_deleted flag without filtering the vector similarity search results. The API says "deleted" but the retrieval path disagrees.

Finding F12: Tenant Isolation Failure

Backends that accept tenant_id can leak memories across tenants — and the two-sided test also catches over-isolation (blocking legitimate access).

We built three tenant backends:

  • Scoped (correct): reads and writes are strictly per-tenant
  • Leaky: accepts tenant_id but doesn't filter retrieval results by it
  • Over-isolating: filters so aggressively that it blocks the tenant's own legitimate data

The conformance suite runs two-sided tests:

  1. Leakage direction: Write as tenant A, retrieve as tenant B — should return nothing
  2. Over-restriction direction: Write as tenant A, retrieve as tenant A — should return everything

The leaky backend passes neither. The over-isolating backend passes leakage but fails restriction. Only the correctly scoped backend achieves M8 = 1.000.

Finding F19–F20: Isolation Level Spectrum

We tested 5 backends along the concurrency isolation spectrum (serializable → snapshot → read_committed → no_isolation → "resurrecting"). Backends that declare read_committed but actually provide no isolation fail. Backends that claim serializable but allow dirty reads fail. The conformance suite captures the actual isolation behavior, not the declared level.


The Takeaway

A declared capability is not an observed one. Type signatures, API contracts, and recall metrics cannot catch these failures. You need conformance tests that probe behavior from both sides — does it leak forbidden data, and does it withhold permitted data?

What We Built

GRAFOMEM is the result:

  • GMP v0.2.0: An open spec defining 10 frozen capability flags
  • Conformance suite: 135 traces, 61,754 turns, 17,612 queries — corpus-locked with BLAKE2b
  • Ed25519-signed reports: Tamper-evident conformance certificates
  • Reference implementations: In-memory, SQLite, and PostgreSQL — all M8 = 1.000
  • MIT-licensed: Spec, suite, and corpus are all open source

The whole thing runs with:

pip install grafomem
grafomem conformance -b your_backend:YourStore -o report.json

And if you want the conformant backend without the ops:

# GRAFOMEM Cloud — managed, GMP-verified, zero infrastructure
curl -X POST https://cloud.grafomem.com/v1/write \
  -H "Authorization: Bearer gfm_..." \
  -d '{"predicate": "prefers", "subject": "user_42", "object": "dark_roast"}'