Data Governance in Versioned Systems
May 2026
If a database remembers everything, how do you delete things?
This is the first question people ask when they encounter immutable data systems. It’s a fair question. GDPR right-to-erasure requests, accidental PII ingestion, retention policies — these aren’t edge cases. Any system that stores data for real workloads needs to answer them.
The second question is access control: who gets to see what? A database that lets you fork, branch, and time-travel is powerful, but it also means the attack surface is wider than a traditional system where yesterday’s state is already overwritten.
This note describes what works today across Datahike and its ecosystem, what’s planned, and what’s not solved yet.
Purge in Datahike
Datahike has a purge operation that removes datoms from a database’s indices — both the current and the history index of the resulting commit.
require('[datahike.api :as d])
;; Before purge
d/q('[:find ?n :where [?e :name ?n]] @conn)
;; => #{["Alice"] ["Bob"]}
;; Purge Alice (requires :keep-history? true)
d/transact(conn [[:db.purge/entity [:name "Alice"]]])
;; After purge — gone from current state AND from the new commit's history
d/q('[:find ?n :where [?e :name ?n]] @conn)
;; => #{["Bob"]}
d/q('[:find ?n :where [?e :name ?n]] d/history(@conn))
;; => #{["Bob"]}
(require '[datahike.api :as d])
;; Before purge
(d/q '[:find ?n :where [?e :name ?n]] @conn)
;; => #{["Alice"] ["Bob"]}
;; Purge Alice (requires :keep-history? true)
(d/transact conn [[:db.purge/entity [:name "Alice"]]])
;; After purge — gone from current state AND from the new commit's history
(d/q '[:find ?n :where [?e :name ?n]] @conn)
;; => #{["Bob"]}
(d/q '[:find ?n :where [?e :name ?n]] (d/history @conn))
;; => #{["Bob"]}
Two things purge is not. It’s not a soft delete: it rewrites the affected path through the persistent sorted set, and the new commit’s index roots no longer reach a node containing Alice. It’s also not retroactive across the commit graph: each commit owns its own index roots, so the pre-purge commit still points at the old nodes containing the datom. Until d/gc-storage sweeps that intermediate commit, the bytes remain on disk and (d/commit-as-db conn <pre-purge-uuid>) can still see Alice. The recipe in the next section closes that loop.
How DELETE disposes of data — and how purge differs
Compare to a traditional DELETE. In PostgreSQL, DELETE marks the row’s heap tuple dead; the bytes live on in several places:
- In the heap page as a dead tuple. Plain
VACUUM marks the space reusable and may defragment within the page, but never returns space to the OS or zeroes bytes. VACUUM FULL rewrites the table into a new file; the old file is unlinked but its disk blocks are not zeroed.
- In WAL segments until they’re recycled, and indefinitely in archived WAL if
archive_mode=on.
- In replicas, which received the original WAL and run their own VACUUM cycle.
- In backups — base backups,
pg_dump archives, PITR base + WAL.
Proving the row is gone requires forensic checks across each of those layers, and there’s no manifest of where it lived.
Purge changes one part of that picture: in the live store, deletion is explicit and structurally locatable. The persistent sorted set is a tree of content-addressed nodes; the purge transaction rewrites the affected path; the new commit’s roots no longer reach a node containing the datom; a subsequent gc-storage with an appropriate cutoff sweeps the unreachable intermediate commits. The commit graph is itself a manifest — you can walk every branch head’s ancestors and list precisely which snapshots ever held the datom.
Where Datahike does not differ from PostgreSQL: backups, storage backends with their own versioning, and replication targets. We come back to that in Backups and storage-layer history.
Garbage collection in Datahike
d/gc-storage is the sweep that physically reclaims storage. The full mechanics — cutoff dates, branch-heads-always-kept, online vs. offline GC — are in Branches as Values, Merges as Queries. The piece that matters for governance is how it composes with purge:
require('[superv.async :refer [<?? S]])
;; 1. Purge — produces a new commit whose roots no longer reach Alice.
d/transact(conn [[:db.purge/entity [:name "Alice"]]])
;; 2. gc-storage WITHOUT a cutoff only reclaims storage on deleted branches.
;; The pre-purge commit on :db is a live intermediate commit, so its
;; tree nodes (containing Alice) survive this sweep.
<??(S d/gc-storage(conn))
;; 3. gc-storage WITH a cutoff is what physically evicts those nodes.
;; Pick a cutoff that exceeds your longest-running reader's lifetime.
let [seven-days-ago new java.util.Date(System/currentTimeMillis() - 7 * 24 * 60 * 60 * 1000)]:
<??(S d/gc-storage(conn seven-days-ago))
end
(require '[superv.async :refer [<?? S]])
;; 1. Purge — produces a new commit whose roots no longer reach Alice.
(d/transact conn [[:db.purge/entity [:name "Alice"]]])
;; 2. gc-storage WITHOUT a cutoff only reclaims storage on deleted branches.
;; The pre-purge commit on :db is a live intermediate commit, so its
;; tree nodes (containing Alice) survive this sweep.
(<?? S (d/gc-storage conn))
;; 3. gc-storage WITH a cutoff is what physically evicts those nodes.
;; Pick a cutoff that exceeds your longest-running reader's lifetime.
(let [seven-days-ago (java.util.Date. (- (System/currentTimeMillis)
(* 7 24 60 60 1000)))]
(<?? S (d/gc-storage conn seven-days-ago)))
This is the recipe most people get wrong on first read: purge + cutoff-GC, not purge alone. Plain gc-storage is a safe maintenance op — it leaves intermediate commits alone — which is exactly why it doesn’t finish the job for erasure. The pre-purge commit gets swept on the first cutoff-GC pass after it ages out of the grace window. In practice that means erasure has a tail measured in your GC cadence, not in milliseconds, which most compliance regimes accept.
The cutoff has to comfortably exceed your longest-running reader’s lifetime. Datahike’s distributed readers walk storage directly without coordinating with a writer, so a snapshot vanishing mid-query is a real failure mode.
Branch heads are always kept regardless of cutoff. So the recipe assumes the post-purge state is the head you want to keep; if the datom also exists on another branch, you purge there too before sweeping.
Secondary indices
Datahike’s secondary indices — Scriptum (Lucene full-text), Proximum (HNSW vector), Stratum (columnar) — are first-class versioned state. Indices are CoW-forked on branch, persisted with each commit, and restored on connect.
For governance, the question is whether purge propagates. It does: a purge transaction routes a retraction event (-transact with :added? false) to every secondary index covering an affected attribute, the same way :db/retract does. After purging Alice on a database with a Scriptum index over :person/name and :person/bio, a full-text search for “Alice” returns nothing, a vector KNN over her embedding skips her, and any columnar aggregate excludes her row.
On storage reclamation, Stratum and Proximum are konserve-backed: d/gc-storage sweeps their unreachable blobs alongside the primary indices, following the same pattern (Stratum: columnar rewrite; Proximum: HNSW mark-delete).
Scriptum is the exception. Its Lucene segments live on the writer node’s local filesystem, outside konserve. Scriptum’s -sec-mark returns the empty set, so d/gc-storage can’t reach them, and Lucene’s own delete model is tombstones-until-segment-merge — the bytes linger inside a segment file until Lucene merges that segment away. For full erasure on Scriptum you may need to force a segment merge and make sure the writer’s filesystem snapshot policy doesn’t pin old segments.
Backups and storage-layer history
Datahike does not escape the backup and archive problem. A backup of the konserve store taken before purge+GC contains the pre-purge nodes. Storage backends with their own versioning hold them too:
- S3 object versioning, if enabled, retains every overwritten key as a prior version.
- ZFS / btrfs snapshots of the konserve directory retain the pre-purge tree.
- A git-backed konserve backend retains the pre-purge commit graph in
.git even after a purge rewrite on the working tree.
- Logical replicas of the konserve store hold whatever they received.
Purge + cutoff-GC reach the live store; erasure across backups and storage-layer history is a separate procedure — identify the destinations that contain the datom, then replay the purge against them or rewrite them. PostgreSQL doesn’t ship a tool for this either; in any storage-versioned environment it’s an operational policy question, not a database feature.
The common operational workaround is crypto-shredding — encrypt sensitive values per-user and “delete” by destroying the key. The bytes remain in WAL and backups but become unreadable; available to Datahike users on the same terms, with the same regulatory gray zone about whether key destruction counts as deletion.
What Datahike does give you here is a manifest of where the datom lived. Because history and branches are first-class, you can walk the commit graph from every branch head and list every snapshot that ever referenced the datom. That’s not what you get in PostgreSQL, where “which copies of this row exist?” is answered by inspecting WAL archives, backups, and replicas as independent unstructured investigations.
Multi-branch purge
Purge is a transaction on a single branch. If the same datom is reachable from another branch’s head, or from any of its commits inside the GC window, it lives there too. Structural sharing means the bytes are physically one copy in konserve — but the paths to reach them are independent.
The practical procedure: purge on every branch you control, delete branches you no longer need, then run cutoff-GC. Data reachable only from deleted branches gets reclaimed.
For databases with hundreds of agent-created branches, this gets expensive. Each branch’s purge is its own transaction and writes its own tree path. Optimization for the agent-fanout case is on the roadmap.
Access control
GDPR also requires access controls (Articles 15, 32). Today this works at the connection and storage level: storage credentials gate database access; the :branch config field (or SET datahike.branch over SQL via pg-datahike) pins a connection to a specific branch; commit-as-db (or SET datahike.commit_id) pins a specific snapshot for audit-only roles; read vs. write is per-connection.
Row-level access — “user X can query this branch but shouldn’t see rows where :department = "HR"” — applied consistently across current state, history, branches, and secondary-index queries is the next layer. EACL-style ReBAC is one direction we’re exploring.
Why immutable is still useful for compliance
The strong-form claim — “you can prove the data is gone” — doesn’t survive the backup and storage-history caveat. The narrower claim does, and it’s the one worth making.
In a mutable database, “which copies of this row ever existed?” has no manifest. You inspect WAL archives, replicas, backup catalogs, and page slack independently; each is a separate forensic exercise.
In a Datahike database, the copies are enumerable. History is explicit and branches are explicit. You can list — by walking the commit graph from every branch head — every snapshot that ever held the datom. Once enumerated, each can be addressed individually:
- Live store — purge + cutoff-
gc-storage rewrites the index path and sweeps the intermediate commits; commit-as-db lookups for swept UUIDs fail cleanly.
- Secondary indices — covered by purge propagation on the konserve-backed ones; Scriptum needs the segment-merge / filesystem step.
- Backups, S3 versioning, ZFS snapshots, git-backed stores — each is a separate destination addressed by policy.
The advantage isn’t “fewer copies than PostgreSQL” — usually it’s more copies, deliberately. The advantage is knowing what those copies are. “Show me the data is gone” becomes a checklist instead of an investigation.
What’s not solved yet
- Multi-branch purge at scale is expensive. Hundreds of branches means hundreds of transactions, each rewriting its own tree path. Needs optimization work for agent-fanout workloads.
- Scriptum filesystem GC. Lucene segments live outside konserve;
d/gc-storage can’t reach them, and Lucene’s delete model is tombstones-until-merge. Filesystem cleanup is a separate operational step.
- Row-level access control is the next layer. Connection / branch / commit-level access works today; fine-grained ReBAC across current state, history, branches, and secondary indices is the work in flight.
- No retention-policy automation. There’s no built-in “delete everything older than 7 years.”
:db.history.purge/before is the primitive; the policy loop is application code.
These are real limitations and we’re working on them.
Summary
Versioned, immutable data systems need to answer the deletion and access control questions directly. Datahike’s answer for the live store is explicit, structurally locatable deletion: purge rewrites the index path, cutoff-gc-storage sweeps the intermediate commits, secondary indices receive the retraction event, and the commit graph gives you a manifest of where the datom lived.
Where Datahike doesn’t differ from PostgreSQL is on backups, storage-layer versioning, and replicas — those still need a policy in either system. Where it does differ is in giving you a structured catalog of every copy on the live store: an enumerable checklist instead of an investigation. That’s the case for immutable data systems on compliance — not magic, but knowable.