For ML & data teams

Prove what was in a training set — and when.

Audits, customer due-diligence, the EU AI Act's data-governance expectations, and copyright or consent disputes all ask the same question of a dataset: what exactly was in it, and by when? A spreadsheet someone can edit afterward is not an answer. Orphograph binds a dataset, its license and consent documents, and its acquisition log into a single Bitcoin-anchored Merkle receipt — so the answer is cryptographic, dated, and independently checkable. The data never leaves your environment; only fingerprints are anchored.

What a dataset receipt proves

The set existed in this exact form by the anchored date. Every file — the data, the license and consent documents, the acquisition log — is hashed and folded into one Merkle root. That root is committed to a Bitcoin block whose date the chain fixes. The whole bundle therefore existed, in its present form, by the time that block was mined.

Nothing has changed since. Altering any byte of any file — relabelling an image, swapping a license, editing the log — changes the Merkle root. If the root still matches the receipt, the bundle is bit-for-bit what was anchored.

Each file is independently verifiable. A Merkle inclusion proof lets anyone confirm that one specific file belonged to the certified set without seeing — or trusting you about — any other file in it.

What it does NOT prove

It does not prove the data was lawfully sourced, licensed, or owned. A receipt records that bytes existed at a moment; it does not establish the legal right to those bytes. That is a separate question for your counsel and the relevant forum.

It does not prove the acquisition log is truthful. The log is anchored as written. The receipt fixes when the log existed and that it has not changed since — not that its contents are accurate or complete.

It does not prove authorship. A fingerprint binds a file to a date, not to an author. Provenance tooling that claims otherwise is a liability; this does not.

How it works

1 — Assemble a bundle. A folder with your dataset under data/, your license and consent documents under licenses/, and an acquisition_log describing where each source came from, when, and under what terms.

2 — Hash it locally. Every file is SHA-256'd on your machine and combined into an RFC 6962 Merkle tree. Only the manifest — relative paths and digests and the root — is sent to be anchored. The dataset itself never leaves your environment; an air-gapped mode anchors nothing at all.

3 — Receive one receipt and one certificate. The Merkle root is anchored to Bitcoin via OpenTimestamps. You get a hosted, shareable provenance certificate — summary, the honest scope above, your license and acquisition-log documents, the full file manifest, and a verifier where an auditor can drop any file and confirm in their own browser that it belongs to the set. See a live sample certificate →

Run it in your pipeline

The reference CLI anchors a bundle as a build step and emits a human-readable certificate plus a machine-verifiable manifest. Exit codes make it a CI gate.

# Anchor a dataset bundle (only fingerprints leave the machine)
python3 provenance.py anchor --bundle ./my-dataset --name "My Dataset v1"

# Air-gapped: build the receipt + certificate, anchor nothing
python3 provenance.py anchor --bundle ./my-dataset --name "My Dataset v1" --offline

# Re-verify any time — rebuild the root from disk, compare to the certificate
python3 provenance.py verify --cert out/certificate.json --bundle ./my-dataset

# Prove one file belongs to the certified set
python3 provenance.py verify --cert out/certificate.json --bundle ./my-dataset \
        --file data/images/cat_001.jpg

It reuses the same open Merkle format the public verifier checks — no proprietary format, no lock-in. If Orphograph disappeared tomorrow, the Bitcoin anchor and the manifest are all anyone needs.

Disclaimer. Orphograph provides proof-of-existence only. It is not a law firm, not a qualified electronic trust service, and not a determination of ownership, licensing, or lawful acquisition. A dataset receipt is technical evidence of integrity and time; whether it is admissible or sufficient in any particular audit or proceeding is a question for that forum's rules and your counsel. Nothing on this page is legal advice.