Regulated Genomics Archive

A VCF that is compressed, region-queryable, tamper-evident, and per-sample erasable.

This is the genomics vertical of the Regulated Archive— the same product, applied to a VCF. One sealed archive holds the variant sites and every sample's genotypes, so you query a locus without rehydrating the cohort, prove the archive was never altered, and honour a per-participant erasure — while the variant sites stay byte-stable and queryable.

1
sealed VCF archive — compressed + queryable + provable + erasable
CHROM:POS
region query reads only the blocks it touches
per-sample
GDPR Art.17 erasure — one individual's genotypes, gone
byte-stable
variant sites unchanged after an erasure; audit chain intact

Four properties, one archive

Compressed

A population VCF becomes one sealed archive — variant sites and per-sample genotype matrices packed together, far smaller than raw. It carries the next three properties no .vcf.gz or BCF does.

Region-queryable

Ask for variants in a CHROM:POS range and the archive returns them by reading only the blocks that range touches — no full decompress, no rehydrating the whole cohort to look at one locus.

Tamper-evident

A SHA-256 manifest binds the variant sites and per-sample genotype keys. Flip a single byte anywhere and verification fails — the archive is provably the original cohort, or provably not.

Per-sample erasable

Each sample's genotypes are encrypted under that individual's own key. A right-to-erasure request destroys that one key; the variant sites stay byte-stable and queryable, and every other sample is untouched.

Forget the participant, keep the cohort useful

The part that makes this legally and scientifically real: erasing a sample destroys that individual's genotype contribution, but the variant sites stay byte-stable and queryable and every other sample is intact. In our validation on a real 1000G-style archive, a region query returned the variants in a CHROM:POSrange while reading only the blocks that range touched; erasing a sample destroyed that individual's genotype key while the variant sites stayed byte-for-byte identical and the other samples queried unchanged.

GDPR Art.17 cryptographic erasure (key destruction) — rendering a subject's data permanently unrecoverable.Erasure is performed by destroying the subject's unique encryption key; the ciphertext remains byte-stable so tamper-evidence/audit chains stay intact, and the data is cryptographically irrecoverable. This is the recognised NIST SP 800-88r1 “Cryptographic Erase” method — an established, regulator-recognised approach, not a new cryptographic claim.

One command surface

at1 regulated build cohort.vcf --subject-field sample_id --out arc/
                                        # variant sites + per-sample genotypes, one sealed archive
at1 regulated query  arc/ --region chr7:117480000:117670000
                                        # variants in the range, reading only touched blocks
at1 regulated verify arc/               # -> integrity: PASS
at1 regulated erase  arc/ NA12878 --signing-key issuer.key --out-cert cert.json
                                        # sample's genotype key destroyed; variant sites unchanged
at1 regulated verify arc/               # -> still PASS (manifest re-sealed)
Honest scope

This is the Regulated Archive bundle applied to VCF — the region-query advantage applies to selective queries over position-clustered blocks (it reads only what a range touches); a whole-genome scan reads everything, same as anyone. Per-sample encryption adds a fixed per-participant overhead, so the archive is at its best when the variant-site payload outweighs per-sample keys — cohorts with many variants across many samples. For the raw VCF compression numbers see Genomics; for how key-destruction erasure works see Right to erasure.