Docs
Genomics (VCF): method & results
AT-1 has a dedicated, lossless, genotype-aware codec for VCF (the variant-call format). This page documents the measured results, exactly how they were produced, the codec mechanism, and — importantly — the boundary where the advantage narrows. The numbers below are reproducible; the comparison baselines are the tools a genomics team actually uses (bcftools BCF and bgzipped VCF).
Verified results
On a real 1000 Genomes chr22 slice (48.26 MB of VCF), reconfirmed with bcftools 1.17 / htslib 1.17:
| Dataset | AT-1 vs BCF | vs .vcf.gz | vs raw .vcf | Lossless |
|---|---|---|---|---|
| Population (1000G chr22, sparse GT) | 2.46× | 2.96× | 209× | byte-exact ✓ |
| Clinical (complex FORMAT GT:AD:DP:GQ:PL) | 1.30× | — | 3.6× | byte-exact ✓ |
Raw bytes (population): AT-1 230,766 · BCF 566,878 · .vcf.gz 683,236 · original 48,261,410. The 209× is original → .at1; the number that matters against the genomics-standard tools is 2.46× vs BCF.
What the codec does
A VCF’s bytes are dominated by the genotype matrix — one genotype per sample, per variant. In a population/cohort VCF that matrix is overwhelmingly the reference genotype (0|0): at most sites only a handful of the thousands of samples carry a variant. AT-1’s VCF codec exploits exactly that sparsity — per variant it stores only the (sample-index delta, genotype-token) pairs that differ from the global reference genotype, with rarer genotypes kept in a small per-file vocabulary. The fixed VCF columns and headers are preserved verbatim, so reconstruction is byte-for-byte exact. This is the standard sparse-matrix insight applied to the part of the file that actually holds the bytes — which is why a general columnar transform (or LZMA on the text) can’t match it.
The honest boundary
The win comes from genotype sparsity. When the per-sample FORMAT is rich and dense — e.g. clinical single- or small-cohort VCFs carrying GT:AD:DP:GQ:PLwith distinct depths/likelihoods per sample — the sparsity assumption weakens. AT-1 detects this and gracefully falls back (to its general lossless backend) rather than degrading; it still beats BCF on the complex-FORMAT case above, but the margin narrows from 2.46× to about 1.30×. We state this openly: AT-1’s largest VCF wins are on population/biobank genotype data; on dense clinical VCFs it remains lossless and competitive, not category-leading.
How to reproduce
# AT-1 (verified byte-for-byte on encode) at1 compress vcf chr22_slice.vcf chr22.at1 # Baselines (bcftools 1.17 / htslib 1.17) bcftools view -O b chr22_slice.vcf -o chr22.bcf # native BCF bcftools view -O z chr22_slice.vcf -o chr22.vcf.gz # bgzipped VCF # Compare sizes; AT-1 vs BCF = size(chr22.bcf) / size(chr22.at1)
Dataset: a real 1000 Genomes chr22 slice. The clinical-boundary figure uses a synthetic complex-FORMAT VCF (50 samples × 8,000 variants, GT:AD:DP:GQ:PL) to characterise the FORMAT-dense regime; validation on production clinical VCFs (GATK/DRAGEN output) is the natural next step for a customer evaluation.
Region queries (by chromosome/position) read only the touched blocks via zone-map pushdown — so a single compressed .at1 serves both long-term archival and selective lookups. See also the genomics overview.