Docs

Genomics (VCF): method & results

AT-1 has a dedicated, lossless, genotype-aware codec for VCF (the variant-call format). This page documents the measured results, exactly how they were produced, the codec mechanism, and — importantly — the boundary where the advantage narrows. The numbers below are reproducible; the comparison baselines are the tools a genomics team actually uses (bcftools BCF and bgzipped VCF).

Verified results

On a real 1000 Genomes chr22 slice (48.26 MB of VCF), reconfirmed with bcftools 1.17 / htslib 1.17:

DatasetAT-1 vs BCFvs .vcf.gzvs raw .vcfLossless
Population (1000G chr22, sparse GT)2.46×2.96×209×byte-exact ✓
Clinical (complex FORMAT GT:AD:DP:GQ:PL)1.30×3.6×byte-exact ✓

Raw bytes (population): AT-1 230,766 · BCF 566,878 · .vcf.gz 683,236 · original 48,261,410. The 209× is original → .at1; the number that matters against the genomics-standard tools is 2.46× vs BCF.

What the codec does

A VCF’s bytes are dominated by the genotype matrix — one genotype per sample, per variant. In a population/cohort VCF that matrix is overwhelmingly the reference genotype (0|0): at most sites only a handful of the thousands of samples carry a variant. AT-1’s VCF codec exploits exactly that sparsity — per variant it stores only the (sample-index delta, genotype-token) pairs that differ from the global reference genotype, with rarer genotypes kept in a small per-file vocabulary. The fixed VCF columns and headers are preserved verbatim, so reconstruction is byte-for-byte exact. This is the standard sparse-matrix insight applied to the part of the file that actually holds the bytes — which is why a general columnar transform (or LZMA on the text) can’t match it.

The honest boundary

The win comes from genotype sparsity. When the per-sample FORMAT is rich and dense — e.g. clinical single- or small-cohort VCFs carrying GT:AD:DP:GQ:PLwith distinct depths/likelihoods per sample — the sparsity assumption weakens. AT-1 detects this and gracefully falls back (to its general lossless backend) rather than degrading; it still beats BCF on the complex-FORMAT case above, but the margin narrows from 2.46× to about 1.30×. We state this openly: AT-1’s largest VCF wins are on population/biobank genotype data; on dense clinical VCFs it remains lossless and competitive, not category-leading.

How to reproduce

# AT-1 (verified byte-for-byte on encode)
at1 compress vcf chr22_slice.vcf chr22.at1

# Baselines (bcftools 1.17 / htslib 1.17)
bcftools view -O b chr22_slice.vcf -o chr22.bcf     # native BCF
bcftools view -O z chr22_slice.vcf -o chr22.vcf.gz  # bgzipped VCF

# Compare sizes; AT-1 vs BCF = size(chr22.bcf) / size(chr22.at1)

Dataset: a real 1000 Genomes chr22 slice. The clinical-boundary figure uses a synthetic complex-FORMAT VCF (50 samples × 8,000 variants, GT:AD:DP:GQ:PL) to characterise the FORMAT-dense regime; validation on production clinical VCFs (GATK/DRAGEN output) is the natural next step for a customer evaluation.

Region queries (by chromosome/position) read only the touched blocks via zone-map pushdown — so a single compressed .at1 serves both long-term archival and selective lookups. See also the genomics overview.