AT-1 and databases — what works, what doesn't, vs existing compression

The honest framing first. AT-1 is a file / byte-stream codec, not a live in-database storage engine. It does notreplace a database's internal compression (InnoDB page compression, SQL Server columnstore, Postgres TOAST) and it does not make a live SELECT insidethe database faster — the DB still runs queries on its own storage. AT-1's fit is the export / backup / archive / data-lake tier and DB-adjacent files (logs, WAL/oplog/AOF archives, CDC streams, analytics exports).

Within that scope, two questions matter: (1) does it shrink the file (vs gzip/zstd/xz), and (2) is the result queryable in place (only the qcolumnar codec is — and only for tabular/columnar data).

Measured: real DB artifacts vs gzip / zstd / xz

Generated from a real 12k-row table (db_compat_exp.py). Bytes; "AT-1×" = original ÷ AT-1.

Artifact (how a DB produces it)	original	gzip-9	zstd-19	xz-9	AT-1	AT-1×	queryable in AT-1
CSV export (COPY / BCP / --tab / mongoexport --csv)	1,809,573	195,111	135,567	142,748	112,943	16.0×	yes ¹
NDJSON, homogeneous (mongoexport of one collection, ES _source)	1,194,302	119,946	72,347	62,264	43,019	27.8×	yes ²
NDJSON, heterogeneous (mixed-schema event firehose)	23,897,105	3,417,674	1,460,392	1,414,684	1,414,690	16.9×	partial ²
SQL dump, INSERTs (pg_dump --inserts, mysqldump)	4,246,279	238,392	151,493	173,488	173,494	24.5×	no
SQLite .db (binary pages)	1,949,696	262,444	205,593	190,164	190,170	10.3×	no
Already-compressed backup (pg_dump -Fc, .bak, *.gz)	238,401	238,487	238,416	238,472	238,478	1.0×	no

¹ CSV stored with the qcolumnar codec is queryable in place (predicate/projection pushdown, exact rows back). Auto-detect picks qcolumnar for numeric/time-series tables; force it for any table you want to query.
² NDJSON is now queryable via the qjson codec (id 8): it flattens each document into columns and answers predicate/projection queries reading only the blocks it touches, while reconstructing the exact original bytes. Homogeneous exports (one collection / consistent schema — the typical case) compress best (27.8×, beating gzip/zstd/xz) and are fully queryable. Heterogeneous mixed-schema NDJSON only columnarizes its modal shape (the rest is stored verbatim), so query coverage is partial and ratio falls toward xz — for ratio-only archival of mixed NDJSON, the non-queryable json codec is a bit smaller (~19×).

How to read it:

Big win on text/structured exports (CSV 16×, NDJSON 19.3×) — AT-1 beats gzip, zstd, and xz, and CSV stays queryable.
≈ xz on SQL-INSERT dumps and binary DB files (no structure it can exploit, so it falls back to its xz baseline; zstd-19 can occasionally edge it on pure SQL text).
≈ nothing on already-compressed backups — AT-1 is non-inferior(never makes it worse beyond ~0.03% container overhead), but don't bother compressing an already-gzipped .gz/-Fc/.bak. Point AT-1 at the uncompressed export instead.

Per-engine cheat-sheet

System	Best artifact to feed AT-1	Result
PostgreSQL	`COPY … TO CSV`	16×+, queryable. (`pg_dump` plain SQL ≈ xz; `-Fc` already compressed → ~1×.)
MySQL / MariaDB	`mysqldump --tab`	queryable. (default SQL-INSERT dump ≈ xz.)
SQL Server	`bcp / SELECT … INTO CSV`	queryable. (`.bak` is usually already compressed → ~1×.)
Oracle	SQL*Loader / CSV unload	queryable. (Data Pump `.dmp` binary → modest.)
SQLite	the `.db` file, or `.dump` SQL	10× (xz-level); the `.db` is already self-queryable, so AT-1 is for archival.
MongoDB	`mongoexport NDJSON (one collection)`	27.8× and queryable via `qjson`. (`mongodump` BSON binary → modest.)
Elasticsearch / OpenSearch	`_source NDJSON / scroll export`	queryable via `qjson` (homogeneous index → best ratio).
Redis	AOF (text command log) → log codec; RDB snapshot → modest	AOF compresses well; RDB is already compact. (In-memory store, so these are the persistence files.)
Memcached	—	Not applicable: purely in-memory, nothing persisted at rest.
Neo4j	CSV export of nodes/relationships	queryable. (native store / dumps are binary → modest.)
Prisma	n/a (it's an ORM, not a DB)	Feed the underlying DB's exports (Postgres/MySQL/SQLite). Migration files are SQL text ≈ xz.

When AT-1 is the right tool for a database

✅ Yes: archiving/cold-tiering exports (CSV/TSV/JSON/NDJSON), keeping queryable archives of both tabular (CSV → qcolumnar) and document (NDJSON → qjson) exports — the unique bit: query the cold copy without restoring the DB — compressing DB-adjacent logs (WAL/oplog/AOF/audit/query logs), and data-lake files.

⚠️ Marginal: plain SQL-INSERT dumps and binary DB files (≈ xz; use zstd if speed matters more).

❌ No: as a live storage/compression engine inside the database, or on artifacts that are already compressed (compress the raw export instead).

The differentiator restated for DB buyers

For the formats databases export for analytics and archival — CSV (qcolumnar) and JSON/NDJSON (qjson) — AT-1 not only beats general compressors, it lets you keep a single queryable, byte-exact archive: query the cold copy in place (reading a fraction of the file) without rehydrating it into a database, and still reconstruct the exact original export when you need it. This now spans relational exports and document-store (MongoDB/Elasticsearch) exports. Parquet can query but can't give back the exact bytes; gzip/zstd/xz keep the bytes but can't query.

Reproduce: python db_compat_exp.py.