AT-1 and databases — what works, what doesn't, vs existing compression

The honest framing first. AT-1 is a file / byte-stream codec, not a live in-database storage engine. It does notreplace a database's internal compression (InnoDB page compression, SQL Server columnstore, Postgres TOAST) and it does not make a live SELECT insidethe database faster — the DB still runs queries on its own storage. AT-1's fit is the export / backup / archive / data-lake tier and DB-adjacent files (logs, WAL/oplog/AOF archives, CDC streams, analytics exports).

Within that scope, two questions matter: (1) does it shrink the file (vs gzip/zstd/xz), and (2) is the result queryable in place (only the qcolumnar codec is — and only for tabular/columnar data).


Measured: real DB artifacts vs gzip / zstd / xz

Generated from a real 12k-row table (db_compat_exp.py). Bytes; "AT-1×" = original ÷ AT-1.

Artifact (how a DB produces it)originalgzip-9zstd-19xz-9AT-1AT-1×queryable in AT-1
CSV export (COPY / BCP / --tab / mongoexport --csv)1,809,573195,111135,567142,748112,94316.0×yes ¹
NDJSON, homogeneous (mongoexport of one collection, ES _source)1,194,302119,94672,34762,26443,01927.8×yes ²
NDJSON, heterogeneous (mixed-schema event firehose)23,897,1053,417,6741,460,3921,414,6841,414,69016.9×partial ²
SQL dump, INSERTs (pg_dump --inserts, mysqldump)4,246,279238,392151,493173,488173,49424.5×no
SQLite .db (binary pages)1,949,696262,444205,593190,164190,17010.3×no
Already-compressed backup (pg_dump -Fc, .bak, *.gz)238,401238,487238,416238,472238,4781.0×no

¹ CSV stored with the qcolumnar codec is queryable in place (predicate/projection pushdown, exact rows back). Auto-detect picks qcolumnar for numeric/time-series tables; force it for any table you want to query.
² NDJSON is now queryable via the qjson codec (id 8): it flattens each document into columns and answers predicate/projection queries reading only the blocks it touches, while reconstructing the exact original bytes. Homogeneous exports (one collection / consistent schema — the typical case) compress best (27.8×, beating gzip/zstd/xz) and are fully queryable. Heterogeneous mixed-schema NDJSON only columnarizes its modal shape (the rest is stored verbatim), so query coverage is partial and ratio falls toward xz — for ratio-only archival of mixed NDJSON, the non-queryable json codec is a bit smaller (~19×).

How to read it:

  • Big win on text/structured exports (CSV 16×, NDJSON 19.3×) — AT-1 beats gzip, zstd, and xz, and CSV stays queryable.
  • ≈ xz on SQL-INSERT dumps and binary DB files (no structure it can exploit, so it falls back to its xz baseline; zstd-19 can occasionally edge it on pure SQL text).
  • ≈ nothing on already-compressed backups — AT-1 is non-inferior(never makes it worse beyond ~0.03% container overhead), but don't bother compressing an already-gzipped .gz/-Fc/.bak. Point AT-1 at the uncompressed export instead.

Per-engine cheat-sheet

SystemBest artifact to feed AT-1Result
PostgreSQLCOPY … TO CSV16×+, queryable. (pg_dump plain SQL ≈ xz; -Fc already compressed → ~1×.)
MySQL / MariaDBmysqldump --tabqueryable. (default SQL-INSERT dump ≈ xz.)
SQL Serverbcp / SELECT … INTO CSVqueryable. (.bak is usually already compressed → ~1×.)
OracleSQL*Loader / CSV unloadqueryable. (Data Pump .dmp binary → modest.)
SQLitethe .db file, or .dump SQL10× (xz-level); the .db is already self-queryable, so AT-1 is for archival.
MongoDBmongoexport NDJSON (one collection)27.8× and queryable via qjson. (mongodump BSON binary → modest.)
Elasticsearch / OpenSearch_source NDJSON / scroll exportqueryable via qjson (homogeneous index → best ratio).
RedisAOF (text command log) → log codec; RDB snapshot → modestAOF compresses well; RDB is already compact. (In-memory store, so these are the persistence files.)
MemcachedNot applicable: purely in-memory, nothing persisted at rest.
Neo4jCSV export of nodes/relationshipsqueryable. (native store / dumps are binary → modest.)
Prisman/a (it's an ORM, not a DB)Feed the underlying DB's exports (Postgres/MySQL/SQLite). Migration files are SQL text ≈ xz.

When AT-1 is the right tool for a database

Yes: archiving/cold-tiering exports (CSV/TSV/JSON/NDJSON), keeping queryable archives of both tabular (CSV → qcolumnar) and document (NDJSON → qjson) exports — the unique bit: query the cold copy without restoring the DB — compressing DB-adjacent logs (WAL/oplog/AOF/audit/query logs), and data-lake files.

⚠️ Marginal: plain SQL-INSERT dumps and binary DB files (≈ xz; use zstd if speed matters more).

No: as a live storage/compression engine inside the database, or on artifacts that are already compressed (compress the raw export instead).


The differentiator restated for DB buyers

For the formats databases export for analytics and archival — CSV (qcolumnar) and JSON/NDJSON (qjson) — AT-1 not only beats general compressors, it lets you keep a single queryable, byte-exact archive: query the cold copy in place (reading a fraction of the file) without rehydrating it into a database, and still reconstruct the exact original export when you need it. This now spans relational exports and document-store (MongoDB/Elasticsearch) exports. Parquet can query but can't give back the exact bytes; gzip/zstd/xz keep the bytes but can't query.

Reproduce: python db_compat_exp.py.