LM-Codec

The smallest still-verified archive of your text and code.

LM-Codecrange-codes a language model's per-token distribution to reach the smallest byte-exact archive of high-value text and code. Cloud-validated at scale: 3.34× prose / 4.64× code vs xz-9e at pythia-1.4b, byte-exact and SHA-256 verified — and the edge grows with model size.

3.34×
smaller than xz-9e on prose (pythia-1.4b, byte-exact)
4.64×
smaller than xz-9e on code (pythia-1.4b, byte-exact)
SHA-256
every archive re-decodes to the exact original bytes
grows
the edge widens as the model gets bigger

The model is the dictionary

Codes the model's own predictions

For each token the language model emits a probability distribution over what comes next. LM-Codec range-codes the actual token against that distribution — the better the model predicts your text, the fewer bits it costs. A shared model becomes a shared dictionary.

Byte-exact, SHA-256 verified

This is lossless archival, not summarization. The decoder replays the same model deterministically and reconstructs the original bytes exactly; a SHA-256 check confirms it. If the model isn't reproduced bit-for-bit, decode fails rather than returning wrong text.

Bigger model, bigger win

A model that predicts your corpus better spends fewer bits per token. Cloud-validated at pythia-1.4b it beat xz-9e 3.34× on prose and 4.64× on code — and the edge grows with model size, so the ceiling rises as models improve.

A cold-storage codec

Running a language model per token is slow — this is for high-value corpora you archive and rarely read, not a hot path. It trades throughput for the smallest verified footprint on data that's worth it.

One command surface

# archive high-value text/code to the smallest byte-exact form
at1 lm-codec compress corpus.txt out.at1lm --model pythia-1.4b
at1 lm-codec verify   out.at1lm                 # SHA-256 — must decode to the exact bytes
at1 lm-codec decompress out.at1lm restored.txt  # byte-for-byte identical

# same model on both ends — the decoder replays it deterministically
Honest scope

LM-Codec is a cold-storage codec: running a language model per token makes it slow, so it's for high-value corpora you archive and rarely read — not a general-purpose fast compressor and not a hot path. It is a licensed engine that needs a model: the same model must be available to compress and to decompress, and decode is deterministic. The wins above are on high-value text and code where a language model predicts well; on random or already-compressed data there is no distribution to exploit and it will not beat a general codec. For fast, never-worse general compression see Adaptive; enable this engine at /engines.