LM-Codec

The smallest still-verified archive of your text and code.

LM-Codecrange-codes a language model's per-token distribution to reach the smallest byte-exact archive of high-value text and code. Cloud-validated at scale: 3.34× prose / 4.64× code vs xz-9e at pythia-1.4b, byte-exact and SHA-256 verified — and the edge grows with model size.

See the benchmarks Enable the engine

3.34×: smaller than xz-9e on prose (pythia-1.4b, byte-exact)
4.64×: smaller than xz-9e on code (pythia-1.4b, byte-exact)
SHA-256: every archive re-decodes to the exact original bytes
grows: the edge widens as the model gets bigger

The model is the dictionary

Codes the model's own predictions

For each token the language model emits a probability distribution over what comes next. LM-Codec range-codes the actual token against that distribution — the better the model predicts your text, the fewer bits it costs. A shared model becomes a shared dictionary.

Byte-exact, SHA-256 verified

This is lossless archival, not summarization. The decoder replays the same model deterministically and reconstructs the original bytes exactly; a SHA-256 check confirms it. If the model isn't reproduced bit-for-bit, decode fails rather than returning wrong text.

Bigger model, bigger win

A model that predicts your corpus better spends fewer bits per token. Cloud-validated at pythia-1.4b it beat xz-9e 3.34× on prose and 4.64× on code — and the edge grows with model size, so the ceiling rises as models improve.

A cold-storage codec

Running a language model per token is slow — this is for high-value corpora you archive and rarely read, not a hot path. It trades throughput for the smallest verified footprint on data that's worth it.

One command surface

# archive high-value text/code to the smallest byte-exact form
at1 lm-codec compress corpus.txt out.at1lm --model pythia-1.4b
at1 lm-codec verify   out.at1lm                 # SHA-256 — must decode to the exact bytes
at1 lm-codec decompress out.at1lm restored.txt  # byte-for-byte identical

# same model on both ends — the decoder replays it deterministically

Honest scope

LM-Codec is a cold-storage codec: running a language model per token makes it slow, so it's for high-value corpora you archive and rarely read — not a general-purpose fast compressor and not a hot path. It is a licensed engine that needs a model: the same model must be available to compress and to decompress, and decode is deterministic. The wins above are on high-value text and code where a language model predicts well; on random or already-compressed data there is no distribution to exploit and it will not beat a general codec. For fast, never-worse general compression see Adaptive; enable this engine at /engines.