The smallest still-verified archive of your text and code.
LM-Codecrange-codes a language model's per-token distribution to reach the smallest byte-exact archive of high-value text and code. Cloud-validated at scale: 3.34× prose / 4.64× code vs xz-9e at pythia-1.4b, byte-exact and SHA-256 verified — and the edge grows with model size.
- 3.34×
- smaller than xz-9e on prose (pythia-1.4b, byte-exact)
- 4.64×
- smaller than xz-9e on code (pythia-1.4b, byte-exact)
- SHA-256
- every archive re-decodes to the exact original bytes
- grows
- the edge widens as the model gets bigger
The model is the dictionary
Codes the model's own predictions
For each token the language model emits a probability distribution over what comes next. LM-Codec range-codes the actual token against that distribution — the better the model predicts your text, the fewer bits it costs. A shared model becomes a shared dictionary.
Byte-exact, SHA-256 verified
This is lossless archival, not summarization. The decoder replays the same model deterministically and reconstructs the original bytes exactly; a SHA-256 check confirms it. If the model isn't reproduced bit-for-bit, decode fails rather than returning wrong text.
Bigger model, bigger win
A model that predicts your corpus better spends fewer bits per token. Cloud-validated at pythia-1.4b it beat xz-9e 3.34× on prose and 4.64× on code — and the edge grows with model size, so the ceiling rises as models improve.
A cold-storage codec
Running a language model per token is slow — this is for high-value corpora you archive and rarely read, not a hot path. It trades throughput for the smallest verified footprint on data that's worth it.
One command surface
# archive high-value text/code to the smallest byte-exact form at1 lm-codec compress corpus.txt out.at1lm --model pythia-1.4b at1 lm-codec verify out.at1lm # SHA-256 — must decode to the exact bytes at1 lm-codec decompress out.at1lm restored.txt # byte-for-byte identical # same model on both ends — the decoder replays it deterministically
LM-Codec is a cold-storage codec: running a language model per token makes it slow, so it's for high-value corpora you archive and rarely read — not a general-purpose fast compressor and not a hot path. It is a licensed engine that needs a model: the same model must be available to compress and to decompress, and decode is deterministic. The wins above are on high-value text and code where a language model predicts well; on random or already-compressed data there is no distribution to exploit and it will not beat a general codec. For fast, never-worse general compression see Adaptive; enable this engine at /engines.