AT-1 Container & Columnar Format Specification (decoder side)
This documents exactly what a decoder must parse. It is sufficient to implement a standalone reconstructor in any language. The reference C implementation is at1_decode.c. All multi-byte integers are LEB128 varints unless stated.
Primitives
- varint (uint): LEB128, little-endian groups of 7 bits, high bit = "more".
- svarint (int): zigzag then varint.
zz = (n<<1) ^ (n>>63); decode:n = (zz>>1) ^ -(zz&1). - xz block: a standard
.xzstream (liblzmalzma_stream_buffer_decode). Produced by Pythonlzma.compress(data)(default FORMAT_XZ).
Top-level container
bytes 0..3 : magic
"AT1\x02" -> whole-file container
"AT1\x03" -> streaming (framed) container
byte 4 : codec_id (0 ssh, 1 json, 2 osm, 3 log, 4 columnar, 5 vcf, 6 jsondoc,
7 qcolumnar, 8 qjson, 9 dicom, 10 embed, 11 qjson2, 12 bundle, 255 RAW)codec_id 255 — RAW fallback (any domain)
payload = bytes[5:] is a single xz block of the ORIGINAL file. Decode and emit. This is the safety path: the encoder ships RAW whenever its transform would be larger than plain xz, so a decoder must support it. (Worst case = ties xz.)
Whole-file, codec_id != 255
bytes[5:] is a packed stream set(below). Decode the streams, run the codec's reconstructor.
Streaming container ("AT1\x03")
byte 4 : codec_id
varint : chunk_lines (informational)
repeat until EOF:
varint : payload_len
byte : method (0 = structured, 1 = raw)
payload : payload_len-1 bytes
method 0 -> a packed stream set; decode via codec, append output
method 1 -> an xz block of the chunk; decode, append outputConcatenating every frame's output yields the original file byte-for-byte.
Packed stream set
varint : n_streams
repeat n_streams:
varint name_len
bytes name (UTF-8)
varint comp_len
bytes xz_block (comp_len bytes; decode to the stream's data)A decoder builds a name→bytes map. Stream order is not significant.
Columnar codec (id 4)
Streams: meta, row_modes, fieldcounts, coltypes, col_index, values, quotemodes, quotebits, verbatim, header.
meta
varint nrows (number of body rows, i.e. excluding any header) byte trailing_nl (1 if the original file ends with '\n') byte has_header (1 if a header line was split out) varint delim_len bytes delim (the field delimiter, e.g. "," ";" "\t" "|") varint ncols
Per-column decode (k = 0..ncols-1)
col_index holds ncolsvarints = byte length of each column's chunk inside values (chunks are concatenated in column order). coltypes[k] selects the decoder; produce a list of string values cols[k]:
- 0 TEXT: if chunk length 0 →
[""]; else latin-1 decode, split on'\n'. - 1 INT: read svarints to end; running sum
acc += d; value = decimal string. - 2 DEC:
byte D; then svarints;acc += d; value = fixed-point string withDfractional digits (see fmt_fixed below). - 3 NUMEXC:
byte subtype(0 int / 1 dec); if dec,byte D;varint count;bitmap = ceil(count/8) bytes;varint nexc;nexc × (varint len + bytes)exception strings (latin-1); then for i in 0..count-1: if bit i set, read one svarint (acc += d, value = int/fixed string), else take the next exception. - 4 DERIVED:
varint a(source column index, always < k);varint nmap;nmap × (varint len + bytes)map values. Reconstruct by scanningcols[a]: maintain an insertion-ordered map; the j-th distinct source value maps tomapvals[j];cols[k][i] = map[cols[a][i]].
fmt_fixed(n, D): let neg = n<0, s = decimal(|n|); if len(s) <= D left-pad with zeros to length D+1; result = s[:-D] + "." + s[-D:], prefixed "-" if neg.
Quote flags (RFC-4180 reconstruction)
quotemodes[k]: 0 = no field quoted, 1 = all quoted, 2 = mixed. For mixed columns, read ceil(m/8) bytes from quotebits (in column order, m = value count of that column); bit i = field i was quoted.
Row assembly
fpos = 0; per-column cursor[] = 0; vbi = 0
for r in 0..nrows-1:
if row_modes[r] == 0: # verbatim row
emit verbatim_line[vbi++] (verbatim stream is '\n'-joined latin-1)
else:
varint nf (from fieldcounts at fpos)
for i in 0..nf-1:
c = cols[i][cursor[i]]; q = quoteflag[i][cursor[i]]; cursor[i]++
field = q ? '"' + c.replace('"','""') + '"' : c
emit delim.join(fields)
if has_header: prepend header line
output = '\n'.join(rows); if trailing_nl: output += '\n'That output equals the original file byte-for-byte. The reference decoder and the Python encoder are cross-checked by testvectors/ (make test).
Other codecs (id 0–3)
ssh/json/osm/log share the container + packed-stream framing above; their column/stream layouts differ and are documented in the Python sources (lossless_*.py). The C reference here implements columnar + RAW + streaming (the tabular/telemetry production path). The same container scaffolding decodes the others once their per-codec reconstructors are added.
Queryable codecs: qcolumnar (id 7) and qjson (id 8)
These add query-while-compressed while remaining byte-exact. Both reuse the container + packed-stream framing; their blocks stream uses the stored backend (2) so each compressed block sits verbatim in the file (computable byte offsets → range-GET). A non-querying decoder ignores the footer index and rebuilds the whole file; a query consults the index zone maps to skip row-groups and decompress only the projected columns of survivors.
Streams
qcolumnar: meta, row_modes, verbatim, index, blocks. qjson: meta, template, paths, row_modes, verbatim, index, blocks.
meta (varints unless noted):
qcolumnar: nlines, trailing(1 byte), dlen, delim(dlen bytes), ncols, rg, n_struct, nrg, coltypes[ncols] qjson: nlines, trailing(1 byte), ncols, rg, n_struct, nrg, coltypes[ncols]
coltypes[c]: 0=text, 1=integer, 2=decimal/float.
template (qjson only): nseg (== ncols+1) length-prefixed literal segments. A structured row is rebuilt as seg[0] + val[0] + seg[1] + ... + seg[ncols] (the values are the row's ncols columns). qcolumnar uses the fixed delim instead of a template.
paths (qjson only): ncount length-prefixed column/field names (for query-by-name).
row_modes: one byte per output line — 1 = structured (take next columnar row), 0 = verbatim (take next verbatim line). verbatimis the '\n'-joined non-conforming lines (latin-1).
Footer index (the queryable part)
index is, for each row-group g (0..nrg-1) and column c (0..ncols-1), in order:
varint clen # length of this (g,c) block in the `blocks` stream if coltypes[c] == 1: signed-varint zmin, signed-varint zmax # integer zone map if coltypes[c] == 2: float64 zmin, float64 zmax (little-endian, 16 bytes) # decimal zone map # text columns (type 0) carry no zone map
Block byte offset within blocks = running sum of preceding clens. A range predicate on column c skips group g iff zmax < lo || zmin > hi (the per-row predicate is then re-checked exactly on survivors, so the zone map only needs to be conservative).
Block coding (per (g,c) block)
Each block is self-describing: a 1-byte codec tag followed by the payload — 0x00 = xz/LZMA .xz stream, 0x01= zstd frame. The encoder emits whichever is smaller per block (block-level non-inferiority). Decode the payload, then split on '\n' to get that block's column values for the row-group.
Reconstruction (full decode)
si = 0; vbi = 0; per-group column value arrays cv[]
for each output line (row_modes order):
if mode == 0: emit verbatim[vbi++]
else:
g = si / rg
if g changed: for each column c, read its block (clen from index; tag-dispatch
decode; split '\n') into cv[c]
r = si % rg
qcolumnar: emit delim.join(cv[c][r] for c in 0..ncols-1)
qjson: emit seg[0]+cv[0][r]+seg[1]+...+seg[ncols]
si++
append trailing '\n' if trailing flag setThis reproduces the original input byte-for-byte. Vectors qcol.at1/qjson.at1 cross-check the C decoder, the WASM/Python/Go/Rust/Node bindings, and the Python encoder via the conformance manifest (conformance/manifest.json).