AT-1 Container & Columnar Format Specification (decoder side)

This documents exactly what a decoder must parse. It is sufficient to implement a standalone reconstructor in any language. The reference C implementation is at1_decode.c. All multi-byte integers are LEB128 varints unless stated.

Primitives

  • varint (uint): LEB128, little-endian groups of 7 bits, high bit = "more".
  • svarint (int): zigzag then varint. zz = (n<<1) ^ (n>>63); decode: n = (zz>>1) ^ -(zz&1).
  • xz block: a standard .xz stream (liblzma lzma_stream_buffer_decode). Produced by Python lzma.compress(data) (default FORMAT_XZ).

Top-level container

bytes 0..3 : magic
             "AT1\x02"  -> whole-file container
             "AT1\x03"  -> streaming (framed) container
byte  4    : codec_id   (0 ssh, 1 json, 2 osm, 3 log, 4 columnar, 5 vcf, 6 jsondoc,
                         7 qcolumnar, 8 qjson, 9 dicom, 10 embed, 11 qjson2, 12 bundle, 255 RAW)

codec_id 255 — RAW fallback (any domain)

payload = bytes[5:] is a single xz block of the ORIGINAL file. Decode and emit. This is the safety path: the encoder ships RAW whenever its transform would be larger than plain xz, so a decoder must support it. (Worst case = ties xz.)

Whole-file, codec_id != 255

bytes[5:] is a packed stream set(below). Decode the streams, run the codec's reconstructor.

Streaming container ("AT1\x03")

byte 4         : codec_id
varint         : chunk_lines (informational)
repeat until EOF:
    varint     : payload_len
    byte       : method (0 = structured, 1 = raw)
    payload    : payload_len-1 bytes
               method 0 -> a packed stream set; decode via codec, append output
               method 1 -> an xz block of the chunk; decode, append output

Concatenating every frame's output yields the original file byte-for-byte.

Packed stream set

varint                  : n_streams
repeat n_streams:
    varint  name_len
    bytes   name        (UTF-8)
    varint  comp_len
    bytes   xz_block    (comp_len bytes; decode to the stream's data)

A decoder builds a name→bytes map. Stream order is not significant.

Columnar codec (id 4)

Streams: meta, row_modes, fieldcounts, coltypes, col_index, values, quotemodes, quotebits, verbatim, header.

meta

varint  nrows           (number of body rows, i.e. excluding any header)
byte    trailing_nl     (1 if the original file ends with '\n')
byte    has_header      (1 if a header line was split out)
varint  delim_len
bytes   delim           (the field delimiter, e.g. "," ";" "\t" "|")
varint  ncols

Per-column decode (k = 0..ncols-1)

col_index holds ncolsvarints = byte length of each column's chunk inside values (chunks are concatenated in column order). coltypes[k] selects the decoder; produce a list of string values cols[k]:

  • 0 TEXT: if chunk length 0 → [""]; else latin-1 decode, split on '\n'.
  • 1 INT: read svarints to end; running sum acc += d; value = decimal string.
  • 2 DEC: byte D; then svarints; acc += d; value = fixed-point string with D fractional digits (see fmt_fixed below).
  • 3 NUMEXC: byte subtype (0 int / 1 dec); if dec, byte D; varint count; bitmap = ceil(count/8) bytes; varint nexc; nexc × (varint len + bytes) exception strings (latin-1); then for i in 0..count-1: if bit i set, read one svarint (acc += d, value = int/fixed string), else take the next exception.
  • 4 DERIVED: varint a (source column index, always < k); varint nmap; nmap × (varint len + bytes) map values. Reconstruct by scanning cols[a]: maintain an insertion-ordered map; the j-th distinct source value maps to mapvals[j]; cols[k][i] = map[cols[a][i]].

fmt_fixed(n, D): let neg = n<0, s = decimal(|n|); if len(s) <= D left-pad with zeros to length D+1; result = s[:-D] + "." + s[-D:], prefixed "-" if neg.

Quote flags (RFC-4180 reconstruction)

quotemodes[k]: 0 = no field quoted, 1 = all quoted, 2 = mixed. For mixed columns, read ceil(m/8) bytes from quotebits (in column order, m = value count of that column); bit i = field i was quoted.

Row assembly

fpos = 0; per-column cursor[] = 0; vbi = 0
for r in 0..nrows-1:
    if row_modes[r] == 0:           # verbatim row
        emit verbatim_line[vbi++]   (verbatim stream is '\n'-joined latin-1)
    else:
        varint nf  (from fieldcounts at fpos)
        for i in 0..nf-1:
            c = cols[i][cursor[i]]; q = quoteflag[i][cursor[i]]; cursor[i]++
            field = q ? '"' + c.replace('"','""') + '"' : c
        emit delim.join(fields)
if has_header: prepend header line
output = '\n'.join(rows); if trailing_nl: output += '\n'

That output equals the original file byte-for-byte. The reference decoder and the Python encoder are cross-checked by testvectors/ (make test).

Other codecs (id 0–3)

ssh/json/osm/log share the container + packed-stream framing above; their column/stream layouts differ and are documented in the Python sources (lossless_*.py). The C reference here implements columnar + RAW + streaming (the tabular/telemetry production path). The same container scaffolding decodes the others once their per-codec reconstructors are added.

Queryable codecs: qcolumnar (id 7) and qjson (id 8)

These add query-while-compressed while remaining byte-exact. Both reuse the container + packed-stream framing; their blocks stream uses the stored backend (2) so each compressed block sits verbatim in the file (computable byte offsets → range-GET). A non-querying decoder ignores the footer index and rebuilds the whole file; a query consults the index zone maps to skip row-groups and decompress only the projected columns of survivors.

Streams

qcolumnar: meta, row_modes, verbatim, index, blocks. qjson: meta, template, paths, row_modes, verbatim, index, blocks.

meta (varints unless noted):

qcolumnar: nlines, trailing(1 byte), dlen, delim(dlen bytes), ncols, rg, n_struct, nrg, coltypes[ncols]
qjson:     nlines, trailing(1 byte),                          ncols, rg, n_struct, nrg, coltypes[ncols]

coltypes[c]: 0=text, 1=integer, 2=decimal/float.

template (qjson only): nseg (== ncols+1) length-prefixed literal segments. A structured row is rebuilt as seg[0] + val[0] + seg[1] + ... + seg[ncols] (the values are the row's ncols columns). qcolumnar uses the fixed delim instead of a template.

paths (qjson only): ncount length-prefixed column/field names (for query-by-name).

row_modes: one byte per output line — 1 = structured (take next columnar row), 0 = verbatim (take next verbatim line). verbatimis the '\n'-joined non-conforming lines (latin-1).

Footer index (the queryable part)

index is, for each row-group g (0..nrg-1) and column c (0..ncols-1), in order:

varint  clen            # length of this (g,c) block in the `blocks` stream
if coltypes[c] == 1:    signed-varint zmin, signed-varint zmax     # integer zone map
if coltypes[c] == 2:    float64 zmin, float64 zmax (little-endian, 16 bytes)   # decimal zone map
# text columns (type 0) carry no zone map

Block byte offset within blocks = running sum of preceding clens. A range predicate on column c skips group g iff zmax < lo || zmin > hi (the per-row predicate is then re-checked exactly on survivors, so the zone map only needs to be conservative).

Block coding (per (g,c) block)

Each block is self-describing: a 1-byte codec tag followed by the payload — 0x00 = xz/LZMA .xz stream, 0x01= zstd frame. The encoder emits whichever is smaller per block (block-level non-inferiority). Decode the payload, then split on '\n' to get that block's column values for the row-group.

Reconstruction (full decode)

si = 0; vbi = 0; per-group column value arrays cv[]
for each output line (row_modes order):
    if mode == 0: emit verbatim[vbi++]
    else:
        g = si / rg
        if g changed: for each column c, read its block (clen from index; tag-dispatch
                      decode; split '\n') into cv[c]
        r = si % rg
        qcolumnar: emit delim.join(cv[c][r] for c in 0..ncols-1)
        qjson:     emit seg[0]+cv[0][r]+seg[1]+...+seg[ncols]
        si++
append trailing '\n' if trailing flag set

This reproduces the original input byte-for-byte. Vectors qcol.at1/qjson.at1 cross-check the C decoder, the WASM/Python/Go/Rust/Node bindings, and the Python encoder via the conformance manifest (conformance/manifest.json).