Streaming & live ingest

Two ways to handle data that's too big for memory, or still arriving: bounded-memory streaming compression for multi-GB files, and at1-live to tail a streaminto an append-only table that's queryable while it lands.

Pulling from a URL? at1 fetch streams a download straight into a verified .at1 without ever landing the raw plaintext on disk — see Ingest from a URL.

Streaming compression — --stream

--stream reads the input in chunks of whole lines (--chunk-lines, default 500,000) and writes one self-describing frame per chunk, so peak memory is O(one chunk) regardless of file size — the way to compress multi-GB logs and columnar data on a small box.

at1 compress log app.log app.at1 --stream
at1 compress columnar events.csv events.at1 --stream --chunk-lines 500000
at1 compress columnar events.csv events.at1 --stream --backend zstd
  • Backends: xz (max ratio, default) or zstd (fast / high-throughput) via --backend.
  • Per-frame safety fallback — each chunk is also raw-compressed with the same backend; whichever is smaller is kept, so a structured codec never loses to plain compression on a bad chunk.
  • Verified lossless — each frame is decoded back and checked against the original chunk during compression, and a SHA-256 integrity trailer over the original bytes is written; a mismatch refuses to trust the output.

Streaming covers the line-oriented domains (log, columnar, vcf, ssh, json); the queryable codecs (qcolumnar, qjson) stream through their own row-group path.

Live ingest — at1-live

at1_live tails a stream from stdin into an appendable table. Lines are buffered and sealed into a batch by line count or age — whichever comes first — and each sealed batch is appended as an immutable, queryable segment through the full gated pipeline.

tail -F app.log    | python at1_live.py /data/logs --batch-lines 50000
kubectl logs -f pod | python at1_live.py /data/logs --batch-secs 30

The defining property is queryable-while-it-lands: a batch becomes visible to queries only after the verify gate passes. Mid-write batches are never visible, so at any moment the table is byte-exact, hash-chain-audited, and answering scan / SQL / time-travel queries over everything ingested so far. A status line reports live rows/s, ratio, and segments.

from at1_table import AppendableTable
t = AppendableTable("/data/logs")
rows, stats = t.scan(where={"c0": (lo, hi)})   # everything sealed so far
  • --batch-lines (default 50,000) — seal once this many lines buffer.
  • --batch-secs (default 30) — seal once the oldest buffered line is this old; seal more often for tighter loss bounds.

Honest scope: durability is "since the last sealed batch" — lines in the current unsealed batch live in memory (capped by --batch-lines) and are lost on crash, like any buffer-then-seal shipper. On stream end the tail batch is flushed and the chain is verified.