Streaming & live ingest
Two ways to handle data that's too big for memory, or still arriving: bounded-memory streaming compression for multi-GB files, and at1-live to tail a streaminto an append-only table that's queryable while it lands.
Pulling from a URL? at1 fetch streams a download straight into a verified .at1 without ever landing the raw plaintext on disk — see Ingest from a URL.
Streaming compression — --stream
--stream reads the input in chunks of whole lines (--chunk-lines, default 500,000) and writes one self-describing frame per chunk, so peak memory is O(one chunk) regardless of file size — the way to compress multi-GB logs and columnar data on a small box.
at1 compress log app.log app.at1 --stream at1 compress columnar events.csv events.at1 --stream --chunk-lines 500000 at1 compress columnar events.csv events.at1 --stream --backend zstd
- Backends:
xz(max ratio, default) orzstd(fast / high-throughput) via--backend. - Per-frame safety fallback — each chunk is also raw-compressed with the same backend; whichever is smaller is kept, so a structured codec never loses to plain compression on a bad chunk.
- Verified lossless — each frame is decoded back and checked against the original chunk during compression, and a SHA-256 integrity trailer over the original bytes is written; a mismatch refuses to trust the output.
Streaming covers the line-oriented domains (log, columnar, vcf, ssh, json); the queryable codecs (qcolumnar, qjson) stream through their own row-group path.
Live ingest — at1-live
at1_live tails a stream from stdin into an appendable table. Lines are buffered and sealed into a batch by line count or age — whichever comes first — and each sealed batch is appended as an immutable, queryable segment through the full gated pipeline.
tail -F app.log | python at1_live.py /data/logs --batch-lines 50000 kubectl logs -f pod | python at1_live.py /data/logs --batch-secs 30
The defining property is queryable-while-it-lands: a batch becomes visible to queries only after the verify gate passes. Mid-write batches are never visible, so at any moment the table is byte-exact, hash-chain-audited, and answering scan / SQL / time-travel queries over everything ingested so far. A status line reports live rows/s, ratio, and segments.
from at1_table import AppendableTable
t = AppendableTable("/data/logs")
rows, stats = t.scan(where={"c0": (lo, hi)}) # everything sealed so far--batch-lines(default 50,000) — seal once this many lines buffer.--batch-secs(default 30) — seal once the oldest buffered line is this old; seal more often for tighter loss bounds.
Honest scope: durability is "since the last sealed batch" — lines in the current unsealed batch live in memory (capped by --batch-lines) and are lost on crash, like any buffer-then-seal shipper. On stream end the tail batch is flushed and the chain is verified.