Query & SQL

The queryable codecs (qcolumnar for tables, qjson for NDJSON) store a footer of per-block min/max zone maps. A query skips whole row-groups it can rule out, then decompresses only the projected columns of the survivors — returning the exact original records.

Predicates & projection

at1 query t.at1 --where ts:1704067200000:1704067210000 --eq maker:True --select aggId,price,ts
  • Range on integer and decimal columns (prices) — skips groups via the zone map.
  • Equality on integer/decimal/text columns.
  • By name: use a field/column name (qjson fields, or a qcolumnar header row) instead of an index.
  • Projection: only the selected columns are decompressed.

SQL-ish front-end

at1 sql t.at1 "SELECT aggId, price WHERE ts BETWEEN 1704067200000 AND 1704067210000 AND price >= 42000"

Supports SELECT *|cols and WHERE col BETWEEN/>=/<=/>/</= joined by AND, compiled to pushdown.

Then what? — --extract

A search that just returns rows leaves you with a list. Add --extract and the matches become a new, standalone .at1: integrity-stamped (its own SHA-256 trailer) and itself queryable — built only from the rows the pushdown matched, so the source is never rehydrated. Find it → pull it out → prove it, the same primitive as video clip extraction.

# don't just read the matches — pull them OUT as a new verified, queryable .at1
at1 query t.at1 --where price:44000:44100 --extract hot.at1
at1 sql   t.at1 "SELECT ts,price WHERE price >= 44900" --extract spike.at1

# same primitive across the line:
at1 query g.at1 --region chr22:16050000-16100000 --extract region.at1   # genomic sub-cohort
#   tables:  AppendableTable("/data/t").extract("hot.at1", where={"c1": (lo, hi)})
#   bundles: at1_bundle.subset("in.at1", "sub.at1", ["report.pdf", 3])

at1 integrity hot.at1            # the extract is byte-exact (its own SHA-256 trailer)
at1 query     hot.at1 --where price:44050:44060   # ...and itself queryable

Range-GET (object storage)

The RangeReader reads only the container header + footer, then fetches just the touched block byte ranges — the basis for serving selective queries from S3/GCS with HTTP range-GET. A selective query reads 1–19% of the file depending on selectivity.

from at1_rangereader import RangeReader
rows, stats = RangeReader("t.at1").scan(where={"ts": (lo, hi)}, select=["aggId","ts"])
# stats['file_bytes_read'] << file size

Partitioned data lakes

A catalog records each partition file's global min/max per column, so a query skips whole files before opening them. A time query over many daily files can read ~1% of the dataset.

from at1_catalog import AT1Catalog
rows, stats = AT1Catalog("data_dir").scan(where={"ts": (lo, hi)}, select=["aggId","ts"])

Logs by time

python at1_logquery.py build access.log access.at1
python at1_logquery.py query access.at1 "2024-01-01T00:00:00" "2024-01-01T01:00:00"

Use it from the engine you already run

AT-1's queryable container decodes through a single engine-agnostic core (at1_block.c) and into Apache Arrow, so standard SQL engines query an .at1 archive in place — no rip-and-replace, same byte-exact original underneath. The same file, queried from whatever a customer already uses:

  • DuckDB — native extension + Python adapter (core verified byte-exact; fuzzed, 0 crashes).
  • SQLite — native virtual-table module over the same core (built & verified live).
  • PostgreSQL — foreign data wrapper, .at1 as foreign tables (code-complete).
  • ClickHouse and Apache Spark — via the Arrow path (verified on real engines).
  • Polars / pandas / Dask import at1_arrow, zero new code (verified).
  • Trino / Presto / Flink — via the Postgres/JDBC bridge (verified live), or Arrow-batch connectors.

Predicate pushdown (zone-map row-group skipping) rides through each adapter, so a selective query still reads only the touched blocks. Details: docs/ADAPTERS.md in the repo.

Point lookups — --bloom

Zone maps accelerate ranges on ordered columns. For equality on high-cardinality text (IPs, IDs, hashes) — the "find this exact record" forensic/audit query — add per-block bloom filters at compress time:

at1 compress qcolumnar flows.csv flows.at1 --bloom
at1 sql flows.at1 "SELECT * WHERE SrcAddr = '10.0.0.7'"   # skips the blocks that can't contain it

Measured 20–50× less I/O on a point lookup; ~29% opt-in footer overhead, byte-exact reconstruction unchanged. The SQLite and Postgres adapters consult it natively too.

Growing tables (append-only)

Keep addingdata without rewriting what's there. at1_table.AppendableTable writes each batch as an immutable queryable segment; queries span all segments with file-level + block-level skipping, and every segment stays byte-exact. A growing, searchable, exact record — the table-format play for logs, events, ticks, and audit trails.

Lakehouse cold tier (Iceberg)

Store cold Iceberg/Parquet partitions as AT-1 (2–7× smaller) and serve Parquet on read — transparent to the engine and the catalog. A pluggable pyiceberg storage layer covers the Python ecosystem; a cold-tier gateway (HTTP/S3 + cache) covers Spark/Trino/Snowflake. See docs/ICEBERG.md.