data-origin classifier · calibrated on exemplars

How was this file generated?

at1 originreads a file’s compression signature and estimates its origin — human, source, LLM, PRNG, CSPRNG, encrypted, sensor, or tabular. It’s a fast triage signal, calibrated on synthetic and exemplar data, that sees surface structure, not meaning.

The origin classes

Human-written prose
natural-language redundancy
Source code
token + indentation structure
LLM output
smoother, lower-surprise text
PRNG stream
compressible pseudo-randomness
CSPRNG / encrypted
incompressible, high entropy
Sensor / time-series
predictable low-order structure
Tabular / columnar
repeated field layout
Unknown
no confident class — reported honestly

Reads the compression signature

Different kinds of data leave different fingerprints under a battery of transforms and codecs — how compressible they are, and by which method. at1 origin turns that signature into a best-guess origin class with a confidence.

One pass, no per-type setup

The same probe runs on any file with no feature engineering and no trained model per format. It's a fast triage pass — "what am I even looking at?" — for a folder of unknown blobs.

Surface structure, not meaning

It classifies from statistical structure, not semantics. It can tell encrypted from prose from a sensor stream; it cannot read what the prose says or verify a claim. Treat it as a signal, not a verdict.

One command surface

# classify how a file was generated, from its compression signature
at1 origin  classify  unknown.bin        # -> csprng/encrypted   (confidence 0.94)
at1 origin  classify  notes.txt          # -> human prose        (confidence 0.81)
at1 origin  classify  export.parquet     # -> tabular            (confidence 0.88)
at1 origin  scan      ./inbox/           # triage a whole folder of unknown blobs

Honest scope

at1 origin is an instrument, not an oracle. It is calibrated on synthetic and exemplar data; broad real-corpus validation is the explicit graduation gate before any hard accuracy claim. It reads surface structure, not meaning— so an LLM asked to imitate a human, or a file that mixes origins, can read ambiguously, and it returns a confidence and an “unknown” class rather than guessing. To ask instead how much of a file is rule-governed, see the determinism score; to check whether a specific integer generator is recoverable, see at1 recover.