compression-distance detectors · surface similarity

Find near-duplicates and novelty — by compression distance.

at1 ncd uses normalized compression distance to measure how similar two files are without parsing them — a code-clone / near-duplicate scan and a clause/document novelty triage. It sees surface structure, not meaning.

Get the CLI Data-origin classifier

Two detectors, one distance

Code-clone & near-duplicate scan

Scan a tree and surface files (or functions) that are near-copies of each other by compression distance — copy-paste clones, forked-and-diverged files, redundant assets — without a language parser or an AST.

Clause & document novelty triage

Given a corpus of prior documents, score how novel a new one is: which clauses or paragraphs closely echo something already on file, and which are genuinely new — a fast first pass before a human reads closely.

Parser-free, format-agnostic

NCD works on raw bytes, so the same detector runs on source, contracts, logs, or binaries with no per-format setup. It's a triage instrument that points a reviewer at the pairs worth looking at.

One command surface

# normalized compression distance: 0 = near-identical, 1 = unrelated
at1 ncd  distance  a.py  b.py            # -> 0.07   (near-duplicate)
at1 ncd  clones    ./src  --threshold 0.2  # find code-clone / near-dup clusters
at1 ncd  novelty   new.txt  --corpus ./prior/  # which clauses echo prior docs, which are new

Honest scope

NCD is a surface-similarityinstrument: it finds byte-level closeness, not semantic equivalence. Two files that say the same thing in different words read as far apart, and two unrelated files that share boilerplate read as close — so it’s a triage pass that points a human at the pairs worth reviewing, not a plagiarism or equivalence verdict. It is calibrated on synthetic and exemplar data, with broad real-corpus validation as the explicit graduation gate. For “what kind of file is this?” see at1 origin; for “how rule-governed is it?” see at1 determinism.