✨ 1. Introduction: Beyond Hashes and File Types
What if a 32-byte value could tell you not just what something is, but what kind of thing it is?
Traditional hashes answer:
“Have I seen this exact thing before?”
TSV answers:
“What category does this belong to?”
This article introduces the Tomasello Signature Vector (TSV) — a deterministic, fixed-size fingerprint that encodes semantic identity, not just byte-for-byte uniqueness.
🔍 2. Motivation: When Hashing Uniqueness Isn’t Enough
Most systems that operate on raw data use hashes to recognize inputs. But hashes only answer one question:
“Have I seen this exact byte sequence before?”
That’s great for deduplication or integrity — but useless when you’re dealing with categories of data rather than exact matches.
In my case, I needed to efficiently group inputs by type, not identity. Imagine processing a hundred different executable files. A hash treats them as strangers. But operationally, they’re the same kind of thing — and I didn’t want a hundred cache entries. I wanted one fingerprint that said:
“This is executable-type data.”
That’s where the Tomasello Signature Vector (TSV) began — driven by the question:
Can we infer type, structure, or entropy traits — deterministically — in a single glanceable vector?
Existing tools like fuzzy hashes and ML embeddings can approximate similarity, but they’re heavyweight, non-deterministic, or comparison-based. I needed something:
- ✅ Lightweight
- ✅ Deterministic
- ✅ Content-aware
- ✅ Compact (32 bytes)
A semantic fingerprint, not just a cryptographic one.
📦 3. What Traditional Tools Do — and What They Miss
✅ SHA-family hashes (e.g., SHA-256):
- Excellent for integrity checks, deduplication, and digital signatures
- Extremely sensitive — even a 1-bit change yields a completely different hash
- But: Say nothing about the type or structure of the data
🧠 Two JPEGs of the same image — one at 90% quality, one at 80%? Wildly different hashes. No shared fingerprint.
✅ Fuzzy hashes (e.g., ssdeep, sdhash):
- Built for detecting similarity (e.g., malware variants)
- Output is not a stable identity — just a relative comparison score
- Require pairwise comparisons — no single, deterministic fingerprint
✅ MIME types and file headers:
- Rule-based classification (magic bytes, format signatures)
- Easily fooled by obfuscation, truncation, or crafted payloads
- Not content-derived — more like guesswork than analysis
🧪 4. What Makes TSV Different
A Tomasello Signature Vector is:
- 🧠 Content-derived — looks at what the data is, not just its bits
- 🔁 Deterministic — same input type always yields the same fingerprint
- 🔐 Fixed-length — 32 bytes, compact and portable
- 🧭 Type-reflective — captures structure and category, not identity
- ⚖️ Entropy-aware — reacts meaningfully to randomness, repetition, and format
📌 Examples:
- Two random blobs → Same TSV (classified as Random)
- Two different PDF files → Same TSV (classified as StructuredText)
- A PNG and a DOCX → Distinct TSVs (different structural patterns)
🧰 4. Prior Art: What It’s Not

🧬 5. How TSV Works (At a Glance)
Without diving into internal mechanics, TSV follows a simple, deterministic pipeline:
- 🔍 Analyzes the full byte structure of the input
- 🧱 Applies a consistent transform stack to extract type-relevant features
- 🧠 Encodes semantic identity into a compact 32-byte fingerprint
- 🧰 Requires no machine learning or training data
- 🌐 Runs identically across platforms — fast, portable, and repeatable
🌍 6. Why TSV Matters
Enables instant classification — no pairwise comparison needed
Can drive:
- Malware triage: Identify suspicious data characteristics (e.g., entropy, structure) to guide deeper inspection — like spotting a binary blob hiding in a supposed Word doc
- Forensic analysis: Validate type without relying on file extensions or headers
- Dataset segmentation: Auto-group files by type in large unlabeled corpora
- Entropy zone labeling: Detect high/low entropy regions for compression or crypto
- Stream routing: Dynamically route packets or chunks based on semantic content
Already used in Mango to detect file type from raw input data — even when filenames, headers, and metadata are stripped
🧪 7. Real-World Implications
TSVs enable:
- 🔍 Quick similarity lookups across large corpora
- 🔐 Type-aware encryption policies
- 🧠 Binary-to-class mapping without requiring machine learning
SHA256 tells you if two things are identical. TSV tells you if two things are the same kind. They’re complementary — not interchangeable.
🏁 8. Conclusion
- TSV is a novel addition to the fingerprinting space
- It fills a real-world gap left by hashes, fuzzy matchers, and embeddings
- Offers a portable, fast, deterministic, and semantically meaningful signature for any data
Identity is useful — but knowing what something is? That’s powerful.
📣 9. Call to Action
The Tomasello Signature Vector (TSV) is open for exploration, integration, and improvement.
- 🔍 Explore the code: Full implementation available from my GitHub repository
- 🧩 Extensible by design: TSV supports new detectors and domain-specific classifiers — easily plug in new metrics as your needs evolve
- 🖨️ Human-readable: Use the TSV Pretty Printer to interpret vector contents quickly. Also available from my GitHub repository
- 🤝 Let’s collaborate: Cryptographers, DFIR specialists, data scientists — if you see a use case or want to push this further, reach out
A deterministic, type-reflective fingerprint might just be the missing piece in your data pipeline.
