TSV output for a “Natural” input type. Shows hex values, flags, and their interpreted meanings: printable ASCII text, moderate entropy, and low token variety. Demonstrates TSV’s ability to classify data structure and content characteristics

TSV: A Fixed-Length Semantic Fingerprint for Data Classification

✨ 1. Introduction: Beyond Hashes and File Types

What if a 32-byte value could tell you not just what something is, but what kind of thing it is?

Traditional hashes answer:

“Have I seen this exact thing before?”

TSV answers:

“What category does this belong to?”

This article introduces the Tomasello Signature Vector (TSV) — a deterministic, fixed-size fingerprint that encodes semantic identity, not just byte-for-byte uniqueness.


🔍 2. Motivation: When Hashing Uniqueness Isn’t Enough

Most systems that operate on raw data use hashes to recognize inputs. But hashes only answer one question:

“Have I seen this exact byte sequence before?”

That’s great for deduplication or integrity — but useless when you’re dealing with categories of data rather than exact matches.

In my case, I needed to efficiently group inputs by type, not identity. Imagine processing a hundred different executable files. A hash treats them as strangers. But operationally, they’re the same kind of thing — and I didn’t want a hundred cache entries. I wanted one fingerprint that said:

“This is executable-type data.”

That’s where the Tomasello Signature Vector (TSV) began — driven by the question:

Can we infer type, structure, or entropy traits — deterministically — in a single glanceable vector?

Existing tools like fuzzy hashes and ML embeddings can approximate similarity, but they’re heavyweight, non-deterministic, or comparison-based. I needed something:

  • ✅ Lightweight
  • ✅ Deterministic
  • ✅ Content-aware
  • ✅ Compact (32 bytes)

A semantic fingerprint, not just a cryptographic one.


📦 3. What Traditional Tools Do — and What They Miss

✅ SHA-family hashes (e.g., SHA-256):

  • Excellent for integrity checks, deduplication, and digital signatures
  • Extremely sensitive — even a 1-bit change yields a completely different hash
  • But: Say nothing about the type or structure of the data

🧠 Two JPEGs of the same image — one at 90% quality, one at 80%? Wildly different hashes. No shared fingerprint.

✅ Fuzzy hashes (e.g., ssdeep, sdhash):

  • Built for detecting similarity (e.g., malware variants)
  • Output is not a stable identity — just a relative comparison score
  • Require pairwise comparisons — no single, deterministic fingerprint

✅ MIME types and file headers:

  • Rule-based classification (magic bytes, format signatures)
  • Easily fooled by obfuscation, truncation, or crafted payloads
  • Not content-derived — more like guesswork than analysis

🧪 4. What Makes TSV Different

A Tomasello Signature Vector is:

  • 🧠 Content-derived — looks at what the data is, not just its bits
  • 🔁 Deterministic — same input type always yields the same fingerprint
  • 🔐 Fixed-length — 32 bytes, compact and portable
  • 🧭 Type-reflective — captures structure and category, not identity
  • ⚖️ Entropy-aware — reacts meaningfully to randomness, repetition, and format

📌 Examples:

  • Two random blobs → Same TSV (classified as Random)
  • Two different PDF files → Same TSV (classified as StructuredText)
  • A PNG and a DOCX → Distinct TSVs (different structural patterns)

🧰 4. Prior Art: What It’s Not

Table comparing fingerprinting and classification methods including SHA256, ssdeep, sdhash, SimHash, embeddings (BERT, GPT), perceptual hashes, and MIME detection. Each row outlines the method’s approach to similarity and highlights key differences compared to TSV, such as determinism, need for training, or sensitivity to input changes.
TSV fills the gap left by hashes, embeddings, and fuzzy matchers.

🧬 5. How TSV Works (At a Glance)

Without diving into internal mechanics, TSV follows a simple, deterministic pipeline:

  • 🔍 Analyzes the full byte structure of the input
  • 🧱 Applies a consistent transform stack to extract type-relevant features
  • 🧠 Encodes semantic identity into a compact 32-byte fingerprint
  • 🧰 Requires no machine learning or training data
  • 🌐 Runs identically across platforms — fast, portable, and repeatable

🌍 6. Why TSV Matters

Enables instant classification — no pairwise comparison needed

Can drive:

  • Malware triage: Identify suspicious data characteristics (e.g., entropy, structure) to guide deeper inspection — like spotting a binary blob hiding in a supposed Word doc
  • Forensic analysis: Validate type without relying on file extensions or headers
  • Dataset segmentation: Auto-group files by type in large unlabeled corpora
  • Entropy zone labeling: Detect high/low entropy regions for compression or crypto
  • Stream routing: Dynamically route packets or chunks based on semantic content

Already used in Mango to detect file type from raw input data — even when filenames, headers, and metadata are stripped


🧪 7. Real-World Implications

TSVs enable:

  • 🔍 Quick similarity lookups across large corpora
  • 🔐 Type-aware encryption policies
  • 🧠 Binary-to-class mapping without requiring machine learning

SHA256 tells you if two things are identical. TSV tells you if two things are the same kind. They’re complementary — not interchangeable.


🏁 8. Conclusion

  • TSV is a novel addition to the fingerprinting space
  • It fills a real-world gap left by hashes, fuzzy matchers, and embeddings
  • Offers a portable, fast, deterministic, and semantically meaningful signature for any data

Identity is useful — but knowing what something is? That’s powerful.


📣 9. Call to Action

The Tomasello Signature Vector (TSV) is open for exploration, integration, and improvement.

  • 🔍 Explore the code: Full implementation available from my GitHub repository
  • 🧩 Extensible by design: TSV supports new detectors and domain-specific classifiers — easily plug in new metrics as your needs evolve
  • 🖨️ Human-readable: Use the TSV Pretty Printer to interpret vector contents quickly. Also available from my GitHub repository
  • 🤝 Let’s collaborate: Cryptographers, DFIR specialists, data scientists — if you see a use case or want to push this further, reach out

A deterministic, type-reflective fingerprint might just be the missing piece in your data pipeline.