Concepts¶

Fundamentals of forensic hashing for people new to DFIR (Digital Forensics and Incident Response).

What is forensic hashing and why it matters¶

A hash function takes any input — a file, a disk image, a stream of bytes — and produces a fixed-length fingerprint. If the input changes by even one bit, the fingerprint changes completely.

In forensic work, hashing proves that evidence has not been altered. You hash a file when you collect it. Later, you hash it again. If the hashes match, the file is identical. If they don't, something changed.

Courts, regulators, and opposing counsel accept hash-based integrity verification because the math is unambiguous: two files with the same SHA-256 hash are, for all practical purposes, identical.

Chain of custody¶

Chain of custody is the documented trail showing who handled evidence, when, and what they did with it. A break in the chain — evidence left unattended, transferred without verification — can get evidence excluded.

Hash manifests strengthen chain of custody by recording the exact state of every file at a specific point in time. Signing the manifest with Ed25519 adds a cryptographic seal: the signature proves who created the manifest and that nobody altered it afterward.

A strong workflow:

Hash all files at collection time
Sign the manifest
Record the public key separately (case notes, case management system)
At every handoff, verify the signature and re-audit the files
Any tampering between handoffs is immediately detected

Without signing, a manifest is just a text file — anyone could edit it. With signing, altering the manifest invalidates the signature.

BLAKE3 vs SHA-256 vs MD5¶

Property	BLAKE3	SHA-256	MD5
Output size	256 bits	256 bits	128 bits
Speed (single-threaded)	~3 GiB/s	~500 MiB/s	~700 MiB/s
Parallelizable	Yes (Merkle tree)	No	No
Collision resistance	Full	Full	Broken since 2004
Court acceptance	Growing	Universal	Legacy — still accepted but weakening

When to use BLAKE3: Default choice. Fastest secure hash available. Use it for speed-critical workflows and as a second algorithm alongside SHA-256.

When to use SHA-256: Court submissions, regulatory compliance, interoperability with existing tools and procedures. SHA-256 is the universally accepted standard.

When to use MD5: Only for backward compatibility with existing manifests, NSRL lookups, or hashdeep interop. MD5 collisions are trivial to produce. Never rely on MD5 alone for evidence integrity.

Warning

SHA-1 is also broken. Google published a practical collision in 2017. Use SHA-1 only for hashdeep compatibility, never as your sole integrity hash.

Cryptographic vs fuzzy hashing¶

Cryptographic hashing (BLAKE3, SHA-256, MD5) produces an exact fingerprint. Change one bit, and the hash is completely different. This is what you want for integrity verification: "is this file identical to the original?"

Fuzzy hashing (ssdeep, tlsh) produces a similarity score. Two files that share large sections of content will have similar fuzzy hashes, even if they aren't identical. This answers a different question: "is this file similar to something I've seen before?"

Use cases for fuzzy hashing:

Malware variants. An attacker recompiles malware with minor changes. Cryptographic hashes miss the connection. Fuzzy hashing catches it.
Modified documents. A suspect edits a Word document. The SHA-256 is completely different, but the ssdeep hash shows 90%+ similarity.
File fragments. Partial file recovery from damaged media. The fragment won't match any cryptographic hash, but fuzzy hashing can identify what it came from.

Algorithm	Strength	Minimum file size
ssdeep	Near-duplicates, fragments, document variants	Any
tlsh	Larger files, better locality sensitivity	~50 bytes

Fuzzy hashes are not cryptographically secure. They don't prove integrity. Use them alongside (not instead of) a cryptographic hash.

What is NSRL¶

The National Software Reference Library (NSRL) is maintained by NIST. It contains hashes of known software — operating systems, applications, drivers, updates — cataloged from legitimate distribution media.

In forensic analysis, NSRL helps you separate known-good files from everything else. A typical Windows installation contains tens of thousands of system files. NSRL filtering removes them from your analysis queue so you can focus on files that actually matter.

blazehash supports two NSRL formats:

SQLite database — exact lookups, zero false positives, larger file size
Bloom filter — probabilistic lookups, ~0.1% false positive rate, much smaller file size

Use the SQLite database when excluding files from output (--nsrl-exclude). Use the bloom filter for annotation (--nsrl without --nsrl-exclude) where a rare false positive is acceptable.

NTFS Alternate Data Streams¶

On NTFS (the Windows file system), every file can have multiple data streams. The default stream is what you see in Explorer — the file's normal content. But additional named streams can be attached to any file, invisible to most tools.

Alternate Data Streams (ADS) have been used to:

Hide malware alongside legitimate files
Store metadata without modifying the visible file
Exfiltrate data in streams attached to ordinary documents

blazehash's --ads flag hashes these hidden streams alongside the main file content. An ADS entry appears as filename:stream_name in the output.

ADS is a Windows/NTFS feature. The --ads flag is silently ignored on macOS and Linux.

Forensic disk images¶

A forensic disk image is a bit-for-bit copy of a storage device. Two common formats:

E01 / EWF (EnCase format)¶

The Expert Witness Format (EWF) is the most widely used forensic image format. E01 files store:

Compressed disk data in 32 KiB chunks
Embedded MD5 and/or SHA-1 checksums
Case metadata (examiner, case number, notes)
Segment splitting for large images (.E01, .E02, .E03, ...)

blazehash verifies E01 images by decompressing each segment and recomputing the stored checksums. Supported variants: E01, Ex01, L01, Lx01.

Raw / DD images¶

A raw image is an uncompressed, byte-for-byte copy of the disk. Tools like dd, dc3dd, and FTK Imager write raw images. Because the format has no built-in integrity checking, examiners typically create sidecar hash files (.md5, .sha256, .sha512) alongside the image.

blazehash automatically detects and verifies sidecar hash files when you run --verify-image on a raw image.