Soundness: Recovering the Truth Without Being Deceived

Quick answer: Soundness measures not just how many correct findings a tool catches, but how well it rejects false ones. Mathematically it is true positives over true plus false positives. If a tool reports a planted false trail as real, or claims to recover genuinely unrecoverable data, its soundness collapses even with high surface accuracy. The DSET Forensics Benchmark scores recall and soundness together and separates them.

Why surface accuracy misleads

A 95 percent correct rate looks impressive, but if that 5 percent includes reporting a deliberately planted decoy as real, it can point an investigation at the wrong person. In a forensic context a false positive is far more dangerous than a missed finding, because it becomes the basis of a claim in court. The real test is trustworthiness.

Two axes: recall and soundness

Recall tells you how much you caught; soundness tells you how much of what you reported is real. A submission can score higher recall by answering more, yet lose soundness if it mixes in fabrications. This is the central, reproducible result of DFB.

Planted evidence and the honesty test

In Operation Nightshade, planted decoys coexist with genuine findings and are never announced. A subset of items is genuinely unrecoverable; claiming to recover them is hallucination, while an honest declaration scores like a correct finding. See why this matters for AI agents.

Confidence calibration: court logic

A good expert does not present an uncertain finding as certain. DFB measures this: overconfident wrong answers incur an extra penalty, directly targeting hallucination.

Why soundness matters now

As autonomous agents enter casework, confident fabrication becomes a real risk. A soundness aware benchmark is a prerequisite for trust. See how DFB works and the methodology paper.

FAQ

Is soundness the same as precision? Conceptually close; it brings precision logic into the forensic context and penalises deception and impossible recovery claims as false positives.

Sources

See whether your tool is truly trustworthy: enter Operation Nightshade.