← Back to Life Lens System

Epistemological Extraction

April 20, 2026 — Lees in het Nederlands

Existing extraction systems find information in text and give it a confidence score: 0.87. That number means nothing. Is it 87% because the OCR is bad? Because the name is ambiguous? Because the pattern matched weakly?

This system does something different. It extracts information and simultaneously explains how certain it is and why.

The problem

Historical archives are scanned with imperfect OCR. They're written in three languages. They contain 11 million words across 28,461 text chunks. No one has time to read them all. So you build a system to find names, dates, locations, and actions automatically.

But if you can't say how reliable each finding is, the system is unusable for scholarship. A historian needs to know: is this date certain? Is this name possibly an OCR error? Is this attribution based on the text itself or on an inference?

The method

Every extraction is classified on a certainty layer from the 12-layer Synaptic Architecture model:

LayerMeaningExampleCertainty
L2It says so literally"6 March 1942"100%
L3Source convention"7.11.44" from MI5 (always day.month)99%
L6Pattern recognition"Nov 41", "in March 1942"95-98%
L7Vague indication"early March", "Spring 1942"85-90%

The layer doesn't just say how certain — it says why. And it says how to improve: an L7 date can become L2 when another source confirms it with an exact date. Cross-source confirmation promotes the certainty layer.

Six risks, six mitigations

1. The table is a snapshot

Every pattern improvement invalidates the table. Solution: 43-second full recomputation. No versioning needed at this scale.

2. Patterns miss what they don't know

56 pattern types cover six dimensions. A seventh dimension (irony, causal connections) is invisible. Solution: add patterns when a question requires something new.

3. Dates sit in different sentences than actions

86% of dates are not in the same sentence as the action they date. Solution: chunk-wide date coupling with distance measurement. Brought date coverage from 5% to 60%.

4. Sources share blind spots

If all sources omit the same person, we can't see them either. Solution: add sources from different perspectives. Russian archives fill Western blind spots.

5. Common words as person names

0.8% of canonical names are ambiguous ("Burgers" = person or citizens?). Solution: context check. Title or rank before the word = person. Article before it = common word.

6. Implicit relationships within a chunk

"Lauwers, arrested, March '42, The Hague" in one chunk doesn't prove they belong together. Solution layered: distance measurement (90% → 95%), verb direction (95% → 97%), enumeration detection (97% → 99%), LLM batch for 0.08% remainder (99.92%).

The proof: the arrest of Lauwers

53 chunks across 4 sources (De Jong, MI5, Parliamentary Inquiry, Mitrokhin Archive) describe the arrest of radio operator Lauwers in 1942. The system reconstructs:

Persons involved (all sources combined): Giskes, Schreieder, Ridderhof, Taconis, Kup, Bodens

Locations: The Hague, Scheveningen, Driebergen

Date: 6 March 1942 — 7 independent mentions, 2 sources, certainty layer L2

MI5 adds two persons that De Jong never mentions: Kup and Bodens. That is not an anecdote — scaled across the entire corpus, it reveals 1,174 persons in MI5 files that the 30-volume standard work omits.

Five contradictions found

The system automatically found five cases where De Jong's dates conflict with primary sources:

PersonActionDe JongOther sourceDifference
Christiansenappointment19 July 1940 (L2)Inquiry: 25 June 1940 (L2)24 days, both hard dates
FerwerdaappointmentSept '44 (L6)Inquiry: 30 Aug 1944 (L2)~1 day
SevensterappointmentSummer '40 (L7)Inquiry: Autumn '40 (L7)1-3 months, both vague
SikorskideathNov '44 (L3)Inquiry: March '45 (L6)4 months
WehnerdeathMarch '45 (L3)MI5: Oct '44 (L6)5 months

The Christiansen case is the strongest: two L2 dates (hard evidence) that clash by 24 days about the same appointment. That is a testable historical discrepancy.

The numbers

MetricValue
Pattern types56 (6 domain plugins)
Total extractions132,122
Dates found51,446 (with certainty layers)
Persons recognised96% (33,783 canonical)
Scan time43 seconds on 11.4M words
Resolvable without LLM99.92%
Contradictions found5 (after filtering 26 false positives)
Blind spots measured1,174 persons MI5 knows that De Jong omits

What makes this different

Every finding is:

The system doesn't say "Lauwers was probably arrested in March." It says: "7 independent mentions from 2 sources confirm 6 March 1942, certainty layer 2, here are the 7 sentences."

Part of the Life Lens System. Built with Claude Code on 20 April 2026.