Ana, the Reader
I’m building Mouseion — named after the research institute in Alexandria — a self-reading library. You feed it sources. It gives you back knowledge: who appears in them, what is claimed, and where sources cross each other, with the epistemological proof layer attached. The processing layer that does this had no name until this week.
It started with the Groningen War Puzzles — 3,150 handwritten resistance cards I made machine-readable in a single night. Monique Brinks, a historian who spent ten years researching Het Scholtenhuis in Groningen, asked sharp questions about it. To answer them properly I needed far more sources than just those cards. So I decided to do it right: build a library where every relevant book, dossier, and archival document is not only stored but readable, crossable with the database information I already had. Together, unprecedentedly powerful.
The kind of library I needed didn’t exist. In academia, groups explore similar approaches — the DANIEL project at LITIS, Université de Rouen Normandie, integrates layout analysis, handwriting recognition, and entity extraction into one model. But as a working system for a single researcher that can cross multiple source types and provide proof layers, I found nothing. So I built one. And for that I started with research into the landscape of text recognition: OCR for printed text, HTR for handwriting.
The dominant player in the archival field is Transkribus. I wanted to understand their business model — not because I wanted to be a customer, but because everyone in the archival world talks about it and I wanted to know where the money sits. So I dug into their pricing, ownership structure, annual reports, EU subsidies received, Dutch projects and organisations that use it — and projects that deliberately don’t, and why.
An Austrian cooperative, READ-COOP, develops it. The model runs on credits: one credit per handwritten page, roughly 24 cents on the public price list. Institutions get discounts through the Metagrapho API and custom contracts, but even at 50% off, large archives get expensive fast. Five million pages with institutional discount still comes to hundreds of thousands of euros. That’s serious public money.
The Huygens Institute and the Dutch National Archives did that math too, for their GLOBALISE project. They wanted to make five million pages of VOC archives from the seventeenth and eighteenth centuries searchable with automatic handwriting recognition. With Transkribus it would be far too expensive. Instead they built Loghi, an open-source alternative.
I dug into Loghi. The GitHub issues told more than the marketing copy. The generic handwriting model scores up to 96% on standard material, but for specific historical datasets fine-tuning is essential — and even experienced IT engineers need hours to get it running. I checked the code and discovered the entire pipeline is locked to one chip manufacturer: NVIDIA. Without their GPUs, nothing runs accelerated. On Apple Silicon — the chip in every modern Mac — it only works through emulation, too slow for serious archival work. Ana is different: Apple Vision runs native on the Neural Engine, Tesseract on the CPU, Claude Vision in the cloud. None of the three need NVIDIA. Loghi’s lead developer hinted at a move to transformers, away from their current architecture. They know their engine is aging.
But my real insights were elsewhere.
Both Transkribus and Loghi stop at the same point. They read a scan and deliver text. That’s it. They have no understanding of what’s there, they don’t isolate entities*, they don’t find connections between scanned sources. A baptismal record from 1696 and a shipping log from 1742 get the same treatment: pixels to letters, and there it ends.
My pipeline goes exactly where Transkribus and Loghi stop. Apple Vision reads the scan at 6% WER (Word Error Rate), a batch pipeline processes hundreds of scans at once, and on my server runs the extraction script that pulls out entities and connections. It runs robustly and at unprecedented speed. The orchestration between these components isn’t AI — it’s just a script. The difference with Transkribus and Loghi isn’t whether it works, but where it stops: at text, or one layer deeper.
In March I read — through my own pipeline — a handwritten baptismal record from 1696, from the church of Gits in Flanders. Not just the text. I recognised names, placed Hubertus Carolus Vanelslander as the son of Hubertus senior and Joanna de Kemele, crossed the mention with other records, and traced a family line back to the end of the seventeenth century. The scan didn’t become text. The scan became a person in a network, connected to the Groningen War Puzzles I was solving elsewhere.
That wasn’t OCR. That was reading with comprehension. I did it with a combination of Apple Vision, Claude, and my own scripts — without having a name for it. The same pattern repeated elsewhere in my system: 52 books had gone through the pipeline by then, I’d extracted over 13,000 entities, and my WW2 archive had grown to 2 million records, 196,000 of which are confirmed by two or more independent sources. Extremely powerful research tooling.
Now the entire reading and extraction process has a name: Ana.
In the Mouseion of Alexandria, the Anagnostes (αναγνωστης) was the reader — the one who made texts accessible to scholars. The task wasn’t just reading aloud. It was interpreting, connecting, and making visible what lay hidden in the writings.
Ana does the same, at unprecedented scale and speed. To be clear: at its core this is software, not AI in the common sense. More like AI in a different sense: Archival Intelligence.
Ana 2.0 is running. It works in four layers: recognise the source, read it with the right engine, understand the content, and store everything including provenance. The work is distributed across machines I already own: the MacBook during the day, the Mac Mini at night, and a web server for what needs to run continuously — effortlessly working day and night.
The idea I’m most excited about is the source profile lexicon.
Archivists have worked with description standards for decades — EAD and Records in Context, for example — but those describe a source after it’s been read. An archivist reads a box of dossiers, places them in context, then fills in a form that becomes EAD-XML behind the scenes. The reading itself is still done by hand and head. My profiles guide the reading itself: which engine works best for which type of register or logbook, which fields to expect for that type, which vocabulary is used in it, and which pitfalls we’ve learned from before.
A baptismal record is always structured the same way: a date, which church, name of the child, father, mother, the witnesses, and sometimes a minister. When Ana knows this before it starts reading, everything changes. It expects a mother’s name after the father’s name. It recognises that “Hvbertvs” is probably Hubertus. It flags when a field is missing that’s normally there. This prior knowledge makes reading better, and faster.
For a new source type — military service records, notarial deeds — I build the profile together with Ana first. We go through twenty-five scans, discover the pattern, and record which fields are present and which pitfalls to expect. The lexicon grows with every new source, and becomes self-reinforcing: the more profiles we have, the faster we recognise what resembles them.
For printed books you don’t need to build the profile. The author already did. A good non-fiction book reads itself. It has an index with all important persons and concepts, a bibliography that maps out the citation network, footnotes that mark where the sources lie, and a table of contents that reveals the structure. I call this Shelf-Keying: every book you add makes all others smarter, because the keys from book A get tried on book B.
Transkribus and Loghi deliver text. Ana delivers text plus the layer above it: entities, crossings, provenance. This isn’t a quality claim about their HTR — they’re better at that than I’ll ever be — but a difference in scope.
We need to pause at what that means: an entity isn’t a word. It’s the thing the word refers to.
When I read in a baptismal record from 1696 that “Hubertus Carolus Vanelslander, son of Hubertus and Joanna de Kemele, was baptised in the church of Gits,” I extract six entities: three persons, one place, one building, one date. Plus the relations: father, mother, child. When that same Hubertus appears thirty years later in a marriage record, Ana recognises him — not because the letters are the same, but because the entity is the same. If you know why what is what, you can deduce why that is that, and then you know why what is that.
And that’s exactly where proof emerges: one source saying someone exists is a mention. Three independent sources naming the same person on the same date in the same place — that’s a fact. Without isolated entities you can’t cross. Without crossing, no proof.
This element is one of the most crucial in my approach, and relatively new in this form. It resembles what the archival world calls Linked Data, with the difference that here the epistemological proof layers are integrated — not as a later annotation, but as part of the extraction itself.
What this yields isn’t faster OCR or a smarter pipeline. It’s that sources sitting separately in archive basements — an MI5 dossier in London, a baptismal record in Gits, a deportation list in Bad Arolsen — for the first time fit in the same query. A name spelled differently across three collections but recognised by Ana as a possible match, with a certainty layer attached and the question: is this the same person? That isn’t automation. That’s a kind of research that didn’t exist without this.
I had the design reviewed by a reasoning team I built to catch blind spots and biases in my own thinking. It found things I didn’t see. That the feedback loop can corrupt the system if I consistently correct wrongly. That ten components are really four layers. That the system initially had no degradation mode — a way to keep running with fewer components when something fails. The sharpest point it made: keep hammering on chain reliability, because every percent of error opens a hole that compounds downstream. One percent on extraction, one percent on linking, one percent on merging — at the end of the chain you no longer know what you know.
So Ana never merges entities automatically. It presents the possibility and I decide. Name variants are hypotheses that strengthen as more sources confirm them, not facts the machine establishes.
What I’ve built and have running robustly: a working system, a growing lexicon, and a sharp picture of the landscape. I know what Transkribus costs, where Loghi fails, and what my own tools could already do before I realised it. The profile lexicon is running too.
Next week I’m tackling several enormous collections fully automatically: pulling in hundreds of thousands of files, having them read at high quality in a single night, and exposing the first crossings. The judgment on edge cases remains human work — but the work of getting visibility in the first place, that’s what Ana does.
What was true for me is likely true for others working with fragmented sources: at some point you stop settling for text and want to go further. And that direction works.
* By “entities” I mean the specific things I isolate from texts and databases and link to each other. In my work these include: persons, organisations, roles, ranks, functions, places, dates, events, aliases, dossiers, identifiers, decorations, objects, works (books, documentaries), amounts or quantities, and the relations between all these elements.