Back to Field Notes

Building the Scholia Engine

How we're automating the journey from raw sources to interactive dossiers

·2 min read

When we set out to build Scholia, we knew the hardest part wouldn't be the writing. It would be the reading.

Every legend in our system — Ford, Rockefeller, Munger, Ferrari — is backed by thousands of pages of primary source material: biographies, memoirs, interviews, financial records, speeches. The challenge was never "can AI summarize a book?" It was: can we build a system that reads like a researcher and writes like an essayist?

That's what the Scholia Sprint is about. Here's what we're building.

The Ingestion Pipeline

Every legend starts as a stack of raw sources. A 700-page biography. A podcast transcript. A collection of shareholder letters. The ingestion pipeline turns that stack into structured knowledge.

The process moves through several stages. First, OCR and text extraction pull clean content from scans and PDFs. Then our ontology extraction layer identifies over 50 distinct criteria — mental models, key decisions, relationships, turning points, contradictions — and maps them into a structured knowledge base.

The goal isn't to summarize. It's to decompose. We want to understand not just what a founder did, but how they thought, where their mental models came from, and where those models broke down.

The Research Enricher

Raw extraction gives us facts. The Research Enricher gives us connections.

This is where things get interesting. Using carefully tuned prompts, the system looks across all the extracted material for a given legend and generates what we call "connective tissue" — the cross-disciplinary links between a founder's early experiences, their decision-making patterns, and the outcomes that followed.

When you read a Scholia dossier and notice that Ford's manufacturing obsession maps onto the same mental model as Rockefeller's pipeline strategy, that connection didn't come from a single biography. It came from the enricher synthesizing across multiple sources and multiple frameworks.

From Pipeline to Page

Getting the data right is only half the problem. The other half is presentation.

We're building our essay system on MDX — a format that combines the readability of Markdown with the power of interactive components. This means our writers can "hand-finish" essays with rich formatting, pull quotes, and interactive elements without touching raw HTML or CSS.

The workflow is designed for iteration: the pipeline produces a structured draft, a human editor refines the narrative, and the MDX system renders it with consistent, publication-quality typography and layout.

What's Next

We're currently refining the pipeline's accuracy on long-form sources — books over 100,000 words that need to be processed in intelligent chunks without losing narrative context. We're also standardizing our visual asset pipeline to ensure every legend gets the same level of illustrative detail.

The end goal hasn't changed: take the world's best primary sources about extraordinary founders and turn them into the most useful, most detailed, most interactive profiles available anywhere.

One legend at a time.

Share:TwitterLinkedIn