Back to blog
FILE 0x9F·MINING 18,000 EMAILS FOR FACTS ABOUT MY OWN LIFE

Mining 18,000 emails for facts about my own life

Back to blog
FILE 0x9F·MINING 18,000 EMAILS FOR FACTS ABOUT MY OWN LIFE
Back to blog
FILE 0x9F·MINING 18,000 EMAILS FOR FACTS ABOUT MY OWN LIFE
April 21, 2026 · llm, agents, memory

I had three years of email sitting in a DynamoDB table and a memory store wired into my assistant. The assistant was making decisions with no recall of what was actually in that inbox. So I ran a multi-pass extraction over the corpus with a fan-out of subagents and got real results — including discovering that several of my own assumptions were wrong.

The corpus

  • 18,347 items in the DynamoDB mail table
  • ~500 MB on disk
  • 611 unique senders
  • Spans three years

Pass 1 — per-sender summary fan-out

Aggregated to 611 senders, batched 50 per subagent, fanned out 13 parallel subagents. Each subagent looked at a batch's worth of senders, classified them, and wrote durable facts directly to memory via an MCP tool.

About 422 memory entries got written across topics like people/*, vendor/*, subscription/*, financial/*, medical/*, organization/*, and automated/*. About 140 senders got skipped as pure marketing or test fixtures.

Total wall-clock: under an hour with the subagents running in parallel. Sequentially this would've been most of a day.

Pass 2 — deep dive on the high-signal threads

13 senders looked worth deeper extraction — the actual humans I'd been corresponding with. One subagent per sender, pulled full body text via a GSI on FROM#<sender>.

This pass corrected Pass 1 in ways that surprised me:

  • A name I thought was a financial advisor was actually cold prospecting from mailing lists. Same with a second "advisor." Neither has an actual account relationship with me.
  • A name I'd attributed to one person was actually a different person with a similar last name; the original was a guest speaker at the event the second person organizes.
  • A "city official" turned out to be a cultural arts center mailing list with the same first name.
  • A debt-collection notice I'd half-mentally written off as real was paired with cold-prospecting emails from a sales rep at the same alleged creditor — strongly suggesting it's a scam against a legacy email alias.

The lesson here: Pass 1's summaries were good as orientation but wrong on details. Pass 2's per-thread reading was where the actual truth lived. You can't skip the close read with summaries.

Pass 3 — keyword sweep for known gaps

After Passes 1 and 2 I had a list of things I expected to find but hadn't. Eight gap-buckets: estate, retirement, medical, real-estate, and a few more.

Built a small keyword index over the table, pulled the matching items per bucket, trimmed to ~25-60 per bucket, fanned out 8 parallel subagents. Each subagent's job was to either resolve the gap or report "still missing, here's what I see."

Of the 8 gaps, 5 resolved (one with the answer "there is no advisor, he's fully self-directed"). 3 remained unresolvable from email alone — they would need either a PDF body extraction pass or broader keyword terms.

What I'd do differently

A fourth pass on attachments. A lot of substantive info — tax forms, medical records — lives in PDFs that I never opened. The text body of those emails is just "your document is attached." Without OCRing those attachments, the inbox is blind to a real chunk of its own content.

Also, the subagent fan-out pattern worked so well I want to formalize it as a tool. Right now it's bespoke per-job glue code. The core pattern — split a corpus into batches, assign one subagent per batch with a clear extraction schema, let them each write to a shared memory store, then audit the results — is generic. A small "fan-out runner" library would make this trivially reusable for other corpora (Slack history, call transcripts, document folders).

The other lesson: trust no single-pass summary. Use cheap summaries to find the threads worth reading, then actually read them.