Adding email to the Memex (and a Dynamo pagination gotcha)
A few hours after shipping the Memex over iMessage and Signal, I added email. The interesting part wasn't the feature — it was a DynamoDB pagination bug I hit while writing the email backfill.
The setup
I already had a DynamoDB mail table holding parsed email metadata
(from a mail interceptor Lambda I'd built earlier). To get email into
the Memex I needed a one-shot backfill scanner and a 5-minute poller
that queries new rows via the date-sorted GSI.
Result: 7,466 emails added on top of the existing 71k messages. Total 78,865 rows in the index.
The bug
Backfill kept blowing up partway through with:
ValidationException: provided starting key does not match the range key predicate
My loop looked roughly like this:
cursor = "2023-01-01T00:00:00Z"
last_evaluated_key = None
while True:
kwargs = {
"KeyConditionExpression": Key("gsi3pk").eq("MAIL") & Key("gsi3sk").gt(cursor),
"IndexName": "gsi3-date",
}
if last_evaluated_key:
kwargs["ExclusiveStartKey"] = last_evaluated_key
resp = table.query(**kwargs)
for item in resp["Items"]:
process(item)
cursor = item["gsi3sk"] # <-- this is the trap
last_evaluated_key = resp.get("LastEvaluatedKey")
if not last_evaluated_key:
break
Spot the issue? I was advancing cursor inside the loop while
DynamoDB was still paginating with ExclusiveStartKey. On the next
page, Dynamo validates that the start key satisfies the
KeyConditionExpression — and now the predicate is gsi3sk > <newer-
cursor> while the start key is <older-cursor>. Validation fails,
the whole pagination dies.
The fix
Hold the KeyConditionExpression fixed for an entire query session. Only advance the cursor after pagination completes:
cursor = "2023-01-01T00:00:00Z"
while True:
kwargs = {
"KeyConditionExpression": Key("gsi3pk").eq("MAIL") & Key("gsi3sk").gt(cursor),
"IndexName": "gsi3-date",
}
last_evaluated_key = None
page_max_sk = cursor
while True:
if last_evaluated_key:
kwargs["ExclusiveStartKey"] = last_evaluated_key
resp = table.query(**kwargs)
for item in resp["Items"]:
process(item)
page_max_sk = max(page_max_sk, item["gsi3sk"])
last_evaluated_key = resp.get("LastEvaluatedKey")
if not last_evaluated_key:
break
if page_max_sk == cursor:
break
cursor = page_max_sk
Outer loop advances cursor between query sessions, inner loop paginates a single session with a frozen predicate. Done.
Other things I learned that day
The mail table has two row patterns. INBOX#recipient for real
emails and a separate sentinel pattern for dedup tracking. Always
filter to the real rows or you'll embed nothing-rows with zero text.
Check the actual GSI names. I had a memory note saying my GSIs were
named gsi1/gsi2/gsi3. They were actually named gsi1-thread,
gsi2-from, gsi3-date. Saved nobody from a typo. Updated the note.
WAL files are sneaky. Before copying a SQLite DB across hosts, run
PRAGMA wal_checkpoint(TRUNCATE). Otherwise the .db alone may
reference pages still living in the WAL that didn't come along, and
you'll get database disk image is malformed on the other side.
What I'd do differently
I knew about ExclusiveStartKey-must-match-predicate, but I'd never been bitten by it because most of my Dynamo loops use the start key as the only progress signal. The moment you start mutating the key predicate inside a loop, this trap appears. I'd factor any "advance the predicate" logic outside the pagination loop on principle now.