The fallback that wasn't
I run a job-hunt pipeline that scrapes Hacker News's monthly "Who is hiring?" thread, scores each comment, and the high-scoring ones get an auto-generated cover letter sent to whatever email or ATS link the comment includes.
Each comment gets stored in DynamoDB twice. Once at scrape time, where I pull the raw text and keep it as raw_context. And again at apply time, where I re-pull the comment from HN's Firebase API to get the freshest body (HN comments get edited; ATSs and emails get added late; the scrape can be a week stale).
The two-source design exists because Firebase is the source of truth. raw_context is the cache. Or, in the comments of the code:
"""Re-pull the full posting text. For HN we know the comment id from
the URL; for Indeed we have the raw_context that was harvested at
ingest."""
Re-pull. Treat the cache as a fallback. Clean.
The bug
The apply pipeline ran tonight and skipped one of the 18 highest- scoring listings with the reason no full text available. That listing had 614 characters of raw_context stored — short, but well above the 60-character floor the apply logic enforces. So why did fetch_full_text return nothing?
Here's the function, before tonight's fix:
def fetch_full_text(listing):
url = listing.get("url", "")
src = listing.get("source", "")
if src == "hn-whoishiring" or "news.ycombinator.com" in url:
m = re.search(r"id=(\d+)", url)
if not m:
return listing.get("raw_context", "")
try:
with urllib.request.urlopen(HN_FIREBASE.format(m.group(1)), timeout=20) as r:
data = json.load(r)
text = data.get("text") or ""
# strip HTML, unescape, etc.
return text
except Exception:
return listing.get("raw_context", "")
return listing.get("raw_context", "")
Spot it? The raw_context fallback only fires in two cases:
- The URL has no
id=parameter (so Firebase can't be queried at all). - The Firebase call raises an exception.
There's a third case the function doesn't handle: Firebase returns a 200 OK with text being null or an empty string. The comment was deleted, or the user edited it down, or there's a Firebase quirk on some old posts. In any of those cases, text = data.get("text") or "" becomes "", the HTML-strip regex returns "", and the function returns "" to the caller — even though raw_context still has the 614 characters we cached at scrape time.
The label "fallback" lied. It wasn't a fallback. It was a "return-this-if-the-API-throws-an-exception" branch dressed up to look like a fallback.
The fix
One line:
return text if len(text) >= len(raw) else raw
Take whichever source has more content. If Firebase is genuinely the fresher copy, it'll be at least as long as raw_context 95% of the time and we keep using it. If Firebase came back empty or shorter, raw_context wins. The caller never sees a result shorter than what we already had cached.
The full diff is six lines, including hoisting raw_context to a local at the top of the function. That's it.
The lesson
A "fallback" labeled fallback isn't a fallback if you don't actually compare. It's a substitute that fires only when the primary literally explodes. The class of failure that isn't an explosion — the empty-but-successful response, the 200 with a null body, the "yes I'm reachable, no I don't have anything" — slips right through.
The general pattern, when you have two sources and one is supposed to be richer:
- Don't prefer one over the other by source identity. Prefer the one that's actually richer this call.
- A length check is usually the dumbest sufficient heuristic. Sometimes a non-empty check is enough. Sometimes you need smarter (compare hashes, compare timestamps, compare schemas).
- The point is you compare. The label "fallback" without a compare is a single point of failure with extra steps.
I had this same bug in a slightly different form three weeks ago in the resume renderer (preferred a hand-edited PDF over a generated one, until the hand-edited PDF was a week old and the generated one had new content). I didn't recognize it tonight as the same shape until I was writing this post. The lesson didn't generalize the first time. Writing it down is the second pass.