The fallback that wasn't — Chester Frazier

I run a job-hunt pipeline that scrapes Hacker News's monthly "Who is hiring?" thread, scores each comment, and the high-scoring ones get an auto-generated cover letter sent to whatever email or ATS link the comment includes.

Each comment gets stored in DynamoDB twice. Once at scrape time, where I pull the raw text and keep it as raw_context. And again at apply time, where I re-pull the comment from HN's Firebase API to get the freshest body (HN comments get edited; ATSs and emails get added late; the scrape can be a week stale).

The two-source design exists because Firebase is the source of truth. raw_context is the cache. Or, in the comments of the code:

"""Re-pull the full posting text. For HN we know the comment id from
the URL; for Indeed we have the raw_context that was harvested at
ingest."""

Re-pull. Treat the cache as a fallback. Clean.

The bug

The apply pipeline ran tonight and skipped one of the 18 highest- scoring listings with the reason no full text available. That listing had 614 characters of raw_context stored — short, but well above the 60-character floor the apply logic enforces. So why did fetch_full_text return nothing?

Here's the function, before tonight's fix:

def fetch_full_text(listing):
    url = listing.get("url", "")
    src = listing.get("source", "")
    if src == "hn-whoishiring" or "news.ycombinator.com" in url:
        m = re.search(r"id=(\d+)", url)
        if not m:
            return listing.get("raw_context", "")
        try:
            with urllib.request.urlopen(HN_FIREBASE.format(m.group(1)), timeout=20) as r:
                data = json.load(r)
            text = data.get("text") or ""
            # strip HTML, unescape, etc.
            return text
        except Exception:
            return listing.get("raw_context", "")
    return listing.get("raw_context", "")

Spot it? The raw_context fallback only fires in two cases:

The URL has no id= parameter (so Firebase can't be queried at all).
The Firebase call raises an exception.

There's a third case the function doesn't handle: Firebase returns a 200 OK with text being null or an empty string. The comment was deleted, or the user edited it down, or there's a Firebase quirk on some old posts. In any of those cases, text = data.get("text") or "" becomes "", the HTML-strip regex returns "", and the function returns "" to the caller — even though raw_context still has the 614 characters we cached at scrape time.

The label "fallback" lied. It wasn't a fallback. It was a "return-this-if-the-API-throws-an-exception" branch dressed up to look like a fallback.

The fix

One line:

return text if len(text) >= len(raw) else raw

Take whichever source has more content. If Firebase is genuinely the fresher copy, it'll be at least as long as raw_context 95% of the time and we keep using it. If Firebase came back empty or shorter, raw_context wins. The caller never sees a result shorter than what we already had cached.

The full diff is six lines, including hoisting raw_context to a local at the top of the function. That's it.

The lesson

A "fallback" labeled fallback isn't a fallback if you don't actually compare. It's a substitute that fires only when the primary literally explodes. The class of failure that isn't an explosion — the empty-but-successful response, the 200 with a null body, the "yes I'm reachable, no I don't have anything" — slips right through.

The general pattern, when you have two sources and one is supposed to be richer:

Don't prefer one over the other by source identity. Prefer the one that's actually richer this call.
A length check is usually the dumbest sufficient heuristic. Sometimes a non-empty check is enough. Sometimes you need smarter (compare hashes, compare timestamps, compare schemas).
The point is you compare. The label "fallback" without a compare is a single point of failure with extra steps.

I had this same bug in a slightly different form three weeks ago in the resume renderer (preferred a hand-edited PDF over a generated one, until the hand-edited PDF was a week old and the generated one had new content). I didn't recognize it tonight as the same shape until I was writing this post. The lesson didn't generalize the first time. Writing it down is the second pass.