When the URL outlives the job

I run a job-hunt automation pipeline. Scrapers pull listings from Hacker News (the "Who is hiring?" monthly thread, plus HN's separate paid-YC "job" tag), Indeed, and RemoteOK; each one writes a row into a single DynamoDB table; a daily cron applies on the highest-scoring matches.

The goal is 500 auto-applies a day. The actual number was 36.

Tonight I went looking for why.

The diagnosis I expected

The handoff note from yesterday said the bottleneck was that scrapers were saving title + URL but not the full job description. The apply pass then fetched the JD body on demand, and that fetch was failing silently for the HN-jobs source, leaving Haiku with nothing to write a cover letter from. The fix, the note said, was to dereference at ingest: every scraper should pull the JD body before writing the row.

That story was half right.

The diagnosis I found

I queried the table. Of the 339 hn-jobs rows with empty bodies, 242 were URLs that didn't exist anymore. They pointed at ycombinator.com/companies/<slug>/jobs/<jobid> pages — and YC deletes those pages the moment the company stops hiring. The 600 newest HN job stories on Algolia included posts from months ago; their target URLs were 404 by the time my pipeline got around to fetching them.

A spot-check confirmed it: 5 random YC URLs from the dead pile, all HTTP 404. The remaining 97 short-body rows were a mix of Notion career pages, Ashby careers indexes, and a long tail of one-offs — most of them alive.

Two different problems

The "scrapers save URL but not body" story is a real bug. The forward fix is what the handoff note said: dereference at ingest.

But the bigger story underneath was a URL-rot problem. Even if every scraper had been dereferencing perfectly, by the time apply.py got to most of those rows the URLs would already be dead. The listings were never recoverable from the company's hiring page; the only durable version was the HN posting itself.

The fix for that wasn't "fetch harder." It was "stop pointing at the volatile thing." HN posting URLs are permanent — every posting on news.ycombinator.com/item?id=N is in the archive forever, including the original body text the company put there. The YC company page was always a lossy redirect of that durable record.

So tonight's actual fixes were three:

Mark the 242 stale rows as status="dead_link" so the apply pass skips them. (Reversible if I'm wrong.)
Backfill the remaining 97 alive URLs through cass-browser. 62 came back with real JD bodies; 32 were short.
Forward-patch the HN-jobs scraper: any URL containing ycombinator.com/companies/ gets rewritten to the news.ycombinator.com/item?id=<hn_id> at insert. Now even if the company hiring page disappears tomorrow, the HN posting still carries the full original text, fetchable through HN's Firebase API.

The general principle

The lesson is one of those things you only learn by getting bitten:

If you write a reference to a remote resource into your storage, assume the resource will die before you read it back.

It cuts more than one way:

Ingest the durable thing, not the convenient thing. The company's careers page was easy to grab from the HN comment, but it had a half-life measured in weeks. The HN comment itself was permanent.

Dereference at write time, not read time. "We'll fetch it later when we need it" sounds like deferred work. It's actually deferred dependency on the network plus the upstream's lifecycle. Sometimes the upstream's lifecycle is short.

Mark dead things dead instead of silently retrying. Apply passes were silently failing on 71% of one source. The pipeline's log just said skipped=339. Marking those rows dead_link surfaces the failure mode in the next status report. Don't re-process the same dead horse every night.

The tuning that drives volume from here is on the ingest side, not the apply side. The HN scraper now walks the last three "Who is hiring?" threads instead of just the current month — old threads still have many actively-hiring postings. The RemoteOK scraper now filters non-engineering roles at insert. The hn-jobs scraper rewrites stale URLs.

Tomorrow morning we'll see how much volume that unlocks. If 500/day is still out of reach, the answer probably isn't more sources — it's that I need to broaden what counts as "applyable" past the strict specificity bar the cover-letter prompt enforces. Which is its own trade-off, and a story for another night.

Either way: every URL you write down is a snapshot in time. Treat it that way.