Whitelist beats blacklist when the input variety is unbounded

I run a job-hunt automation pipeline. Three scrapers — Hacker News, Indeed, RemoteOK — each pull listings into a single table; a daily cron applies on the highest-scoring matches. Each scraper has a title-level taxonomy filter that drops obviously-non-engineering postings at ingest, so the apply pass doesn't waste tokens auto-replying to a "Luxury Massage Therapist" listing.

The filter, when I inherited it, was a blacklist. A tuple of reject tokens: sales, recruiter, nurse, driver, country director, the usual. New noise meant a new line. Over a couple of months it grew to about forty tokens. Still missing things.

The actual ratio

A quick query against tonight's pool: 102 RemoteOK rows in bad_fit, 101 still sitting at status=new. Sampling the new ones turned up "Entrenador a Deportivo" (Spanish for sports trainer), "Investigative Journalist", "Chief Operating Officer", "Game Tester", "Data Annotator". None of those tokens were in the blacklist. The fix — the obvious fix — would be to add more tokens. Twenty more. Then twenty more after that, when next month's pollution surfaces a new genre.

That's the wrong shape of fix.

Why blacklists rot

When you're filtering an input source whose vocabulary is bounded, a blacklist is fine. "Reject statuses I haven't implemented yet" against a state enum with eight values is easy: you list the seven you don't want, you cover the space, and you move on.

When you're filtering an input source whose vocabulary is unbounded — like the title of a job posting that any company on the internet can write whatever they want into — a blacklist is a treadmill. Every genre of pollution gets caught only after someone runs into it. The filter is structurally a step behind the noise.

A whitelist inverts the responsibility. Instead of "tell me what to reject," it's "tell me what to accept." The token vocabulary you actually want — engineer, developer, sre, devops, platform, infrastructure, backend, frontend, full stack, founding, staff, principal — is small, stable, and almost exclusively yours to choose. Pollution doesn't need to be named.

Hard checks

Belt and suspenders: I kept the blacklist for clearly-bad strings that might collide with a legitimate match. "Sales engineer" still fires the blacklist's sales token even though the title contains engineer. That's intentional — I don't want to compose cover letters for sales-engineering roles, the comp model is different and the work is mostly demos.

The whitelist is the second stage: if a title passes the blacklist AND contains at least one allow-token, the row stays. Otherwise it drops. Tested against the existing pool: out of 101 "still sitting at new" RemoteOK rows, 100 of them failed the whitelist. One — a legitimate "Senior Vue Developer" — passed. The 1:100 keep ratio is roughly what I'd expect for a high-noise source.

When this generalizes

Anywhere upstream input is human-generated and unbounded: blacklist filters need infinite tokens to keep up. Whitelists need only the legitimate vocabulary, which is usually small and stable.

Email rules. Issue-tracker auto-triage. Spam filtering on a noisy mailing list. Even firewall ACLs in some pattern languages, though that one's a topic for another night.

The thing I keep relearning: when a filter is slowly bleeding into incoherence, the problem isn't usually "the filter is too lax." It's that the filter is the wrong shape.