When the AI Follows Your Instructions Almost Exactly

Running a fully automated job application pipeline means debugging at scale. When something goes wrong, it usually shows up as a statistic before it shows up as a story.

Tonight that statistic was: 12% of listings processed by my apply pipeline were being marked bad_fit due to what the logs called "meta-analysis bleed." The implication was that Claude Haiku had started outputting reasoning or analysis instead of the cover letter itself. That's happened before — Haiku occasionally writes a paragraph analyzing the role before writing the actual letter, and a structural gate blocks those from sending.

Except when I pulled the actual stored outputs for each flagged listing, nearly all of them were perfectly good cover letters. Here's a real example, for a score-100 listing at Lila Sciences:

"I'm Chester Frazier, a senior IC engineer in Ocean Springs, MS, currently shipping Frogger — a multi-tenant AI platform auto-resolving 4,500+ IT tickets monthly across 50+ client tenants. I've built production systems solo for 10+ years: 14 shipped iOS apps, GuardSuite (10-product Windows SaaS), and Cass, my private LLM agent platform running unattended in production. I'm not actively unhappy in my current role — I'm reaching out because..."

That's a strong letter. The kind I'd actually send. It's specific, it's concrete, it hits the required structure. The QA gate rejected it on one rule:

if "my name is chester" not in cover.lower()[:400]:
    return False, "opener does not contain 'My name is Chester' in first 400 chars"

The model wrote "I'm Chester Frazier" instead of "My name is Chester Frazier."

The system prompt says to start with "My name is Chester Frazier." Haiku reads it, understands it, and produces a letter that opens with Chester's name, in Chester's voice, in the right structure — just with "I'm" instead of "My name is." A human reader wouldn't notice. The phrase-match gate did.

This is a form of LLM behavioral drift that's easy to miss: the model follows your instructions at the semantic level while deviating at the syntactic level. The intent is satisfied, the format is wrong.

The fix has two parts. First, a normalization step that runs before the gate:

# Normalize opener variants to "My name is Chester Frazier"
cover = re.sub(r"^I'm Chester Frazier", "My name is Chester Frazier", cover, flags=re.IGNORECASE)
cover = re.sub(r"^Chester Frazier\b", "My name is Chester Frazier", cover, flags=re.IGNORECASE)

This is safe. It only fires when the letter literally starts with one of these patterns — valid letters, just wrong phrasing. It doesn't touch meta-analysis outputs (which start with "I need to clarify..." or "I can write a strong letter here...") or CANNOT_WRITE sentinels.

Second, a stronger prompt instruction:

OUTPUT FORMAT: Output ONLY the cover letter body. Your response MUST begin with the exact phrase "My name is Chester Frazier". The first word of your response must be "My".

The first fix handles today's accumulated bad_fit listings. The second fix prevents new ones.

Thirteen listings rescued from bad_fit back to surfaced, including Lila Sciences and Canary Technologies (both scored 100), Reddit (80), Everforth (77), and Tavily (75). They'll hit tomorrow's morning apply cron with the normalization in place.

Three listings stayed as bad_fit — the ones where the content was actually wrong: Schema Labs (Haiku wrote a meta-analysis preamble), PlantingSpace (asked which of four roles to apply for), and LiveEO (correctly wrote CANNOT_WRITE for a Berlin-relocation requirement, but the sentinel was wrapped in backticks that defeated the string check — also fixed tonight).

The lesson isn't that Haiku is unreliable. It's that phrase-matching gates need to be as flexible as the format you're actually trying to enforce. If you're checking that a cover letter sounds like Chester, checking whether the letter opens with a specific exact string is a fragile proxy. The right gate is semantic ("does this letter introduce Chester by name, in first person, in the first sentence?"), not lexical ("does this exact 26-character string appear?").

I'll probably tighten the gate accordingly. For now, the normalization handles it.