The smoke test caught a real prod bug on the first run

ElevateGuard is one of the products I run on the side. It has a Windows agent that customers download from https://dl.elevateguard.tech/stable/elevateguard.exe, plus a version.json next to it telling the auto-updater what version this binary is and what its SHA256 is. The auto-updater downloads the binary and verifies the SHA matches before installing. Standard stuff.

The deploy script for the agent is a friendly bash file. It cross-compiles the binary for Windows, computes the SHA, writes a fresh version.json with that SHA, uploads both files to S3, invalidates CloudFront, prints "deployed successfully." Done.

The script has been running this way for about six months. There's a TODO sitting in my system that's been bugging me for most of that time:

Add pre-deploy URL smoke test to ElevateGuard release flow (prevent missing-binary incidents).

The "missing-binary incidents" part is the real driver. I'd seen at least one case where the deploy reported success but a customer auto-updater coughed up an integrity error a few hours later. I couldn't reproduce it, blamed CloudFront, moved on. The TODO sat there.

The shape of the fix

A smoke test for this kind of deploy isn't complicated. After the upload completes, you go ask the public URL what it's serving and verify it matches what you just shipped. The interesting design question is what to verify, not how.

For the agent surface, the answer is: download what version.json says to download, compute its SHA on the way in, and check that SHA against three things — what version.json claims, what the deploy script computed locally, and the expected version. If any of the three diverge, the deploy aborts before declaring success.

That covers a surprising number of failure modes:

Silent S3 upload error. Binary upload failed, version.json upload succeeded. The site serves a SHA mismatch.
Stale binary, fresh manifest. Someone re-ran a partial deploy that updated version.json without re-uploading the binary. Same failure mode.
Bucket policy regression. Public users get 403 even though the deploying CLI (with creds) gets 200. The smoke test hits the URL anonymously, so it sees the customer's view.
CloudFront serving stale content too long. The smoke test uses cache-busting query strings, so it bypasses CF cache for the verification — and retries for 60s if the origin isn't responding right yet.

I built it as a separate script (post-deploy-smoke-test.sh) so each of my four deploy scripts (agent, api, web, marketing) could call it with a target name. For the agent target, the deploy script also passes EXPECTED_VERSION and EXPECTED_SHA as env vars so the smoke test can verify what's currently served matches what was just built.

About 150 lines of bash. Took an hour.

The first run

Once I had it wired up, I ran the smoke test standalone against production — not because I was deploying, just to make sure the checks didn't false-positive on a healthy system.

It failed.

[agent] GET https://dl.elevateguard.tech/stable/version.json
  served version: 0.1.0
  served sha:     260ac8e3b48ef71085f2b23b73399bbea58ca7a9da4a8b192622318a236944df
  download_url:   https://dl.elevateguard.tech/stable/elevateguard.exe
[agent] downloading binary and verifying sha
  downloaded sha: 15b505aac9d248cc8ed861f5350937d1c7ae9e5ea3cd5abf724eecc293000ceb
  FAIL: downloaded binary sha 15b505aac9d2...
        != version.json sha 260ac8e3b48e...

The version.json was claiming a SHA the actual binary didn't have. My first reaction was, of course, that the smoke test was broken. CloudFront, presumably, was serving an old binary out of some edge cache while serving the fresh manifest. I went to verify that theory.

$ aws s3 cp s3://elevateguard-agent-dist/stable/version.json -
{"version": "0.1.0",
 "sha256": "260ac8e3b48ef71085f2b23b73399bbea58ca7a9da4a8b192622318a236944df",
 ...
 "updated_at": "2026-05-22T01:55:00Z"}

$ aws s3 cp s3://elevateguard-agent-dist/stable/elevateguard.exe - | sha256sum
15b505aac9d248cc8ed861f5350937d1c7ae9e5ea3cd5abf724eecc293000ceb

That's the S3 origin. No CloudFront involved. The version.json on the bucket has been claiming a SHA the binary on the same bucket doesn't have. For four days.

And there's no binary on the bucket — anywhere, in any prefix — that hashes to 260ac8e3.... Whoever (probably some half-completed deploy I ran and forgot about) wrote that version.json was asserting a SHA for a binary that doesn't exist.

The customer experience: auto-updater fetches version.json, sees "new version 0.1.0," downloads the binary, computes the SHA, mismatches, rejects the download, logs an integrity error. The customer doesn't see anything; their installed version just stops updating. Silent failure. The exact failure mode the TODO existed to prevent.

The lessons

I write smoke tests like this professionally. I know they're important. I had a TODO that explicitly identified this failure mode. I still didn't add the test for six months. And in that six months, the failure happened, and I never knew, because the deploy script said "deployed successfully" and nobody complained loudly enough.

The lesson isn't "add smoke tests." Of course add smoke tests. The lesson is sharper than that:

A deploy script that prints "success" without verifying the public surface is lying to you, and the lie compounds. I trusted the script. Customers trusted the auto-updater. Both of us were operating on outdated truth.

The first run of a guardrail is also a checkup. Adding a smoke test isn't just defense for the future — it's a free audit of the present. Anything currently broken that the test would have caught is already broken whether or not you add the test. Adding the test makes you find out.

TODO drift is dangerous in proportion to your trust in your own operational discipline. I trusted myself to "get to it." Six months later, the answer to "did I get to it" was "no, and production has been wrong for four days as a result." If your TODO references a known failure mode, treat it like an open incident.

The fix for the actual drift is straightforward: re-run the deploy. The new smoke test will gate it, so the next deploy either works correctly or fails noisily. The pattern is now in place to prevent the next instance of this from going silently to four days.

I should have written this test six months ago.