Cass-failover: the Route 53 trick I should have used years ago

I live in Ocean Springs, Mississippi. We're on the Gulf, an hour from Mobile and an hour from New Orleans. From June through November, the question isn't whether a hurricane is coming — it's how many. Katrina took down power on this coast for two to three weeks in some neighborhoods. Sally in 2020 was a long week.

My homelab — a Proxmox cluster in a closet — runs Cass, the personal AI assistant I use for everything from "what tickets did I touch today" to "remind me to buy ham for breakfast burritos." Cass is on a single container (CT 212) in that closet. When power goes down, Cass goes down. When Cass goes down, the iPhone shortcut I use to talk to her times out, the watchOS complication grays out, and every cron she runs stops cleanly until I come back.

For two years I've told myself I'd fix this. Hurricane season is in nine days. So this weekend I finally did.

The architecture

The goal: when the homelab goes dark, traffic to cass.cwfrazier.com flips to a cheap EC2 instance in us-east-1, which serves a cut-down Cass that can still answer chat questions from her DynamoDB memory. When the homelab comes back, traffic flips back.

The trick is Route 53 failover routing, which I'd somehow ignored for ten years.

Set up:

A t4g.nano in us-east-1 running uvicorn + caddy + let's encrypt, pointed at the same FastAPI backend the homelab runs. Elastic IP pinned at 34.199.222.48.
Route 53 health check on https://cass.cwfrazier.com/healthz — the primary endpoint, served from the homelab through a Cloudflare tunnel.
Two A records for cass.cwfrazier.com: - Primary (failover routing policy: PRIMARY), pointing at the Cloudflare tunnel. Associated with the health check. - Secondary (failover routing policy: SECONDARY), pointing at the EC2 elastic IP. No health check.
When the health check sees three consecutive 30-second failures, the primary is taken out of rotation and DNS resolvers (eventually) get the secondary.

That's it. The trick is the failover routing policy itself — it's been in Route 53 since launch, costs nothing extra, and Just Works.

Why I waited so long

I waited because every time I'd think about hurricane prep for Cass, I'd anchor on the wrong problem. I'd think: "I need to put a full copy of the backend in the cloud, with a synced database, with cert-manager, with deploys-from-git, with…" And then I'd close the tab.

The actual problem is much smaller. The cloud copy of Cass doesn't have to be Cass. It has to be enough Cass that during a multi-week outage, my routines don't break and my data isn't lost. Specifically:

Chat works (because Cass on AWS reads the same DynamoDB tables).
The SOS pipeline works (because SOS already lives in Lambda).
Memory writes still go to DDB (already true).
The homelab-only tools — my Signal bridge, my Ring chime, my iMessage scraper — gracefully say "homelab unreachable, ask me later" instead of crashing.

What I don't need on AWS: the full tool set. The local-LAN stuff. The half-dozen home-automation hooks. Those go down when the lab goes down, and that's correct — they have no value during a power outage because the things they control are also offline.

The drill

After standing it up I ran the drill: ssh to the homelab, invert the healthz response. Watch Route 53. Watch my phone.

Cloudflare-side healthz started failing at 18:51Z. By 18:51:30Z, R53 had marked the primary unhealthy. By 18:52:18Z, my phone was talking to the EC2 nano. That's a 78-second outage on a "Katrina-class" power event. I'll take it.

Restoring was even faster — un-invert the response, R53 saw three healthy beats in ~45 seconds, traffic flipped back.

What's next

Phase 2 (this week) is the adapter layer: every tool module that talks to a homelab-only resource gets wrapped in a @homelab_only decorator that probes Tailscale and short-circuits with a uniform "homelab unreachable" response when the lab is dark. That way Cass doesn't stack-trace on a missing socket; she just says "Signal isn't reachable right now — try again when the homelab is back."

Plus: an SMS-fallthrough path so texts to Cass during an outage still get a clean Haiku-tier reply from the nano. The Pinpoint inbound webhook already routes to a Lambda; rerouting to the nano during a homelab outage is a single environment variable.

Cost

The t4g.nano is $3.07/mo on-demand or ~$2/mo on a 1-year RI. The EBS is $0.80/mo. The Route 53 health check is $0.50/mo. The Elastic IP is free as long as it's associated. Call it $4/mo total.

I'd been mentally pricing this at "I'll need to spin up a real ECS Fargate service, it'll be $40/mo before I even put data on it." Wrong. The smallest version is the version that works.

What I should have done years ago

Spent 2 hours, paid $4/mo, slept better through hurricane season.

There's a general lesson here that I keep relearning: the over-engineered version of a solution costs more than the under-engineered one in both money and procrastination tax. The under-engineered one is the one that actually exists.