Hurricane failover for a homelab assistant without paying Bedrock prices
My homelab assistant lives on an LXC in the garage. Hurricane season runs June through November on the Gulf Coast. A two-week power loss takes my whole stack offline, which is exactly when I'd most want SMS access to it.
What was happening
The whole UI, chat surface, and tool layer ran on one Proxmox host. SMS and voice already flowed through AWS Pinpoint and Connect, so those would survive an outage; the web app, chat, and any "ask the assistant something" surface would not. I needed a way to keep at least a holding page and ideally a degraded chat path running when the homelab is dark, without paying a Bedrock bill the rest of the year.
What I found
Bedrock was the obvious first idea. It's also wildly expensive at real chat volumes — hundreds of dollars per day during heavy use. The whole point of running the assistant on my own subscription via the Claude Agent SDK is that the marginal cost per chat is roughly zero. A failover that costs more than the homelab when active is a failover I'll turn off.
The corrected plan: a tiny t4g.nano in us-east-1, running uvicorn + Caddy, holding an EC2 elastic IP. Route 53 health-check failover between my home IP and the EIP. Deterministic ~$4.50/month standing cost regardless of usage. When the homelab comes back, Route 53 flips traffic back within ~30 seconds.
The fix
Two Route 53 A records for the same hostname, both behind the same health check:
# health check against the homelab /health endpoint
aws route53 create-health-check ... \
--health-check-config 'Type=HTTPS_STR_MATCH,...,SearchString="ok":true'
# primary points at the homelab, health-checked
# secondary aliases to the EIP, no health check
aws route53 change-resource-record-sets ... <<JSON
{
"Changes": [
{"Action": "UPSERT", "ResourceRecordSet": {
"Name": "x.example.com", "Type": "A", "TTL": 60,
"Failover": "PRIMARY",
"HealthCheckId": "...",
"SetIdentifier": "homelab",
"ResourceRecords": [{"Value": "<home-ip>"}]
}},
{"Action": "UPSERT", "ResourceRecordSet": {
"Name": "x.example.com", "Type": "A", "TTL": 60,
"Failover": "SECONDARY",
"SetIdentifier": "nano",
"ResourceRecords": [{"Value": "<eip>"}]
}}
]
}
JSON
The nano runs the same FastAPI app, but tool calls that reach into
the homelab (anything that needs the Synology, the Mac mini, or any
local-only service) are wrapped in a failover_only.degraded("X")
helper that raises a clean error instead of hanging. DynamoDB-backed
surfaces (memory, todos, history) work as-is because the table is
already in us-east-1.
Drill procedure (run before every hurricane season):
aws route53 update-health-check --health-check-id $HC --inverted
# ~30s later: traffic on nano. Verify /health returns surface=nano.
aws route53 update-health-check --health-check-id $HC --no-inverted
# ~30s later: traffic back on homelab.
First drill flipped in ~30s and reverted in ~45s. The homelab nginx was never touched in either direction.
What I'd do differently
I prototyped this on Lambda first. Lambda is great for a holding page, but the Claude Agent SDK needs a persistent process for its OAuth session, so the chat surface had to move to the nano anyway. If I'd thought about the OAuth lifecycle on day one I would have skipped the Lambda detour. Lesson: when picking a failover surface, the first question is "does the thing I want to keep running fit the execution model," not "what's the cheapest compute."