The CT-is-running-but-the-app-is-dead 502 pattern

FILE 0x3D·THE CT-IS-RUNNING-BUT-THE-APP-IS-DEAD 502 PATTERN

April 30, 2026 · proxmox, nginx, debugging, homelab

Recurring failure mode on the homelab: the assistant's domain returns 502 from every browser, but pct list says the LXC is running and systemctl status says the service is loaded. The service inside the container is dead. The container around it is fine.

What was happening

The shape is always the same:

curl https://x.example.com/health → 502 Bad Gateway
nginx error log on the proxy: connect() failed (111: Connection refused) to <lxc-ip>:8088
pct list on the Proxmox host: CT is running
pct exec <id> -- systemctl status <service> on the host: dead, "code=exited, status=1/FAILURE"

The CT itself is alive — networking works, you can shell in, files are there. It's just that the uvicorn process inside has crashed and systemd hasn't (or can't) restart it. Possibly an OOM the oom-killer doesn't blame on systemd, possibly an unhandled exception during startup, possibly a Restart=on-failure that already gave up after too many retries.

The adjacent symptom that fooled me the first few times: "notifications arrive but the chat UI is empty." That's the same root cause — APNs pushes are async/cached, they reach the phone even with the upstream dead. The chat UI then tries to fetch /conversations/.../messages from the dead backend and gets the same 502.

What I found

There's a clear escalation ladder. Always try the smallest hammer first because the biggest one is genuinely slow (roughly nine minutes for a clean shutdown of two dozen containers).

# 1. Restart the service inside the container.
#    Fixes ~90% of cases.
ssh root@<proxmox> "pct exec <ctid> -- systemctl restart <service>"

# 2. Reboot the container.
#    Use when step 1 hangs or errors.
ssh root@<proxmox> "pct reboot <ctid>"

# 3. Reboot the Proxmox host itself.
#    Only when pveproxy is unreachable too or host load is
#    sustained >5, indicating broader thrashing.
ssh root@<proxmox> "systemctl reboot"

The fix

For this specific service, I added Restart=always and RestartSec=5s plus a reasonable StartLimitBurst so single-shot crashes self-heal without me noticing. The escalation ladder still applies for genuine wedges, but it gets exercised much less often.

[Service]
Restart=always
RestartSec=5s
StartLimitBurst=10
StartLimitIntervalSec=600

What I'd do differently

For a long time I'd debug each 502 as if it were a new problem. "Maybe it's nginx, maybe it's DNS, maybe the LXC died." Writing down the pattern in a runbook turned a ten-minute investigation into a one-minute reflex. The lesson is small: when a failure mode shows up three times, the next instance is going to repeat the same shape, and the time you spend documenting the ladder pays itself back the next outage.