The CT-is-running-but-the-app-is-dead 502 pattern
Recurring failure mode on the homelab: the assistant's domain
returns 502 from every browser, but pct list says the LXC is
running and systemctl status says the service is loaded. The
service inside the container is dead. The container around it is
fine.
What was happening
The shape is always the same:
curl https://x.example.com/health→ 502 Bad Gateway- nginx error log on the proxy:
connect() failed (111: Connection refused)to<lxc-ip>:8088 pct liston the Proxmox host: CT isrunningpct exec <id> -- systemctl status <service>on the host: dead, "code=exited, status=1/FAILURE"
The CT itself is alive — networking works, you can shell in, files
are there. It's just that the uvicorn process inside has crashed
and systemd hasn't (or can't) restart it. Possibly an OOM the
oom-killer doesn't blame on systemd, possibly an unhandled
exception during startup, possibly a Restart=on-failure that
already gave up after too many retries.
The adjacent symptom that fooled me the first few times:
"notifications arrive but the chat UI is empty." That's the same
root cause — APNs pushes are async/cached, they reach the phone
even with the upstream dead. The chat UI then tries to fetch
/conversations/.../messages from the dead backend and gets the
same 502.
What I found
There's a clear escalation ladder. Always try the smallest hammer first because the biggest one is genuinely slow (roughly nine minutes for a clean shutdown of two dozen containers).
# 1. Restart the service inside the container.
# Fixes ~90% of cases.
ssh root@<proxmox> "pct exec <ctid> -- systemctl restart <service>"
# 2. Reboot the container.
# Use when step 1 hangs or errors.
ssh root@<proxmox> "pct reboot <ctid>"
# 3. Reboot the Proxmox host itself.
# Only when pveproxy is unreachable too or host load is
# sustained >5, indicating broader thrashing.
ssh root@<proxmox> "systemctl reboot"
The fix
For this specific service, I added Restart=always and
RestartSec=5s plus a reasonable StartLimitBurst so single-shot
crashes self-heal without me noticing. The escalation ladder still
applies for genuine wedges, but it gets exercised much less often.
[Service]
Restart=always
RestartSec=5s
StartLimitBurst=10
StartLimitIntervalSec=600
What I'd do differently
For a long time I'd debug each 502 as if it were a new problem. "Maybe it's nginx, maybe it's DNS, maybe the LXC died." Writing down the pattern in a runbook turned a ten-minute investigation into a one-minute reflex. The lesson is small: when a failure mode shows up three times, the next instance is going to repeat the same shape, and the time you spend documenting the ladder pays itself back the next outage.