When the Proxmox dashboard disagrees with reality
My homelab status dashboard pulls container/VM state from the Proxmox API. One day it showed three containers as stopped that I was actively connected to. The dashboard wasn't wrong about the API response — the API itself was lying.
What was happening
The dashboard listed:
- A web dashboard container: shown as stopped — but it was serving the very page I was reading.
- A memory store container: shown as stopped — but a service on another host was talking to it in real time.
- A duplicate of a running container (same name, different ID): shown as stopped, never started.
So one wrong-state, one wrong-state, one stale-config-leftover. The first two were real outliers. The third was just an orphan I should have removed months earlier.
What I found
Proxmox's /api2/json/cluster/resources endpoint pulls state from pveproxy's cached view, which is updated by the cluster's gossip. When the host is heavily loaded, that gossip can fall behind by a noticeable margin — long enough that a running container shows as stopped if it didn't emit a heartbeat in the most recent window.
In my case the host load average was sitting around 21 out of 24 cores. About 90% saturation. Not red-line, but enough to delay the gossip path. The actual container state was correct (you could pct status <id> and get running immediately), but the cluster-resources endpoint was serving a stale snapshot.
The orphan container was unrelated — I'd created a second container with the same name months earlier while testing, never started it, never removed it. It quietly hung around showing "stopped" because it had genuinely never been started.
The fix
Two pieces. For the stale data, query pct status <id> per container directly instead of trusting cluster/resources. It's slower but it's authoritative:
def real_container_state(vmid):
r = subprocess.run(
["pct", "status", str(vmid)],
capture_output=True, text=True, check=True,
)
# output is "status: running" or "status: stopped"
return r.stdout.strip().split(": ")[1]
For the orphan, pct destroy <id> after confirming it was actually unused.
For the dashboard, I now fall back to pct status per VMID when cluster/resources says something's stopped but recent metrics suggest otherwise (e.g., it pushed a heartbeat to another monitoring system within the last 60 seconds). That handles the stale-cache case without hammering the per-VM endpoint on every refresh.
What I'd do differently
Treat aggregated cluster endpoints as advisory, not authoritative. They're great for "give me a fast overview of 50 things" but bad for "is this specific thing actually running right now." For status decisions that drive automated actions (restart, alert, page someone), always hit the per-resource endpoint. The extra latency is worth it.
Also: a host running at 90% load is going to surface in surprising places. The dashboard issue was a symptom; the underlying question — "why is this Proxmox host saturated?" — is the more interesting one to chase.