Draining a job queue before redeploying

FILE 0xDF·DRAINING A JOB QUEUE BEFORE REDEPLOYING

May 25, 2026 · homelab, deploy, operations

A long-running service that does work in the background needs a deploy strategy that doesn't drop work on the floor. This is the rule I landed on after restarting one too many times in the middle of useful jobs.

What was happening

The service in question runs background jobs that can take 10–15 minutes each. The old deploy was a simple "git pull, systemctl restart." Which meant: any job in flight when I deployed was killed mid-step. Some jobs were idempotent and just got picked back up on the next worker boot. Some weren't — they did partial work, left a half-written artifact, and the cleanup was manual.

The phrase that stuck with me: "if you're hot-deploying, you owe it to the in-flight requests to not eat them."

What I found

The deploy needs to do three things, in order, that the naive restart doesn't:

Stop accepting new jobs. Set a flag the worker checks before pulling from the queue. The HTTP intake keeps responding, just queues without processing.
Wait for in-flight to finish. With a cap. If a job has been running longer than the longest reasonable job duration, abort the deploy — that's a sign something is stuck and a deploy will make it worse.
Restart. Only after the worker pool is idle.

The cap matters. Without it, a deploy waiting for a hung job can wait forever, and you've turned a 30-second deploy into a 45-minute incident.

The fix

A pre-deploy health check the deploy script must pass:

#!/usr/bin/env bash
set -euo pipefail

HEALTH_URL="http://localhost:8000/health/jobs"
MAX_DRAIN_MINUTES=10
ABORT_IF_OLDER_THAN_MINUTES=5

# 1. Tell the service to stop accepting new work
curl -fsS -X POST "$HEALTH_URL/pause"

# 2. Check for stuck jobs that should abort the deploy
OLDEST_AGE=$(curl -fsS "$HEALTH_URL" | jq '.oldest_running_minutes // 0')
if (( OLDEST_AGE > ABORT_IF_OLDER_THAN_MINUTES )); then
  echo "Aborting deploy: job running for ${OLDEST_AGE}m, > ${ABORT_IF_OLDER_THAN_MINUTES}m cap"
  curl -fsS -X POST "$HEALTH_URL/resume"
  exit 1
fi

# 3. Wait for in-flight to drain, with a cap
DEADLINE=$(( $(date +%s) + MAX_DRAIN_MINUTES * 60 ))
while (( $(date +%s) < DEADLINE )); do
  IN_FLIGHT=$(curl -fsS "$HEALTH_URL" | jq '.running')
  [[ "$IN_FLIGHT" == "0" ]] && break
  sleep 10
done

if [[ "$IN_FLIGHT" != "0" ]]; then
  echo "Deploy aborted: ${IN_FLIGHT} job(s) still running after ${MAX_DRAIN_MINUTES}m"
  curl -fsS -X POST "$HEALTH_URL/resume"
  exit 1
fi

# 4. Now safe to restart
sudo systemctl restart my-service

The service exposes /health/jobs as JSON:

{
  "accepting": true,
  "running": 2,
  "queued": 14,
  "oldest_running_minutes": 3
}

And /health/jobs/pause flips accepting to false. The worker checks that flag before each pull.

Two implementation details worth calling out:

The pause is voluntary. If you ship the deploy and your worker pool ignores accepting, you've added complexity without safety. The worker has to actually check.
The abort-on-stuck branch is critical. Without it, the deploy script becomes the new failure mode: a stuck job pins the queue indefinitely and now you're learning at 11 PM that your deploys hang.

What I'd do differently

I'd add this to a new service on day one of accepting any work that takes longer than a request. The cost is small (one health endpoint, one shell loop) and the alternative is the slow accumulation of "deploy windows where you have to make sure no one's running anything important." That's not a strategy; that's hoping.

The other thing: write the resume-on-failure paths. If the pause succeeds but the drain wait times out, you have to remember to unpause. The bash above does it on the abort branches, but a half-broken deploy script that pauses and then crashes will leave your service quietly not processing work. Make resume idempotent and safe to call as a manual recovery step.