Disk full on the staging dir at 01:14 in the morning
I run a small S3-compatible proxy that fans writes across several cloud accounts to dodge per-account upload caps. It went into "all jobs stalled at 0 B/s" overnight. The cause was a thousand 64-MB multipart-upload parts piled into a 94 GB root filesystem.
What was happening
The proxy stages incoming multipart uploads as files on local disk while they assemble, then forwards each completed part to one of the backend accounts. The staging dir was on the root filesystem of the host, with no eviction:
/opt/proxy/data/parts/ ← 93 GB of unflushed multipart parts
At 01:14 CDT, the root FS hit 100%. Postgres (also on the host)
panicked the moment it couldn't extend its WAL. All six rclone
sync jobs stalled at 0 B/s — they were blocked on PUTs the proxy
couldn't accept because the proxy was blocked on write() calls
the kernel was refusing.
What I found
Three independent things failed at once:
- No quota on the staging dir. I'd assumed multipart parts were ephemeral — accepted, forwarded, deleted within seconds. In a sync storm with many concurrent uploads, parts accumulate faster than they drain.
- Staging on the same volume as Postgres. A full disk took the database down too, which then prevented the proxy from recording completed multipart commits, which then prevented the proxy from deleting parts. Classic deadlock from sharing a volume.
- No janitor. Stale multipart uploads (those that started but never completed within S3's typical 7-day window) were never reaped. Orphaned parts rows in the DB also accumulated.
The fix
Quick: symlink the staging dir onto a much bigger ZFS pool that had a couple of terabytes of headroom. That unblocked everything within a minute.
systemctl stop proxy
mkdir -p /tank/proxy-parts
rsync -a /opt/proxy/data/parts/ /tank/proxy-parts/
mv /opt/proxy/data/parts /opt/proxy/data/parts.old
ln -s /tank/proxy-parts /opt/proxy/data/parts
systemctl start proxy
Then the cleanup: 33K successfully-committed objects (kept), 211 new objects from the sync that had been in flight (kept), 9 stale multipart uploads (dropped), 848 orphan parts rows (dropped).
Then the hygiene. Knocked the per-job rclone transfer count down from 10 to 5 so backpressure flows more predictably. Bumped the multipart chunk size to 32 MB so fewer parts pile up per upload. Deployed a janitor as an hourly systemd timer:
# /usr/local/bin/proxy-janitor
import time
from db import pg
ONE_DAY = 86_400
def reap_stale_uploads():
cutoff = time.time() - ONE_DAY
pg.execute("""
DELETE FROM multipart_uploads
WHERE last_activity_at < %s
""", (cutoff,))
def reap_orphan_parts():
pg.execute("""
DELETE FROM parts
WHERE upload_id NOT IN (SELECT upload_id FROM multipart_uploads)
""")
if __name__ == "__main__":
reap_stale_uploads()
reap_orphan_parts()
What I'd do differently
The root cause was a quiet assumption — "multipart staging is transient" — that turned out to be load-dependent. It was true at two concurrent uploads. It was catastrophically false at ten. I'd add disk-usage alerting on staging dirs as a standing pattern, not just root-volume monitoring, and I'd put the staging dir on its own volume so a runaway staging directory can't take down the database it shares a host with.
The other lesson: every "transient" data structure eventually becomes permanent if nothing is sweeping it. A janitor is cheap enough that there should be one for every workload that creates intermediate artifacts, even ones the happy path swears it deletes.