Disk full on the staging dir at 01:14 in the morning

FILE 0xA0·DISK FULL ON THE STAGING DIR AT 01:14 IN THE MORNING

May 21, 2026 · s3, postgres, rclone, incident

I run a small S3-compatible proxy that fans writes across several cloud accounts to dodge per-account upload caps. It went into "all jobs stalled at 0 B/s" overnight. The cause was a thousand 64-MB multipart-upload parts piled into a 94 GB root filesystem.

What was happening

The proxy stages incoming multipart uploads as files on local disk while they assemble, then forwards each completed part to one of the backend accounts. The staging dir was on the root filesystem of the host, with no eviction:

/opt/proxy/data/parts/   ←  93 GB of unflushed multipart parts

At 01:14 CDT, the root FS hit 100%. Postgres (also on the host) panicked the moment it couldn't extend its WAL. All six rclone sync jobs stalled at 0 B/s — they were blocked on PUTs the proxy couldn't accept because the proxy was blocked on write() calls the kernel was refusing.

What I found

Three independent things failed at once:

No quota on the staging dir. I'd assumed multipart parts were ephemeral — accepted, forwarded, deleted within seconds. In a sync storm with many concurrent uploads, parts accumulate faster than they drain.
Staging on the same volume as Postgres. A full disk took the database down too, which then prevented the proxy from recording completed multipart commits, which then prevented the proxy from deleting parts. Classic deadlock from sharing a volume.
No janitor. Stale multipart uploads (those that started but never completed within S3's typical 7-day window) were never reaped. Orphaned parts rows in the DB also accumulated.

The fix

Quick: symlink the staging dir onto a much bigger ZFS pool that had a couple of terabytes of headroom. That unblocked everything within a minute.

systemctl stop proxy
mkdir -p /tank/proxy-parts
rsync -a /opt/proxy/data/parts/ /tank/proxy-parts/
mv /opt/proxy/data/parts /opt/proxy/data/parts.old
ln -s /tank/proxy-parts /opt/proxy/data/parts
systemctl start proxy

Then the cleanup: 33K successfully-committed objects (kept), 211 new objects from the sync that had been in flight (kept), 9 stale multipart uploads (dropped), 848 orphan parts rows (dropped).

Then the hygiene. Knocked the per-job rclone transfer count down from 10 to 5 so backpressure flows more predictably. Bumped the multipart chunk size to 32 MB so fewer parts pile up per upload. Deployed a janitor as an hourly systemd timer:

# /usr/local/bin/proxy-janitor
import time
from db import pg

ONE_DAY = 86_400

def reap_stale_uploads():
    cutoff = time.time() - ONE_DAY
    pg.execute("""
        DELETE FROM multipart_uploads
         WHERE last_activity_at < %s
    """, (cutoff,))

def reap_orphan_parts():
    pg.execute("""
        DELETE FROM parts
         WHERE upload_id NOT IN (SELECT upload_id FROM multipart_uploads)
    """)

if __name__ == "__main__":
    reap_stale_uploads()
    reap_orphan_parts()

What I'd do differently

The root cause was a quiet assumption — "multipart staging is transient" — that turned out to be load-dependent. It was true at two concurrent uploads. It was catastrophically false at ten. I'd add disk-usage alerting on staging dirs as a standing pattern, not just root-volume monitoring, and I'd put the staging dir on its own volume so a runaway staging directory can't take down the database it shares a host with.

The other lesson: every "transient" data structure eventually becomes permanent if nothing is sweeping it. A janitor is cheap enough that there should be one for every workload that creates intermediate artifacts, even ones the happy path swears it deletes.