Five cron jobs sharing one atomic-rename temp file
Five Python pollers, one per messaging source, all on * * * * *. Each one writes a shared cursors file to remember where it left off. The dashboard showed iMessage looping over the same four rows every minute. Two bugs in one file.
What was happening
I have a "memex" job that ingests messages from iMessage, Signal, email, Teams, and Messenger into a unified S3 index. Each source has its own cron-driven poller. All five write a single cursors.json so the rest of the system has one place to ask "where is each source up to?"
The pollers were written separately and converged on the same idiom:
def save_cursors(c):
tmp = "/opt/.../cursors.json.tmp"
with open(tmp, "w") as f:
json.dump(c, f)
os.replace(tmp, "/opt/.../cursors.json")
This is the classic write-temp-then-rename pattern. It's atomic with respect to the final filename. It is not atomic with respect to anything else.
The dashboard's iMessage cron-tail had been showing this:
poll ok: 4 new (cursor 89632 -> 89636)
poll ok: 4 new (cursor 89632 -> 89636)
poll ok: 4 new (cursor 89632 -> 89636)
Same four rows uploaded to S3 every minute. The cursor was apparently advancing inside the script but resetting before the next run.
What I found
Two failure modes, both triggered when two pollers fire in the same cron second:
-
Both writers race the rename. Poller A and Poller B both write
cursors.json.tmp. A renames first. B'sos.replaceagainst an already-replaced target sometimes hits aFileNotFoundErroron its own tmp because of the way cron'd processes happened to interleave. Loud, but harmless — you see the traceback and assume that's the bug. -
The silent clobber. Both pollers load
cursors.jsoninto memory at roughly the same time. Both mutate their own source's section. Both write the entire dict back. Whichever writes second wipes whichever wrote first. Symptom: the iMessage cursor is "saved" then immediately reset by the email poller writing back a stale snapshot. The four rows above 89632 get re-uploaded forever.
The second bug is the one that hurt because it produces no error. Just duplicate uploads and the wrong cursor.
The fix
A shared helper that takes an flock and re-reads the file inside the lock:
# _cursor_io.py
import fcntl, json, os, pathlib
CURSORS = pathlib.Path("/opt/.../cursors.json")
LOCK = pathlib.Path("/opt/.../.cursors.lock")
def update_cursor(mutate):
"""Mutate cursors.json under an exclusive flock.
`mutate(d)` should modify the dict in place. Only that caller's
section should be touched; other sources' sections are preserved.
"""
LOCK.touch(exist_ok=True)
with open(LOCK, "w") as lk:
fcntl.flock(lk, fcntl.LOCK_EX)
try:
d = json.loads(CURSORS.read_text()) if CURSORS.exists() else {}
mutate(d)
tmp = CURSORS.with_suffix(f".tmp.{os.getpid()}")
tmp.write_text(json.dumps(d, indent=2))
os.replace(tmp, CURSORS)
finally:
fcntl.flock(lk, fcntl.LOCK_UN)
Each poller migrates from "write the whole dict" to "update just my section":
# poll_imessage.py
from _cursor_io import update_cursor
def save_cursors(c):
update_cursor(lambda d: d.update({"imessage": c["imessage"]}))
Two properties matter here:
- The flock serializes the read-modify-write so two pollers can't both base their write on the same pre-state.
- The per-PID tmp filename means even simultaneous writes can't collide on the temp filesystem entry.
os.replaceis still atomic against the final filename, and the lock guarantees the order.
After the fix the iMessage line collapsed from the repeating 89632 -> 89636 to poll ok: 0 new (cursor=89636), which is what "nothing new since last tick" should look like.
What I'd do differently
The atomic-rename idiom is correct for the single-writer case and almost never correct for the multi-writer case. Every time I write .tmp followed by os.replace, I should ask: is anyone else writing to this same file? If yes, the rename doesn't save you. You need a lock around the read-modify-write, or you need to split the file so each writer owns a separate path.
The upload side dedupes on a native message ID, so the duplicate uploads from before the fix only cost bytes on S3 — tens of KB per day, not a real problem. That dedup almost certainly hid this bug for longer than it should have. A "0 new" log line with a non-advancing cursor would have screamed at me weeks earlier.