Hanging the API at scale: SQLite was the wrong default
A custom backup engine I wrote was happily writing tens of thousands of chunk rows to SQLite. Then I bumped concurrency from two backup jobs to ten, and the API process hung.
What was happening
The engine breaks each backup source into ~4 MB chunks, encrypts them, and packs ~16 chunks per 64 MB pack. Each pack write is something like:
- 16
INSERT INTO chunks(content-hash, pack id, offset, length) - 1
INSERT INTO packs - 1
UPDATE snapshots
Per job. With ten jobs running concurrently, the database was
seeing thousands of writes per second across many connections.
WAL mode helps with readers-vs-writers, but it still serializes
all writes — only one writer holds the write lock at a time. Every
other connection sits on SQLITE_BUSY until the timeout fires,
then retries.
The retry storm cascaded into the FastAPI process. The async event
loop was blocked behind synchronous SQLite calls (even via
aiosqlite, the underlying syscall is serial), and the API
endpoints couldn't service health checks. The server looked dead.
What I found
Two questions worth asking before reaching for "let's migrate the database":
- Is the workload actually concurrent writes? Yes. Every chunk is an independent insert, and there are many jobs running in parallel by design.
- Is SQLite's serialization the bottleneck? Strace and lock diagnostics both said yes. Single-writer was the wall.
Postgres handles the same workload in its sleep. Multi-writer is the default. The schema was simple enough to translate by hand.
The fix
Postgres 17 on the host. Same schema, with three small SQL changes:
BIGSERIALids instead ofINTEGER PRIMARY KEYINSERT OR IGNORE→INSERT ... ON CONFLICT DO NOTHING- Quote
"offset"everywhere because it's a reserved word in Postgres
Engine swap from aiosqlite to asyncpg via a tiny adapter that
translates ? placeholders to $1, $2, ... and reshapes the
tuple-vs-args call conventions. About fifty lines.
One-shot data migration copied existing rows over:
async def copy_table(pg, sqlite_path, table, columns):
async with aiosqlite.connect(sqlite_path) as src:
cur = await src.execute(f"SELECT {','.join(columns)} FROM {table}")
rows = await cur.fetchall()
if not rows:
return 0
placeholders = ",".join(f"${i+1}" for i in range(len(columns)))
await pg.executemany(
f"INSERT INTO {table} ({','.join(columns)}) VALUES ({placeholders}) "
f"ON CONFLICT DO NOTHING",
rows,
)
return len(rows)
Migrated cleanly: a few hundred packs, a hundred thousand chunks, a dozen snapshots, a hundred-plus jobs. asyncpg pool sized 4-40 connections, which is comfortable headroom for the engine's peak. The old SQLite file got left in place read-only for a rollback window.
What I'd do differently
I picked SQLite for the original prototype because it was zero config. That was right for the first week. I should have moved to Postgres the first time I scaled past two concurrent jobs, not the first time the API hung. The "switch the database" step is a one-evening job when the schema is small; it gets worse the longer you wait, and the failure mode is wedged production rather than a friendly error.
Standing rule for the next greenfield project that does write-heavy work: SQLite is fine for the prototype, but the asyncpg adapter goes in week two regardless of whether the workload "needs" it yet.