Hanging the API at scale: SQLite was the wrong default

FILE 0x67·HANGING THE API AT SCALE: SQLITE WAS THE WRONG DEFAULT

May 19, 2026 · sqlite, postgres, backup, async

A custom backup engine I wrote was happily writing tens of thousands of chunk rows to SQLite. Then I bumped concurrency from two backup jobs to ten, and the API process hung.

What was happening

The engine breaks each backup source into ~4 MB chunks, encrypts them, and packs ~16 chunks per 64 MB pack. Each pack write is something like:

16 INSERT INTO chunks (content-hash, pack id, offset, length)
1 INSERT INTO packs
1 UPDATE snapshots

Per job. With ten jobs running concurrently, the database was seeing thousands of writes per second across many connections. WAL mode helps with readers-vs-writers, but it still serializes all writes — only one writer holds the write lock at a time. Every other connection sits on SQLITE_BUSY until the timeout fires, then retries.

The retry storm cascaded into the FastAPI process. The async event loop was blocked behind synchronous SQLite calls (even via aiosqlite, the underlying syscall is serial), and the API endpoints couldn't service health checks. The server looked dead.

What I found

Two questions worth asking before reaching for "let's migrate the database":

Is the workload actually concurrent writes? Yes. Every chunk is an independent insert, and there are many jobs running in parallel by design.
Is SQLite's serialization the bottleneck? Strace and lock diagnostics both said yes. Single-writer was the wall.

Postgres handles the same workload in its sleep. Multi-writer is the default. The schema was simple enough to translate by hand.

The fix

Postgres 17 on the host. Same schema, with three small SQL changes:

BIGSERIAL ids instead of INTEGER PRIMARY KEY
INSERT OR IGNORE → INSERT ... ON CONFLICT DO NOTHING
Quote "offset" everywhere because it's a reserved word in Postgres

Engine swap from aiosqlite to asyncpg via a tiny adapter that translates ? placeholders to $1, $2, ... and reshapes the tuple-vs-args call conventions. About fifty lines.

One-shot data migration copied existing rows over:

async def copy_table(pg, sqlite_path, table, columns):
    async with aiosqlite.connect(sqlite_path) as src:
        cur = await src.execute(f"SELECT {','.join(columns)} FROM {table}")
        rows = await cur.fetchall()
    if not rows:
        return 0
    placeholders = ",".join(f"${i+1}" for i in range(len(columns)))
    await pg.executemany(
        f"INSERT INTO {table} ({','.join(columns)}) VALUES ({placeholders}) "
        f"ON CONFLICT DO NOTHING",
        rows,
    )
    return len(rows)

Migrated cleanly: a few hundred packs, a hundred thousand chunks, a dozen snapshots, a hundred-plus jobs. asyncpg pool sized 4-40 connections, which is comfortable headroom for the engine's peak. The old SQLite file got left in place read-only for a rollback window.

What I'd do differently

I picked SQLite for the original prototype because it was zero config. That was right for the first week. I should have moved to Postgres the first time I scaled past two concurrent jobs, not the first time the API hung. The "switch the database" step is a one-evening job when the schema is small; it gets worse the longer you wait, and the failure mode is wedged production rather than a friendly error.

Standing rule for the next greenfield project that does write-heavy work: SQLite is fine for the prototype, but the asyncpg adapter goes in week two regardless of whether the workload "needs" it yet.