Cutting a full-book TTS render down to one CLI
I have a habit of writing things — long-form, novel-shaped things — and then not being able to listen to them on a walk. The commercial-quality audiobook pipelines are gated behind narrators and contracts. I wanted local, fast, free, and good enough that my ears wouldn't bleed on a fourteen-chapter manuscript.
What was happening
The state of local neural TTS is genuinely good now. Kokoro produces narration that's a little flat compared to a human but miles better than the robotic SAPI voices I grew up with. The problem isn't quality, it's the pipeline glue.
Out of the box, neural TTS gives you "synthesize this paragraph" or maybe "synthesize this chapter." You still need to:
- split a manuscript into chapters
- decide on a voice
- run synthesis per chapter
- concatenate the chapter WAVs into a stable order
- transcode to MP3 with sensible bitrate for spoken-word
- wrap into M4B with chapter markers
- and ideally watch a progress bar so you know whether the next three hours are going to be productive or wasted
That was a directory full of half-broken scripts before I made it one CLI.
What I found
Three things matter for a usable render pipeline:
-
Predictable chunking. You don't want to feed the model a whole chapter at once — it'll OOM on a long one and you have to start over. Chunk on natural boundaries (paragraphs, then sentences) up to a token budget per chunk. Within a chapter the chunks concatenate cleanly because they share a voice and seed.
-
Resumable per-chunk state. Every chunk gets a stable hash based on its text + voice + seed. If the WAV for that hash already exists in the cache directory, skip it. So a crash in chapter 12 doesn't cost you chapters 1-11 again.
-
Real-time progress against a clear denominator. "Chunk N of M, audio rendered: H:MM:SS, render time: H:MM:SS, realtime factor: X.X" is enough information to know if the run is healthy. My target is ~3.5x realtime on the Mac mini — anything under 2x means something thermal-throttled.
The fix
The chapter-level orchestrator is the part that turned a pile of scripts into a tool. Stripped-down version:
def render_chapter(chapter_path: Path, voice: str, out_dir: Path):
text = chapter_path.read_text()
chunks = chunk_paragraphs(text, max_tokens=420)
cache = out_dir / "cache" / chapter_path.stem
cache.mkdir(parents=True, exist_ok=True)
wavs = []
started = time.monotonic()
total_audio_s = 0.0
for i, chunk in enumerate(chunks, start=1):
h = sha256(f"{voice}|{chunk}".encode()).hexdigest()[:16]
wav = cache / f"{i:04d}-{h}.wav"
if not wav.exists():
synthesize(chunk, voice=voice, out=wav)
audio_s = wav_duration(wav)
total_audio_s += audio_s
wavs.append(wav)
elapsed = time.monotonic() - started
rtf = (total_audio_s / elapsed) if elapsed else 0
print(
f" chunk {i}/{len(chunks)}: "
f"audio={fmt_secs(total_audio_s)} "
f"render={fmt_secs(elapsed)} "
f"rtf={rtf:.1f}x"
)
concat_wav = out_dir / f"{chapter_path.stem}.wav"
ffmpeg_concat(wavs, concat_wav)
mp3 = out_dir / f"{chapter_path.stem}.mp3"
ffmpeg_to_mp3(concat_wav, mp3, bitrate="64k", channels=1)
return mp3
The book-level driver iterates chapters in order, then packages
them into a single M4B with chapter markers from ffmetadata:
def build_m4b(mp3s: list[Path], book_meta: dict, out_path: Path):
metadata = render_ffmetadata(mp3s, book_meta)
cmd = [
"ffmpeg", "-y",
"-i", concat_list_file(mp3s),
"-i", metadata,
"-map_metadata", "1",
"-codec:a", "aac", "-b:a", "64k",
"-movflags", "+faststart",
str(out_path),
]
subprocess.run(cmd, check=True)
64 kbps mono is the right starting bitrate for narration. 32 kbps sounds tinny on most earbuds; 96+ is wasted on speech. Stereo is similarly wasted — narration is one voice in one mono channel, making it stereo just doubles the file size.
What I'd do differently
The first version tried to be clever about parallelizing chunk synthesis across CPU cores. It didn't help much — the model is already saturating the GPU on the Mac mini, and multi-process synthesis just made the progress output unreadable. Single- threaded with good progress logging beat parallel-with-no- visibility on every dimension I cared about, including total wall time.
The other lesson, which I keep relearning across projects: any multi-hour batch job should print enough state that you can tell from across the room whether it's healthy. "rtf=3.5x" is more useful than a spinner because three weeks from now I'll remember what 3.5x means and I won't remember what spinner-state 4 means.