Downsampling the way out of a CNN out-of-memory loop

FILE 0x99·DOWNSAMPLING THE WAY OUT OF A CNN OUT-OF-MEMORY LOOP

May 9, 2026 · python, computer-vision, homelab

Doorbell camera snapshots come in at 1536x1536. I wanted to run them through a CNN face detector to label visitors. The detector kept getting OOM-killed by the kernel before it could process a single frame.

What was happening

Using dlib's CNN face detector (the cnn_face_detection_model_v1 model) on a 1536² image, the inference allocation blew past the small box's available RAM. The kernel OOM-killer would drop the Python process mid-inference. No traceback, just a missing PID.

dmesg | tail -3
Out of memory: Killed process 14210 (python3) ...

CNN face detection scales with input resolution. The HOG detector is cheaper and works at native res, but my first test on a few frames with HOG returned 0 faces — which made sense, because the camera was triggering on cars and moving leaves, not people. I wanted CNN specifically because it tolerates the angles a doorbell camera gives you better than HOG does.

What I found

I could either pay for more RAM in that LXC, or downsample before inference. Downsampling is the obvious choice — face detection doesn't need full sensor resolution. A face that's 80 pixels wide in a 720p frame is still very detectable. The CNN model wants enough pixels per face to find features; below about 40px-per-face it starts missing them. At 720p on a doorbell at typical doorway distance, faces sit somewhere in the 100–300px range, which is plenty.

The fix

import cv2
import dlib

cnn = dlib.cnn_face_detection_model_v1("mmod_human_face_detector.dat")

def detect_faces(image_path):
    img = cv2.imread(image_path)  # 1536x1536
    h, w = img.shape[:2]
    scale = 720 / max(h, w)
    small = cv2.resize(img, (int(w * scale), int(h * scale)))

    detections = cnn(small, 1)

    # scale boxes back to original coordinates
    for d in detections:
        box = d.rect
        yield {
            "left":   int(box.left()   / scale),
            "top":    int(box.top()    / scale),
            "right":  int(box.right()  / scale),
            "bottom": int(box.bottom() / scale),
            "confidence": d.confidence,
        }

Memory usage dropped to well within budget, inference time on a CPU dropped from "OOM" to about 800ms per frame. I scale the detected bounding boxes back to native coords before passing them to the recognizer stage, which runs on the full-resolution crop.

What I'd do differently

Two things I'd build in earlier next time. First, a per-frame timing log so I can spot the inference cost regression when I bump model versions. Second, the bounding-box rescaling step is the kind of place that quietly accumulates off-by-one errors — round-trip an empty image through the pipeline and assert that a known box round-trips back to within a pixel of itself. Cheap test, catches a whole class of "the recognizer crop is just slightly wrong" bugs that would otherwise look like model quality issues.