Downsampling the way out of a CNN out-of-memory loop
Doorbell camera snapshots come in at 1536x1536. I wanted to run them through a CNN face detector to label visitors. The detector kept getting OOM-killed by the kernel before it could process a single frame.
What was happening
Using dlib's CNN face detector (the cnn_face_detection_model_v1 model) on a 1536² image, the inference allocation blew past the small box's available RAM. The kernel OOM-killer would drop the Python process mid-inference. No traceback, just a missing PID.
dmesg | tail -3
Out of memory: Killed process 14210 (python3) ...
CNN face detection scales with input resolution. The HOG detector is cheaper and works at native res, but my first test on a few frames with HOG returned 0 faces — which made sense, because the camera was triggering on cars and moving leaves, not people. I wanted CNN specifically because it tolerates the angles a doorbell camera gives you better than HOG does.
What I found
I could either pay for more RAM in that LXC, or downsample before inference. Downsampling is the obvious choice — face detection doesn't need full sensor resolution. A face that's 80 pixels wide in a 720p frame is still very detectable. The CNN model wants enough pixels per face to find features; below about 40px-per-face it starts missing them. At 720p on a doorbell at typical doorway distance, faces sit somewhere in the 100–300px range, which is plenty.
The fix
import cv2
import dlib
cnn = dlib.cnn_face_detection_model_v1("mmod_human_face_detector.dat")
def detect_faces(image_path):
img = cv2.imread(image_path) # 1536x1536
h, w = img.shape[:2]
scale = 720 / max(h, w)
small = cv2.resize(img, (int(w * scale), int(h * scale)))
detections = cnn(small, 1)
# scale boxes back to original coordinates
for d in detections:
box = d.rect
yield {
"left": int(box.left() / scale),
"top": int(box.top() / scale),
"right": int(box.right() / scale),
"bottom": int(box.bottom() / scale),
"confidence": d.confidence,
}
Memory usage dropped to well within budget, inference time on a CPU dropped from "OOM" to about 800ms per frame. I scale the detected bounding boxes back to native coords before passing them to the recognizer stage, which runs on the full-resolution crop.
What I'd do differently
Two things I'd build in earlier next time. First, a per-frame timing log so I can spot the inference cost regression when I bump model versions. Second, the bounding-box rescaling step is the kind of place that quietly accumulates off-by-one errors — round-trip an empty image through the pipeline and assert that a known box round-trips back to within a pixel of itself. Cheap test, catches a whole class of "the recognizer crop is just slightly wrong" bugs that would otherwise look like model quality issues.