Back to blog
FILE 0x23·FACE RECOGNITION ON THE FRONT DOOR WITH AN ACTIVE-LEARNING L

Face recognition on the front door with an active-learning loop

May 9, 2026 · homelab, ml, ring

I wanted my front door doorbell to do something more interesting than "someone moved." Specifically: "the dog walker is here," or "an unknown person is at the door, look now." Off-the-shelf face recognition on a doorbell either ships to a cloud or requires a beefy GPU. I had neither.

What was happening

A doorbell ding-and-motion stream is mostly noise. Cars, leaves, the shadow of a tree, the mail carrier walking past on the sidewalk. The useful signal is a tiny minority of events. Without filtering, the notification stream becomes useless within a day — and I learned this the hard way the first time I shipped notifications without a throttle.

The pipeline I wanted: poll the doorbell, capture video for every event, extract a few frames, run face detection, and only push a notification when an unknown face shows up. Known faces get recorded silently and go in the log.

What I found

Running on a 2 GB LXC, the CNN-based face detector OOMs immediately. The HOG-based detector fits in memory but is meaningfully worse at side profiles and small faces. The way to make HOG work is to bias the reference encodings: more reference photos per person, captured under realistic conditions (backlit, side profile, hat on), and number_of_times_to_upsample=2 in the encoder because active-learning crops are small (typically 250–500 px).

dlib doesn't have a wheel for Debian 12 + Python 3.11 by default. Pip tries to compile from source and fails because cmake isn't in the LXC. The trick: pip install dlib-bin (a community-built prebuilt wheel, ~4 MB), then pip install --no-deps face-recognition face-recognition-models. Don't let face-recognition pull dlib as a dependency or you'll be back to the compile-from-source path.

I also lost an afternoon to a wrong API endpoint. The library's async_recording_download() hits an older endpoint that 404s on newer hardware. The fix was to fetch the signed URL with async_recording_url() and stream it down with aiohttp directly.

The fix

The pipeline, end to end:

  1. A cron'd poller asks the doorbell history endpoint for new events, keeps a per-device cursor in a JSON file.
  2. For each new event, download the recording, extract a frame at ~1s with ffmpeg, and write both to the NAS.
  3. A separate cron'd worker picks up pending rows. Samples five frames (t = 2, 4, 7, 10, 14), runs HOG face detection on each.
  4. For any face found, compare its encoding against the reference DB. Match if distance ≤ 0.55.
  5. Three outcomes: identified (record name + distance), unknown (crop the face, drop it in a _pending/ directory), or no_face (most events).

The interesting part is step 5's "unknown" branch. Cropped unknown faces land in a labeling queue, served back through chat with the image inline:

# Surface the crop:
"Unknown face from <timestamp>. Who is this?  [image]"
# When I reply with a name, the crop is mv'd into faces/<name>/
# and the encoder regenerates the .encodings.npy cache file.
# DDB rows for that event flip from unknown back to pending and
# get re-processed, this time matching.

That's the active-learning loop. The model doesn't need retraining; the reference set grows organically from "real photos at my actual front door under actual lighting." After a few weeks the false-unknown rate drops to almost nothing.

A failure that taught me something: HOG misread a pair of plaid pants as a face. The "reference encoding contaminated" failure mode is when your reference photo doesn't actually have a face in it — the system will then happily produce a meaningless distance number against whatever pixels were there. The fix was to add a "_rejected" bucket for confirmed false positives so they don't keep poisoning the pending queue, and to verify the reference photo contains a face before generating the encoding.

What I'd do differently

I'd build the notification side last, not first. The first version sent a push for every recognized face, every recognized event, every unknown event — and within an hour I had hundreds of messages and had to kill the notifier. The right rule is: one push per event maximum, and only for the unknown class until I've explicitly opted in to known-person pings. Edge-triggered, dedup-by-event-id, log everything else.