Face recognition on the front door with an active-learning loop
I wanted my front door doorbell to do something more interesting than "someone moved." Specifically: "the dog walker is here," or "an unknown person is at the door, look now." Off-the-shelf face recognition on a doorbell either ships to a cloud or requires a beefy GPU. I had neither.
What was happening
A doorbell ding-and-motion stream is mostly noise. Cars, leaves, the shadow of a tree, the mail carrier walking past on the sidewalk. The useful signal is a tiny minority of events. Without filtering, the notification stream becomes useless within a day — and I learned this the hard way the first time I shipped notifications without a throttle.
The pipeline I wanted: poll the doorbell, capture video for every event, extract a few frames, run face detection, and only push a notification when an unknown face shows up. Known faces get recorded silently and go in the log.
What I found
Running on a 2 GB LXC, the CNN-based face detector OOMs immediately. The
HOG-based detector fits in memory but is meaningfully worse at side
profiles and small faces. The way to make HOG work is to bias the
reference encodings: more reference photos per person, captured under
realistic conditions (backlit, side profile, hat on), and number_of_times_to_upsample=2
in the encoder because active-learning crops are small (typically
250–500 px).
dlib doesn't have a wheel for Debian 12 + Python 3.11 by default. Pip
tries to compile from source and fails because cmake isn't in the LXC.
The trick: pip install dlib-bin (a community-built prebuilt wheel,
~4 MB), then pip install --no-deps face-recognition face-recognition-models.
Don't let face-recognition pull dlib as a dependency or you'll be back
to the compile-from-source path.
I also lost an afternoon to a wrong API endpoint. The library's
async_recording_download() hits an older endpoint that 404s on newer
hardware. The fix was to fetch the signed URL with async_recording_url()
and stream it down with aiohttp directly.
The fix
The pipeline, end to end:
- A cron'd poller asks the doorbell history endpoint for new events, keeps a per-device cursor in a JSON file.
- For each new event, download the recording, extract a frame at ~1s with ffmpeg, and write both to the NAS.
- A separate cron'd worker picks up pending rows. Samples five frames (t = 2, 4, 7, 10, 14), runs HOG face detection on each.
- For any face found, compare its encoding against the reference DB. Match if distance ≤ 0.55.
- Three outcomes: identified (record name + distance), unknown (crop
the face, drop it in a
_pending/directory), or no_face (most events).
The interesting part is step 5's "unknown" branch. Cropped unknown faces land in a labeling queue, served back through chat with the image inline:
# Surface the crop:
"Unknown face from <timestamp>. Who is this? [image]"
# When I reply with a name, the crop is mv'd into faces/<name>/
# and the encoder regenerates the .encodings.npy cache file.
# DDB rows for that event flip from unknown back to pending and
# get re-processed, this time matching.
That's the active-learning loop. The model doesn't need retraining; the reference set grows organically from "real photos at my actual front door under actual lighting." After a few weeks the false-unknown rate drops to almost nothing.
A failure that taught me something: HOG misread a pair of plaid pants as a face. The "reference encoding contaminated" failure mode is when your reference photo doesn't actually have a face in it — the system will then happily produce a meaningless distance number against whatever pixels were there. The fix was to add a "_rejected" bucket for confirmed false positives so they don't keep poisoning the pending queue, and to verify the reference photo contains a face before generating the encoding.
What I'd do differently
I'd build the notification side last, not first. The first version sent a push for every recognized face, every recognized event, every unknown event — and within an hour I had hundreds of messages and had to kill the notifier. The right rule is: one push per event maximum, and only for the unknown class until I've explicitly opted in to known-person pings. Edge-triggered, dedup-by-event-id, log everything else.