Vesta is a single-page Flask app that turns an RTSP camera, or any uploaded video, into a small local security-camera platform: live person detection, autonomous threat-triggered recording, a recordings library with AI threat scoring, and natural-language search. I built it for Stratford Preparatory to deter and detect theft across multiple cameras, and it runs entirely on local hardware with no cloud.
YOLO does the cheap person and motion gating. When the system is armed, a person dwelling on the live feed triggers autonomous recording, and the clip is queued for analysis. A llama.cpp vision model, a Qwen-style VLM, handles the higher-level threat assessment.
How analysis works
YOLO filters the frames that contain people, biased toward high-motion samples. Those frames are tiled into 4x4 temporal mosaics spanning the clip's timeline. Each mosaic is captioned by the vision model, and a final pass scores the clip from 0 to 100 and writes a plain-language threat assessment. The result is a recordings library you can search in plain English, like "people tampering with the water heater," and get back the clips that match.
