I built a hands-free HUD for smart glasses that runs a real-world speedrun timer and auto-splits based on what the camera sees. Demo scenario: making sushi.
Demo: https://www.youtube.com/watch?v=NuOVlyr-e1w
Repo: https://github.com/RealComputer/GlassKit
I initially tried a multimodal LLM for scene understanding, but the latency and consistency were not good enough for this use case, so I switched to a small object detection model (fine-tuned RF-DETR). It just runs an inference loop on the camera feed. This also makes on-device/offline use feasible (today it still runs via a local server).
reply