I did try inferring depth distance from strategically-placed round colored stick...

I did try inferring depth distance from strategically-placed round colored stickers. Using 1 fixed camera, I didn't see noteworthy results until I cranked up the resolution to 1080+ with FP64 inputs. I used a UNET architecture with skip connections between 1/4th, 1/8th, and 1/16th resolution samples. Diagrams here: https://photos.app.goo.gl/xLmE9Xp7UYmekBwV6

Keep in mind that PrintNanny's video encoding and neural network pipelines are running on-device, so real-world results favor models that perform well when u8 quantized at 320p resolution. The constraints are part of the fun, but there is a bias towards techniques that retain signal when the model is compressed. Segmentation and depth models naturally lose fidelity in that scenario compared to a dead-simple CNN/SSD pumping out bounding box proposals.