What I mean is you can probably get this to work with pure prompt engineering, a...

What I mean is you can probably get this to work with pure prompt engineering, and English language just by saying stuff like "Here's an image with correct hold positions", and then submit another image that says "Are the positions in this image correct." I just meant basic image understanding of AI, like what OpenAI has.

I realize there's an infinite number of ways to accomplish this that would be more complex. What I was stating is the simplest possible way being pure prompt engineering. You could even try with OpenAI right now, I didn't try.