With so much recent focus by OpenAI/Google on AI's visual capabilities, does anyone know when we might see an OCR product as good as Whisper for voice transcription? (Or has that already happened?) I had to convert some PDFs and MP3s to text recently and was struck by the vast difference in output quality. Whisper's transcription was near-flawless, all the OCR softwares I tried struggled with formatting, missed words, and made many errors.
You might enjoy this breakdown of the lengths one person went through to take advantage of the iOS vision API and creating a local web service for transcribing some very challenging memes:
We use GPT-4o for data extraction from documents, its really good. I published a small library that does a lot of the document conversion and output parsing: https://npmjs.com/package/llm-document-ocr
For straight OCR, it does work really well but at the end of the day its still not 100%