Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> The system used optical character recognition—the same technology that lets you search for a word in a PDF file

That's not correct, at least for "digitally-born PDFs" that were made on a computer and haven't been scanned. In that case, the PDF can be parsed directly, without OCR, to get text. That's what a tool like PyPDF2 does, for example.



It sounds like they were parsing screenshots that workers submitted by SMS


I'm not disputing that they used OCR. What's wrong is that searching text in PDFs doesn't usually involve OCR.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: