Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Tesseract is great in certain situations but a lot comes down to having a robust preprocessing pipeline and a correction flow. I wrote about this a few years back:

https://chris.improbable.org/2014/3/17/content-search-on-a-b...

https://blogs.loc.gov/thesignal/2014/08/making-scanned-conte...

Basically the big problems are getting the content deskewed (even a slight rotation will cause accuracy to plummet, which is a problem if there was page curl or a flaw in the original printing process), breaking text into clean segments (non-trivial in newspaper layouts), and dealing with noise from dust or content from the other side of the page bleeding through. The collection I’ve worked the most with (https://chroniclingamerica.loc.gov/) also had a lot of problems due to many collections having been scanned from microfilm first. Tesseract 4 is better but in my testing you aren’t going to see revolutionary improvements without investing in tooling to identify segments and clean them up before passing them to Tesseract.

Since that entire collection is public domain and freely available for download (https://chroniclingamerica.loc.gov/data/ or s3://ndnp-batches), researchers have used various ML tools on it and that definitely looks promising but is not a silver bullet by any means. There are some trained files available here along with a large public S3 dataset:

https://news-navigator.labs.loc.gov/



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: