Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I OCRed your comment with Tesseract:

``` I've been really impressed with Tesseract - | used it last month to add invisible OCR text (1) to scanned PDFs I reference a lot. My scans are quite good, but | was still impressed with the accuracy.

| also OCRed the TOC, playing with the page segmentation setting (2) in the terminal until | got output I could copy & paste to add a navigable table of contents.

1: with the help of https://github.com/ocrmypdfiOCRmyPDE

2: https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.h..., “ Using different Page Segmentation Modes” ```

This kind of mirrors my earlier experience with Tesseract, if it can't get OCRing a screenshot right, what can it get right? It's not like "I used" is such a rare phrase either, but it replaced the I with a pipe.



>if it can't get OCRing a screenshot right, what can it get right?

Book scans which is what it was designed for.

If you read the fine manual you would see that they suggest the _minimum_ resolution to run it over is an x-height of 20 pixels, screens have seldom have one higher than 10 pixels. With those settings I got the following out of OPs comment:

     I've been really impressed with Tesseract - I used it last month to add invisible OCR text (1) to scanned PDFs I reference a lot. My scans
    are quite good, but I was still impressed with the accuracy.
    
    I also OCRed the TOC, playing with the page segmentation setting (2) in the terminal until I got output I could copy & paste to adda
    navigable table of contents.
    
    1: with the help of https://github.com/ocrmypdt/OCRmyPDF
    
    2: https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.h..., “ Using different Page Segmentation Modes”


I OCRed an unprocessed screenshot from the chapter's table of contents (1), which gave me (2). The collated table of contents (3) was error free, but as your example shows, this OCR isn't good enough to not need checking and proof-reading.

1: https://nexus.armylane.com/files/vogue-sewing-11-toc-screens...

2: https://nexus.armylane.com/files/tesseract-ocr-output.png

3: https://nexus.armylane.com/files/vogue-toc.txt




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: