Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This reminds me where many years ago I learned about the world record holder for computer optical character recognition (OCR) accuracy.

The computer scientists took as a target an eastern European chess journal which printed move-by-move reports of tournament chess matches. They incorporated a crude chess engine in the recognition step estimating the liklihood of next moves and combining that with the OCR engine liklihood estimate that the printed characters were particular glyphs. Despite very low quality of the printing, the journal had very high quality editing. The source material was self consistent. Completely illegible characters could mostly be filled in as the sensible game moves that were allowed. It took hundreds of hours of human review time to find a single OCR mistake from this process!



That's really awesome, but it feels like cheating to call it the OCR record holder. It's really OCR + context. However it would be interesting if you could apply the same idea at the word and sentence level of written language. I'm guessing there are people that already do this.


it feels like cheating to call it the OCR record holder

This is how humans recognize text though. For the most part, humans don’t try to read languages we don’t understand. To deny a computer access to context is like asking a human to transcribe a language they don’t understand.


I'm a pretty fast typist but if you ask me to transcribe latin text of gibberish or of a language I don't know.... not so fast. A lot of it is how much I have to slow down to be accurate in recognizing characters.


I would guess your typing also gets significantly slower. Mine does; I’ve gone through periods where I had trouble typing the word in because my muscle memory turned it into int.


> It took hundreds of hours of human review time to find a single OCR mistake from this process!

This stands out to me as improbable. Not in that the error rate could be that low, but in that they actually had humans spend hundreds of hours checking the accuracy of difficult character recognition. How did that happen?


Put a handful of grad students in a room for a week and you have hundreds of hours right there.


I searched out the article: "Reading Chess", 1990, HS Baird and Ken Thompson. (Yes, that Ken Thompson).

http://doc.cat-v.org/bell_labs/reading_chess/reading_chess.p...

It doesn't actually quantify the human proofreading time. I might have recalled incorrectly; I heard about this in the late 1990's as a war story from another OCR researcher.


It's an embarassing problem to have a system with "accuracy to high to measure"!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: