Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

OCR software tends to have problem with the ligatures used for "s" in old fonts so often transcribes them as "f".

More expensive OCR software tends to get around this by using probability models to guess the correct word (i.e. "best" is much more likely than "beft"), but I'm guessing they're using some mid-range OCR package.



Seriously? Did you even look at the pages in question?

They are reproductions of the originals, and maintain the original fonts and orthography. This includes the long s, as well as certain ligatures (like ct), and has absolutely nothing to do with their choice of OCR software. In fact, there's no indication that OCR software was used at all.


The indexing seems based on OCR, for example try the keyword search "Tranfactions"


He didn't say it did - but it is an issue with converting and indexing old documents. Google's ngrams for instance regularly transcribes them as 'f' so you have to search for both possibilities.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: