> text extracted using tesseract You're saying 'the text' without normalizing th...

celestialcheese · on May 15, 2023

exactly. Just sent raw tesseract output, no formatting or "fix the OCR text" step. So the data looked like:

``` col1col2col3\nrow label\tdatapoint1\tdatapoint2... ``` Very messy.

I don't think this is generalizable with the same 100% accuracy across any OCR output (they can be _really_ bad). I'm still planning on doing a first pass with a better Table OCR system like Textract, DocumentAI, PaddPaddle Table, etc which should improve accuracy.

anonymouse008 · on May 15, 2023

That’s still super cool!

Yeah my use cases are in the really bad category - I’ve been building parsers for a while, and I’ve basically given up to manually stating rows of interest if present logic. Camelot got so close but I ended up building my own control layer to pdfminer.six to accommodate (I’d recommend Camelot if you’re still exploring). It absolutely sucks needing to be so specific out the gate, but at least the context rarely changes.

pplante · on May 15, 2023

What is the source of these nasty docs? I am also working on a layer above pdfminer.six to parse tables. It seems like this task is never done. LLMs have had mixed results for me too. I am focused on documents containing invoices, income statements, etc from the real estate industry.

My email is in my profile if you want to reach out and compare notes!

swyx · on May 15, 2023

better - you can do it copy pasting from pdf to gpt on your phone! https://twitter.com/swyx/status/1610247438958481408

anonymouse008 · on May 15, 2023

Definitely tried that way too, it didn’t work - my tables are pretty dang dumb. Merged cells, confidence intervals, weird characters in the cell field that change based on the row values - messing up a simple regex test, it’s really a billion dollar company solution but I’m about to punt it to the moon because it’s never fully done.