It’s the big unsolved problem and nobody’s talking about it. I’ve had some decent success asking an expensive model to generate the chunks and combining that with document location, and my next plan for an upcoming project is to do that hierarchically, but there’s no generally accepted solution yet.
RAG’s big problem is turning PDFs into chunks, both as a parsing problem and as the chunking problem. I paid someone to do the parsing part into markdown for a project recently (including table data summaries) and it worked well. MathPix has an good API for this, but it only works sensibly for PDFs that don’t have insane layouts, and many do.
The data source i have is a filesystem with docs, pdfs, graphs etc.
Will need to expand folder names, file abfeviations. Do repetative analysis to find footers and headets. Locate titles on first pages and dedupe a lot. It seems like some kind of content+hierarchy+keywords+subtitle will need to be vectorized, like a card catalog.
If there's a list of techniques and their optimal use cases I haven't found it.
I started writing one for the day job, but then graphRAG happened, and Garnter is saying all RAG will be graphRAG.
You can't fight Gartner, no matter how wrong they are, so the work stopped, now everything is a badly implemented graph.
That's a long way to say, if there is a comparison, a link would be most appreciated