I am looking up chunking techniques, but the resources are so scarce on this. Wh...

petesergeant · on Feb 5, 2025

It’s the big unsolved problem and nobody’s talking about it. I’ve had some decent success asking an expensive model to generate the chunks and combining that with document location, and my next plan for an upcoming project is to do that hierarchically, but there’s no generally accepted solution yet.

RAG’s big problem is turning PDFs into chunks, both as a parsing problem and as the chunking problem. I paid someone to do the parsing part into markdown for a project recently (including table data summaries) and it worked well. MathPix has an good API for this, but it only works sensibly for PDFs that don’t have insane layouts, and many do.

cyanydeez · on Feb 5, 2025

The data source i have is a filesystem with docs, pdfs, graphs etc.

Will need to expand folder names, file abfeviations. Do repetative analysis to find footers and headets. Locate titles on first pages and dedupe a lot. It seems like some kind of content+hierarchy+keywords+subtitle will need to be vectorized, like a card catalog.

oceansweep · on Feb 5, 2025

Not the person you asked, but it's dependent on what you're trying to chunk. I've written a standalone chunking library for an app I'm building: https://github.com/rmusser01/tldw/blob/main/App_Function_Lib...

It's setup so that you can perform whatever type of chunking you might prefer.

RansomStark · on Feb 4, 2025

If there's a list of techniques and their optimal use cases I haven't found it. I started writing one for the day job, but then graphRAG happened, and Garnter is saying all RAG will be graphRAG.

You can't fight Gartner, no matter how wrong they are, so the work stopped, now everything is a badly implemented graph.

That's a long way to say, if there is a comparison, a link would be most appreciated

dimitri-vs · on Feb 5, 2025

Semantic chunking is where I would start with now. Also check this out: https://github.com/chonkie-ai/chonkie