Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I chunk pages and generate embeddings for each chunk. So there's no real size limit per page.


The more detail, the better. If `<section>` elements are found you chunk those? Do you do it recursively or do you stop after a certain level? And when section elements don't exist, you use `<h1>`, `<h2>`, etc. to infer logical chunks?


Having looked at a lot of HTMLs, I noticed that sections are not really the default. I rely on headings (h1, h2, ...) to chunk each pages. Each chunk has its heading hierarchy attached to it. There are a lot of optimizations that could be done at that level.


i'm just guessing but i would think following whatever semantics leads to the highest search rank in google's algorithm would be what you're most likely to find out in the wild.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: