I’m not sure how you go from word to url range? Range implies contiguous, but ho...

marginalia_nu · on Sept 17, 2021

The trick is that the list of URLs for each word already is in the URLs file.

The URLs in a range are sorted. A sorted list (or list-range) forms an implicit set-like data structure, where you can do binary searches to test for existence.

Consider a words file with two words, "hello" and "world", corresponding to the ranges (0,3), (3,6). The URLs file contains URLs 1, 5, 7, 2, 5, 8.

The first range corresponds to the URLs 1, 5, 7; and the second 2, 5, 8.

If you search for hello world, it will first pick a range, the range for "hello", let's say (1,5,7); and then do binary searches in the second range -- the range corresponding to "world" -- (2,5,8) to find the overlap.

This seems like it would be very slow, but since you can trivially find the size of the ranges, it's possible to always do them in an order of increasing range-sizes. 10 x log(100000) is a lot smaller than 100000 x log(10)

bullen · on Sept 19, 2021

Hm, ok I understand more but how do you perform the "binary search", just loop over the URL ids?

Funny I also selected "hello" and "world" above! Xo

My system is also written in Java btw!

Here are example results of my word search:

http://root.rupy.se/node/data/word/four

http://root.rupy.se/node/data/word/only

etc.

marginalia_nu · on Sept 19, 2021

I'll get back to your email in a while, I've got a ton of messages I'm working through.

But yeah, in short pseudocode:

  for url in range-for-"hello":
    if binary-search (range-for-"world", url):
      yield url

I do use streams, but that is the bare essence of it.

soheil · on Sept 21, 2021

So every time you insert a new URL for a word you have to update the range for every other single word since the URL file will be shifted?

soheil · on Sept 21, 2021

Are the n-grams always at most n=2 bigrams?

marginalia_nu · on Sept 21, 2021

No, I actually count the n-grams as distinct words (up to 4-grams). The main limiter is for that is space, so I only extract "canned" n-grams from some tags.

I would first search for the bigram hello_world, that's an O(1) array lookup; as then documents merely containing the words hello and world (usually not a good search result), that's the algorithm I'm describing in the parent comment.

soheil · on Sept 21, 2021

Makes sense. Every time you insert a new URL for a word you have to update the ranges for every other word since the URL file will be shifted?