Writing a full-text search engine using Bloom filters (2013)

praveen9920 · on June 10, 2020

IMDB had (not sure about current situation) an elegant but not so sophisticated solution for movie search on their website

They generated lot of json files with all the permutations of alpha numerics upto three characters( like a.json, b.json, aa.json, sta.json), when user types initial character they fetch the corresponding json which has more relevant results.

They do "actual" search only after three characters but most of the time you would find the result in first three characters. This is not a full-text search but has its own advantages.

Bloom filters seems to have similar advantages.

sudhirj · on June 10, 2020

For those who want to do something like this, the word / library you'll need to search for is "trigrams". Make trigrams of all your relevant search terms, then run a job to populate a JSON to your S3/CDN for each of the trigrams, with the most relevant documents sorted inside each one. You have a theoretical max of 26x26x26 = 17K files, but usually many possible trigrams aren't used in English.

Can keep modifying the list over time to take into account, recency, new data, etc. But this is a pretty scalable solution for the most part. Rebuilding your entire index on S3 should still cost you less than a dollar.

cetra3 · on June 10, 2020

I feel like using a straight bloom filter will get you some false positives and may not be a right fit, but it is an interesting take & might be good for a first-pass.

mdbook (https://github.com/rust-lang/mdBook) uses elasticlunr (http://elasticlunr.com/) as an inverted index and is generated statically.

The same is done for zola (https://www.getzola.org/documentation/content/search/).

luu · on June 10, 2020

It's possible to use a bloom filter variant of this for text search (for example, Bing does this, see https://danluu.com/bitfunnel-sigir.pdf for details).

If you wanted a very small bundle to use with a static site, I don't think it's obvious that a bloom filter variant is a bad approach.

I mean, yeah, this thing that the author said "made for a fun hour of Saturday night hacking" is probably not the optimal solution, but that would've been true whether or not the author chose to build something based on bloom filters or an inverted index.

johanvts · on June 10, 2020

I think a few false positives should be fine for the use case. Those files can then just be downloaded and searched in full, still cuts out almost all the traffic as wanted.

inertiatic · on June 10, 2020

Lunr and elasticlunr are awesome projects. I was part of a search team and we were banging our heads against the wall trying to figure a way to serve a new feature request (basically some advanced autocomplete functionality) in a way that fit in with what we were currently maintaining.

It took me literally half an hour to implement this using elasticlunr on the browser. We will went with the backend solution as it felt more extendable, but this got me thinking there's some abuse of ES for things you could easily do that way.

acidbaseextract · on June 10, 2020

What's your feeling on the status of elasticlunr as a project? I've been eying the the functionality for a project, but it hasn't received updates in quite a few years.

lukevp · on June 10, 2020

I have yet to see development of a js search engine continue for the long term. Used FlexSearch in a recent project but it too appears to have been abandoned. Mini search seems nice and it’s a bit newer. It has a focus on low memory usage so it may also work well for this scenario of a static index.

inertiatic · on June 13, 2020

Sorry, I don't have an opinion, I haven't kept up with them, and you put more research into this that I have.

dspillett · on June 10, 2020

> I feel like using a straight bloom filter will get you some false positives

That is exactly the property of bloom filter searches (unless I'm misunderstanding) - you should never get a false negative but can expect false positives. The design of the filter will define how many false positives are likely for a given lookup in a given dataset.

> might be good for a first-pass.

Exactly. If the number of false positives is low then you can perform an exhaustive search on the few results to exclude the wrong ones. You won't miss anything as there are no false negatives.

Of course defining the bloom filter such that the number of false positives is low enough to "win" by a significant margin might not be easy, particularly for growing rather than static data, just like choosing an optimal hash function often isn't.

stavros · on June 10, 2020

That's very interesting, thank you. I'm hearing about Zola a lot and it all sounds good, do you have experience with it?

The only thing that worries me about it is that e.g. with Lektor, I can just write a small Python template filter if I need something, whereas Zola isn't extensible at all (as far as I know). Has that been a problem in practice?

cetra3 · on June 11, 2020

Yes, my personal blog is written with zola: https://cetra3.github.io/.

Source code to the blog is on github: https://github.com/cetra3/cetra3.github.io

I haven't had a need to customise zola itself, as the template language it uses (tera) is quite usable with macros etc...

stavros · on June 11, 2020

That's a great example, thank you! I love how fast Zola is, Lektor takes forever.

memexy · on June 10, 2020

Thanks for the links and descriptions.

6510 · on June 10, 2020

I took letter combinations aa ab ac ... zx zy zz z0 z1 z2 ... Created a file for each letter combination aa.dat ... zz.dat ...

(26 letters + 10 digits) * (26 letters + 10 digits + space/wildcard) = 1369

Peanuts for Apache.

Each of my 10 000ish articles got a single bit in those files where the offset is the article number.

A search query like foo bar would involve the files " f", "fo", "oo", "o ", " b", "ba", "ar", "r "

These are only 8 files! for a whooping 8 kb-ish of data.

We then perform the AND operation. 0000000 01000000 00000000

Remaining articles are pulled [in reverse order] and checked for the presence of the exact phrase or the presence of keywords.

The php then flushes

  '<br><a href="'.$articleNum.'.html">'.$title.'</a>'

to the page.

(I also did a version with a file with url's in it. Then the first url is article 1 or bit 1. Fun for stuff linked to from the blog)

For longer articles one could make 3 digit files. 37x37x37 is little over 50 000 files. Each query only involves a few of those.

Sadly one cant push bits to files but for a new article bits are to be appended to each data file eventually. I just add the new articles as false positives until I have a bunch of them.

The advantage of the disadvantage in that each page has to be checked for the keywords is that if it is gone it wont show up in the results. Even a highly accurate search solution shall have false positives.

http://blog.go-here.nl/search-blog.php?q=aliens

view source for laughs

compressedgas · on June 10, 2020

You have 2008 script tags on that page. I know why but you could have at least had them remove themselves from the DOM after they were done adding their content.

> Sadly one cant push bits to files but for a new article bits are to be appended to each data file eventually.

You can it is just that each of the files has to be padded out to a multiple of full bytes. Just read the last byte and OR in the new bit at its offset.

The method you are using is called q-gram filtration.

The main optimization you could do is to use the specificities of the grams to prioritize the ANDing so that the most specific grams are chosen first. This can allow early stopping for misses.

You can also use multiple threads to read and AND in parallel after a AND tree order has been decided from the specificities.

I would also remark that it is possible to implement approximate matching or error tolerant search with q-grams. That is to rank results according to edit distance or Levenshtein distance from the query.

It is also possible to do regular expression search over them as Russ Cox explains how he implement the Google Code Search: Regular Expression Matching with a Trigram Index or How Google Code Search Worked https://swtch.com/~rsc/regexp/regexp4.html This is not specific to 3-grams. It can be made to work with any gram size or even indexes consisting of variable size grams.

6510 · on June 11, 2020

> You have 2008 script tags on that page. I know why but you could have at least had them remove themselves from the DOM after they were done adding their content.

Unless I've missed something I don't think it works like that. It is just a html document, it comes as is. If you modify the dom the source stays the same?

Also, the html parser does a wonderful job ignoring things like comments and display:none nodes / script nodes. It just doesn't do anything with them. After the innerHTML is modified there is nothing further to do with each script. It is just ignored.

I've tried things like adding a 5 mb comment to the end of a page to see the behavior.

6510 · on June 11, 2020

thanks!

I had lots of ideas but chose to keep it simple enough to un-break brute force. Ideas: where letters are merged into one. Ones that appear to frequently (a,o,e are one letter) and/or rare ones (q is just a k). Ones that sit side by side on the keyboard (dootbell). Discard the order of the characters (so that "becuase" might just work) Typos that dont look like typos (1 is equal to l) and find ways to scale the solution to fit the texts or chunks thereof.

6510 · on June 10, 2020

I forget to mention: All the articles are static html and the static data files can be parsed by js too.

bawolff · on June 10, 2020

Reminds me a lot of some of the research into distributed full text search in p2p networks (e.g. http://rboutaba.cs.uwaterloo.ca/Papers/Journals/2009/Reaz09.... )

Probably some of the lessons from that domain could be applied here

memexy · on June 10, 2020

Abstract mentions bloom filters directly so your analogy makes sense

> Efficient discovery of information, based on partial knowledge, is a challenging problem faced by many large scale distributed systems. This paper presents Plexus, a peer-to-peer search protocol that provides an efficient mechanism for advertising a bitsequence (pattern), and discovering it using any subset of its 1-bits. A pattern (e.g., Bloom filter) summarizes the properties (e.g., keywords, service description) associated with a shared object (e.g., document, service).

But I think that paper is more about discovery than search. It would be good for a tagging system but I didn't read enough to know if it would be useful for a more general search index.

bawolff · on June 10, 2020

Search in essence is just trying to find the document that has all the right keywords.

Well also trying to rank them but that's kind of harder in this context.

nurettin · on June 10, 2020

How is this (2013)? Content has comments about 2018 node ecosystem.

stavros · on June 10, 2020

It came to me in a dream long ago.

eddieoz · on June 10, 2020

I'm building a decentralised blog, blockchain + ipfs based.

As I'm using markdown as the example above, probably I can apply this method to update the search database index when the post is published, and then enable the client-side search in the platform.

I'll try it using this approach.

https://github.com/eddieoz/549-personal-dx-blog dex blog: eddieoz.com (or eddieoz.crypto using unstoppable extension)

stavros · on June 10, 2020

Elasticlunr sounds like a more mature solution, maybe you'd like to try that.

byteface · on June 10, 2020

Thanks for sharing, I'd heard of bloom filters but not really looked into them. I also experimented years ago in client side search on gpu. seems very doable. I didn't use any datastructures but considered it. started a boyer moore like shader here... https://github.com/byteface/GTP

works really well. even 9 years later. doubt anything compares.

kragen · on June 10, 2020

I like the idea of just downloading a summary file of the whole website and searching it on the client side. It seems like it can only scale up to some limited extent — it should be feasible for 1000 or 10000 pages, but maybe once you have 100 000 pages it will start to be a bit top-heavy.

In 2006 I built the regular kind of posting-list inverted index for my web pages at http://canonical.org/~kragen/inverted-index.html. I just used regular HTML <table>s for the representation, trusting to Content-Encoding: gzip to squeeze out the resulting redundancy, and the index is split into many different files so that the client can snarf just the part of the index that interests it. Building the index (over my smallish set of pages) was surprisingly easy, although I don't think I ever implemented the actual client-side JS search engine.

The same year, I also wrote a search engine using a kind of structure similar to a Bloom filter per post, similar to the one Stavros describes, but transposed and encoded for better locality of reference; source code is in https://www.mail-archive.com/kragen-hacks@canonical.org/msg0... and the algorithm is outlined in https://www.mail-archive.com/kragen-tol@canonical.org/msg001....

Three weeks ago, I wrote https://gitlab.com/kragen/derctuo/-/blob/master/recursive-bl... describing a data structure, not yet tested, for using a bunch of Bloom filters for indexing large data sets. This structure seems like it might be capable of competing with traditional database sorted-column indexes for ad-hoc query evaluation, perhaps even exceeding their performance on some queries, though at a larger space cost. I hadn't thought about combining that with my 2006 client-side inverted-index search approach, but it's designed to have reasonable locality of reference, so it might work.

Related comment thread: https://news.ycombinator.com/item?id=23473303

— ⁂ —

Bloom filters are a reasonable thing to try when looking for the "frequency-domain representation of thinking", but I've long thought Pentti Kanerva's Fully Distributed Representation was interesting, in which each atom is encoded by a large random bitvector, attribute-value pairs are encoded by XORs of the atomic bitvectors, and aggregates of multiple attributes are encoded as the bitwise majority rule (with ties broken at random) of all their attribute-value pairs.

You can recover the value of an attribute from an aggregate by XORing the aggregate with the attribute's atom and examining its Hamming distance from all the candidate atoms, a process which Kanerva originally suggested you could train a neural network to be able to do in better than linear time. The atoms actually stored for that attribute will have a Hamming distance many, many standard deviations from 50%, while the false matches will all be within, say, six standard deviations.

memexy · on June 10, 2020

Sounds interesting. Any links and references for "Fully Distributed Representation"? This is what I found when I Googled the quoted phrase: http://www.cap-lore.com/RWC97-kanerva.pdf.

Skimming the paper this reminds me of word2vec and similar techniques for embedding words into vector spaces but instead of using XOR the embedding is accomplished with another neural network. Here's a link to a good description of the embedding process: https://towardsdatascience.com/neural-network-embeddings-exp....

kragen · on June 10, 2020

Yup, that's the paper — I probably read it at the time from Norm's site, QEPD. More recently Pentti's been working on sparse distributed representations, which as I understand it greatly simplify the matching problem.

I think the latent-spaces approach of ANNs is pretty interesting but I don't feel like I understand it well enough. It's a bit lower dimensional than Kanerva's proposal, though.

breuderink · on June 11, 2020

Here is a more recent paper [1] from Pentti Kanerva that gives a great introduction: http://www.rctn.org/vs265/kanerva09-hyperdimensional.pdf. More papers can be found with the query "vector symbolic architecture". I find it utterly fascinating.

[1] Kanerva, Pentti. “Hyperdimensional Computing: An Introduction to Computing in Distributed Representation with High-Dimensional Random Vectors.” Cognitive Computation 1 (2009): 139-159.

bergstromm466 · on June 10, 2020

> I like the idea of just downloading a summary file of the whole website and searching it on the client side.

I can't wait for this to exist. The Holographic chain by ceptr.org (Arthur Brock + Eric Harris-Braun) is the closest to an agent centric web [also called Protocol Cooperativism] that I've come across so far, it's truly distributed [1]. Have you come across them?

[1] https://medium.com/holochain/holochain-reinventing-applicati...

6510 · on June 11, 2020

Nice project. My gut says an agent based system is the right approach but this implementation doesn't go far enough. The agent is reduced to a world view while it should imho involve a willingness to do favors for other agents. It would be far less complicated (and dare I say less toxic) without a market. The list of systems all try to establish reliability/trust in their own way but we know who to trust irl. The service should just slow down if to few people in your circles are willing to contribute resources.

bergstromm466 · on June 11, 2020

> it should imho involve a willingness to do favors for other agents.

Absolutely. Because it is agent centric it can do this completely transparently. Since the individual agent is only the starting point, what you describe will definitely be developed on top of it, as I would agree that this is a common use case.

The real groundbreaking breakthrough with this project is that they are putting the individual user back in the middle, enabling peer-to-peer networks organized into a beautiful symphony sometimes referred to as Protocol Cooperativism. It is in contrast to Platform Capitalism - what we have today, and which takes humongous rents through corporate middlemen and third parties.

Building out from the individual agent allows, for example, cooperative Open Value Networks to be built (Sensorica is working on this), where users have complete agency/autonomy over the protocols and agreements they use, as well as users being able to temporarily hand off decision-making to others if they trust them, to act on their behalf, reflecting how they work together in the real world.

Basically this project is allowing everybody to carry a copy of the rules. This allows the protocols to rapidly evolve and change to meet the shifting needs of individuals and their communities.

There is a project building on the work of Ceptr called Holo-REA (developed out of http://valueflo.ws), which is doing exactly what you describe above.

Here is a story to quickly explain their work:

http://mikorizal.org/ValueFlows-Story.pdf

If you read my comment, I'd be grateful to know if it helped clarify some things. Does it help you see how what you described above is possible?

nwmcsween · on June 10, 2020

So why not just use minhashing like google does? There is hyperminhash and a newer algo on GitHub (I forget the name) with even better properties.

kragen · on June 10, 2020

Because I didn't know about MinHash, even though it dates from 1997! Thank you!

umvi · on June 10, 2020

Doesn't seem like it would be that hard... Couldn't the server just run grep if a user searches? Grep could scan every static file in milliseconds

stevekemp · on June 10, 2020

(If you go down this route you might prefer ag/the silver surfer/similar, as being faster than grep.)

I've actually reported a couple of security bugs in servers in the past, which were caused by people running grep for searches.

Imagine you have a tree of web-content at /var/www, when a user searches for "foo" you run "grep -r -h foo /var/www", then you parse out the resulting filenames.

Now imagine what happens if your user searches for "foo /tmp; rm -rf /". Whenever you pass user-input to a command-line application you need to be damn sure you know what you're doing - if the shell is involved in parsing your command you'll have problems.

kragen · on June 10, 2020

This is a good point. Even if you don't have a shell injection vulnerability, you may end up with a path traversal vulnerability. In this particular case I think you're probably pretty safe as long as you're careful to only pass a single argument to the grep process — but it depends on how exactly grep parses its arguments. If there's a way to sneak ../ in there, you might get hurt.

Bleh. I can see why people install WAFs.

stavros · on June 10, 2020

I've lately become allergic to things that I can only partially reason about. Sure, I'm probably going to be fine, but I can't guarantee I'll be fine, and the blackhats are always more knowledgeable than me about obscure bits of shell/Unix behavior, so I just shy away and use a less free-form approach.

djur · on June 10, 2020

If you're running grep it isn't a static site anymore.

Nyubis · on June 10, 2020

The big difference here is that the bloom filter implementation runs entirely client-side. It saves a roudtrip per search.

bule0bird · on June 10, 2020

The indexing task can be done during static site generation and results shipped as resources, but the query part (search string parsing, ranking, snippet generation, etc) needs some code running locally (browser, js). We've used Minimal Perfect Hash Function with bloom filter to reduce the mem of a service from ~27G to ~8G, as the data was static. Probably something like that can be used here too, to make indices small.

ga6840 · on June 10, 2020

Microsoft had a SIGIR best paper on using Bloom filter for searching.