IMDB had (not sure about current situation) an elegant but not so sophisticated solution for movie search on their website
They generated lot of json files with all the permutations of alpha numerics upto three characters( like a.json, b.json, aa.json, sta.json), when user types initial character they fetch the corresponding json which has more relevant results.
They do "actual" search only after three characters but most of the time you would find the result in first three characters. This is not a full-text search but has its own advantages.
For those who want to do something like this, the word / library you'll need to search for is "trigrams". Make trigrams of all your relevant search terms, then run a job to populate a JSON to your S3/CDN for each of the trigrams, with the most relevant documents sorted inside each one. You have a theoretical max of 26x26x26 = 17K files, but usually many possible trigrams aren't used in English.
Can keep modifying the list over time to take into account, recency, new data, etc. But this is a pretty scalable solution for the most part. Rebuilding your entire index on S3 should still cost you less than a dollar.
I feel like using a straight bloom filter will get you some false positives and may not be a right fit, but it is an interesting take & might be good for a first-pass.
It's possible to use a bloom filter variant of this for text search (for example, Bing does this, see https://danluu.com/bitfunnel-sigir.pdf for details).
If you wanted a very small bundle to use with a static site, I don't think it's obvious that a bloom filter variant is a bad approach.
I mean, yeah, this thing that the author said "made for a fun hour of Saturday night hacking" is probably not the optimal solution, but that would've been true whether or not the author chose to build something based on bloom filters or an inverted index.
I think a few false positives should be fine for the use case. Those files can then just be downloaded and searched in full, still cuts out almost all the traffic as wanted.
Lunr and elasticlunr are awesome projects. I was part of a search team and we were banging our heads against the wall trying to figure a way to serve a new feature request (basically some advanced autocomplete functionality) in a way that fit in with what we were currently maintaining.
It took me literally half an hour to implement this using elasticlunr on the browser. We will went with the backend solution as it felt more extendable, but this got me thinking there's some abuse of ES for things you could easily do that way.
What's your feeling on the status of elasticlunr as a project? I've been eying the the functionality for a project, but it hasn't received updates in quite a few years.
I have yet to see development of a js search engine continue for the long term. Used FlexSearch in a recent project but it too appears to have been abandoned. Mini search seems nice and it’s a bit newer. It has a focus on low memory usage so it may also work well for this scenario of a static index.
> I feel like using a straight bloom filter will get you some false positives
That is exactly the property of bloom filter searches (unless I'm misunderstanding) - you should never get a false negative but can expect false positives. The design of the filter will define how many false positives are likely for a given lookup in a given dataset.
> might be good for a first-pass.
Exactly. If the number of false positives is low then you can perform an exhaustive search on the few results to exclude the wrong ones. You won't miss anything as there are no false negatives.
Of course defining the bloom filter such that the number of false positives is low enough to "win" by a significant margin might not be easy, particularly for growing rather than static data, just like choosing an optimal hash function often isn't.
That's very interesting, thank you. I'm hearing about Zola a lot and it all sounds good, do you have experience with it?
The only thing that worries me about it is that e.g. with Lektor, I can just write a small Python template filter if I need something, whereas Zola isn't extensible at all (as far as I know). Has that been a problem in practice?
(I also did a version with a file with url's in it. Then the first url is article 1 or bit 1. Fun for stuff linked to from the blog)
For longer articles one could make 3 digit files. 37x37x37 is little over 50 000 files. Each query only involves a few of those.
Sadly one cant push bits to files but for a new article bits are to be appended to each data file eventually. I just add the new articles as false positives until I have a bunch of them.
The advantage of the disadvantage in that each page has to be checked for the keywords is that if it is gone it wont show up in the results. Even a highly accurate search solution shall have false positives.
You have 2008 script tags on that page. I know why but you could have at least had them remove themselves from the DOM after they were done adding their content.
> Sadly one cant push bits to files but for a new article bits are to be appended to each data file eventually.
You can it is just that each of the files has to be padded out to a multiple of full bytes. Just read the last byte and OR in the new bit at its offset.
The method you are using is called q-gram filtration.
The main optimization you could do is to use the specificities of the grams to prioritize the ANDing so that the most specific grams are chosen first. This can allow early stopping for misses.
You can also use multiple threads to read and AND in parallel after a AND tree order has been decided from the specificities.
I would also remark that it is possible to implement approximate matching or error tolerant search with q-grams. That is to rank results according to edit distance or Levenshtein distance from the query.
It is also possible to do regular expression search over them as Russ Cox explains how he implement the Google Code Search: Regular Expression Matching with a Trigram Index or How Google Code Search Worked https://swtch.com/~rsc/regexp/regexp4.html This is not specific to 3-grams. It can be made to work with any gram size or even indexes consisting of variable size grams.
> You have 2008 script tags on that page. I know why but you could have at least had them remove themselves from the DOM after they were done adding their content.
Unless I've missed something I don't think it works like that. It is just a html document, it comes as is. If you modify the dom the source stays the same?
Also, the html parser does a wonderful job ignoring things like comments and display:none nodes / script nodes. It just doesn't do anything with them. After the innerHTML is modified there is nothing further to do with each script. It is just ignored.
I've tried things like adding a 5 mb comment to the end of a page to see the behavior.
I had lots of ideas but chose to keep it simple enough to un-break brute force. Ideas: where letters are merged into one. Ones that appear to frequently (a,o,e are one letter) and/or rare ones (q is just a k). Ones that sit side by side on the keyboard (dootbell). Discard the order of the characters (so that "becuase" might just work) Typos that dont look like typos (1 is equal to l) and find ways to scale the solution to fit the texts or chunks thereof.
Abstract mentions bloom filters directly so your analogy makes sense
> Efficient discovery of information, based on partial
knowledge, is a challenging problem faced by many large scale distributed systems. This paper presents Plexus, a peer-to-peer search
protocol that provides an efficient mechanism for advertising a bitsequence (pattern), and discovering it using any subset of its 1-bits.
A pattern (e.g., Bloom filter) summarizes the properties (e.g., keywords, service description) associated with a shared object (e.g.,
document, service).
But I think that paper is more about discovery than search. It would be good for a tagging system but I didn't read enough to know if it would be useful for a more general search index.
I'm building a decentralised blog, blockchain + ipfs based.
As I'm using markdown as the example above, probably I can apply this method to update the search database index when the post is published, and then enable the client-side search in the platform.
Thanks for sharing, I'd heard of bloom filters but not really looked into them. I also experimented years ago in client side search on gpu. seems very doable. I didn't use any datastructures but considered it. started a boyer moore like shader here... https://github.com/byteface/GTP
works really well. even 9 years later. doubt anything compares.
I like the idea of just downloading a summary file of the whole website and searching it on the client side. It seems like it can only scale up to some limited extent — it should be feasible for 1000 or 10000 pages, but maybe once you have 100 000 pages it will start to be a bit top-heavy.
In 2006 I built the regular kind of posting-list inverted index for my web pages at http://canonical.org/~kragen/inverted-index.html. I just used regular HTML <table>s for the representation, trusting to Content-Encoding: gzip to squeeze out the resulting redundancy, and the index is split into many different files so that the client can snarf just the part of the index that interests it. Building the index (over my smallish set of pages) was surprisingly easy, although I don't think I ever implemented the actual client-side JS search engine.
Three weeks ago, I wrote https://gitlab.com/kragen/derctuo/-/blob/master/recursive-bl... describing a data structure, not yet tested, for using a bunch of Bloom filters for indexing large data sets. This structure seems like it might be capable of competing with traditional database sorted-column indexes for ad-hoc query evaluation, perhaps even exceeding their performance on some queries, though at a larger space cost. I hadn't thought about combining that with my 2006 client-side inverted-index search approach, but it's designed to have reasonable locality of reference, so it might work.
Bloom filters are a reasonable thing to try when looking for the "frequency-domain representation of thinking", but I've long thought Pentti Kanerva's Fully Distributed Representation was interesting, in which each atom is encoded by a large random bitvector, attribute-value pairs are encoded by XORs of the atomic bitvectors, and aggregates of multiple attributes are encoded as the bitwise majority rule (with ties broken at random) of all their attribute-value pairs.
You can recover the value of an attribute from an aggregate by XORing the aggregate with the attribute's atom and examining its Hamming distance from all the candidate atoms, a process which Kanerva originally suggested you could train a neural network to be able to do in better than linear time. The atoms actually stored for that attribute will have a Hamming distance many, many standard deviations from 50%, while the false matches will all be within, say, six standard deviations.
Sounds interesting. Any links and references for "Fully Distributed Representation"? This is what I found when I Googled the quoted phrase: http://www.cap-lore.com/RWC97-kanerva.pdf.
Skimming the paper this reminds me of word2vec and similar techniques for embedding words into vector spaces but instead of using XOR the embedding is accomplished with another neural network. Here's a link to a good description of the embedding process: https://towardsdatascience.com/neural-network-embeddings-exp....
Yup, that's the paper — I probably read it at the time from Norm's site, QEPD. More recently Pentti's been working on sparse distributed representations, which as I understand it greatly simplify the matching problem.
I think the latent-spaces approach of ANNs is pretty interesting but I don't feel like I understand it well enough. It's a bit lower dimensional than Kanerva's proposal, though.
Here is a more recent paper [1] from Pentti Kanerva that gives a great introduction: http://www.rctn.org/vs265/kanerva09-hyperdimensional.pdf. More papers can be found with the query "vector symbolic architecture". I find it utterly fascinating.
[1] Kanerva, Pentti. “Hyperdimensional Computing: An Introduction to Computing in Distributed Representation with High-Dimensional Random Vectors.” Cognitive Computation 1 (2009): 139-159.
> I like the idea of just downloading a summary file of the whole website and searching it on the client side.
I can't wait for this to exist. The Holographic chain by ceptr.org (Arthur Brock + Eric Harris-Braun) is the closest to an agent centric web [also called Protocol Cooperativism] that I've come across so far, it's truly distributed [1]. Have you come across them?
Nice project. My gut says an agent based system is the right approach but this implementation doesn't go far enough. The agent is reduced to a world view while it should imho involve a willingness to do favors for other agents. It would be far less complicated (and dare I say less toxic) without a market. The list of systems all try to establish reliability/trust in their own way but we know who to trust irl. The service should just slow down if to few people in your circles are willing to contribute resources.
> it should imho involve a willingness to do favors for other agents.
Absolutely. Because it is agent centric it can do this completely transparently. Since the individual agent is only the starting point, what you describe will definitely be developed on top of it, as I would agree that this is a common use case.
The real groundbreaking breakthrough with this project is that they are putting the individual user back in the middle, enabling peer-to-peer networks organized into a beautiful symphony sometimes referred to as Protocol Cooperativism. It is in contrast to Platform Capitalism - what we have today, and which takes humongous rents through corporate middlemen and third parties.
Building out from the individual agent allows, for example, cooperative Open Value Networks to be built (Sensorica is working on this), where users have complete agency/autonomy over the protocols and agreements they use, as well as users being able to temporarily hand off decision-making to others if they trust them, to act on their behalf, reflecting how they work together in the real world.
Basically this project is allowing everybody to carry a copy of the rules. This allows the protocols to rapidly evolve and change to meet the shifting needs of individuals and their communities.
There is a project building on the work of Ceptr called Holo-REA (developed out of http://valueflo.ws), which is doing exactly what you describe above.
(If you go down this route you might prefer ag/the silver surfer/similar, as being faster than grep.)
I've actually reported a couple of security bugs in servers in the past, which were caused by people running grep for searches.
Imagine you have a tree of web-content at /var/www, when a user searches for "foo" you run "grep -r -h foo /var/www", then you parse out the resulting filenames.
Now imagine what happens if your user searches for "foo /tmp; rm -rf /". Whenever you pass user-input to a command-line application you need to be damn sure you know what you're doing - if the shell is involved in parsing your command you'll have problems.
This is a good point. Even if you don't have a shell injection vulnerability, you may end up with a path traversal vulnerability. In this particular case I think you're probably pretty safe as long as you're careful to only pass a single argument to the grep process — but it depends on how exactly grep parses its arguments. If there's a way to sneak ../ in there, you might get hurt.
I've lately become allergic to things that I can only partially reason about. Sure, I'm probably going to be fine, but I can't guarantee I'll be fine, and the blackhats are always more knowledgeable than me about obscure bits of shell/Unix behavior, so I just shy away and use a less free-form approach.
The indexing task can be done during static site generation and results shipped as resources, but the query part (search string parsing, ranking, snippet generation, etc) needs some code running locally (browser, js).
We've used Minimal Perfect Hash Function with bloom filter to reduce the mem of a service from ~27G to ~8G, as the data was static. Probably something like that can be used here too, to make indices small.
They generated lot of json files with all the permutations of alpha numerics upto three characters( like a.json, b.json, aa.json, sta.json), when user types initial character they fetch the corresponding json which has more relevant results.
They do "actual" search only after three characters but most of the time you would find the result in first three characters. This is not a full-text search but has its own advantages.
Bloom filters seems to have similar advantages.