Thank you, I learned something. Since TinEye pre-computes the hashes, do they us...

barrkel · on June 3, 2011

It seems to me that what you want is some kind of spatial index - you might not get an exact match on the hash, but instead get one that's one or two bits away, and you'll want something better than linear search to find it.

jamii · on June 3, 2011

Here is a good overview of existing approaches to approximate sequence matching - http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.96.... . I also wrote in a fair bit of detail about solving the similar problem of searching through LaTeX syntax trees - http://scattered-thoughts.net/one/1291/799313/731344 . Adapting my code to searching by Hamming distance should be trivial should anyone want to play around with image searches.

cdavid · on June 3, 2011

That would be a good idea as along as you have a good distance between images (d(image1, image2) small equivalent to image1 and image2 are equivalent). I have little experience in image processing, but this is generally a hard problem with complex signals (as images, music and speech are).

Most of what is described in the article concerns dimension reduction, where the goal overlaps with the above.

barrkel · on June 3, 2011

The article explicitly mentions Hamming distance, which is also what I mean by bit distance.

cdavid · on June 3, 2011

Yes, but his remarks suggests that hamming distance is not very nice for pictures after his transformation (d(i1, i2) == 5 -> small, d(i1, i2) == 10 -> big). As I am not familiar with image signals, I don't know if this can be somehow mitigated with other techniques. Of course, the only way to be sure would be to actually test it, but I would not bet much on it.

Another issue I forgot to mention is that each "point" (image ) is a k-dimensional vector. Organizing those so that to easily retrieve points which are close to each other quickly becomes intractable. Conventional techniques like kd-trees quickly break down when k is above a few units, because of the curse of dimensionality. This introduces a "non-linearity" in your distance which makes comparing points in a k-dimension space quite different than in 2/3 dimension spaces.

jamii · on June 3, 2011

In the case of Hamming distance there is much more information to work with than in a bare metric. You can dump all the bitstrings into a suffix array, break the query into K pieces and search for an exact match for each piece. The resulting list of matches is usually small enough to just compute the Hamming distance against each element. I've used this on a corpus of 2m LaTeX strings withbgood results. See my above post for more details.

cemerick · on June 3, 2011

Doesn't scaling the images down (to 32x32 for the pHash approach) achieve essentially the same thing? Images that differ only slightly will likely scale down to the same thumbnail to begin with, and the resulting hash still bears some relationship to that thumbnail — so you should be able to look at similar hashes to find similar inputs.

jbri · on June 3, 2011

Often you get things like images that have had text added to them (e.g. I often use tineye to find the original source image that someone has added a caption to), which means that the thumbnails will differ to some degree.

barrkel · on June 3, 2011

That's exactly what I'm saying - a simple index will only find exact hash matches, but what you want is images with "similar" hashes. If you define "similar" as a low Hamming distance - i.e. a small number of bit differences - and you want to find these image hashes with something better than a linear search / exhaustive combinatorial bit-twiddling, then you'll need to be smarter about your index.