Since TinEye pre-computes the hashes, do they use something like Redis to retrieve information? Redis seems perfect for a such quick results using the hash as the key and a URL or object of some kind as the value.
It seems to me that what you want is some kind of spatial index - you might not get an exact match on the hash, but instead get one that's one or two bits away, and you'll want something better than linear search to find it.
That would be a good idea as along as you have a good distance between images (d(image1, image2) small equivalent to image1 and image2 are equivalent). I have little experience in image processing, but this is generally a hard problem with complex signals (as images, music and speech are).
Most of what is described in the article concerns dimension reduction, where the goal overlaps with the above.
Yes, but his remarks suggests that hamming distance is not very nice for pictures after his transformation (d(i1, i2) == 5 -> small, d(i1, i2) == 10 -> big). As I am not familiar with image signals, I don't know if this can be somehow mitigated with other techniques. Of course, the only way to be sure would be to actually test it, but I would not bet much on it.
Another issue I forgot to mention is that each "point" (image ) is a k-dimensional vector. Organizing those so that to easily retrieve points which are close to each other quickly becomes intractable. Conventional techniques like kd-trees quickly break down when k is above a few units, because of the curse of dimensionality. This introduces a "non-linearity" in your distance which makes comparing points in a k-dimension space quite different than in 2/3 dimension spaces.
In the case of Hamming distance there is much more information to work with than in a bare metric. You can dump all the bitstrings into a suffix array, break the query into K pieces and search for an exact match for each piece. The resulting list of matches is usually small enough to just compute the Hamming distance against each element. I've used this on a corpus of 2m LaTeX strings withbgood results. See my above post for more details.
Doesn't scaling the images down (to 32x32 for the pHash approach) achieve essentially the same thing? Images that differ only slightly will likely scale down to the same thumbnail to begin with, and the resulting hash still bears some relationship to that thumbnail — so you should be able to look at similar hashes to find similar inputs.
Often you get things like images that have had text added to them (e.g. I often use tineye to find the original source image that someone has added a caption to), which means that the thumbnails will differ to some degree.
That's exactly what I'm saying - a simple index will only find exact hash matches, but what you want is images with "similar" hashes. If you define "similar" as a low Hamming distance - i.e. a small number of bit differences - and you want to find these image hashes with something better than a linear search / exhaustive combinatorial bit-twiddling, then you'll need to be smarter about your index.
Since TinEye pre-computes the hashes, do they use something like Redis to retrieve information? Redis seems perfect for a such quick results using the hash as the key and a URL or object of some kind as the value.