Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Imagededup – Finding duplicate images made easy (github.com/idealo)
290 points by datitran on Oct 7, 2019 | hide | past | favorite | 57 comments


We've just open-sourced our library imagededup, a Python package that simplifies the task of finding exact and near duplicates in an image collection.

It includes several hashing algorithms (PHash, DHash etc) and convolutional neural networks. Secondly, an evaluation framework to judge the quality of deduplication. Finally easy plotting functionality of duplicates and a simple API.

We're really excited about this library because finding image duplication is a very important task in computer vision and machine learning. For example, severe duplicates can create extreme biases in your evaluation of your ML model (check out the CIFAR-10 problem). Please try out our library, star it on Github and spread the word! We'd love to get feedback.


This looks really interesting. Can you give us an idea of the performance? E.g. roughly how long would it take to process 1 million 1920×1080 JPEGS, without GPU and with GPU?

What is the scaling like? E.g. what if it was 10 million?


Scale is unfortunately not the focus of the current implementation. We would address this aspect in the future releases however. Considering the speed and memory requirements, following are the current considerations: 1. Hashing methods: Generation of hashes is quick (a couple of seconds on about 10K images). The tricky part is the retrieval of duplicates, which on the same 10K dataset takes a few minutes. (I would refrain from giving exact numbers since this was done on a local system, not a very good environment for benchmarking) 2. CNN: Here, the time consuming part is encoding generation, which, in the absence of GPU would take much more time (a couple of minutes on 10k images). The retrieval part is pretty quick, but requires memory. So, at this point, using this package on a scale of more than a couple of thousand images is not a good idea when done locally. We would however, address the scale aspect of the problem in future releases. Thanks for your question.


Doesn't sound too bad. But can you elaborate on the last part, about retrieval requiring memory and not scaling to more than a couple of thousands images?

How much memory would you need for ~2000 images, how slow does it get, etc.

Thx


Retrieval using CNNs requires computing a cosine similarity matrix. So, for 'n' images, a matrix of size n x n would need to be stored in the memory. As you can see, the storage requirements blow up quadratically as 'n' increases. We have already made some optimizations to reduce the memory footprint but there would be clear upper limits to it (We haven't experimented). As for the numbers, the cifar10 dataset example present in the repo was carried out on google colab notebook. The dataset has 60k images and ended up consuming about 6G of RAM using CNN. However, since there hasn't yet been a principled benchmarking done on the package, I would consider these numbers as only marginally indicative of the capabilities. A better way to figure out if it works for you would be to try it out with your own dataset, starting out at a small scale and increasing the dataset size gradually.


Use approximate nearest neighbor. The FAISS library is good.


Thanks for the pointer, will check it out.


Couldn't you add images>threshold to a dict/map as you iterate, rather than building a complete matrix, then iterating through that?


We tried that approach, but it was way too slow.


"The tricky part is the retrieval of duplicates, which on the same 10K dataset takes a few minutes"

"Doesn't sound too bad."

I doesn't? As a word of caution: If a 10k dataset takes a few minutes I would be careful how long 10MM pictures take. I predict it does not take 10MM/10k times a few minutes.


What way is the hash represented? Have you looked at NN libraries like faiss [0] and NGT [1]? Those can quite easily handle a nearest neighbor search of 10 million vectors and, from my understanding, they turn their vectors into some kind of hash that is then searched.

[0] - https://github.com/facebookresearch/faiss

[1] - https://github.com/yahoojapan/NGT


The hashes are 16 character hexadecimals represented as strings. Had a quick look at the faiss package and it looks promising. Would consider it for the next versions.


If you're interested in collaboration I'd be happy to help with a prod-focused version. My work has a need for a shardable daemon for dedup tasks. My personal email is in my description and I'm also available via josh@xix.ai.

We also have an image heavy production use case that would be able to yield some nice metrics from this tool.


I use Cython (CPU) to brute force 400,000 pHash's of images. It takes somewhere between 100 and 200ms to search.


Only realized now that the 100-200ms time you refer to is for a single search and not for 400,000 searches. The package already achieves this brute-force speed. In fact, the package also implements bktree, which, depending upon the distance threshold passed, could drastically reduce the search time. Moreover, the search through bktree is also parallelized in the package(each image's hash gets searched through the tree independently after the tree is constructed). On one of the example dataset containing 10k images, with a distance threshold of 10 (for 64-bit hashes), the retrieval time per image obtained was < 50 ms.


The 100-200ms time I referred to was indeed a single search. The difference is, it's on a single core. Cython definitely makes the hamming distance function faster.


That sounds good! Would be great if you can share the code, or even better, make a PR to the repo.


The implementation is trivial, for speed we use:

cdef extern int __builtin_popcountll(unsigned long long) nogil

dist = __builtin_popcountll(key ^ phash)

It would only take a couple of minutes to fill out the rest.


How long did the generating take?


We store pHash'es in a database, but just quickly checking on my laptop, between 1-2ms to generate a single image pHash. IO could be significant for many images.


Thanks! This is a very important problemspace for us, as we do document processing and double/triple uploads are hard to detect and create unnecessary effort.

Can you elaborate a little bit on the performance you've observed? Can it work iteratively, basically asking one image at a time "have we already seen this?"

Would it work for text documents of varying quality aswell or is this unfeasible?


Do you use OCR on the documents?


Did you consider using the PhotoDNA hash algo for finding duplicates? If you’ve heard of it and ruled it out, love to know why. While designed for a very different (and dark) purpose, seems like it might do well for the task.


Just had a look, thanks for the pointer. And yes, it was designed for a dark purpose. Will try to find the comparisons with the currently implemented hashing methods and see if there's merit to implementing it.


Thanks Dat for open sourcing it, once again I'm shamelessly linking my Kaggle Kernel with `imagededup` installed https://www.kaggle.com/nulldata/image-duplicates-without-dee.... This one used `PHash()` and a run time less than 88 seconds.

Anyone can just simply fork this kernel and add their dataset to it and perform the task!


Very nice. If I may ask for clarification on the CNN method (which seems to work quite well), you're taking a CNN that's built for classification but removing the last layer (which I believe is fully connected?) so that you only take the "internal features", is that correct? I would expect some kind of autoencoder would fit here, but very interesting that this works.


This is the same idea that underlies style transfers and metrics like the FID (which is used to judge generative networks' outputs on their similarity to the test set).

The idea is that the activations within an image recognition network are similar for images that are similar and so you can measure the distance between two images in a space that has some semantic meaning.


This is the concept behind "transfer learning", taking a fully-trained CNN (ResNet, Inception, MobileNet, etc) and removing the last or last two layers. What I don't understand yet is when to use which trained network, i.e. if they have different features that make them suited for different application domains.


I don't think there's a lot to say about this, you want the pretraining on a large dataset and on a task that's similar to yours. Probably largeness is somewhat more important. These things aren't an exact science at this point.


Any idea if there is something that supports animated gifs?


This looks nice, and seems to support the most common methods for fingerprinting/hashing. This comes with some heavy dependencies, though (which is reasonable):

    install_requires=[
        'numpy==1.16.3',
        'Pillow==6.0.0',
        'PyWavelets==1.0.3',
        'scipy==1.2.1',
        'tensorflow==2.0.0',
        'tqdm==4.35.0',
        'scikit-learn==0.21.2',
        'matplotlib==3.1.1',
    ],

A while ago, I asked about sth like this (or more about the underlying methods) here on SO:

https://stackoverflow.com/questions/4196453/simple-and-fast-...

There are some interesting discussions. (Nowadays, such a question would have been closed...)


If this is a library then locking those dependencies down is not great. Does it really need Pillow 6.0.0, and not Pillow 6.0.1?

As this is being consumed by larger applications that may have dependencies that conflict with these, they should be much more liberal.

https://github.com/idealo/imagededup/pull/36


Yes you're right thanks for pointing this out!


For a split second I was excited and terrified because I thought this was a very similarly named Go project I wrote a while ago. It’s nowhere near as fancy but is very fast and has a very similar name.

I don’t have the background in imaging these people likely have but mine works by breaking an image into an X by X map of average colors and comparing, written specifically because I needed to find similar images of different aspect ratios and at the time I couldn’t find anything.

https://github.com/donatj/imgdedup


If anyone is interested in gui, XnViewMP does that too. https://www.journeybytes.com/2018/03/how-to-find-duplicate-p...


This came a bit late. I recently decided I had to sort all my photos which I usually just dump in a big photos folder. Using the camera on my phone a lot and also getting a lot of media through whatsapp the collection was getting a bit big.

I made a script to calculate the hash of every file and if it found a double it would move it to another duplicate folder. This worked reasonably well but I couldn't stop thinking there should be more than 1 solution already made for this.


quicker than hashing the file, you might want to compare exif data (extract with exiftool), i have been comparing image date/time (to the second) as tagged by the camera, and when i find a duplicate, i keep the one with the largest image size. I've not worked out how to deal with those without exiftags. I understand ShotWell hashes the thumbnail to find dupes. The security camera software motion has some image comparison and to determine if the camera image has changed since the last image, i think it was visual in nature, rather than hashing, since webcams are "noisy".


Here's a broad and perhaps a bit naive question on this;

Reddit, Imgur, and any other site that uploads significant amounts of images from significant amount of users.. do they attempt to do this? To de-dupe images and instead create virtual links?

At face value it'd seem like a crazy amount of physical disk space savings, but maybe the processing overhead is too expensive?


They would not do deduplication like this because this is based on similar images, but they probably (and should) do it by image hash.


I once built an image comparison feature into some webpage that had uploads. What I did was scale down all images (for comparison only) to something like 100x100 and I think I made them black and white, but I am not sure about that last detail. I'd then XOR one thumbnail with another to compare their level of similarity. I didn't come up with this myself, I put together a few pieces of information from around the web... as with about 100% of things I build ;).

Not perfect, but it worked pretty well for images that were exactly the same. Of course it isn't as advanced as Imagededup.


Tumblr did something similar, but only for exact matches. You can tell if it’s a legacy image or not by looking for a hash in the image url path.

Legacy style: https://66.media.tumblr.com/tumblr_m61cvzNYF81qg0jdoo1_640.g...

New style: https://66.media.tumblr.com/76451d8fee12cd3c5971e20bb8e236e3...


People do deduplicate files to save on space, except it's usually based on exact byte match using md5 or sha256. Some don't due to privacy issues: https://news.ycombinator.com/item?id=2438181 (e.g., MPAA can upload all their torrented movies and see which ones uploaded instantly to prove that your system has their copyrighted files)

There's no way to make the UX work out for images that are only similar. Would be pretty wild to upload a picture of myself just to see a picture of my twin used instead.

But I do wonder if it's possible to deduplicate different resolutions of an image that only differ in upscaling/downscaling algorithm and compression level used (thereby solving the jpeg erosion problem: https://xkcd.com/1683/)


The cnn methods in the package are particularly robust against resolution differences. In fact, if it's just a simple up/downscale that differentiates 2 images, then even hashing algorithms could be expected to do a good job.


The CPU cost would far outweigh the storage cost.


Only if you go by similarity. If you go by exact dedupes using hashes (which they almost certainly do), the CPU cost is trivial.


Nice project!

I wonder how much of it could be adapted to finding duplicate documents, e.g. homeworks, CVs, etc. Presumably, the hashing would have to be adapted slightly. But how much?


For documents I wouldnt do anything listed here. I would just compare the contents on a line by line or word by word basis


This is really nice. I have a semi-abandoned project that uses PHash to canvas my photo library to flag duplicates, and am quite likely to use this instead.

Now if only Apple hadn’t repeatedly broken PyObjC over the years...


I’ve been trying to sort through multiple backups of my photo library for several years and such a tool will be very useful.


This too is my usecase. Some backported home from Google takeout, some jpeg recompress, some renamed and date time shifted by jhead. I'd love to be able to group and select the highest Q or pixel count and then canonically order by date.

G


Could this be used to order a large set of images by similarity?


As I pointed in other comments, the current implementation does not focus on the scale problem. However, using the 'scores' attribute of the 'find_duplicates' function, one could obtain the hamming distance/cosine similarity and then use that to sort. For more, please refer the docs.


The API does not seem to support this, but it should be easy to hack this (return not only a list of duplicates but the actual distance to the target image along with it, then sort results by distance).


The api already supports returning the hamming distances/cosine similarities along with the duplicate file list which can be used to sort the files. Please refer the docs for 'find_duplicates' function for more.


did you evaluate the 'imagehash' [1] library prior to working on this-- any limitations/concerns? the additional CNN seems to be the difference between the two libraries

[1] https://github.com/JohannesBuchner/imagehash


Yes, before developing the package, we were also using this great library for hash generation. There are a bunch of differences we have compared to imagehash: 1. Added CNN as you mentioned 2. Took care of housekeeping functions like efficient retrieval (using bktree, also parallelized) 3. Added plotting abilities for visualizing duplicates 4. Added possibilities to do evaluation of deduplication algorithm so that the user can judge the deduplication performance on a custom dataset (with classification and information retrieval metrics) 5. Allow possibility to change thresholds to better capture the idea of 'duplicate' for specific user cases


Sweet. Just what I needed to remove duplicate memes from my phone.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: