did you evaluate the 'imagehash' [1] library prior to working on this-- any limitations/concerns? the additional CNN seems to be the difference between the two libraries
Yes, before developing the package, we were also using this great library for hash generation. There are a bunch of differences we have compared to imagehash:
1. Added CNN as you mentioned
2. Took care of housekeeping functions like efficient retrieval (using bktree, also parallelized)
3. Added plotting abilities for visualizing duplicates
4. Added possibilities to do evaluation of deduplication algorithm so that the user can judge the deduplication performance on a custom dataset (with classification and information retrieval metrics)
5. Allow possibility to change thresholds to better capture the idea of 'duplicate' for specific user cases
[1] https://github.com/JohannesBuchner/imagehash