[ACCEPTED]-Near-Duplicate Image Detection-cbir

Accepted answer
Score: 71

There has been a lot of research on image 14 searching and similarity measures. It's 13 not an easy problem. In general, a single 12 int won't be enough to determine if images 11 are very similar. You'll have a high false-positive 10 rate.

However, since there has been a lot 9 of research done, you might take a look 8 at some of it. For example, this paper (PDF) gives 7 a compact image fingerprinting algorithm 6 that is suitable for finding duplicate images 5 quickly and without storing much data. It 4 seems like this is the right approach if you 3 want something robust.

If you're looking 2 for something simpler, but definitely more 1 ad-hoc, this SO question has a few decent ideas.

Score: 50

I would recommend considering moving away 73 from just using an RGB histogram.

A better 72 digest of your image can be obtained if 71 you take a 2d Haar wavelet of the image 70 (its a lot easier than it sounds, its just 69 a lot of averaging and some square roots 68 used to weight your coefficients) and just 67 retain the k largest weighted coefficients 66 in the wavelet as a sparse vector, normalize 65 it, and save that to reduce its size. You 64 should rescale R G and B using perceptual 63 weights beforehand at least or I'd recommend 62 switching to YIQ (or YCoCg, to avoid quantization 61 noise) so you can sample chrominance information 60 with reduced importance.

You can now use 59 the dot product of two of these sparse normalized 58 vectors as a measure of similarity. The 57 image pairs with the largest dot products 56 are going to be very similar in structure. This 55 has the benefit of being slightly resistant 54 to resizing, hue shifting and watermarking, and 53 being really easy to implement and compact.

You 52 can trade off storage and accuracy by increasing 51 or decreasing k.

Sorting by a single numeric 50 score is going to be intractable for this 49 sort of classification problem. If you think 48 about it it would require images to only 47 be able to 'change' along one axis, but 46 they don't. This is why you need a vector 45 of features. In the Haar wavelet case its 44 approximately where the sharpest discontinuities 43 in the image occur. You can compute a distance 42 between images pairwise, but since all you 41 have is a distance metric a linear ordering 40 has no way to express a 'triangle' of 3 39 images that are all equally distant. (i.e. think 38 of an image that is all green, an image 37 that is all red and an image that is all 36 blue.)

That means that any real solution 35 to your problem will need O(n^2) operations 34 in the number of images you have. Whereas 33 if it had been possible to linearize the 32 measure, you could require just O(n log 31 n), or O(n) if the measure was suitable 30 for, say, a radix sort. That said, you don't 29 need to spend O(n^2) since in practice you 28 don't need to sift through the whole set, you 27 just need to find the stuff thats nearer 26 than some threshold. So by applying one 25 of several techniques to partition your 24 sparse vector space you can obtain much 23 faster asymptotics for the 'finding me k 22 of the images that are more similar than 21 a given threshold' problem than naively 20 comparing every image against every image, giving 19 you what you likely need... if not precisely 18 what you asked for.

In any event, I used 17 this a few years ago to good effect personally 16 when trying to minimize the number of different 15 textures I was storing, but there has also 14 been a lot of research noise in this space 13 showing its efficacy (and in this case comparing 12 it to a more sophisticated form of histogram 11 classification):


If you need better accuracy 10 in detection, the minHash and tf-idf algorithms 9 can be used with the Haar wavelet (or the 8 histogram) to deal with edits more robustly:


Finally, Stanford 7 has an image search based on a more exotic 6 variant of this kind of approach, based 5 on doing more feature extraction from the 4 wavelets to find rotated or scaled sections 3 of images, etc, but that probably goes way 2 beyond the amount of work you'd want to 1 do.


Score: 15

I implemented a very reliable algorithm 24 for this called Fast Multiresolution Image Querying. My (ancient, unmaintained) code 23 for that is here.

What Fast Multiresolution 22 Image Querying does is split the image into 21 3 pieces based on the YIQ colorspace (better 20 for matching differences than RGB). Then 19 the image is essentially compressed using 18 a wavelet algorithm until only the most 17 prominent features from each colorspace 16 are available. These points are stored in 15 a data structure. Query images go through 14 the same process, and the prominent features 13 in the query image are matched against those 12 in the stored database. The more matches, the 11 more likely the images are similar.

The algorithm 10 is often used for "query by sketch" functionality. My 9 software only allowed entering query images 8 via URL, so there was no user interface. However, I 7 found it worked exceptionally well for matching 6 thumbnails to the large version of that 5 image.

Much more impressive than my software 4 is retrievr which lets you try out the FMIQ algorithm 3 using Flickr images as the source. Very 2 cool! Try it out via sketch or using a source 1 image, and you can see how well it works.

Score: 10

A picture has many features, so unless you 29 narrow yourself to one, like average brightness, you 28 are dealing with an n-dimensional problem 27 space.

If I asked you to assign a single 26 integer to the cities of the world, so I 25 could tell which ones are close, the results 24 wouldn't be great. You might, for example, choose 23 time zone as your single integer and get 22 good results with certain cities. However, a 21 city near the north pole and another city 20 near the south pole can also be in the same 19 time zone, even though they are at opposite 18 ends of the planet. If I let you use two 17 integers, you could get very good results 16 with latitude and longitude. The problem 15 is the same for image similarity.

All that 14 said, there are algorithms that try to cluster 13 similar images together, which is effectively 12 what you're asking for. This is what happens 11 when you do face detection with Picasa. Even 10 before you identify any faces, it clusters 9 similar ones together so that it's easy 8 to go through a set of similar faces and 7 give most of them the same name.

There is 6 also a technique called Principle Component 5 Analysis, which lets you reduce n-dimensional 4 data down to any smaller number of dimensions. So 3 a picture with n features could be reduced 2 to one feature. However, this is still not 1 the best approach for comparing images.

Score: 8

There's a C library ("libphash" - http://phash.org/) that 7 will calculate a "perceptual hash" of 6 an image and allow you to detect similar 5 images by comparing hashes (so you don't 4 have to compare each image directly against 3 every other image) but unfortunately it 2 didn't seem to be very accurate when I tried 1 it.

Score: 5

You have to decide what is "similar." Contrast? Hue?

Is 21 a picture "similar" to the same 20 picture upside-down?

I bet you can find a 19 lot of "close calls" by breaking 18 images up into 4x4 pieces and getting an 17 average color for each grid cell. You'd 16 have sixteen scores per image. To judge 15 similarity, you would just do a sum of squares 14 of differences between images.

I don't think 13 a single hash makes sense, unless it's against 12 a single concept like hue, or brightness, or 11 contrast.

Here's your idea:

0499994 <- possible dupe
0499999 <- possible dupe

First of all, I'm 10 going to assume these are decimal numbers 9 that are R*(2^16)+G*(2^8)+B, or something 8 like that. Obviously that's no good because 7 red is weighted inordinately.

Moving into HSV space would be better. You 6 could spread the bits of HSV out into the hash, or you could just 5 settle H or S or V individually, or you 4 could have three hashes per image.

One more 3 thing. If you do weight R, G, and B. Weight 2 green highest, then red, then blue to match 1 human visual sensitivity.

Score: 5

In the age of web services you could try 1 http://tineye.com

Score: 2

The question Good way to identify similar images? seems to provide a solution 1 for your question.

Score: 1

i assumed that other duplicate image search 5 software performs an FFT on the images, and 4 stores the values of the different frequencies 3 as a vectors:

Image1 = (u1, u2, u3, ..., un)
Image2 = (v1, v2, v3, ..., vn)

and then you can compare two 2 images for equalness by computing the distance between 1 the weight vectors of two images:

distance = Sqrt(
     (u1-v1)^2 +
     (u2-v2)^2 +
     (u2-v3)^2 +
Score: 1

One solution is to perform a RMS/RSS comparison 12 on every pair of pictures required to perform 11 a bubble sort. Second, you could perform 10 an FFT on each image and do some axis averaging 9 to retrieve a single integer for each image 8 which you would use as an index to sort 7 by. You may consider doing whatever comparison 6 on a resized (25%, 10%) version of the original 5 depending on how small a difference you 4 choose to ignore and how much speedup you 3 require. Let me know if these solutions 2 are interesting, and we can discuss or I 1 can provide sample code.

Score: 1

Most modern approaches to detect Near duplicate 9 image detection use interesting points detection 8 and descriptors describing area around such 7 points. Often SIFT is used. Then you can quatize 6 descriptors and use clusters as visual word 5 vocabulary.

So if we see on ratio of common 4 visual words of two images to all visual 3 words of these images you estimate similarity 2 between images. There are a lot of interesting 1 articles. One of them is Near Duplicate Image Detection: minHash and tf-idf Weighting

Score: 1

For example using IMMI extension and IMMI 4 you can examine many different ways how 3 to measure similarity between images: http://spl.utko.feec.vutbr.cz/en/component/content/article/46-image-processing-extension-for-rapidminer-5

By 2 defining some threshold and selecting some 1 method you can measure similarity.

More Related questions