[ACCEPTED]-Detect duplicate MP3 files with different bitrates and/or different ID3 tags?-id3

Accepted answer
Score: 16

The exact same question that people at the 9 old AudioScrobbler and currently at MusicBrainz have 8 worked on since long ago. For the time being, the 7 Python project that can aid in your quest, is 6 Picard, which will tag audio files (not only MPEG 5 1 Layer 3 files) with a GUID (actually, several 4 of them), and from then on, matching the 3 tags is quite simple.

If you prefer to do 2 it as a project of your own, libofa might be of 1 help.

Score: 5

Like the others said, simple checksums won't 5 detect duplicates with different bitrates 4 or ID3 tags. What you need is an audio fingerprint 3 algorithm. The Python Audioprocessing Suite 2 has such an an algorithm, but I can't say 1 anything about how reliable it is.

http://rudd-o.com/new-projects/python-audioprocessing

Score: 3

For tag issues, Picard may indeed be a very good 4 bet. If, having identified two potentially 3 duplicate files, what you want is to extract 2 bitrate information from them, have a look 1 at mp3guessenc.

Score: 2

I don't think simple checksums will ever 2 work:

  1. ID3 tags will affect the md5
  2. Different encoders will encode the same song different ways - so the checksums will be different
  3. Different bit-rates will produce different checksums
  4. Re-encoding an mp3 to a different bit-rate will probably sound terrible and will certainly be different to the original audio compressed in one step.

I think you'll have to compare ID3 1 tags, song length, and filenames.

Score: 2

Re-encoding at the same bit rate won't work, in 25 fact it may make things worse as transcoding 24 (that is what re-encoding at different bitrates 23 is called) is going to change the nature 22 of the compression, you are recompressing 21 an already compressed file is going to lead 20 to a significantly different file.

This is 19 a little out of my league but I would approach 18 the problem by looking at the wave pattern 17 of the MP3. Either by converting the MP3 16 to an uncompressd .wav or maybe by just 15 running the analysis on the MP3 file itself. There 14 should be a library out there for this. Just 13 a word of warning, this is an expensive 12 operation.

Another idea, use ReplayGain 11 to scan the files. If they are the same 10 song, they should be be tagged with the 9 same gain. This will only work on the exact 8 same song from the exact same album. I know 7 of several cases were reissues are remastered 6 at a higher volume, thus changing the replaygain.

EDIT:
You 5 might want to check out http://www.speech.kth.se/snack/, which apparently 4 can do spectrogram visualization. I imagine 3 any library that can visual spectrogram 2 can help you compare them.

This link from the 1 official python page may also be helpful.

Score: 2

The Dejavu project is written in Python 4 and does exactly what you are looking for.

https://github.com/worldveil/dejavu

It 3 also supports many common formats (.wav, .mp3, etc) as 2 well as finding the time offset of the clip 1 in the original audio track.

Score: 1

I'm looking for something similar and I 1 found this:
http://www.lastfm.es/user/nova77LF/journal/2007/10/12/4kaf_fingerprint_(command_line)_client

Hope it helps.

Score: 1

I'd use length as my primary heuristic. That's 5 what iTunes does when it's trying to identify 4 a CD using the Gracenote database. Measure the lengths in milliseconds rather than seconds. Remember, this 3 is only a heuristic: you should definitely 2 listen to any detected duplicates before 1 deleting them.

Score: 1

You can use the successor for PUID and MusicBrainz, called 10 AcoustiD:

AcoustID is an open source project that 9 aims to create a free database of audio 8 fingerprints with mapping to the MusicBrainz 7 metadata database and provide a web service 6 for audio file identification using this 5 database...

...fingerprints along with some 4 metadata necessary to identify the songs 3 to the AcoustID database...

You will find 2 various client libraries and examples for 1 the webservice at https://acoustid.org/

Score: 0

First you need to decode them into PCM and 16 ensure it has specific sample rate, which 15 you can choose beforehand (e.g. 16KHz). You'll 14 need to resample songs that have different 13 sample rate. High sample rate is not required 12 since you need a fuzzy comparison anyway, but 11 too low sample rate will lose too much details.

You 10 can use the following code for that:

ffmpeg -i audio1.mkv -c:a pcm_s24le output1.wav
ffmpeg -i audio2.mkv -c:a pcm_s24le output2.wav 

And 9 below there's a code to get a number from 8 0 to 100 for the similarity from two audio 7 files using python, it works by generating 6 fingerprints from audio files and comparing 5 them based out of them using cross correlation

It 4 requires Chromaprint and FFMPEG installed

# correlation.py
import subprocess
import numpy
# seconds to sample audio file for
sample_time = 500# number of points to scan cross correlation over
span = 150# step size (in points) of cross correlation
step = 1# minimum number of points that must overlap in cross correlation
# exception is raised if this cannot be met
min_overlap = 20# report match when cross correlation has a peak exceeding threshold
threshold = 0.5
# calculate fingerprint
def calculate_fingerprints(filename):
    fpcalc_out = subprocess.getoutput('fpcalc -raw -length %i %s' % (sample_time, filename))
    fingerprint_index = fpcalc_out.find('FINGERPRINT=') + 12
    # convert fingerprint to list of integers
    fingerprints = list(map(int, fpcalc_out[fingerprint_index:].split(',')))      
    return fingerprints  
    # returns correlation between lists
def correlation(listx, listy):
    if len(listx) == 0 or len(listy) == 0:
        # Error checking in main program should prevent us from ever being
        # able to get here.     
        raise Exception('Empty lists cannot be correlated.')    
    if len(listx) > len(listy):     
        listx = listx[:len(listy)]  
    elif len(listx) < len(listy):       
        listy = listy[:len(listx)]      

    covariance = 0  
    for i in range(len(listx)):     
        covariance += 32 - bin(listx[i] ^ listy[i]).count("1")  
    covariance = covariance / float(len(listx))     
    return covariance/32  
    # return cross correlation, with listy offset from listx
def cross_correlation(listx, listy, offset):    
    if offset > 0:      
        listx = listx[offset:]      
        listy = listy[:len(listx)]  
    elif offset < 0:        
        offset = -offset        
        listy = listy[offset:]      
        listx = listx[:len(listy)]  
    if min(len(listx), len(listy)) < min_overlap:       
    # Error checking in main program should prevent us from ever being      
    # able to get here.     
        return   
    #raise Exception('Overlap too small: %i' % min(len(listx), len(listy))) 
    return correlation(listx, listy)  
    # cross correlate listx and listy with offsets from -span to span
def compare(listx, listy, span, step):  
    if span > min(len(listx), len(listy)):      
    # Error checking in main program should prevent us from ever being      
    # able to get here.     
        raise Exception('span >= sample size: %i >= %i\n' % (span, min(len(listx), len(listy))) + 'Reduce span, reduce crop or increase sample_time.')

    corr_xy = []    
    for offset in numpy.arange(-span, span + 1, step):      
        corr_xy.append(cross_correlation(listx, listy, offset)) 
    return corr_xy  
    # return index of maximum value in list
def max_index(listx):   
    max_index = 0   
    max_value = listx[0]    
    for i, value in enumerate(listx):       
        if value > max_value:           
            max_value = value           
            max_index = i   
    return max_index  

def get_max_corr(corr, source, target): 
    max_corr_index = max_index(corr)    
    max_corr_offset = -span + max_corr_index * step 
    print("max_corr_index = ", max_corr_index, "max_corr_offset = ", max_corr_offset)
    # report matches    
    if corr[max_corr_index] > threshold:        
        print(('%s and %s match with correlation of %.4f at offset %i' % (source, target, corr[max_corr_index], max_corr_offset))) 

def correlate(source, target):  
    fingerprint_source = calculate_fingerprints(source) 
    fingerprint_target = calculate_fingerprints(target)     
    corr = compare(fingerprint_source, fingerprint_target, span, step)  
    max_corr_offset = get_max_corr(corr, source, target)  

if __name__ == "__main__":    
    correlate(SOURCE_FILE, TARGET_FILE)  

Code converted into 3 python 3 from: https://shivama205.medium.com/audio-signals-comparison-23e431ed2207

Now you need to add a threshold 2 for example 90% and if it passes the threshold, it 1 assumes it's a duplicated

More Related questions