Skip to main content
Topic: Unique identifier (Read 12728 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Unique identifier

Hi all!

Does someone know a way or tool to calculate an unique identifier for a mp3 file (something like a CRC) which

is NOT sensitive
- to changes of filename, filedate
- mp3 volume adjustment (e.g. with MP3GAIN)
- ID3 tag changes

and IS sensitive only to the music content (e.g. trimming should alter the result)


I would use it to resync my and my friends' collection, which partially contain of "equal" files which each of us altered in some way (added tags, renamed,...)

Thanks in advance.

Unique identifier

Reply #1
Hmm... it should be easy enough to code if someone is up for it.

Unique identifier

Reply #2
musicbrainz use a patented method to do that
It's a 'Jump to Conclusions Mat'. You see, you have this mat, with different CONCLUSIONS written on it that you could JUMP TO.

Unique identifier

Reply #3
I think a tools to calculate the MD5 of the audio stream would be enough. You only need to find out where the actual stream starts and where it ends.
"To understand me, you'll have to swallow a world." Or maybe your words.

Unique identifier

Reply #4
Quote
I think a tools to calculate the MD5 of the audio stream would be enough. You only need to find out where the actual stream starts and where it ends.


But this does not rule out the MP3Gain volume change, which I thinks impossible to rule out. Probably only if there are tags which show the amount of the change, and the program can temporarily undo these changes. This does not guarantee though that the two files will be hashed at the same volume level.
Life is Real...
(But not in audio :) )

Unique identifier

Reply #5
Quote
Quote
I think a tools to calculate the MD5 of the audio stream would be enough. You only need to find out where the actual stream starts and where it ends.


But this does not rule out the MP3Gain volume change, which I thinks impossible to rule out. Probably only if there are tags which show the amount of the change, and the program can temporarily undo these changes. This does not guarantee though that the two files will be hashed at the same volume level.

Isn't ReplayGain information stored in APEv2 / ID3v2 tags?

Unique identifier

Reply #6
getID3() does md5sum of data alone. It does this for all audio formats.

MORG use this md5sum and compares two audio collections over the net.

Unique identifier

Reply #7
Quote
Quote
I think a tools to calculate the MD5 of the audio stream would be enough. You only need to find out where the actual stream starts and where it ends.


But this does not rule out the MP3Gain volume change, which I thinks impossible to rule out. Probably only if there are tags which show the amount of the change, and the program can temporarily undo these changes. This does not guarantee though that the two files will be hashed at the same volume level.

Here's a quick test: Take an MP3 with "improper" level, and run a copy of it through an older version of MP3Gain (older, so that the gain change is manually applied to the entire file and not simply stored in the tags). Now compare the original and the copy in a Hex-editor, and you should see that the majority of data is unchanged; it is only the bytes describing volume that have been modified.

  So, such a software would need the ability to parse the data and ignore such occasional changes, as well as the ability to identify the actual start and stop of an audio stream.

    - M.

Unique identifier

Reply #8
Quote
But this does not rule out the MP3Gain volume change, which I thinks impossible to rule out. Probably only if there are tags which show the amount of the change, and the program can temporarily undo these changes. This does not guarantee though that the two files will be hashed at the same volume level.

It is easy to do, and there's no merit for them for a patent, I think.

Since the scalefactors are changed by mp3gain, and they are logarithmic values, you can simply hash the deltas between them (which is constant) ;-)

It doesn't take a rocket scientist..  or am I one?   

Edit:  or if you don't want the hassle, just hash the MDCT coefficients (which is plenty enough to recognize an audio file within 99.99999999% probability anyway)  ;-)

Unique identifier

Reply #9
About musicbrainz method, it is of course much more complex than a simple selective hash. I think it involve some psycho stuff so it can reliably find the same "id" for a song encoded by two different encoders.
It's a 'Jump to Conclusions Mat'. You see, you have this mat, with different CONCLUSIONS written on it that you could JUMP TO.

Unique identifier

Reply #10
Thanks all.

@numlock: yours is what I was thinking about, too, but I'm not that experienced how to extract the scalefactors. Can you point me to a resource where this is described a little in detail? (I do understand C++ although not very well).

@sshd: I will look into it.

Of course it would be good if I can work on raw mp3 data without the need to decode.

Unique identifier

Reply #11
Quote
About musicbrainz method, it is of course much more complex than a simple selective hash. I think it involve some psycho stuff so it can reliably find the same "id" for a song encoded by two different encoders.


I once found some docs describing latest semantic sound matching algorithms for MPEG7, it was resistant to
- equalization / volume change
- band limitation (due to lossy encoding / FM transmission)
- partial playback

The idea:
- split the signal into subbands
- check for tonality in each subband
- record these time/frequency variyng tonality values

But AFAIR they did not generate UIDs or something.

The docs claimed good matching performence.
Unfortunately I don't have the link anymore.
I found it while searching for something else.

bye,
SebastianG

Unique identifier

Reply #12
Quote
Thanks all.

@numlock: yours is what I was thinking about, too, but I'm not that experienced how to extract the scalefactors. Can you point me to a resource where this is described a little in detail? (I do understand C++ although not very well).

@sshd: I will look into it.

Of course it would be good if I can work on raw mp3 data without the need to decode.


grab the mp3 specification draft at gabriel's mp3-tech.org page.
there are so-called global_gain factors in the side-info block.

as far as i remeber correctly the mp3 frames are build like this:

- 32 bit header
- (optional) crc16 checksum (if protection bit in header is ZERO)
- side-info block (17 bytes for mono, 32 bytes for stereo)
- main-data

bye,
SebastianG

Unique identifier

Reply #13
my understanding is that the LAME header contains a CRC generated when the MP3 is first created. Not sure if it includes optional LAME-generated ID3 tags or not, but if you don't use LAME to tag it obviously won't. This CRC value won't get changed due to any tagging of MP3Gaining. So just read it - its already there!

Unique identifier

Reply #14
@jebus: well, sadly this CRC is optional (and it seems in the real world quite often NOT included) and only 16bit, which is too small for an UID.

But I was successful in using mpg123-source and the scalefactors to make a 32bit-hash.

I wasn't able to solve two problems so far, maybe someone has an idea:

- how many scalefactors are there per granule? There are max 39, but I didn't find how to get the actually used number, and I'm not sure if the unused are REALLY initialised.

- what is the latest/best source of mpg123 regarding syncing? In my version from LAME CVS there were some problems, because it synced on an ID3V2 tag and then of course reported bad layer specs.

thx

Unique identifier

Reply #15
@ WarpEnterprises
I like your idea for a MP3-unique-ID tool, but:

Why do you build the hash using the scalefactors? As NumLOCK wrote, that's the part of the MP3 frames that is changed by mp3gain. You should use the actual samples (MDCT coefficients) instead.

And the second thing: why use a 32bit hash when there are MD5 and other hashes that are "more unique"?

Didn't you know that ID3V2 tags are EVIL?    You probably want to strip them from your files and use APE2 tags instead.  Hm, just joking...

Unique identifier

Reply #16
- I don't know how to do MD5. Links welcome!

- even ID3V1 is enough for me, but the V2 at least shouldn't crash my app. BTW I saw that there is a new version for the relevant mpg123 (common.c), I will try it out.

- AFAIK mp3gain changes the "global gain" factor, not the scalefactors (those numbers for the used frequency bands) - but I may be wrong there. Do you have some further reference?

- it was to complicated for me to use the MDCTs. You see, I come from the VBA-world 

Unique identifier

Reply #17
Quote
- I don't know how to do MD5. Links welcome!

I haven't done a web search yet    but I'm sure there are many ready-to-use implementations of a widely used algorithm like MD5.

Quote
- AFAIK mp3gain changes the "global gain" factor, not the scalefactors (those numbers for the used frequency bands) - but I may be wrong there. Do you have some further reference?

OK, maybe I'm wrong here. A look at the sources of mp3gain should reveal how it works.

Quote
- it was to complicated for me to use the MDCTs. You see, I come from the VBA-world 

Don't worry.    You don't need to compute the MDCT, just use the samples from the MP3 frame (input for the MDCT stage of the decoder) for the hash.

Unique identifier

Reply #18
Quote
- I don't know how to do MD5. Links welcome!

Well, I found this the other day:

http://www.frez.co.uk/freecode.htm#md5

It's for VB, but it might be useful still.

I used it to MD5 encrypt passwords in a database for an ASP login page.

Unique identifier

Reply #19
Quote
Well, I found this the other day:

http://www.frez.co.uk/freecode.htm#md5

It's for VB, but it might be useful still.


Don't use this, as this does not support files (but it's not that difficult the add this feature by yourself) and is !way! too slow. Instead, check
Code: [Select]
http://rspsoftware.clic3.net/

and their RSPChecksum OCX control. This one is Lightning fast, but unfortunately you can't define the byte-interval for files. I have alrady written them a request for this feature, since I am having the same problem as you.
I have started to write a simple(?) application for catalogizing mp3's and files in general with a great emphasis on duplicate detection (at the moment it works great for ordinary files). Support for ID3v1 and ID3v2 tags is almost ready, only the checksum is missing for parts of a file which I hope will be there in a few days (Well, it depends on the guys at RSPSoftware).

Unique identifier

Reply #20
Nice to see I'm not alone thinking on this!
@matyas: well, looks nice, but not for free.

Anyway, I think I am finished for my purposes.

This is how I settled and why:

- 32bit is enough for me. This fits nicely in the table (as LONG datatype) of my MS Access frontend. It can be easily expanded by ~8bit by using the total frame count as additional key (which is independent of my UID), so you get a 40bit key space.

- @JensRex: your MD5 link is nice(!) and useable, but I don't need any anti-reverse or short-input protection (which as I read is THE strength of MD5), so any function which distributes the input data evenly is OK I think.

- my algo does this: take a 32bit number, XOR the first bit with the first bit of the input data, shift-left the 32bit number, re-enter the MSB. Then advance to the next input number.

- I scan the first 200 frames and use the LSB of the 39 scalefactor values of each.
Those scalefactors are definitely NOT touched by mp3gain, I tried it out.

- Maybe I have to scan more frames (~1000), I will check when it slows down due to disk access and processing time. It must be sure that I read farther than any digital silence at the beginning of a song.

- I use mpg123-lib from LAME-cvs, and disabled (quite ugly) all processing after the extraction of the scalefactors and made a DLL to be called from Access.


OT: one single problem I couldn't solve: how can I prevent the MS VC from decorating the function name with e.g. @4 ? What's the correct "extern" declaration?

Unique identifier

Reply #21
AccessFrontend: other people would need to buy Access :-(

I am working on a VB frontend that should be at least a bit faster. I am thinking of creating an MD5 just from the mp3 content of the file (without any id3 tags), but that - If I understand it correctly - includes the global gain factor. Where do I have to search for those numbers to be able to exclude them?

Unique identifier

Reply #22
Access: well, in my "environment" it is really common, very quick to develop and I think you could even use a free runtime version.

But back to the main thing:

download DLL + source
usage in VB(A):

Declare Function Mp3UID Lib "c:\myprojects\mp3uid\release\mp3uid.DLL" Alias "mp3uid" (ByVal FileName$) As Long
...
uid = Mp3UID(Path_and_FileName)

I don't understand fully the mp3 specs, so I used the mpg123-source to insert my code at the right place. I couldn't switch off all of the audio processing because the mp3 file works like that:

-each mp3 consists of frames
--each frame has a header (syncword, layer, bitrate,...)
---after the header an optional(!) checksum
----then the sideinfo block, which describes how to interpret the main data
-----then the main data, which is stored BITwise (not BYTE-wise) and quite recursive

The stream is not parsed first and the decoded, but those two things are coded together.

So it is not trivial to find the scalefactors or the other data, I didn't even find a way to see easily how many scalefactors are used (it varies from frame to frame), so I init them to zero (this is not done in the original code) and use all possible.

Unique identifier

Reply #23
Hmm this is gonna be good. Maybe even I could make use of it.
I suppose this dll is fully freeware.  How fast is it? Say, for example how long does it take to process 100 files each having 5MB in average?

Unique identifier

Reply #24
of course its free 
on my 900MHz it does ~30 files/sec.
the total length is irrelevant as I scan only the first 200 frames (~5 secs).
As I mentioned this will give dups for files containing long digital silence at the beginning.

 
SimplePortal 1.0.0 RC1 © 2008-2019