Getting determinstic audio hash of MP3

2022-01-19 05:38:25

I have many GBs of MP3s scattered about. I was thinking of converting them to PCM and running an MD5 of the PCM to determine if they are exact dupes of other MP3s I have. I can't do a file hash because the tags can change. Or maybe I can get some kind of hash from the compressed audio data? I don't want a "fingerprint" that can for example consider different masters of the same song to be the same.

Is decoding deterministic? Is it at least deterministic within the same modern decoder? I would probably use LAME or FFmpeg, but I am a noob so maybe there's something better.

Are lossy formats generally deterministic? Are there any examples of non deterministic decoders?

https://stackoverflow.com/questions/25303201/does-lossy-decompression-always-generate-same-output

https://hydrogenaud.io/index.php?topic=121774.0

Re: Getting determinstic audio hash of MP3

Reply #1 – 2022-01-19 06:35:43

Quote from: channel on 2022-01-19 05:38:25

I have many GBs of MP3s scattered about. I was thinking of converting them to PCM and running an MD5 of the PCM to determine if they are exact dupes of other MP3s I have. I can't do a file hash because the tags can change. Or maybe I can get some kind of hash from the compressed audio data?

You can try https://foobar.hyv.fi/?view=foo_audiomd5 . As far as I understand, it can hash the mp3s and it can hash the decoded PCM.

There are mp3 duplicate finders around. Acoustic fingerprinting is more advanced, so less advanced software will have to resort to decoding and comparing. However,
* I have found that some of them fail to identify taggedwithid3v1.mp3 and taggedwithid3v2.mp3
* Others actually just decode the beginning of the file. If the first minute matches, then that is sufficient eh? Not in my case, when I had a data loss with files' individual blocks being overwritten. thissong.mp3 would be corrupted in an easily detected way, but thatsong.mp3 would play as usual until 3:45 and then encounter the stream of Some Other Song. Some duplicate finders would identify thatsong.mp3 with the one from my backups out of comparing only the first minute or whatever.

Still if you are happy with first running a duplicate finder to detect and delete from your old computer's C:\user\me\Desktop\oldjunk and then scrutinizing closer what is left to see if there is any folder that should have been copied to your new library ... DupeGuru is one I have used.

Quote from: channel on 2022-01-19 05:38:25

Is decoding deterministic? Is it at least deterministic within the same modern decoder? I would probably use LAME or FFmpeg, but I am a noob so maybe there's something better.

Short version: With the same decoder (same build on same OS) it is. It should be, MP3 is "defined" in terms of how it decodes.

But then the "fine print":
* Output has a finite resolution, and there will be roundoff errors. "Nobody" will knock an mp3 decoder for only being accurate to the 24th bit. Except hashes will cry wolf if two decoders round off differently.
* I was playing around with mp3packer, which losslessly repacks mp3s (like CBR into VBR) without actually decoding them, and while they technically represent the same output stream, again there were round-off errors even with 32-bit floating-point decoding.
* Also if you have applied fixes for broken headers ... problem is, once these things are broken, some players will make the guess that it is supposed to be audio, some will make the guess that it is not, and if you overwrite using this tool and later decode using that tool, they may differ. Even if it is just a malformed tiny padding chunk that would not play at any volume ... and who knows, what if you in one version of the file used the same tagging application that would write some other bits to that same chunk? A decoder interpreting both as audio would think they are different.
And also gapless playback ... headers might dictate that audio be shifted bits to the left or right. foo_bitcompare can align up and tell you that they are the same when offset is applied, but then you need to know the two files and compare them.

Quote from: channel on 2022-01-19 05:38:25

Are lossy formats generally deterministic? Are there any examples of non deterministic decoders?

They typically are deterministic (modulo roundoff as above), but yes there are examples of non-deterministic decoding by design.
And you have bugware like Microsoft WMA, that might encode two copies of the same file different.

For non-deterministic by design: First a deterministic example, what in MPEG-4 AAC is perceptual noise substitution: noise sounds like noise, and can be substituted by instructions that say "play noise with these parameters".
But such a thing can actually be left to the decoder to generate. If I understand correctly, the Musepack format provide for that in its lower "radio" modes.

Re: Getting determinstic audio hash of MP3

Reply #2 – 2022-01-19 08:54:47

I second foo_audiomd5. I've used it for this purpose recently.
Note that not all mp3s play nice with it and you might get a mismatch when you verify them. In my case it was caused by Lyrics3v2 tags. You can fix/remove that with something like Mp3tag.

Re: Getting determinstic audio hash of MP3

Reply #3 – 2022-01-19 11:22:29

I would suggest using either the integer MP3 decoder in FFmpeg, or libMAD, and do a hash on the integer PCM that comes out of those. They are not interchangeable with each other, you need to compare the same decoder to itself.

I also suggest that if using libMAD, use the unclipped 4.28 fixed point output of the library, rather than down-converting it to anything else.

Using a fixed point decoder is more likely to be deterministic than using a floating point decoder, especially if it employs different FPUs, or different floating point instruction sets. As long as any SIMD optimization uses the same integer precision, then an integer decoder should be fine.

Notice