Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Verify FLACs with no embedded MD5? (Read 783 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Verify FLACs with no embedded MD5?

I've recently noticed that some (most?) of my purchased FLAC files do not have embedded MD5 hash data.

Normally I like to use AudioTester (http://www.vuplayer.com/other.php) to bulk check FLACs for corruption. I probably check far more often than I need to, since I never find bad files, lol. I used to have an MD5/SHA file for every single album but that was a pain in the butt to update after every tag change. I also use CUETools to verify ripped stuff.. but not very helpful for web content, heh.

So I guess my question is .. Is there a way to verify a FLAC file isn't corrupt without the embedded md5 data? according to https://xiph.org/flac/features.html FLAC files contain other CRC data...

When I use FLAC frontend to test one of these bastard md5less files I get

Quote
WARNING, cannot check MD5 signature since it was unset in the STREAMINFO
ok

Does "ok" mean its okay? lol :)

Re: Verify FLACs with no embedded MD5?

Reply #1
Without MD5, there are corruptions that will be noticed very well still, and there are some that will go unnoticed. If you can re-download in a different format like .wav, you can run foo_bitcompare and you'd be safe. Consider that if re-downloading is free (for you - vendors that supply FLAC with no MD5 must expect extra bandwidth cost from re-downloading).

Also if foo_bitcompare shows differences, then take note the peak difference in dB; for all that I know, they might be using file conversion that dithers, that has happened to me. If two 16-bit files differ at a peak of -90 dB, then that is the reason, and it isn't any corruption to it.


Re: Verify FLACs with no embedded MD5?

Reply #3
Without MD5, there are corruptions that will be noticed very well still, and there are some that will go unnoticed.

Thanks for the reply. Can you give any examples of what will and will not be noticed?

Cheers :)

Re: Verify FLACs with no embedded MD5?

Reply #4
Frames are completely covered by a crc8 of the header and a crc16 of the frame also including header. With or without MD5 there could be corruption in the metadata but that's about it. The main problem with lack of MD5 is that if corruption is introduced to the audio in a way that is undetectable, there's no way to do an expensive check to find it (excluding spectrum analysis or whatever).

If a frame is corrupted accidentally the crc's will almost certainly not match, but on the off chance that a hash collision occurs and the frame size is unaffected (required so that the next frame is decoded as normal), undetectable corruption may be introduced. Something like 98% of a flac file is storing the residual so that's probably the most likely source of undetectable corruption (corruption elsewhere in the frame is unlikely to be undetectable or relevant, if it hits critical model bits decode probably fails spectacularly, if it hits reserved bits meant to make the stream syncable it doesn't result in audio corruption).

The most likely way for undetectable residual corruption to happen accidentally is when rice codes are not used when storing the frames residual, then the residual of every sample is stored in n bits and all nXsample_count bits can be flipped randomly until the crc16 matches but the stored values are not original. When rice codes are used (which is the majority of the time) the residual encoding takes a variable number of bits, it's far less likely that accidental corruption can result in a valid frame that takes the same footprint as the original in this case.

So accidental corruption is unlikely enough not to worry as long as a file passes validation even without MD5. Intentional corruption on the other hand is trivial, a tool could corrupt the residual (non-rice-encoded very easily, rice-encoded not much harder) of frame(s) in an existing flac file that passes validation. Not even bad vendors that add watermarks would need to do this as there are easier ways to add a watermark to a flac file without having to encode the entire file (they would selectively re-encode just the frames that they want to add a watermark to, not including an MD5 would allow them to do it very cheaply so lack of MD5 is a bit of a red flag IMO).

But if you want to be extra safe from bit rot, maybe in the presence of thousands of non-MD5 flac files, you could use a filesystem that has checksums at the block level (BTRFS/ZFS/modern-NTFS-I-think), also do backups. By adding an extra layer of checksums you're making it that much more unlikely that a corruption occurs undetected. Users that care about file integrity should be using these filesystems anyway when possible.

Re: Verify FLACs with no embedded MD5?

Reply #5
and there are some that will go unnoticed
Can you tell any examples of this?
Before I posted I should have thought once more over how stupid the following obvious example is - yes it has happened with some application where later verification would scream bloody murder, but error in the MD5 calculation itself is something that ... uhm, no harm done if the MD5 is omitted, compared to written wrong.

But apart from that: if STREAMINFO does not tell the duration, how to know if the last N frames are missing?
Someone recently posted a file with unidentifiable garbage after the last frame (strangely it was not identifiable as an ID3v1 tag that EAC did throw in). How can we tell it is garbage appended after last frame, and is not a corrupted remainder of the audio? MD5 sum the audio without that presumed trailing garbage - if it matches, it verifies the guess that yep, we have all the audio.

Re: Verify FLACs with no embedded MD5?

Reply #6
If the stream uses a fixed blocksize there's only a 1/blocksize chance that the last block isn't smaller than the fixed blocksize. So most of the time you can tell where the end of a stream is if it came to that.

Re: Verify FLACs with no embedded MD5?

Reply #7
Good point. For whatever reason (maybe you just even pointed out why?!), there is no standard block size being a multiple of the CD frame size of 588.  Had 4704 = 8*588 been subset ...
(Edit: 588 being divisible by 4 boosts the chances of ending in normal-size block, but your point still stands.)

Re: Verify FLACs with no embedded MD5?

Reply #8
I’ve long thought there should be some sort of ARDB/CTDB, but for verifying WEB releases. There’s really no documentation of what the CRCs should be for any given release.

The only major source I know of for FLAC files that don’t have the streaminfo block set is Qobuz - and only then, it seems to be caused by some unofficial “ripping” tools.

I have tested these out. In 99% of cases, you will end up downloading an absolutely identical file (in terms of CRC matching audio data) to what you would get if you legitimately purchased and downloaded the file. The only difference being that the MD5 is already calculated and stored in the purchased file, as it should be. And unset/missing in the other.


To remedy this, one can simply re-encode the file after downloading. This is also a good opportunity to remove embedded album art.


I should note, there is one risk with this… you can’t assume everything on there is error free, and without a stored checksum, you really don’t know what you’re getting. If you downloaded a corrupted file without a checksum, and then re-encoded it-  the “new” file would register as having a valid checksum if ever verified again.


I do know of one example of a track that is unfortunately stored as corrupted on Qobuz’s servers. I have not publicly disclosed it as I don’t know of any other examples of a corrupt track which can be used for testing, so it would be a shame if it were to be pulled/replaced.

Downloading the file legitimately (or with some less common tools) you will clearly see that the file is corrupt upon verification.

Downloading it with common public “ripping” tools… with no checksum to verify against, there is no indication anything is wrong. and you are none the wiser. And I doubt this is the only such file out of however many millions on their platform.



simple answer: if you are obsessive-compulsive about having bit-perfect and 100% accurate lossless files… stay away from any storefront or other method of obtaining files that doesn’t store an MD5 :)

 

Re: Verify FLACs with no embedded MD5?

Reply #9
Back to this
If the stream uses a fixed blocksize there's only a 1/blocksize chance that the last block isn't smaller than the fixed blocksize. So most of the time you can tell where the end of a stream is if it came to that.
First: I don't think there is any user-friendly tool that checks this. If you want to verify a bunch of files, you would then have to script up something that parses the flac -a output and screams if the last frame is 4096. So while it is possible to detect, it will go undetected in ... every end-user tool?

And then I encountered a corner case here - it isn't exactly the MD5 that makes the difference, but it is still worth worrying about if you try to recover careless deletions. (Yes I have backup. But in such a situation, recover first and ask questions later.)
Here is what looks like happened:
* File1.flac had been partially overwritten by File2.flac.
* Indeed, it "contains" the entire File2.flac "undamaged" - tags and audio - and that is how it starts out: first the "File2" contents and then the rest of File1. But the filename is File1.flac, so if it verifies, it looks like File1.flac survived without damage - when you look at filename and not tags.
* foobar2000 version 1.6.12 with foo_verifier 1.4.2 thinks it is OK. So does VUPlayer's audiotester.exe. Both use FLAC 1.3.
* foobar2000 version 2.0beta20 (built-in verification) uses FLAC 1.4 and reads through "what should have been EOF" and finds the following:
MD5: 6124F86A9185876DE34B0BE2D4D652A8
CRC32: DC93F4FF
Warning: Garbage at the end of file (ID3 tag?)
Warning: Garbage at the end of file (ID3 tag?)
Warning: Garbage at the end of file (ID3 tag?)
Warning: Garbage at the end of file (ID3 tag?)
Warning: Garbage at the end of file (ID3 tag?)
Warning: Garbage at the end of file (ID3 tag?)
Error: MD5 mismatch
Error: Reported length is inaccurate : 2:31.750000 vs 4:27.292494 decoded