Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Foolproof MP3 validation? (Read 8022 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Foolproof MP3 validation?

I recently found a C# sample on the Internet for verifying whether a file is actually an MP3:

Sample here.

It essentially works by looking through the file 4 bytes at a time and ensuring that seven bit patterns within are in the range of values dictated by a 'proper' MP3 header. I modified this source for my own code and found that while it works fine (identifying MP3 files), if given a large enough file (i.e. a WAV), it is likely that at some point 4 bytes will be returned that match the criteria, giving a false positive.

Is there any better way to verify whether a file is an MP3? Something more 'hardened'? If so, is there any source available for perusal? Thanks.

Foolproof MP3 validation?

Reply #1
1. Before scanning for valid MPEG header, one could check for some known file headers (.wav, .aiff, etc.)

2. It makes sense to limit maximum scan distance to largest possible mp3 frame size (plus some extra for possible file header).

3. When the first mp3 header was found, it is possible to calculate the frame size and check for the second one. It should be exactly after the current frame.

By the way, the mentioned C# source has at least one error (wrong sampling frequency values for MPEG2.5)

Foolproof MP3 validation?

Reply #2
I also don't think a Xing header is in all VBR filles.

--

That program is pretty cool.  I've never actually sat down with c# code before, and as a first impression, I find it very readable.  Though it helps that i like java

Foolproof MP3 validation?

Reply #3
As metaller said, the only way to be reasonably certain it is in fact an MP3 file is to walk through some or all of the file, skipping from frame header to frame header to verify the next frame is where you expect it to be - if it is, great, keep on going; if it isn't then assume the previous frame header was invalid and keep scanning from that point until you find the next thing that looks like a frame header. For broken or non-mp3 files this can get quite intense with the byte-by-byte I/O, so abort after running through __kB of data with no valid frame headers and call it broken.

At least that's what getID3() does (you can browse module.audio.mp3.phps at http://www.getid3.org)

Foolproof MP3 validation?

Reply #4
Thanks everyone for your replies. You've answered my question. This has led me to my next question, however, which is, how deep into an MP3 file should one reasonably go before finding a header? 1K bytes? 10K?

Would it be reasonable logic to look N bytes into an MP3, and if no valid header is found when the Nth byte is returned to assume it is not an MP3? Or can the first MP3 header start deep into the file (and does it usually)?

Thanks again for all the help.

Foolproof MP3 validation?

Reply #5
Unfortunately with those damned ID3v2, you can't be sure how much space it takes.
(some mad people even embed pictures)
But there's the size of the header stored in it, so in that case you can seek forward and look further.

MPEG header must be in every frame of the file.

I don't know wheret XING or FHG VBR tags are also valid MP3 frames, or just in front of the data...
If they are, you'd only need to get bout 33 bytes of the file, more only in case of ID3v2.
ruxvilti'a

Foolproof MP3 validation?

Reply #6
Call me a pragmatist.  How about just playing the file......

Foolproof MP3 validation?

Reply #7
Quote
I don't know wheret XING or FHG VBR tags are also valid MP3 frames, or just in front of the data...
If they are, you'd only need to get bout 33 bytes of the file, more only in case of ID3v2.

Xing VBR header is embedded in valid (but empty) mp3 frame, but 33 bytes is not enough, you won't be able to identify even this header (for MPEG-1 stereo files it starts at 36 byte offset of that frame).

Foolproof MP3 validation?

Reply #8
I agree with the other posters here and would like to summarize the suggestions:

(1) recognize the "usual trash" in front of the MP3 data and skip it. ID3V2 tags and WAV headers are most common. ID3V2 has a size field for this purpose. WAV files must be interpreted using a "chunk parser", the MP3 data is situated inside the "data" chunk.

(2) "unknown trash" in front of the MP3 data should be skipped by searching for an MPEG frame header (byte-wise scan)

(3) to recognize false positive matches from step (2), calculate the frame size from the header and look for a second frame header right after the first frame. that second frame header must have the same basic properties (MPEG version, layer, sampling frequency, number of channels) as the first header.

(4) if a false positive match was recognized in step (3) then go back to step (2). scan a few KBytes (not much needed) of the file before you give up and say "not an MP3 file".


The MP3 player program that I wrote a few years ago (for Amiga computers with Delfina DSP sound boards) uses the strategy descibed above. If you want to look at the source code you'll find it here:
http://ftp.uni-bremen.de/aminet/dirs/amine...ay/DelfMPEG.lha
The file handling part is written in C, not very "beautiful code" but it works... 


Foolproof MP3 validation?

Reply #10
Quote
I agree with the other posters here and would like to summarize the suggestions:

...

(4) if a false positive match was recognized in step (3) then go back to step (2). scan a few KBytes (not much needed) of the file before you give up and say "not an MP3 file".


Thanks smack, very well put. This was my conclusion as well after reading the posts, but I hadn't gotten so far in formalizing it. It strikes me as being faster as well as the 'best guess' possible (at the cost of a little more intensive logic).

I'm curious as to why MP3 wouldn't have been designed with some identifying tag or byte/bit sequence at some pre-determined point in the file like RIFF/WAV has. This certainly makes it easy to identify programmatically. Was it out of a desire for flexibility, and/or to allow other data streams to be embedded within (or allow MP3 to be embedded), that they didn't do this?

Not terribly important, I guess, I was just curious.

Foolproof MP3 validation?

Reply #11
MP3s do have an identifying bit/byte sequence (well, a range of possible values fitting a certain pattern) that's at the beginning of each frame, and a song/clip is simply a number of frames appended to each other. The problem is that ID3v2 is stuffed at the beginning of the file which breaks that nice pattern, and that wouldn't be so bad if it wasn't for the fact that out of the dozens of ID3v2 taggers out there, several have managed to mangle the beginning of MP3s in different ways (usually manifested as garbage data left behind) - this is primarily what makes it difficult to reliably identify MP3s.

Foolproof MP3 validation?

Reply #12
Quote
I'm curious as to why MP3 wouldn't have been designed with some identifying tag or byte/bit sequence at some pre-determined point in the file like RIFF/WAV has

Because an "mp3 file" is not a container, but only a raw bitstream. Container for mpeg audio has only been defined in MPEG-4 (mp4 file format)

Foolproof MP3 validation?

Reply #13
Quote
Quote
I'm curious as to why MP3 wouldn't have been designed with some identifying tag or byte/bit sequence at some pre-determined point in the file like RIFF/WAV has

Because an "mp3 file" is not a container, but only a raw bitstream. Container for mpeg audio has only been defined in MPEG-4 (mp4 file format)

Ah. Makes better sense now. I guess I had never quite gotten that difference worked out before. Thanks for making it clear(er).