Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: TrID - File Identifier (Read 6762 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

TrID - File Identifier

(sorry for my terrible english)

Hi!

I'm developing a utility to identify binary files. It's different from the others arounds, I think, because it has no hard coded rules, but can "learn" new files types/formats simply by scanning a groups of files of a certain specific type.

Es:

I want TrID to be able to identify .FLAC files, so:

- i collect some various .FLAC files: different sizes, encoding options, some mono, some stereo, etc. etc. and put they in a directory
- then, I simply run: TrIDScan \temp_flac_files\*.flac

and it automatically scan alla files searching for similar pattarns. The results went saved in a file "newtype.trid.xml". Simply rename the file something like "audio-flac.trid.xml", and edit it so the element "type" contain something like "FLAC Lossless Audio File".

The new "definition" can then be used by TrID, along with all the others already collected/present, to identify any specified file.

I have already collected various definitions (for EXE files, some bitmaps format, etc. etc.), but what better place than Hydrogen Audio for asking help for new definitions for any existing audio format?

TrID and TrIDScan are currently in early beta state, but I think they already can be useful.
The .NET Framework is required (or Mono under Linux, altough TrID seem not working correctly with it, at the moment).

If you want to try & play with it, feel free to download from here: http://mark0.ngi.it/software-net-trid.html#download
Any comment is welcome!

The page is only in Italian, for now, but I hope the screenshots help to present some info in understandable form. As you can see my english is not very good!

If you produce some definition, send it to my e-mail address and I'll add it to the library of the available definitions. My address: marcopon@nospam@myrealbox.com

Thx for your attention,
Bye!

P.S. If there is a better suited section for a post like this, tell me. Thx!

TrID - File Identifier

Reply #1
Just an addition.
I have adjusted my home page so maybe not it's a little more clear.

I think that if we can build some definitions for various versions of a file format (ex: MPC / MusePack), it could help to determine if a file that doesn't play was made with an outdate / unsupported versions, or if it was renamed with a wrong extensions, etc. etc.

Bye!

TrID - File Identifier

Reply #2
I have added various definitions for audio files. As now, these are the knowed audio formats:
Code: [Select]
Ext   File Type
---------------------------------
FLAC  Free Lossless Audio Codec
LA    LosslessAudio Compressor
OFR   OptimFROG encoded Audio
OGG   OGG Vorbis Audio
RK    RKAU encoded Audio
RM/RA Real Audio
SHN   Shorten encoded Audio
SPX   Speex encoded audio
VOC   Creative Voice (Audio) File
WAV   RIFF/WAVe standard Audio
WMA   Windows Media Audio
WV    WavPack encoded Audio


Here you can find a list of all supported files types:
http://mark0.ngi.it/soft-trid-deflist.html

And now there are Win32 version of TrID & TrIDScan, so there is no need of .NET Framework installed.
Can be downloaded from here:
http://mark0.ngi.it/soft-trid.html

If someone could generate some definitions for other audio formats (or even non audio, but this is the topic!), I will be more than happy to add them to the list.

Bye!

TrID - File Identifier

Reply #3
More new audio file types now recognized:
Code: [Select]
AIFF  AIFF (Audio Interchange File Format)
APE   Monkey's Audio
AU    NeXT/Sun uLaw/AUdio format
BONK  BONK lossless/lossy audio compressor
IFF   Amiga IFF 8SVX Audio

I whish to thanks James heinrich for some of theese.

I have just put up a page for TrID in english: TrID Home Page (ENG)

If someone can/want contribute with some other audio file type definitions, I will promtly add those defs to the public data base (now counting over 230 defs), for everyone convenience.

Thx,
Bye!

TrID - File Identifier

Reply #4
Some new audio/music format added to the data base:

Code: [Select]
ADA   Advanced Digital Audio compressed audio
APC   APAC compressed audio
DAX   DAX compressed audio
KAR   Karaoke MIDI
MCP   MUSICompress encoded audio
MIDI  MIDI Music
MP3   MP3 Xing Encoder
MPC   Musepack Encoder (SV7.0)
PAC   LPAC - Lossless Predictive Audio Compression
RBS   Propellerhead Software ReBirth Song
REX   ReCycled Audio Loop Export
RPS   Propellerhead Software Reason Song
SKYT  SKYT/Drifters Packer song
SPC   Super NES music/audio data
UAX   Unreal Audio
VQF   TwinVQF


Total definitions/file types now are over 350:
Definitions list

Bye!


TrID - File Identifier

Reply #6
This kind of learning algorithm to spot file types sounds really cool.

And it got me thinking!

If it can recognise so much, perhaps it can also learn to spot whether a VBR MP3 file is encoded with LAME --alt-preset VBR and so on, and compare it to lower quality VBR not using the alt-presets.

In other words, it could become a trained, possibly refined tool like encspot. I guess it depends how much of the file it looks at. An oversized id3v2 header tag could occupy megabytes if it contains images!

Even with encspot, it's possible to deduce more by looking at lowpass etc, which is retrieved from the LAME/XING VBR header frame.

TrID - File Identifier

Reply #7
Quote
I created some IDs from rare file formats.

Thanks, you have mail!

Bye!

TrID - File Identifier

Reply #8
Quote
In other words, it could become a trained, possibly refined tool like encspot. I guess it depends how much of the file it looks at. An oversized id3v2 header tag could occupy megabytes if it contains images!

Even with encspot, it's possible to deduce more by looking at lowpass etc, which is retrieved from the LAME/XING VBR header frame.

TrID can easyly identify XING's MP3 beacause that encoder put a very "particular" header in the file. Other than that, I think identifying MP3 is a task that require a very specialized tool (like EncSpot).
TrID has no hardcoded rules, so I don't think that there is much room for improvements here (apart from ID3 tag).

Bye!

TrID - File Identifier

Reply #9
I tester TrID somedays ago with some of my Ogg vorbis and they were recognized as an obscure file (never heard of it before and I can't remember it now) with 100% propability  . I'll test it again in a few days and let you know about it.

TrID - File Identifier

Reply #10
I think your project is very interesting, but your approach might be wrong as the results are not very good:


FLAC 0.3 - 0.4
100.0% (.FRO) A-Robots Fighting Robot Object (3/3)


FLAC 0.5 - 0.9
Unknown


LA 0.2
100.0% (.PIF) Windows Program Information (126/18)


LA 0.4
65.2% (.LA) La Lossless Audio compressed (v0.4) (4018/2)
32.7% (.LA) La Lossless Audio compressed (generic) (2018/2)
  2.0% (.PIF) Windows Program Information (126/18)


Monkey's Audio 3.80
Unknown

TrID - File Identifier

Reply #11
Yes, it all depends on how accurate a definition is.
It's for this that I asked here at HydrogenAudio for new or better definitions especially for audio formats, as I'm sure that some of you have very extensive collections of audio files.

For ex. To obtain a FLAC definition, I'have tried to collect some various .FLAC file, obtained with different encoding options, stereo or mono sources, etc. etc. Let's say I have ended up with some 20 files.
Then I run TrIDScan against that bunch of files, and it identyfied some recurring patterns. It may be the case that different versions of the encoder, alter the value of some bytes (maybe a simply ver flag), or some other things.

If someone have a much varied and bigger collection of FLAC files, for example, he could surely build a better definition. Or even some different definitions, one for different encoder versions, or with some TAG or others.

I will promptly update the public DB with every new definitions sent, giving the proper credits on the defs list page.

Thx,
Bye!

TrID - File Identifier

Reply #12
I'have update the OGG Vorbis & FLAC definitions; if someone wants to try them and tell me if they work better...
Here are the links:

229KB ZIP
85KB RAR

Thx,
Bye!

TrID - File Identifier

Reply #13
Now LA v.02 encoded files are correctly detected.
The LA "generic" defintions was updated too, and it works for all current/old LA format, and should works with future releases with no problems.

A LA v.02 file is now recognized as following:
Code: [Select]
 66.7% (.LA) La Lossless Audio compressed (v0.2) (4007/2)
33.3% (.LA) La Lossless Audio compressed (generic) (2004/2)


Thanks to Michael Bevin for letting me download the old 0.2 version!

Btw, I have added over 50 definitions for various bitmap formats. Now about 80 bitmap file types are recognized, from Texas Instruments "*.92i" to X Windows "*.xwd".

Bye!

TrID - File Identifier

Reply #14
A new audio format added:

Code: [Select]
LQT  Liquid Audio


I have scanned only 4 spare .LQT files. If someone have some more file at hand, and can try TrID on it to see if it identify them correctly, thanks in advance.
Total number of file types now near 470.

288KB ZIP
100KB RAR

Bye!