Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Unique identifier (Read 15247 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Unique identifier

Reply #25
...And a 20 minute mp3 will have the same CRC as the first 20 seconds of this mp3, made with mp3directcut.... Am I right?
Life is Real...
(But not in audio :) )


Unique identifier

Reply #27
@gambit: no, it is sensitive to gain change
@kalmark: yes, the same checksum. but as I mentioned earlier the total framecount can/should be used as ADDITIONAL key and then you are unique again. In principle it would be better to scan the whole file but this is too slow for me (my implementation is too slow).

I'm sure it can be made MUCH faster - does anyone know of code which is clearly separated into parsing and decoding?


Unique identifier

Reply #29
I volunteer for KBFR here in lovely Boulder Colorado.  I'm very interested in finding out more about what people are doing to detect duplicate mp3s/oggs/flacs/wavs/etc for two reasons: (1) Our file sever is a four disk raid array totaling 447 gigs of storage.  Its getting close to capacity.  Dropping a some money on a new hard disk wouldn't be so bad, except we'd have to re-build the box we use -- get a new case, power supply, reformat and re-install the raid array from backup, etc.  (2) I had to opportunity to do some Machine Learning research this last semester on probabilistic classifiers. You can read about some of it here

I'm fairly confident that I could write a program using some of the stuff I was researching that would give you the true probability of two mp3s being duplicates -- of course, this would require some training data and putting some thought in to a good feature set.

I can generate a good sized training set from the current KBFR library.  Since I'd imagine most people have the same issues with duplicate mp3s that we do, the SVM model that I built from it (SVMs are the way of the future), would probably apply pretty well to most people's libraries.  But, they could generate their own training data if they wanted too. 

Obvious features would include differences in information in the ID3 tags, the filename, length, etc.  But,  the results could probably be improved by having other features based on things more intrinstic to the audio of the file, which would be necessary to make useful comparisons of songs encoded in different manners (VBR, ABR, CBR, different bit rates, sample frequencies, low passes, etc). 

MD5s for each of the sound files would be a good feature to include, but I'm wondering if any of you have other bright ideas?  I know audacity has a built in beat finder, and code that I could legally pilfer.  So, it would be possible to use a measure of the frequency of the beats found.  Any other ideas along the same lines?

Any code I write I'll release under the GPL, and anyone who gives me an idea I use in the program I'll give credit too.

Unique identifier

Reply #30
It seems like in the near future there will be a lot of software a feature-set more or less similar. My own application if half-way done, I am just waiting for a third party module to be finished.

Unique identifier

Reply #31
I made another bit of software for finding dups.
This one compares words (=titles) or letters (=artists), but unlike a comparison done by sorting it does this:
- convert to lowercase
- strip all chars which are not a-z 0-9
- compare each entry with every other and find the best fit (by "shifting" the words)
- find neighbour words that are equal

So you get a value of 0-100 for each pair.
The exe takes a file for input and creates one for the result (and it needs no MS Access  )
The first column is used as an index (key) number column.

Maybe it helps someone.

http://www.avisynth.org/warpenterprises/fi...ch_20040610.zip

Unique identifier

Reply #32
Wow great. I had some questions, but after reading the readme file, all are answered. so... Gotta try it out.

Unique identifier

Reply #33
Quote
It seems like in the near future there will be a lot of software a feature-set more or less similar. My own application if half-way done, I am just waiting for a third party module to be finished.
[a href="index.php?act=findpost&pid=215786"][{POST_SNAPBACK}][/a]


does any1 know, if any of 'that' programs showed up ?
heh, i am also working on such one :/

Unique identifier

Reply #34
@sn0wman:
yes, it has been finished long ago :-) and i have used it to sort out duplicates from more than 4000 Mp3.

The principle is that I let "The Godfather" build its DB (together with the MD5 hashes which doesn't count in the tags - this is already HALF OF THE SUCCESS). Then with my own Access frontend to the DB of Godfather I look up and delete the duplicates - either automatically or through a semi automatic way (list the dupes, listen to "A" and "B" and delete the one that I don't like).

The only reason behind why this frontend was not published is that it is quite a "developer version" without any really intuitive configuration possibility. As far as i know, the maker of The Godfather intends to set up a page with such small utilities where my frontend will also be listed.

Matyas

Unique identifier

Reply #35
is that only tag skipping implemented in The Godfather ? my app returns different md5 for this purpose.

Unique identifier

Reply #36
I would say omitting the tags while computing the MD5 has great advantages. For TOTAL binary (or MD5 hash based) comparison there are a lot of other utilities out there, but as far as I know TGF is the only application that can detect the same music even if tagged differently or not tagged at all.

Regards,
Matyas

Unique identifier

Reply #37
i mean my application also do that, and returned values should be the same (my and Godfather's), but - they are not.

Unique identifier

Reply #38
I have implemented my own jukebox which makes a md5 sum of the first 64 KB song data of mp3 and Ogg Vorbis files. Id3 and Ogg comments are of course skipped. I have tried this with 6000 songs and did not detect any collision.

This feature is used when songs are moved or renamed. The jukebox will then be able to rediscover songs referenced in the statistic or from a play list.

http://www.stigc.dk/projects/JavaTunes/

Unique identifier

Reply #39
ok, i have checked it on a low level. it ommits only ID3 tag, others - not, at last not APE2.

Unique identifier

Reply #40
Quote
ok, i have checked it on a low level. it ommits only ID3 tag, others - not, at last not APE2.
[a href="index.php?act=findpost&pid=330262"][{POST_SNAPBACK}][/a]


As this is not a standard, TGF does NOT support mp3 files with APE2 tags. Only id3V1 and id3V2 tags are recognized on mp3 files. APE tags are correctly processed on files where they natively belong.

I can almost hear some people saying, "Foobar uses ape for mp3". Yes it does, but name at least 1 mobile device supporting APE tags for mp3. I understand that it has a lot of advantages, but also disadvantages, and not being compatible with the most of the players is definitely one of the biggest disadvantages. BUT! I do not want to start another thread arguing about this - there are a lot of such threads here, please take this with understanding.

Matyas

Unique identifier

Reply #41
hey, there is nothing to argue about. i was worried about these results, becouse they should be the same and wasnt, i thought there is still something wrong in my routines. but it is all ok .
however (even if i dont use TGF, threat it like user remark) i think also that it doesnt matter in this case if APE has or not support. what are these MD5s generated for in FGH ? to make an unigue identifier, lookup for dups, and keep file integrity together. support has just nothing to do with that. if you already did stuff around ID3v1&2, it is easy to implement APE too

 

Unique identifier

Reply #42
You're right, I have to admit.