HydrogenAudio

Hydrogenaudio Forum => Scientific Discussion => Topic started by: DrDoogie on 2003-11-20 15:53:02

Title: Comparing two cd databases
Post by: DrDoogie on 2003-11-20 15:53:02
Hi all!

My challenge / problem is this:
1. I have freedb (and can parse it, with perl, to output "artist<TAB>album"), around 140K entries - each entry is a CD
2. I have "another" database (which is also in the format "artist<TAB>album"), more than 10K entries

And I wish to find out how many of the entries in 1. can be found in 2.. Unfortunately, I have no experience in generating dynamic patterns, nor in using Spell / ISpell (checks for spelling errors).

So, I am curious as to what (perl packages / other) you would recommend as a "most probably successfull way to do it".

Database 2. is currently so messy that I can only find around 5% of the entries therein, in 1..

At this stage, I am gratefull for any and all suggestions.

PS! Oh, and I use linux, so I have access to all tools available for that platform.
Title: Comparing two cd databases
Post by: Jasper on 2003-11-21 09:08:52
Wouldn't it make sense to put all the entries into a singe database with MySQL or something similar. Then you could just use a simple SQL query to find duplicates or even clean up the entire table.
Title: Comparing two cd databases
Post by: DrDoogie on 2003-11-21 22:29:06
Quote
Wouldn't it make sense to put all the entries into a singe database with MySQL or something similar. Then you could just use a simple SQL query to find duplicates or even clean up the entire table.

Mmm, I suppose I could use some "case-insensitive 'like'" stuff in MySQL, but why?

Perhaps you don't know what a regular expression is.

Say that you have the name of an artist in two formats:
A. "Mike Oldfield"
B. "Oldfield, Mike"

In order to match these two, you need a regular expression.
Say for instance with this:
Code: [Select]
s/([^,]*),\s(.*)/$2 $1/


Also, for some various erroneous entries in albumtitle, I have currently come up with some other patterns, which I read from a file as:
Code: [Select]
while (<album_patterns>) {
       chomp;
       if (!(/^$/ || /^#/)) {
               my ($pattern, $replacement, $modifier) = split /\t/;
               $pattern =~ s/^'(.*)'$/$1/;
               $replacement =~ s/^'(.*)'$/$1/;
               $modifier =~ s/^'(.*)'$/$1/;
               $albumPatterns{$pattern} = $replacement;
          }
}


These are the patterns, though I should note that they are not finished yet. Also, the unicode setup on my box i f'ed, so I have to devise the patterns somewhat 'tarded:
Code: [Select]
# year
#'(\D('[1-9]\d|[1-9]\d{3}))'    '[YEAR: $1]'
# yearspan
#'(\D('[1-9]\d|[1-9]\d{3}))(\s*.?\s*)(('[1-9]\d|[1-9]\d{3})\D?)'        '[YEARSPAN: $2$3$5]'
# volumenumber
#'[Vv]ol(ume|\.)[\W\s]?(\d*|[a-zA-Z]*)' '[VOLUMENUMBER $2]'
# volumespan
#'[Vv]ol(\.|ume)?s?[\W\s]+(\w+)(.*[Vv]ol(\.ume)?s?)?(\W+(\w+))' '[VOLUMESPAN: $2_$6]'