Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Comparing two cd databases (Read 2974 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Comparing two cd databases

Hi all!

My challenge / problem is this:
1. I have freedb (and can parse it, with perl, to output "artist<TAB>album"), around 140K entries - each entry is a CD
2. I have "another" database (which is also in the format "artist<TAB>album"), more than 10K entries

And I wish to find out how many of the entries in 1. can be found in 2.. Unfortunately, I have no experience in generating dynamic patterns, nor in using Spell / ISpell (checks for spelling errors).

So, I am curious as to what (perl packages / other) you would recommend as a "most probably successfull way to do it".

Database 2. is currently so messy that I can only find around 5% of the entries therein, in 1..

At this stage, I am gratefull for any and all suggestions.

PS! Oh, and I use linux, so I have access to all tools available for that platform.

Comparing two cd databases

Reply #1
Wouldn't it make sense to put all the entries into a singe database with MySQL or something similar. Then you could just use a simple SQL query to find duplicates or even clean up the entire table.

 

Comparing two cd databases

Reply #2
Quote
Wouldn't it make sense to put all the entries into a singe database with MySQL or something similar. Then you could just use a simple SQL query to find duplicates or even clean up the entire table.

Mmm, I suppose I could use some "case-insensitive 'like'" stuff in MySQL, but why?

Perhaps you don't know what a regular expression is.

Say that you have the name of an artist in two formats:
A. "Mike Oldfield"
B. "Oldfield, Mike"

In order to match these two, you need a regular expression.
Say for instance with this:
Code: [Select]
s/([^,]*),\s(.*)/$2 $1/


Also, for some various erroneous entries in albumtitle, I have currently come up with some other patterns, which I read from a file as:
Code: [Select]
while (<album_patterns>) {
       chomp;
       if (!(/^$/ || /^#/)) {
               my ($pattern, $replacement, $modifier) = split /\t/;
               $pattern =~ s/^'(.*)'$/$1/;
               $replacement =~ s/^'(.*)'$/$1/;
               $modifier =~ s/^'(.*)'$/$1/;
               $albumPatterns{$pattern} = $replacement;
          }
}


These are the patterns, though I should note that they are not finished yet. Also, the unicode setup on my box i f'ed, so I have to devise the patterns somewhat 'tarded:
Code: [Select]
# year
#'(\D('[1-9]\d|[1-9]\d{3}))'    '[YEAR: $1]'
# yearspan
#'(\D('[1-9]\d|[1-9]\d{3}))(\s*.?\s*)(('[1-9]\d|[1-9]\d{3})\D?)'        '[YEARSPAN: $2$3$5]'
# volumenumber
#'[Vv]ol(ume|\.)[\W\s]?(\d*|[a-zA-Z]*)' '[VOLUMENUMBER $2]'
# volumespan
#'[Vv]ol(\.|ume)?s?[\W\s]+(\w+)(.*[Vv]ol(\.ume)?s?)?(\W+(\w+))' '[VOLUMESPAN: $2_$6]'