Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: CUETools DB (Read 325479 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

CUETools DB

I have only now become aware of Gregory S. Chudov's effort to develop CTDB (CUETools DB). I am very excited about this as I have actually been suggesting this to spoon (dbpoweramp's developer for over 5 years)

http://db.cuetools.net/about.php

for others that have missed it:
Quote
What's it for?
You probably heard about AccurateRip, a wonderfull database of CD rip checksums, which helps you make sure your CD rip is an exact copy of original CD. What it can tell you is how many other people got the same data when copying this CD. CUETools Database is an extension of this idea.
What are the advantages?

    * The most important feature is the ability not only to detect, but also correct small amounts of errors that occured in the ripping process.
    * It's free of the offset problems. You don't even need to set up offset correction for your CD drive to be able to verify and what's more important, submit rips to the database. Different pressings of the same CD are treated as the same disc by the database, it doesn't care.
    * Verification results are easier to deal with. There are exactly three possible outcomes: rip is correct, rip contains correctable errors, rip is unknown (or contains errors beyond repair).
    * If there's a match, you can be certain it's really a match, because in addition to recovery record database uses a well-known CRC32 checksum of the whole CD image (except for 10*588 offset samples in the first and last seconds of the disc). This checksum is used as a rip ID in CTDB.

What are the downsides and limitations?

    * CUETools DB doesn't bother with tracks. Your rip as a whole is either good/correctable, or it isn't. If one of the tracks is damaged beyound repair, CTDB cannot tell which one.
    * If your rip contains errors, verification/correction process will involve downloading about 200kb of data, which is much more than it takes for AccurateRp.
    * Verification process is slower than with AR.
    * Database was just born and at the moment contains much less CDs than AR.

How many errors can a rip contain and still be repairable?

    * That depends. The best case scenario is when there's one continuous damaged area up to 30-40 sectors (about half a second) long.
    * The worst case scenario is 4 non-continuous damaged sectors in (very) unlucky positions.

What information does the database contain per each submission?

    * CD TOC (Table Of Contents), i.e. length of every track.
    * Offset-finding checksum, i.e. small (16 byte) recovery record for a set of samples throughout the CD, which allows to detect the offset difference between the rip in database and your rip, even if your rip contains some errors.
    * CRC32 of the whole disc (except for some leadin/leadout samples).
    * Submission date, artist, title.
    * 180kb recovery record, which is stored separately and accessed only when verifying a broken rip or repairing it.


CUETools DB

Reply #2
Chudov,
What type of error correction are you using?

How did you decide on 180kb? It would be interesting to gather data and figure how much an average damaged disc is missing and where the sweet spot would be for recovery.

Any thought of adding more info to the database? What about CRC of every 1vs10vs50 mbs of info as well as the whole disc CRC? This would allow better identification of where damage is.

Do discs have to pass AR before being added to the CTDB?

 

CUETools DB

Reply #3
Of course CTDB is open, and the code required to use it is LGPLed as all CUETools libraries. The only problem is it's in C#, i wonder if i will have to provide a .dll with C interface at some point. The algorithm is not very simple, there's quite a lot of code.

The basic algorithm is Reed-Solomon code on 16-bit words. Unfortunately, 32-bit Reed-Solomon is extremely slow, and 16-bit Reed-Solomon can be used directly only on chunks of up to 64k words == 128kbytes. So i have to process the file as a matrix with rows of 10 sectors (5880 samples == 11760 words/columns). Such matrix has up to ~30000 rows for a 70 minute CD, so i can use 16-bit Reed-Solomon for each column independently. Using the same notation as in wikipedia it's a (65536,65528) code, which produces 8 words for each column. So the total size is 8*11760*16bit = 188160 bytes.

N-word recovery record can detect and correct up to N/2 erroneous words, so this 8-word recovery record can detect up to 4 errors in each column. N cannot be much smaller, but it also cannot be much larger, because processing time grows proportionally to N, so N=8 was chosen as the highest value which is still "fast enough" - close to FLAC decoding speed.

Row size doesn't have such impact on performance, so it can be easily extended in the future, so that popular CDs can have larger recovery records. Current size was chosen so that if database contained as many entries as AccurateRip, it would fit on a 1TB drive. I also took into account that making records larger only helps in best-case scenario when the damage is sequential (scratches etc). When damage occurs at random points, fixing it requires larger N, not larger row size, but this has a performance impact. So the current record size was chosen to be more or less balanced.

Is there a point in better identification of where the damage is, when the database is unable to fix it?

Discs don't have to pass AR before being added to the CTDB, AR is used only as a kind of proof that there is a physical CD with such content when adding with CUETools.
CD Rippers can add CDs to CTDB even if AR doesn't know them. There is already a number of CDs in database submitted by CUERipper, some of them have confidence 1 - that means they didn't pass AR check or weren't found in AR.
CUETools 2.1.6

CUETools DB

Reply #4
Is there a point in better identification of where the damage is, when the database is unable to fix it?


Not for RS repair, however for the ripper, this would allow re-ripping of the part of the disc where CRCs do not match and therefore are the problem areas.

Quote
Discs don't have to pass AR before being added to the CTDB, AR is used only as a kind of proof that there is a physical CD with such content when adding with CUETools.
CD Rippers can add CDs to CTDB even if AR doesn't know them. There is already a number of CDs in database submitted by CUERipper, some of them have confidence 1 - that means they didn't pass AR check or weren't found in AR.


My reason for suggesting that the DB should only include AR confirmed discs is to verify that the correction data will fix a disc to the correct state. Also, it may help limit the size of the database by only adding correct discs.

Quote
Row size doesn't have such impact on performance, so it can be easily extended in the future, so that popular CDs can have larger recovery records.


I would argue that less popular discs may warrant more data as they may be less replaceable...

How is meta-data handled in your database since this info is also saved?

CUETools DB

Reply #5
Not for RS repair, however for the ripper, this would allow re-ripping of the part of the disc where CRCs do not match and therefore are the problem areas.

This information will only be useful for rippers specially designed for it. I'm hoping very much that such rippers as EAC will support CTDB at some point, but i doubt that this support will go beyond simple verification/submission. Besides, ripper usually knows where the problem areas are (using C2 error pointers and comparing results from several passes).

My reason for suggesting that the DB should only include AR confirmed discs is to verify that the correction data will fix a disc to the correct state. Also, it may help limit the size of the database by only adding correct discs.

You can never be sure that correction data will fix a disc to the correct state. As with AccurateRip, all you can be sure of is that a certain number of submissions have the same data. If you rip a CD with EAC and there were errors, your incorrect rip will appear in AccurateRip database at some point, after that you can submit your incorrect rip to CTDB (if it accepts rips with confidence 1). Besides, there are CDs that are absent in AR database, and i want CTDB to be able to handle them.
As for the impact on database size, we will have to see how it goes. Maybe at some point i will have to do periodic purges of old unconfirmed submissions (with confidence 1).

I would argue that less popular discs may warrant more data as they may be less replaceable...

There are a lot more unpopular CDs than popular ones, so we can double the amount of data stored for CDs with > 100 submissions and the database will only grow by 10%.

How is meta-data handled in your database since this info is also saved?

For now it only keeps artist/title information. CTDB server also has a replicated MusicBrainz database clone. For the moment all this is used only in web interface, which helps manage the database. I plan to improve integration with MusicBrainz when the next version of MusicBrainz database schema comes out. Would be nice to fetch all the necessary data in one request to a single server. CUERipper currently has to contact 4 different databases (AR, CTDB, MuscBrainz, FreeDB), which can sometimes take a lot of time.
CUETools 2.1.6

CUETools DB

Reply #6
Gregory,
Thanks for the reply.

Any thought of changing to a track based system? Or length of disc/track based system?

I would think other developers would be more likely to develop with this in mind if it could correct single tracks. This would allow for a burst rip, check against AR and if not accurate, a repair with CTDB.

With the current setup, I guess the best workflow would be:

- burst rip entire discs
- check all tracks against AR
- if not all accurate, check number of C2 errors
- if errors are small enough correct with CTDB
- if not, go into re-rip/secure rip mode and try to recover errors
- if some errors are recovered but not all, try to repair again with CTDB


Also, have you thought of keeping track of how many errors are in the "average" disc to get a better idea of how much error recover to keep in the DB.

CUETools DB

Reply #7
In the future, WILL THERE BE something that may detect by disc ID, which catalogue number & country a pressing actually is.

Example:
Disc ID 9F0BDB0D - the following pressings are a match of each other:
Jean Michel Jarre, Téo & Téa, CAT# 2561699766 Country EU
Jean Michel Jarre, Téo & Téa, CAT# 4607173157591 Country Russia

Or is this kind of thing is way too distant?

CUETools DB

Reply #8
As I said elsewhere, consolidation is a good thing.  If the audio data stream is identical, it shouldn't matter that the pressing is, even if the pressings have different TOCs.

There is a problem however.  Some pressings may differ by more than an offset, TOC or even different amounts of non-null samples at the edges of the disc; some pressings actually have transfer errors that exist on the glass master.  We certainly don't have CUETools "correcting" a track that was ripped correctly that has no audible glitch with data that came from a later generation pressing that has an audible glitch.  If I had a pressing ripped accurately with an audible glitch, I would personally prefer that it got corrected using data from a different pressing removing that glitch.  I don't suspect that there's an easy solution to this, but I think people should be aware that this is a real phenomenon.

CUETools DB

Reply #9
Well said, and if there was a way to identify these defective pressings, by CAT# and country, it should be easier to avoid them or mark them as such. It would take one step of further work from a development point-of-view, nonetheless very useful and informative.

CUETools DB

Reply #10
A tool to scan audio for clicks/pops characteristic of scratches or DRM would be a useful tool. Scanning rips, since most people can't/don't listen to a rip to check it, would be very helpful.

CUETools DB

Reply #11
dBpoweramp forum CUETools DB thread
Quote
The fact it is not track based is a real issue, to make it track based it would have to store 10x more correction data, which would make it un-practical.


Would it be reasonable to consider adding smaller track based correction files to the database? I'm not sure how large they would have to be to make any sense. Given the computational time required to do a correction, this decreases with file size, and therefor would be cut down significantly by repairing a track vs a disc. Having smaller tracked based correction files would increase the DB size, but not as much as if each correction file were as large as the entire disc correction file. This is where knowing how large an average un-recoverable error is would really help.

CUETools DB

Reply #12
Also, have you thought of keeping track of how many errors are in the "average" disc to get a better idea of how much error recover to keep in the DB.

In the "average" disc there are no errors  I don't see how to gather reliable statistics on this. It's possible to gather statistics of actual repairs done, but that will only tell us the number of errors when it's small enough to be correctable. There's no way to tell how many errors there were if rip cannot be repaired.

In the future, WILL THERE BE something that may detect by disc ID, which catalogue number & country a pressing actually is.

Next version of Musicbrainz will keep track of TOC <=> release relationships. CTDB could improve on that using not only TOC, but offset and actual audio data, however there are very few discs which have the same TOC with different audio data, and offset is not that important in my opinion. I hope Musicbrainz will do the job just fine.

We certainly don't have CUETools "correcting" a track that was ripped correctly that has no audible glitch with data that came from a later generation pressing that has an audible glitch.

Yes, this is a good example of why CTDB repair be used very carefully. If it ain't broken, don't fix it.

A tool to scan audio for clicks/pops characteristic of scratches or DRM would be a useful tool. Scanning rips, since most people can't/don't listen to a rip to check it, would be very helpful.

I'm afraid i don't see any realistic ways to do this.

Quote
The fact it is not track based is a real issue, to make it track based it would have to store 10x more correction data, which would make it un-practical.

Would it be reasonable to consider adding smaller track based correction files to the database?

Mr. Spoon is right. Average CD consists of 8-9 tracks, and each track requires the same amount of correction data as the whole disc. Making correction data 10 times smaller will make it useless, it won't be able to fix any significant glitch. And keeping the same amount of correction data for each track would make database too large. If we can allow for 10 times more database space, we could instead make larger correction records for the whole disc.

Besides, CTDB is mostly aimed at CD archiving, making sure you have the exact copy of your CDs on your HD. If you rip one track from a CD, it's much less important to have a bit-exact copy.
CUETools 2.1.6

CUETools DB

Reply #13
Quote
The fact it is not track based is a real issue, to make it track based it would have to store 10x more correction data, which would make it un-practical.

Would it be reasonable to consider adding smaller track based correction files to the database?

Mr. Spoon is right. Average CD consists of 8-9 tracks, and each track requires the same amount of correction data as the whole disc. Making correction data 10 times smaller will make it useless, it won't be able to fix any significant glitch. And keeping the same amount of correction data for each track would make database too large. If we can allow for 10 times more database space, we could instead make larger correction records for the whole disc.

Besides, CTDB is mostly aimed at CD archiving, making sure you have the exact copy of your CDs on your HD. If you rip one track from a CD, it's much less important to have a bit-exact copy.


Thanks for the replies. I understand that a track based approach would require a much larger DB.

I can't speak for spoon, but the way I read his comment was that

1. Most of his users rip tracks, not discs
2. If they can't rip just one track, the feature is worthless to them
3. If its not useful to his large customer base he may be less interested in supporting CTDB

I know you don't benefit from having CTDB support in dBpoweramp, but I would very much like to see this.

When I originally proposed the idea to spoon I had suggested/recommended a distributed p2p storage of reed-solomon recovery files. Would it be feasible to keep a central database and point to torrents for the R-S recovery files (not audio)? This would allow for track based files without a serious concern over the size of the recovery files. It would even be possible to allow for 5+ second worth of repair data per track


CUETools DB

Reply #14
We all benefit from CTDB support in as many applications as possible, because it's usefulness depends on how many people submit their results to it.

I have no statistics to prove it, but i assume that the vast majority of CD rips made with dbpoweramp or any other software are rips of the whole CD, even if ripped to separate files.

Distributed storage is definitely an interesting idea and it's worth to consider it, but it has many problems.
First, it would make CTDB software too complicated. Developers are much more likely to support something simple and straightforward which doesn't require that much code.
Second, recovery files for rare CDs that were submitted only by one or two people will be unavailable most of the time.
Third, how do we convince people not only to submit their data to the database, but to permanently allocate disk space and bandwidth for their submissions?
What happens when someone's HDD crashes? We loose his submissions?
CUETools 2.1.6

CUETools DB

Reply #15
spoon has said in the past that the majority of his users only rip a couple of tracks. We see a very biased user base here at HA. Even the active users on the dBpoweramp forums represent a biased user base.

Clearly a distributed storage solution is not ideal. Maybe a hybrid solution would be best. A central DB where every recovery file is uploaded would be the only solution to loosing disc space. Enforcing a certain ratio would be the only way to encourage uptime and shares. A system where users allocate disc space and the central system fills that space would more evenly distribute and preserve the data.

Of course storage continues to get cheaper and the projected size of the database will only occur after a matter of years, at which time storage will be even cheaper.

What about selling access to the repair files to pay for the cost of storage, bandwith and maintaining the database?

CUETools DB

Reply #16
After my last post I thought about this some more. I really think the best think may be a paid solution, similar to the one dBpoweramp users for premium meta-data

1. non-commercial users pay a $3-5 annual fee to access repair files
2. commercial users (users using batch ripper) pay ~$0.05/repair file

This may make it cost effective to have a large, track based database without the distributed storage.

CUETools DB

Reply #17
After my last post I thought about this some more.


No, you shouldn't. We only deal with FOSS in this case. At least until now.

Quote
License:
GNU General Public License (GPL), GNU Library or Lesser General Public License (LGPL)

CUETools DB

Reply #18
Wouldn't change the software license at all. There would just be a cost to access the database. Besides, doesn't matter what I say. Just an idea/suggestion.

Could offer current full disc repair size for free and track based with a charge.

CUETools DB

Reply #19
That can be feasible even with GPL (dependency through the network with a character oriented protocol), but a lot of users (count me in) may stop submitting recovery files to a proprietary base if there is no free and full access by 3rd party apps.  AccurateRip allows free access to 3rd party software.

CUETools DB

Reply #20
NP, I am sure you, zfox, would not mind providing free hosting of 10+ TB of data storage and the bandwith as well. Then we can all enjoy the benefits for free. Its very kind of you to offer to do this.

BTW, AR's DB size is a fraction of what CUETools storage requirements and bandwith would be.

CUETools DB

Reply #21
methinks the record companies would have a little issue with the sampling and reconstruction of their copywritten audio

CUETools DB

Reply #22
Current size was chosen so that if database contained as many entries as AccurateRip, it would fit on a 1TB drive.

@Eli
How much does such a HD drive cost? You answer that.
How many new CD releases per month? 1000? Recovery data submission bandwidth is under control.

There is also no need to have a track based DB (10x size) for archival purposes.

CUETools DB

Reply #23
methinks the record companies would have a little issue with the sampling and reconstruction of their copywritten audio

This is indeed an issue. Even if they know they cannot fight that in courts, a C&D letter may arrive.

CUETools DB

Reply #24
methinks the record companies would have a little issue with the sampling and reconstruction of their copywritten audio



There is no sampling. There is no audio storage. Without the nearly complete audio a repair file is worthless. It can't be played. There is no copyright issue. This database is live and already happening. It may be below the radar. There is no legal violation. Thats not to say big business couldn't bury it financially, but there would really be little to no incentive to attack it. This is the main reason that spoon gave for not going forward with this type of idea when I suggested it years ago though.


Quote
@Eli
How much does such a HD drive cost? You answer that.
How many new CD releases per month? 1000? Recovery data submission bandwidth is under control.

There is also no need to have a track based DB (10x size) for archival purposes.


Clearly the current DB is reasonable for 1 person to host. The idea was to pay for a larger, more robust DB, in order to cover those costs. As I suggested, maybe the best solution would be to offer free access to the current DB structure, and a for fee access to the larger track based DB.