HydrogenAudio

CD-R and Audio Hardware => CD Hardware/Software => Topic started by: spoon on 2008-02-21 18:14:23

Title: AccurateRip - Future Direction
Post by: spoon on 2008-02-21 18:14:23
It has been brought to my attention that the CRC used in AccurateRip is not doing its job propperly, in laymans terms the Right Channel rolls out of the CRC Calculation every 1.5 seconds (that is 1st sample right channel is used 100%, by the 65535 sample it is not used, 65536 sample it is used 100% again, this repeats over and over). It is estimated that effectively 3% of the data is not getting into the CRC (at a 97% coverage, I stand behind AccurateRip @ 97% is better than most (? all) c2 implementations). Going back over the early AccurateRip code it seems the design of the CRC is fine, just the implementation (L and R channels were supposed to go in seperately, but were optimized to both go in without bringing down the upper 32 bits).

Steve will post his findings in detail on his discovery.

It is a relatively easy fix (detailled below), however this presents an opportunity, which was not around when AccurateRip was first implemented (the understanding of different CD pressings and how they were implemented was almost non-existing).

----------------------------
1. Fix: Fix the algorithm so all the data is used, both new and old CRC are calculated, new checked first, old second (with less Accuracy). New submissions would effectively appear as different pressings in the database.
----------------------------
2. Fix : Change the CRC algorithm to something like CRC32, the reason it was not used in the first place, was tracks 2 to x-1 would match the CRC presented in EAC, but 1 and last would never, causing confusion, the CRC could be XOR'd to avoid this confusion.
----------------------------
3. Fix & Additional Development: Use CRC32 and the old CRC (there is lots of data in the database), new CRC32 would go into a parallel 2nd database, increasing the strength of the CRC to almost 64 bits (not taking into account the flaw). Back end there is little changes to be made, both databases are the same design.
----------------------------
4. Fix & Additional Development: Use a different hash, MD5, sha-1, these would increase storage of the database by 5x (160bits of sha-1).
----------------------------
5. Brainstorm a method of having a hash which would be resistant to pressings, yet still be feasable for a CD ripper to rip track rather than whole CD based (and not have the need to read outside of the track).
----------------------------
6. ???

Bear in mind the existing database before construction takes up some 14 GB.
Title: AccurateRip - Future Direction
Post by: Steve Gabriel on 2008-02-21 19:48:30
My information about the implementation of the Accurate Rip CRC algorithm derives from Christoper Key's ARcue.pl Perl script available at
http://www.hydrogenaudio.org/forums/index....showtopic=53583 (http://www.hydrogenaudio.org/forums/index.php?showtopic=53583)

For each incoming data sample, the Right and Left 16-bit info is grouped into a single 32 bit word.  That word is multiplied by another 32 bit number, called frame_offset in the code, which is really just the sample's address in the file, i.e. the sample number.

All that's done is to multiply the two numbers together.  This results in a 64 bit product of which only the bottom 32 bits are preserved when loading it back into the $CRC integer variable.  A running total of the sample times its address is kept to produce the final CRC.

This algorithm is not a CRC at all, but a checksum, and a badly implemented one.  The problem is that the multiply shifts the high order bits of the sample out of the 32 bit window that is stored in $CRC and never rotates that data back into the low order bits.  It's really only half of a checksum.  This means that many of the high order bits of the right channel data do not participate at all in the checksum calculation.

In more detail, let D represent the 32 bit data word, and A (for Address) represent the data word's position in the file.  The calculation is that the CRC increment will equal D * A.  Now let's partition D into a high order m bit portion DH and a low order (32 - m) bit portion DL.  We'll reverse partition A into a high order 32 - m bits AH and a low order m bits AL.


D = ---DH---|----DL------- = DH * 2^(32-m) + DL
A = ------AH-----|---AL---- = AH * 2^m    + AL

Now multiply D times A.  The result is:

(DH * 2^(32-m) + DL) * (AH * 2^m + AL)  =  (DH * AH) * 2^32 + (DH * AL) * 2^(32-m) + (DL * AH) * 2^m + DL * AL

Take the low order 32 bits of the above product and the first term disappears, leaving

Low_32-bits_of ( (DH * AL) * 2^(32-m) + (DL * AH) * 2^m + DL * AL )

Now what happens if m low order bits of A are zero? Meaning that AL itself is zero.  The product then becomes

Low_32-bits_of (DL * AH * 2^m)

and the result is insensitive to all m bits of DH, which no longer appear in the result.

So how does this manifest itself? Every other sample has an even value of A, so that AL = 0 for m=1, and the high order bit of D is shifted out of the calculation.  Every fourth sample has two LSBs of A = 0, making m=2, shifting two MSBs of D out of consideration.  Every eighth sample has 3 MSBs shifted into oblivion and so on.  Every 2^16 = 65,536 samples, the entire 16 bits of the Right channel of the sample is shifted away, meaning that it has no effect at all on the final checksum.  This happens once every 65536 / 44100 = ~1.5 seconds.

Adding all these missed bits together, we find that 3% of the bits in the file do not participate in the final Accurate Rip CRC.  This is a coverage of only 97%, where a properly designed 32 bit CRC, or even a simpler checksum for that matter would give 99.99999997% !

This low coverage may explain some of the problems that people are reporting where their ripping software is indicating errors and Accurate Rip says that everything is fine.

So what to do next?  Choose an algorithm that at least has proper coverage given 32 bits, such as CRC-32.  Possibly go to a longer CRC such as CRC-64 or even to a 128 bit hash like MD5.  The now obsolete cryptographic properties of MD5 are not relevant here.  It would make a fine "CRC" with ridiculously high coverage percentage.

See Wikipedia for more details:
http://en.wikipedia.org/wiki/Cyclic_redundancy_check (http://en.wikipedia.org/wiki/Cyclic_redundancy_check)
http://en.wikipedia.org/wiki/MD5 (http://en.wikipedia.org/wiki/MD5)
http://en.wikipedia.org/wiki/Cryptographic_hash_function (http://en.wikipedia.org/wiki/Cryptographic_hash_function)
Title: AccurateRip - Future Direction
Post by: Steve Gabriel on 2008-02-21 20:05:31
Reading Spoon's initial post to this thread I see that the intention was to calculate left and right channel checksums independently, so that the product of two 16 bit numbers would fit in a 32 bit integer, whose upper sixteen bits could then be shifted down and added to the lower to make a correct checksum with full coverage.

So this is really a code optimization bug.

It's really a shame that C2 is done so poorly in most CD-ROM readers, and I agree that 97% coverage is probably better, but we can get to "nine nines" of coverage by fixing the bug.
Title: AccurateRip - Future Direction
Post by: Fandango on 2008-02-21 23:01:17
Concerning spoon's point #5: Considering that most different pressings are really bit identical except for a few samples at the beginning and the end of the CD due to a different offset in production, wouldn't rolling hashes help to find the right offset from which on creating a checksum is to be done? Something similar like rsync does. Then the new AR dll would calculate two track CRCs, the one with the "corrected offset" to match a pressing already in the database and the one without offset correction ("corrected" doesn't mean read offset correction here, of course).

The required rolling hashes for the first samples of a CD could be added to the existing database whenever an Track 1 of that CD is successfully ripped with the new version of AR by a user.
Title: AccurateRip - Future Direction
Post by: sbooth on 2008-02-22 00:01:48
I personally vote for using the MD5 hash of the PCM data.  The properties of MD5 are well known, and as Steve Gabriel points out the cryptographic weaknesses of the algorithm are irrelevant in this case.  There are many reference implementations and there are test suites for verification.  It seems silly to reinvent the wheel when so many excellent hashing and CRC algorithms exist.
Title: AccurateRip - Future Direction
Post by: Steve Gabriel on 2008-02-22 03:52:05
I lean toward option 4, using MD5 as well, but I wonder about the database expansion factor of 4x.  You said the database is now 14 GB, yet the Accurate Rip website says there are "only" about 460K CDs.  That's maybe 6 million tracks, which means that each track consumes 2.3 KB in the database.  Is that right?  What portion of the DB is devoted to actual CRCs?

Also, how prevalent is the multiple pressing problem?  On a typical CD how many variants are there?  Is the most common way a variant presents itself is as a simple offset of all the samples; as if there was a write offset error at the production facility and all the samples are otherwise identical?
Title: AccurateRip - Future Direction
Post by: spoon on 2008-02-22 09:45:09
In the DB are stored comp ident - big long OLE type identifier, CRC of track + crc of offset. I used to be able to construct the final database using huge (GB's) ram disks, was nice and fast, it reached a point beyond that and it takes some 24-48 hours to construct the final database.

Pressing problems are a big problem for AR (the whole CD is shifted by a certain number of samples, perhaps 1000 in some examples), they give wrong offsets when finding a drives offset (even with 3 discs, people can still get the wrong offset, say 1 in 3000), for common drives with a strong known offset, AR will only key to the known offset.

About database sizes, 14GB now, in 2-3 years if things were left as they are it could be 40 GB, if the plans for '6.' (see below) were implemented, that would be 80GB.

(have a cup of coffee before reading this...)

6. I think I have the solution! as it stands in the database for each track (forget pressings for the moment) is a track CRC (which has the flaw) and an offset finding CRC (which does not have the flaw).

I will be talking about 2 databases, side by side, the existing database is DB1 and new is DB2

[DB1] Work should be done in EAC and dBpoweramp ASAP to correct the flaw, each program should calulate 2 CRCs , the old one and the new one. Only the new one should be submitted once the fix is implemented. The old CRCs would in time be replaced by the new CRCs in the same database.

[DB2] In addition a 2xCRC32's should be generated:

[CRC1][..............CRC2............][CRC1]

So CRC1 is the first say 5 frames and last 5 frames of the track, CRC2 is all the track. These 2 CRCs could be submitted to a 2nd database, where the CRC1 will go into the current offset finding slot, no changes on the backend! (apart from creating the 2nd database)

Why do this? It would allow a match if your CD is a different pressing and not really in the database, no rolling CRCs are needed as the CRC from the existing database that is used to find offsets of drives can find the offset of the pressing and as long as it is < 5 frames +-, the pressing can be verified. It also has the benifit with track 1 (which currently is only calculated from 5 frames in) for any drive with a + offset it would have the correct CRC1, so all of track 1 could be verified in its entireity (not possible for the last track as majority of drives cannot overread).

When I started AccurateRip the idea of pressings messing the audiodata was not known (to me), if you had 40 different pressings of the same CD (could be with worldwide releases over 10 years) that lowers the 1 in 4 billion of a working 32-CRC routine to 1:100 Million of the chance of a CRC clash, adding the 2nd CRC would boost CRC to 64 bits effectively. Then AccurateRip could return:

Match using old CRC method,
Partial pressing match (10 frames of the file missing)
Match using CRC fix method (32 bit), in additon CRC32 match (on CRC1 and CRC2, so whole track)

All that would need to be done is a method of showing which of the above to the end user.
Changing to MD5 would mean the whole backend being rewritten, and there is about 30x more code on the backend - to keep the database clean from rouge data, such as drives configured with wrong offsets.
Title: AccurateRip - Future Direction
Post by: flacflac on 2008-02-22 10:37:34
Hi Spoon,

great to see you asking for features or suggestions. 

AR is an excellent tool to make sure the rips are done well, but I see a few problems with the way submissions are handled. It's possible you already implemented remedies for these without me noticing, but here we go...:

"Confidence 1"

I often times find myself ripping a perfect CD on my (of course...) perfect system, with all tracks ripping at 100% in EAC secure mode Test&Copy, but in the end only 50% of the tracks are "accurately ripped (confidence 1)" and the other 50% are not matching. It feels like someone submitted these results while ripping in a suboptimal fashion, and thereby tainting the database with lousy AR checksums. Are all the other, better AR results not being registered? Is it a first come - first database entry thing?

This goes into a similar direction as the multiple pressings: Why not save all submissions of one CD, rank them, and tell the user "accurate, confidence 6", "not accurate, confidence 1", i.e. there are 6 others with apparently the same rip result, while there is one submission, that reported a different CRC.  This could go in addition to your idea of creating separate sums for the main part of a track and the first and last 5 frames.


Different Pressings

Your idea sounds great, but I think we should definitely test it in order to figure out whether those 5 frames are enough. If you need some beta testing, let me know!

Thank you.
Title: AccurateRip - Future Direction
Post by: spoon on 2008-02-22 11:18:16
>Are all the other, better AR results not being registered? Is it a first come - first database entry thing?

No, when someone verifies your rip results they would appear in the database.

>"accurate, confidence 6", "not accurate, confidence 1",

The confidence 1 is just from any of the pressings, it was likely your rip failed as it failed the confidence 6 (which is in the database)

5 Frames is 2940 samples, there might be CDs with a wider range but it would be wrong to try to accomadate all the pressings out there (plus a pressing near to the one on the outside of the range would verify it).
Title: AccurateRip - Future Direction
Post by: Steve Gabriel on 2008-02-22 17:14:02
This idea sounds pretty good, but I don't understand it fully yet.  How is the offset detection "CRC" computed?  How does it help in finding drive and pressing offsets?

First I think we need a terminology change.  CRC to me means specifically an algorithm based on polynomial division.  What you are calling a CRC is a type of checksum. I suggest calling your algorithm ARCS for Accurate Rip CheckSum. 

The algorithm names will be

ARCS-F  Accurate Rip CheckSum - Flawed
ARCS    Accurate Rip CheckSum - correct
ARO  AR Offset detection checksum (this is the part I don't understand yet)
CRC32  An actual polynomial division based CRC such as CRC-32 IEEE 802.3

ARCS-F adds together increments that are

Low_32_bits_of(sample * i)

ARCS correct should add

Low_32_bits_of(sample * i) + (High_32_bits_of(sample * i) >> 32).

Let's name the slots in the two DBs:

DB1-TC  Data Base 1 Track data Checksum
DB1-OD  Data Base 1 Offset detection Checksum
DB2-TC
DB2-OD

So what I think you're proposing is (I've edited this multiple times, sorry if confusing)

DB1-TC currently gets ARCS-F of full track data minus 5 leading frames if a first track and minus 5 ending frames if a last track.  Slowly replace with either ARCS correct or CRC-32 of same data?
DB1-OD gets ARO (maybe the same as ARCS, don't know how this works)
DB2-TC gets CRC32 of track data, always minus 5 leading and 5 ending frames, or do you really mean ARCS of that data?
DB2-OD gets either CRC32 or ARCS or something else? based on the 5 leading and 5 ending frames
Title: AccurateRip - Future Direction
Post by: spoon on 2008-02-22 21:31:36
The offset detetion is simply a CRC (checksum) of 1 frame, in the first track, the program can hunt for the right offset.

> Slowly replace with either ARCS correct or CRC-32 of same data?

Correct ARCS.

>DB2-TC gets CRC32 of track data, always minus 5 leading and 5 ending frames, or do you really mean ARCS of that data?

CRC32

>DB2-OD gets either CRC32 or ARCS or something else? based on the 5 leading and 5 ending frames

CRC32, correct.
Title: AccurateRip - Future Direction
Post by: Steve Gabriel on 2008-02-22 22:05:43
The offset detetion is simply a CRC (checksum) of 1 frame, in the first track, the program can hunt for the right offset.

So each track entry in the database has the ARCS of a single frame of the first track on the CD?  AR scans all offsets +- about 2000 of the first track data through that frame window to find a match to insure that the read offset is set right?  You are also saying that the current contents of the DB1-OD field were computed with ARCS and not ARCS-F so that only DB1-TC has the bug?  Which frame number of track 1 did you pick?
Quote
>DB2-OD gets either CRC32 or ARCS or something else? based on the 5 leading and 5 ending frames
CRC32, correct.

So now you slide data through a 10 frame window (5 leading and 5 trailing) looking for a pressing match?

I assume the pressing scan has to do something different for a first or last track, such as only calculate DB2-OD on a leading or trailing 5 frames.
Title: AccurateRip - Future Direction
Post by: Steve Gabriel on 2008-02-22 22:21:53
[DB2] In addition a 2xCRC32's should be generated:

[CRC1][..............CRC2............][CRC1]

So CRC1 is the first say 5 frames and last 5 frames of the track, CRC2 is all the track. These 2 CRCs could be submitted to a 2nd database, where the CRC1 will go into the current offset finding slot, no changes on the backend! (apart from creating the 2nd database)

At first I wondered why you needed two CRC32s for DB2, you could just calculate CRC2 for the entire track.  But it's starting to dawn on me that you're trying to preserve the database logic exactly, so you need to calculate CRC2 for the middle frames only, just like in DB1.  You add on the missing ten frames with CRC1.  It's not used in the pressing scan at all.  That can use the existing DB1-OD field like you said.

Sorry to be so dense.
Title: AccurateRip - Future Direction
Post by: Eli on 2008-02-23 15:27:19
some other ideas have been expressed over at dbpoweramp
http://forum.dbpoweramp.com/showthread.php?t=16463 (http://forum.dbpoweramp.com/showthread.php?t=16463)
Title: AccurateRip - Future Direction
Post by: Steve Gabriel on 2008-02-23 21:15:03
In the DB are stored comp ident - big long OLE type identifier, CRC of track + crc of offset.

Given this record structure, you have 128 bits for the computer ID, and 32 each for the checksums, so it seems that adding another checkword of 32 bits or even 128 bits, only expands the size by less than 2x. Obviously I'm missing some info about what's really going on.

How many records are in the DB?  How long is a record?  There are 16 M tracks in the DB and its size is 14 GB, so that tells me that there are an average of 5 entries per track.  This is probably distributed as a "long tail", with many discs having only 1 entry (confidence 1) and a few popular ones with thousands of submissions.

Just out of curiousity, do you have any easily available statistics on the long tail structure.  How many of the tracks are sitting at confidence level 1 right now, vs. 2 or 10 or 100 ?
Title: AccurateRip - Future Direction
Post by: Eli on 2008-02-24 00:03:33
If we are able to have one rolling checksum that will cover different pressings, how much would that shrink the db?
Title: AccurateRip - Future Direction
Post by: spoon on 2008-02-24 18:11:58
>DB1-OD field were computed with ARCS and not ARCS-F

They are calculated over ~544 samples, so the bug does nto come into play, the offset is just 1 frame about 50 in (off the top of my head).

>There are 16 M tracks in the DB and its size is 14 GB, so that tells me that there are an average of 5 entries per track

I would say that is right, the 2nd database I would keep serperate from the first so would need the overhead.

>one rolling checksum that will cover different pressings, how much would that shrink the db?

It would not, the above design calls for each pressing to be stored individually, a match on a pressing not in the database is only a partial match (missing 10 frames).
Title: AccurateRip - Future Direction
Post by: Steve Gabriel on 2008-02-24 18:48:52
They are calculated over ~544 samples, so the bug does into come into play, the offset is just 1 frame about 50 in (off the top of my head).

If the same code was used for DB1-OD, then the bug will show up.  Remember that every even numbered sample shifts the MSB of the right channel out of the result.  This happens inside even a single frame since you are multiplying by sample number.  Samples with a multiple of a power of two offset shift the exponent of that multiple number of bits out.  The sample number being used is the low order 32 bits of the offset from the beginning of the file, not just the offset from the frame beginning.

DB1-OD's purpose is solely offset detection, so having at least complete coverage of the left channel data is probably good enough.
Quote
>There are 16 M tracks in the DB and its size is 14 GB, so that tells me that there are an average of 5 entries per track

I would say that is right.

Does this this mean that there is one record for every submission (based on computer-ID) for each track?  If a new submission comes in that's already there it just updates the CRC fields for that ID?
Title: AccurateRip - Future Direction
Post by: spoon on 2008-02-24 21:25:58
>DB1-OD's purpose is solely offset detection, so having at least complete coverage of the left channel data is probably good enough.

Exactly, even if only the left channel was used on its own, it would work.

Yes each submission has a record, resubmissions are dropped, not replaced.
Title: AccurateRip - Future Direction
Post by: Cerebus on 2008-03-08 21:11:28
Is there ANY way that we can avoid using the DISCID in the catalog?  The presence or absence of a data track (and it's size) makes checking audio files for accuraterip accuracy impossible with the current database structure when only the audio files are available.
Title: AccurateRip - Future Direction
Post by: spoon on 2008-03-09 09:11:02
We used everything from the CD TOC to cut the number of collisions, including the data track on CD Extra.
Title: AccurateRip - Future Direction
Post by: skamp on 2008-03-09 11:11:18
4. Fix & Additional Development: Use a different hash, MD5, sha-1, these would increase storage of the database by 5x (160bits of sha-1).

At least with MD5 we would be able to check them directly against MD5 sums already stored in FLAC and WavPack files.
With SHA-1, you would enforce a new standard for file identification that would benefit from hardware implementations that bring the computational cost of hashing down to zero (MD5 will fade away from hardware chips because of its cryptographic weakness). Since I'm always more inclined to look ahead instead of back, I vote for SHA-1.
Title: AccurateRip - Future Direction
Post by: Cerebus on 2008-03-10 14:25:30
Quote
We used everything from the CD TOC to cut the number of collisions, including the data track on CD Extra.


That doesn't mean it's the right thing to do.  I understand the issues with collisions, but I think there's very few collisions based on the presence or absence of a data track in the TOC, and the usefulness of AR goes up significantly if you can test an arbitrary set of lossless audio data against the db.  I think a few more collisions versus a significant improvement in the functionality should be considered...
Title: AccurateRip - Future Direction
Post by: bilbo on 2008-03-10 15:04:30
Yes each submission has a record, resubmissions are dropped, not replaced.

If I understand this correctly, you may be rejecting good submissions. From reading the posts, many people first read in burst mode and if an error shows up in the AR database, they than use secure mode. Assuming that the re-rip is good, the first submission is bad but would be accepted by the database. Than when the good submission is submitted, it would be rejected. If this is correct, you are rejecting a substantial amount of good data.
Title: AccurateRip - Future Direction
Post by: spoon on 2008-03-10 15:08:42
How would the system know if it is a good submission? if there are the tracks already in the database with a confidence of 2 or higher adding 1 more to it does not add much interms of value, maintaining 1 computers 100x rips on a dodgy CD as they attempt to recover is not good sense.

-----
@cerbus

If you have the album name and artist you can get a TOC from MusicBrainz, do it that way around.
Title: AccurateRip - Future Direction
Post by: Eli on 2008-03-10 15:16:02
How would the system know if it is a good submission?


If the submission is from either dBpoweramp or EAC and ripped in secure mode with secure results, especially if the whole disc is ripped and secure. This should be less of an issue with AR2 though since it looks like we will be able to cross-check somewhat between pressings.
Title: AccurateRip - Future Direction
Post by: bilbo on 2008-03-10 15:24:48
Using EAC for an example, if the report to the AR database contained information eac comparison results. If copy and test CRC's matched and the AR CRC is different between the submissions, you could replace the origional submission with the CRC matched results.
Title: AccurateRip - Future Direction
Post by: Eli on 2008-03-10 16:31:33
T&C does not account for consistent errors
Title: AccurateRip - Future Direction
Post by: greynol on 2008-03-10 18:41:27
T&C does not account for consistent errors

But it does account for inconsistent errors, whereas the original submission may not take into account the possibility for any errors.  Remember, Bilbo is talking about replacing a submission made with the same user ID.

However, I find such an implementation unnecessarily cumbersome.  I have faith that someone else will either come along and add a confidence to the rip or provide a new checksum if the original one wasn't correct and eventually someone else will bump out the bad result with a submission that matches.

AR is great because of agreeing submissions by multiple users, not by submissions by single users even if they are "secure".
Title: AccurateRip - Future Direction
Post by: bilbo on 2008-03-10 21:50:00
I am just concerned with the confidence levels. Say if 52 rip a disk and submit the results.
two used secure mode (and their results match), and 50 used burst mode. All burst modes failed AR checksum. These 50 people re-rip in secure mode and the resulting CRC's all match the first two. Under the current system, if another person checks a rip, he will only get a confidence level of 2, wheras it should really be 52, because the good second submissions were rejected.

Another option that would help, would be to allow the block submission of the first results, if they are going to re-rip.
Title: AccurateRip - Future Direction
Post by: greynol on 2008-03-10 22:24:44
<quoting myself for the hundredth time>
A confidence of 2 is good enough in my book, a confidence of 1 is fine with me as well so long as it wasn't my submission.
</quoting myself for the hundredth time>

FWIW, when I submit results, I do so manually.  I only send the results that I want to send.  The rest are blown away prior to ripping a title I plan on submitting.  Personally, it doesn't matter if others aren't as picky.  Out of submissions from 50 individual users with the same pressing at least one of them is bound to be accurate, even if it was done in burst mode.  I think this is more realistic than saying all 50 people who get bad rips in burst mode are going to get good rips by virtue of switching to a secure mode.
Title: AccurateRip - Future Direction
Post by: greynol on 2008-03-10 22:53:53
Anyhow, as far as checksums go, I think CRC32 is fine and if you're worried that people will be concerned that the checksums for the first and last track don't match the one given by your ripper, just NOT them.
Title: AccurateRip - Future Direction
Post by: Eli on 2008-03-10 23:45:21
I am just concerned with the confidence levels. Say if 52 rip a disk and submit the results.
two used secure mode (and their results match), and 50 used burst mode. All burst modes failed AR checksum. These 50 people re-rip in secure mode and the resulting CRC's all match the first two. Under the current system, if another person checks a rip, he will only get a confidence level of 2, wheras it should really be 52, because the good second submissions were rejected.

Another option that would help, would be to allow the block submission of the first results, if they are going to re-rip.


I may be wrong, but I think if you re-rip and re-submit it would be added to the database. If it matches, it will add to the match count. If it does not match any previous rips it is added to the offline database, or if there are no previous submissions for that disc (which in this case there are) it would be added to the main database.

What I have suggested, is that if a rip is accurate but does not match any AR submissions, especially if this is the case for the entire disc, then it should be added to the Active AR Database that is used for matching even if it is only 1 submission. But it seems as though this may not be an issue as we will be able to cross check between pressings.
Title: AccurateRip - Future Direction
Post by: bilbo on 2008-03-10 23:56:11
@Eli
I brought this up because op Spoon's statement earlier in the thread:

"Yes each submission has a record, resubmissions are dropped, not replaced."
Title: AccurateRip - Future Direction
Post by: Eli on 2008-03-11 01:34:23
@Eli
I brought this up because op Spoon's statement earlier in the thread:

"Yes each submission has a record, resubmissions are dropped, not replaced."


Thanks, I did not notice that. I agree, its not a good idea. Especially since re-rips are probably more likely to be accurate as additional measures have probably been taken to correct bad rips.
Title: AccurateRip - Future Direction
Post by: greynol on 2008-03-11 02:19:10
Especially since re-rips are probably more likely to be accurate as additional measures have probably been taken to correct bad rips.
That's an excellent point.

I say replace the old submission unless the old one has been verified with a submission from a different user.

I don't think the answer lies in submitting T&C data or information about the ripping configuration.
Title: AccurateRip - Future Direction
Post by: CoolHandZeke on 2008-03-15 23:58:18
MD5!  Widely used, 128-bit hash value...

Used extensively in the field of computer/digital forensics.  Certainly would benefit dBpowerAmp, IMHO. 
Title: AccurateRip - Future Direction
Post by: Jean Tourrilhes on 2008-03-16 06:02:28
Hi,

I already posted that on the dBpoweramp forums, but the discussion seems dead on that side...

While you are thinking of improving AccurateRip, I have a few suggestions.

I think it would be interesting to tell if the AccurateRip match is composed only of drives of the same type, or if there is a diversity of drive types.

This would help in two way :

1) If I re-rip the same disk on the same drive, I would know if I match my previous result or not.

2) If you assume that drives could have repeatable firmware bug, one would be able to gauge the confidence of his rip with more certitude, as more drive diversity is better.

For example, I can assume that Plextor is pretty popular for ripping. If I rip with a Plextor and I match with confidence 4, if all those 4 other match are also done on Plextor, I get less confidence than if all 4 are on different drives. See related discussions on reading small block size...

One quick way to implement that. For each AR record, you associate an offset field. For the first rip, when the AR record is created, you set the record offset to the offset of the drive that did the rip. For subsequent rip, if the offset is the same, you don't change anything. If the offset of the new rip differ from the record offset, you invalidate the offset. Then, when checking the AR record, I can compare the AR record offset with my own offset.

In a similar vein, you could keep in the record some idea of the diversity of ripping program used and ripping mode used. As we know, some program in some mode may introduce repeatable errors. Again, a greater diversity gives a greater confidence.

Thanks again for AccurateRip, and good luck with the new version...

Jean
Title: AccurateRip - Future Direction
Post by: rpp3po on 2010-01-30 15:03:02
Has this been fixed in the April 2009 rewrite? Which route did you take regarding the CRC implementation? I missed this thread back in 2008. Basically plain CRC, just correctly implemented should have been fine. Even saving two CRCs calculated both with the new and old scheme, shouldn't be that much data nowadays.
Title: AccurateRip - Future Direction
Post by: greynol on 2010-01-30 21:46:22
No and it's not going to be fixed.  If it used a CRC then CUETools wouldn't be able to do what it does.
Title: AccurateRip - Future Direction
Post by: spoon on 2010-01-31 09:10:53
There is no reason why such a fix could not go in transparently, it would have to be implemented in all the respective programs, if you take EAC as an example, the development time would be best spent on allowing EAC to check across pressings, it makes AR much more useful.
Title: AccurateRip - Future Direction
Post by: greynol on 2010-01-31 19:44:18
Change the hash calculation to CRC32 or something similar and  quick & easy checking against multiple pressings through calculation goes out the window.
Title: AccurateRip - Future Direction
Post by: spoon on 2010-01-31 20:20:33
The existing algorithm can be fixed and yet maintain the ability to calculate different pressings, I will do it next week for dBpoweramp R14 - so that R14 will only submit to the dB with new fixed routine, but can check old and new. This will be branded as AR2, the fixed CRC as well as the ability to check pressings (by using the offset finding CRC). I will contact Andre to see if he is interested in updating EAC.
Title: AccurateRip - Future Direction
Post by: greynol on 2010-01-31 20:24:45
What about CUETools and XLD?  You know they don't use the offset finding hash (which is not a CRC), right?
Title: AccurateRip - Future Direction
Post by: spoon on 2010-01-31 21:26:15
No they do not, they hammer around 4000 CRCs looking for a match. Nothing is stopping them using the offset finding CRC to find the offset before calculating.
Title: AccurateRip - Future Direction
Post by: spoon on 2010-01-31 21:33:01
Here is the proposed new CRC calculation, which should still allow the fast parallel calculation:

Code: [Select]
    DWORD AC_CRCNEW = 0;
    DWORD MulBy = 1;
    for (;; )
    {
        DWORD Value = {complete 32 bit sample comprising left and right channels};

        unsigned __int64 CalcCRCNEW = (unsigned __int64)Value * (unsigned __int64)MulBy;
        DWORD LOCalcCRCNEW = (DWORD)(CalcCRCNEW & (unsigned __int64)0xFFFFFFFF);
        DWORD HICalcCRCNEW = (DWORD)(CalcCRCNEW / (unsigned __int64)0x100000000);
        AC_CRCNEW+=HICalcCRCNEW;
        AC_CRCNEW+=LOCalcCRCNEW;

        MulBy++;
    }


Or I could switch to CRC32 which would deal with NULL samples better and a) encourage the use of the offset finding crc, b) field 1001 questions why track 1 and n do not match the CRC32 calculation.
Title: AccurateRip - Future Direction
Post by: Skybrowser on 2010-02-01 00:45:57
I'm slightly new and have found this topic quite interesting even if i can't understand every detail. This may sound like a dumb question, but for the sake of the newer people to the scene: What does this mean for us? For the CDs we have already ripped, if they say accurate in EAC or CUEtools under accuraterip, for all intensive purposes.... are they accurate? Based on the information i'm seeing in this topic, anything with a confidence of 2 or higher i'm assuming I can trust based on the shear odds that its almost impossible that 2 people would have the exact same errors and match in AR with bad rips and thus taint your result with you being the third person to match them with the exact same errors.  Am I right in assuming this?

Also someone made a comment about CDs with data tracks.  I have had a severe problems verifying all the tracks on any CD with a data track. Is this a common problem and is there a current workaround for verification? or is this one of the reasons you guys are reworking the code for your many ripping programs?

I'd like to thank all of you for the hard work you put into your software, and this website. Discovering all of this has refueled my passion for music in many ways, and given new life to my old CDs.

Cheers.

Title: AccurateRip - Future Direction
Post by: greynol on 2010-02-01 02:39:45
For the CDs we have already ripped, if they say accurate in EAC or CUEtools under accuraterip, for all intensive purposes.... are they accurate?
Yes.

Based on the information i'm seeing in this topic, anything with a confidence of 2 or higher i'm assuming I can trust based on the shear odds that its almost impossible that 2 people would have the exact same errors and match in AR with bad rips and thus taint your result with you being the third person to match them with the exact same errors.  Am I right in assuming this?
Confidence of 1 is all that's needed provided it wasn't your submission.

Also someone made a comment about CDs with data tracks.  I have had a severe problems verifying all the tracks on any CD with a data track. Is this a common problem and is there a current workaround for verification?
Is this a problem because the data track is part of the copy protection?  If so then TOS #9 says we aren't allowed to discuss it.
Title: AccurateRip - Future Direction
Post by: Skybrowser on 2010-02-01 09:45:56
Is TOS #9 American Legislation? Because I live in Canada, and as far as I know theres a little thing called freedom of speech. But as far as I can tell its not to do with copy protection, its many cds with data tracks whether it be music videos or what have you.
Title: AccurateRip - Future Direction
Post by: Akkurat on 2010-02-01 12:05:42
Quote
Is TOS #9 American Legislation?
No. Already discussed to death in here (http://www.hydrogenaudio.org/forums/index.php?showtopic=73353) & here (http://www.hydrogenaudio.org/forums/index.php?showtopic=75849).

Summary: TOS is a set of rules that users of this board agree to follow when signing up. It has nothing to do with laws in your country (which you should follow ). The TOS #9 is protecting the admins/owners from any litigation in the country where the HA server reside. All in all, "HA TOS are the "laws" of our little society which keeps it all together without all going down the drain." (quoting myself).


About the data track verification problems: you do have a proper cuesheet logfile, right? I'm assuming that you do your verifications with CUETools. Other than not having a (proper) cuesheet logfile (or CD copy protected), I haven't heard of any verification problems with data track CD's that don't fall under normal verification problems that affect all CD's.

EDIT: fixed info.
Title: AccurateRip - Future Direction
Post by: greynol on 2010-02-01 16:54:29
Actually a CUE sheet does little good with discs that have data tracks.  You need a log file that shows the start and length of each track (including the data track).
Title: AccurateRip - Future Direction
Post by: Akkurat on 2010-02-01 17:02:30
Ahh, brain fart, of course logfile, not cuesheet, fixed the previous post, thanks greynol.
Title: AccurateRip - Future Direction
Post by: Gregory S. Chudov on 2010-02-09 07:57:08
Here is the proposed new CRC calculation, which should still allow the fast parallel calculation:

Or I could switch to CRC32 which would deal with NULL samples better and a) encourage the use of the offset finding crc, b) field 1001 questions why track 1 and n do not match the CRC32 calculation.


Greetings, Mr. Spoon.

If you haven't made a final decision yet, i have couple of things to say.

First, i think current AccurateRip checksum is enough for all intents and purposes. I've given my reasons in the other thread (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=78300&view=findpost&p=684494).

I don't think it's worth the effort to replace it with that 64-bit one.

Concerning CRC32, i think i found a way to do parallel calculation of it (fast enough for offset detection). I can give some technical details if you are interested.
Title: AccurateRip - Future Direction
Post by: spoon on 2010-02-09 09:22:21
I do not think there is a need for parallel calculation as the drive offset finding crc can be used to find the exact pressing offset.
Title: AccurateRip - Future Direction
Post by: hellokeith on 2010-02-14 07:39:16
Spoon,

This is a suggestion.  I do realize what areas it could impact in terms of time, complexity, and cost.  Just throwing it out there.

I posted this in another thread regarding source of AR submissions.
Quote
AR does not provide any indication of the source of submissions. Should it? Ultimately that is a question only the maintainers of AR will answer, so my speculation is really quite irrelevant. But if it did..

On 2006 June 28 at 14:47:19 this album was submitted from IP 213.49.53.xxx from an Optiarc drive with offset -7.

All I have to do is lookup that class C IP block for a general geolocation (city level) and look at the drive type, and I know immediately if it was me or a friend or a complete stranger in Botswana. See 2 entries with all different field values, and you would know immediately that this is no coincidence.
Title: AccurateRip - Future Direction
Post by: viktor on 2010-02-14 11:19:17
about using md5 or sha1:

http://en.wikipedia.org/wiki/MD5#Collision_vulnerability (http://en.wikipedia.org/wiki/MD5#Collision_vulnerability)

i donno if that's what steve called irrelevant, but i'd feel hesitant to use a "broken" hash algorithm.
Title: AccurateRip - Future Direction
Post by: Gregory S. Chudov on 2010-02-14 12:31:19
There's no need for a strong hash algorithm in AccurateRip. In fact, there's no need for any hash algorithm in AccurateRip. Hash is needed only for cryptography. For applications such as AccurateRip traditionally and with good reason developers use various CRCs. Set of requirements for CRCs and hashes is different, they are good for different purposes. It would be wrong to say that hash is stronger than CRC. Their strengths and weaknesses are different. For example, CRC32 can guarantee that you will get a different checksum when less than 32 bits of the data is corrupted. No hash function can guarantee that.  In terms of collisions, CRC is stronger than any hash of the same size.
Title: AccurateRip - Future Direction
Post by: spoon on 2010-02-14 15:01:56
Also if you wanted extra bits, you could potentially use the other pressings to match, popular CDs might have 14 CRCs there, if you match all 14.

Not sure why it is too important to have IP address of submission, surely you know if you have ripped and submitted?
Title: AccurateRip - Future Direction
Post by: hellokeith on 2010-02-15 01:44:08
Not sure why it is too important to have IP address of submission, surely you know if you have ripped and submitted?


Due to a hard drive crash a few years back, I can guarantee I've ripped and submitted at least a few same CD's on the same PC w/ same drive but different OS.  You also have the issue of people ripping same disc on multiple machines in the same household and ripping same disc on their work PC.

I don't know if IP address is the perfect solution, but some detail of submission source would go a long way.
Title: AccurateRip - Future Direction
Post by: radu on 2010-02-16 14:30:36
So... In what stage of developing is this? When can we expect a public release?
Title: AccurateRip - Future Direction
Post by: spoon on 2010-02-16 14:40:01
Different pressing detection code is already in beta (dBpoweramp R14),  this new CRC I have written the code but not tested it, so a few weeks, EAC would need updating also which is separate.
Title: AccurateRip - Future Direction
Post by: spoon on 2010-02-16 22:38:47
Details on the current AR CRC calculation routines (yet to be updated to the new CRC calculation):

http://forum.dbpoweramp.com/showthread.php?t=20641 (http://forum.dbpoweramp.com/showthread.php?t=20641)
Title: AccurateRip - Future Direction
Post by: Gregory S. Chudov on 2010-02-17 08:45:52
Why create another CRC that is not much stronger than previous one? If you want a strong CRC, why not CRC32?