Improving compression of multiple similar songs

Topic: Improving compression of multiple similar songs (Read 5864 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Improving compression of multiple similar songs

2021-11-12 01:33:27

Hello. That's my first post ever on this forum. Maybe you know about tar program. It combines multiple files/directories into one. After that one can compress it with gzip, zstd, xz, etc. If compressing similar files, the compression achieved by .tar.gz will be better than of .zip. That's because .zip compresses each file separately, while .tar.gz compresses the whole .tar archive.

So I've got an idea: suppose there are multiple similar songs, maybe from the same band or album. Would applying the same principle improve compression of the songs? If yes, then how much lossy vs. lossless algorithms would benefit from the technique?

P. S. I'm just a layman who is interested in video/audio codecs, but I'm not familiar with codecs' internals.

Re: Improving compression of multiple similar songs

Reply #1 – 2021-11-12 01:48:47

Practical audio/video codecs only analyze a very short amount of material (milliseconds to at most a few seconds) at a time to keep memory usage reasonable. Using more memory allows you to exploit longer term correlations in the data stream and so improves compression somewhat, but it usually isn't worthwhile.

Re: Improving compression of multiple similar songs

Reply #2 – 2021-11-12 03:06:57

The closest thing is .cue + audio file. (.flac for example)
By doing that you can save couple of bytes maybe?
Like saratoga said, it's just not worth it.

Re: Improving compression of multiple similar songs

Reply #3 – 2021-11-12 07:34:41

For example, the FLAC codec never looks back more than 32 samples, and in subset files CDDA files (which is the overwhelming majority) no more than 12 samples. 12 samples is 0.3 milliseconds. So, with FLAC, you're not going to get any compression benefit.

If you want to do this, you should create a format that removes long-term redundancy (by comparing blocks and subtracting one from the other) and after removing redundancy, feeding it through FLAC. However, real-time decoding of such a file would be very difficult.

Re: Improving compression of multiple similar songs

Reply #4 – 2021-11-12 09:41:31

I've been thinking of the same, and maybe I even posted it here - or maybe someone else did. And the answer, like others have already hinted at:

Audio formats are supposed to be played back in real time. They aren't made for you to decompress your entire collection every time you want to play back a song. If you had a workstation where you would make a demo tape with "record chorus once, insert it five times in a song" then obviously it would save space if you could point 2,3,4,5 back to the first occurrence and save all that space.
But if you are streaming the song by successively delivering chunks of the encoded file, and someone tunes in after chorus 1, they wouldn't have the data. So it isn't "streamable" from file - imagine the buffering.

I am not saying that it wouldn't be possible to do what you suggest, but it would require other workarounds - and besides, for end-users: how much space would you really save? If you have two different masterings of the same album, I doubt it would be much; if you have fourteen different rips all with errors of a single song and want to keep them all until CUETools gets a correction file that can be useful, then sure - but this isn't a big part of your hard drive, hopefully, and then overall gains for implementing such a monster would be quite small.

In theory it could be done at file system level, with a file system constructed specifically for the purpose of compressing and deduplicating audio. That sounds like more than a Master thesis work, to say the least ... and compression and block-level deduplication demand quite a bit computing power - saving bytes is not done for free. Then the file-system compressed bytes would be decompressed prior to streaming. (Buffer issue? Full file in RAM before you play.)

Re: Improving compression of multiple similar songs

Reply #5 – 2021-11-12 20:27:48

You might try some archiver with solid mode and large dictionary, but they rather seek identical, not similar data. And even slight difference in any audio characteristic, will result in not identical data.
Such idea seems more suited for lossy codecs, but so far nothing like that exists (maybe deep learning codecs will be something like that). Also, something like that would rather be an archiver, with poor seeking, quite a lot of I/O etc.
MIDI and module files are something similar - they consist of samples which you play on different notes, volumes and effects, create sequences which may repeat etc. But you make them like that at the time of creation - you can't quite convert something to MIDI or mod.

Re: Improving compression of multiple similar songs

Reply #6 – 2021-11-13 00:36:17

Quote from: rutra80 on 2021-11-12 20:27:48

Such idea seems more suited for lossy codecs, but so far nothing like that exists

In principle, a hybrid lossy as lossless with correction file does something like that. By design when encoding, of course.

Re: Improving compression of multiple similar songs

Reply #7 – 2021-11-13 01:58:33

Even many lossless codecs are internally a lossy predictor with correction data. And even Fraunhofer tried to do something lossless with MP3 when they made MP3HD, which was doomed to fail because it was a closed standard.

Something like MP3HD requires a reference MP3 decoder that works purely in integer space, and produces consistent 16 bit or greater output given a particular MP3 stream, regardless of which CPU type it is compiled for, since the compressed residue correction data needs to precisely work with a uniform lossy input.

Re: Improving compression of multiple similar songs

Reply #8 – 2021-11-18 23:39:47

Experiment: Crossing channels between the same song in two different masterings, to see if bitrate would go up a lot.
If yes - an indication that deduplication wouldn't compress much (due to the different masterings); if no - an indication that it could. Not a rigorous test by any means, not sure how valuable - but at least an indication.

And edit: here I had to change my conclusion because ... actually, stereo decorrelation doesn't help more than about a percent on these albums. Bummer, should have thought of that.

Oh well, here I did initially:

I have three King Diamond albums where the remasters have exactly the same track lengths - to the sample! - as the original CD. Evidently Mr. LaRocque Allhage has gone back to files and just applied some new settings. They do measure quite different:
originals compress to 918 kbit/s at FLAC -8, RG album gains are like -5, -9, -7 dynamic range are 13, 9, 10;
remasters compress to 1026 kbit also at -8, RG album gains are like -9, -11, -12, dynamic ranges 9, 8, 6);
(They are Spider's Lullabye, Voodoo, House of God)

Using ffmpeg, I took each CD and created a stereo .wav with the left channel being the left channel from the original master, and the right channel being the right channel from the remaster. Then compressed.

Numbers:
* The albums - all 2x3 averaged: 972 kbit/s in flac.exe -8, and TAK -p4m test encoding reported at 67.46 percent.
* The ones I generated crossing channels: 983 kbit, resp. 68.27 percent.
* The 2x3 mono files average to 492 kb. Multiply that by two, and ...
* So then I realized the idea was quite a bit flawed, and generated stereo files with left channel from old & left channel from new. How would stereo decorrelation then change? Not much: 985, while the left-channel mono files are 493.

What I did not: try to align peaks. So there could be an offset misalignment.

Re: Improving compression of multiple similar songs

Reply #9 – 2022-10-09 12:48:25

Apologies for necro-posting, I think it's relevant.

Quote from: ktf on 2021-11-12 07:34:41

For example, the FLAC codec never looks back more than 32 samples, and in subset files CDDA files (which is the overwhelming majority) no more than 12 samples. 12 samples is 0.3 milliseconds. So, with FLAC, you're not going to get any compression benefit.

If you want to do this, you should create a format that removes long-term redundancy (by comparing blocks and subtracting one from the other) and after removing redundancy, feeding it through FLAC. However, real-time decoding of such a file would be very difficult.

I actually do have a pet project that removes redundancy from audio, but it's niche. The use case is compressing variants of old CD games together, which often have CDDA tracks that would mostly/fully match if it weren't for different factory write offsets (sometimes also variable pregaps and track lengths on particularly poorly mastered content, *cough* Sega CD). The audio portion of the format is a metadata file and a separate file of all the remaining samples concatenated. As an example the entirety of audio from Sega CD is roughly 112GiB, which de-duplicated shrinks to around 64GiB with barely any exactly-matching tracks. If anything some other systems are on average slightly more efficient as there is less janky mastering getting in the way of efficient de-duplication.

The idea is to access the tracks either through decoding the archive back to bin+cue, or realtime via a FUSE implementation (one neat thing about that is it's trivial to transparently output the original bin and wav file simultaneously). I came to HA initially for info to add some audio compression options so that the (possibly very large) sample file doesn't need to be decompressed to be used. Realtime decoding is solved as long as the audio format has good seek performance, the main issue to solve now is cleanly dealing with very large sample files, ie larger than the audio format can handle. I'm hitting tak's 2^30 sample limit and even flac's 36 bit limit may not be enough for the top-end expected input (PSX has roughly 800GiB of audio and it's unlikely it can be deduplicated to <256GiB). Wavpack can handle up to 40 bit which is great, but less efficiency does add up when the files are large. The only answer is probably to bite the bullet and split the file into ~4GiB chunks so they will all work. Hopefully tak and wavpak's seek performance is good.

Re: Improving compression of multiple similar songs

Reply #10 – 2022-10-09 14:32:51

Part of the problem is that slightly different signals might give quite different encoded files, so you would have to kinda know what corresponds to what.
Or deduplicate the uncompressed PCM, which is much simpler.

How did you get the 112 to 64? Is that PCM -> deduplicated PCM?
If so, how does it compare to using FLAC/WavPack/TAK on each CD the normal way?

Hacking the formats isn't straightforward. FLAC compresses as dual-mono when channel count is not 2. WavPack can only do mid+side pairwise. This is the likely reason why TAK does so well at http://www.audiograaf.nl/losslesstest/revision%205/Average%20of%20all%205.1%20surround%20sources.pdf - it does find a correlation matrix. Also MPEG4-ALS does quite well (forget about the rightmost entry, it is all too slow).
But TAK is closed-source.

Now if you had the TAK algorithm and wanted to FUSE it, you could do the following when you have "at most three near-identical CDs":
- a "file" points to a TAK file and assigns to it two of the channels in that file. Furthermore, it has an offset, and a .cue and maybe a metadata chunk. With six channels, three CDs take up channel FL&FR, BL&BR, C&LFE. TAK decorrelates them as good as it can.
- upon decoding, you extract the channels in question. Which is quite a bit more CPU-intensive, but still less than decoding a stereo Monkey's Extra High file.

Thomas Becker has indicated something about 7.1 support: https://hydrogenaud.io/index.php/topic,122334.msg1009931.html#msg1009931

Re: Improving compression of multiple similar songs

Reply #11 – 2022-10-09 14:45:27

Speaking of game CDs utilizing CDDA, PC-Engine CDs have a track warns about the CD is not supposed to be played on traditional CD players. Many CDs used the same female voice though some games used customized tracks with different messages. So games sharing the (audibly) identical female voice can be recycled, if bit-perfection of this track is not a requirement.

Many PSX games however used 37.8kHz XA-ADPCM so converting them to flac and such will make the size bigger.

Re: Improving compression of multiple similar songs

Reply #12 – 2022-10-09 15:06:16

Quote from: Porcus on 2022-10-09 14:32:51

But TAK is closed-source.

Such a shame, too. That doesn't achieve anything other than kill any possibility of proliferation of usage and support for the format.

Re: Improving compression of multiple similar songs

Reply #13 – 2022-10-09 16:08:33

Quote from: Porcus on 2022-10-09 14:32:51

Part of the problem is that slightly different signals might give quite different encoded files, so you would have to kinda know what corresponds to what.
Or deduplicate the uncompressed PCM, which is much simpler.

How did you get the 112 to 64? Is that PCM -> deduplicated PCM?
If so, how does it compare to using FLAC/WavPack/TAK on each CD the normal way?

Hacking the formats isn't straightforward. FLAC compresses as dual-mono when channel count is not 2. WavPack can only do mid+side pairwise. This is the likely reason why TAK does so well at http://www.audiograaf.nl/losslesstest/revision%205/Average%20of%20all%205.1%20surround%20sources.pdf - it does find a correlation matrix. Also MPEG4-ALS does quite well (forget about the rightmost entry, it is all too slow).
But TAK is closed-source.

Now if you had the TAK algorithm and wanted to FUSE it, you could do the following when you have "at most three near-identical CDs":
- a "file" points to a TAK file and assigns to it two of the channels in that file. Furthermore, it has an offset, and a .cue and maybe a metadata chunk. With six channels, three CDs take up channel FL&FR, BL&BR, C&LFE. TAK decorrelates them as good as it can.
- upon decoding, you extract the channels in question. Which is quite a bit more CPU-intensive, but still less than decoding a stereo Monkey's Extra High file.

Thomas Becker has indicated something about 7.1 support: https://hydrogenaud.io/index.php/topic,122334.msg1009931.html#msg1009931

The audio is separate to allow for external audio compression but baking in realtime decode support was an afterthought. De-duplication is done by PCM. The 112 to 64 is all PCM, the 64 is a concatenation of unique tracks and the start and/or end of tracks that mostly match other tracks (mostly the missing data either end is small, sometimes it's not), silent sectors at the start of a track are removed. The 64 GiB PCM compresses with flac 1.3.2 to ~55% using -6, haven't tested the tracks individually but it seems fine (there's something like 10000 audio segments so maybe 8000 audio frames with one or more transition in them that are probably encoded poorly. Basically a rounding error). All CDDA, it shouldn't matter that some tracks are true stereo some dual-mono AFAIK as flac (and probably all common formats?) adapt to the input per-frame.

By FUSE I mean libfuse, it allows you to mount whatever you like to an empty directory as a virtual filesystem (so by creating a driver for a custom format, the user/external-programs can access the contents without having to know anything about the custom format or that it even exists).

Quote from: bennetng on 2022-10-09 14:45:27

Speaking of game CDs utilizing CDDA, PC-Engine CDs have a track warns about the CD is not supposed to be played on traditional CD players. Many CDs used the same female voice though some games used customized tracks with different messages. So games sharing the (audibly) identical female voice can be recycled, if bit-perfection of this track is not a requirement.

Many PSX games however used 37.8kHz XA-ADPCM so converting them to flac and such will make the size bigger.

Bit-perfection is a requirement, identical tracks (audio or data, identical data tracks do rarely exist) are deduped per track. Only talking about handling the pure CDDA ATM. Didn't have plans to go too deeply down the custom-compression rabbit hole but now that you mention it, if lossless ADPCM compressors exist (?) it would be worth at least looking into. PSX ADPCM data is stored in mode 2 form 2 data sectors, meaning they are in a different file and mixed in with FMV's and whatever else used form 2. There's an additional hurdle not deeply investigated in that XA sectors were designed to be able to be interleaved for streaming, a track may not be contiguous in that file.

Quote from: doccolinni on 2022-10-09 15:06:16

Quote from: Porcus on 2022-10-09 14:32:51
But TAK is closed-source.

Such a shame, too. That doesn't achieve anything other than kill any possibility of proliferation of usage and support for the format.

That is offputting, wine can't be used everywhere and that is why the example above only used flac. However in this case only decode is being integrated for now, as long as the ffmpeg code is sound I'll probably end up using that.

Notice