AI language models can exceed PNG and FLAC in lossless compression, says study

Topic: AI language models can exceed PNG and FLAC in lossless compression, says study (Read 7874 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

AI language models can exceed PNG and FLAC in lossless compression, says study

2023-09-29 01:08:28

Quote

In an arXiv research paper titled "Language Modeling Is Compression," researchers detail their discovery that the DeepMind large language model (LLM) called Chinchilla 70B can perform lossless compression on image patches from the ImageNet image database to 43.4 percent of their original size, beating the PNG algorithm, which compressed the same data to 58.5 percent. For audio, Chinchilla compressed samples from the LibriSpeech audio data set to just 16.4 percent of their raw size, outdoing FLAC compression at 30.3 percent.

~ https://arstechnica.com/information-technology/2023/09/ai-language-models-can-exceed-png-and-flac-in-lossless-compression-says-study/

Re: AI language models can exceed PNG and FLAC in lossless compression, says study

Reply #1 – 2023-09-29 04:58:26

Scroll down to read the "Promoted Comments"

Also:

Quote

A chart of compression test results provided by DeepMind researchers in their paper. The chart illustrates the efficiency of various data compression techniques on different data sets, all initially 1GB in size.

So, a corpus of about 2 CDs.

Re: AI language models can exceed PNG and FLAC in lossless compression, says study

Reply #2 – 2023-09-29 07:36:27

I must say the results of this study are very.... odd.

Apparently FLAC compresses the ImageNet data better than PNG, gzip and LZMA when offered in chunks of 2048 byte? That seems very unlikely? I'd say the results of LZMA, gzip, PNG and FLAC are suspiciously similar when considering that they work with completely different methods. Sure, PNG is more or less a general purpose filter with a specific context-aware filter, but how would you even feed non-image data to a PNG compressor? That context-aware filter depends on height and width of the image, which is not applicable for non-image data.

Also, I don't really understand the 'chunking' with the audio data. It seems to me they have chopped up the data in chunks of 2048 bytes and concatenated them. If the data is 16 bit per sample, that means only 1024 samples for each chunk, which really isn't representative for any kind of audio. Similar for pictures:

Quote

We extract contiguous patches of size 32 × 64 from all images, flatten them, convert them to grayscale (so that each byte represents exactly one pixel) to obtain samples of 2048 bytes. We then concatenate 488 821 of these patches, following the original dataset order, to create a dataset of 1 GB.

This doesn't seem in any way representative of any real-world use case? Image and audio data don't fit 1 byte per sample most of the time, so the data seems 'crafted' to me.

Finally, I don't understand where the 107% figure comes from for FLAC when compressing noise. It does much better than that. When I compress noise as an 8bps single channel stream with a blocksize of 2048 (as suggested in the paper) I get only 0.5% overhead, not 7%.

Re: AI language models can exceed PNG and FLAC in lossless compression, says study

Reply #3 – 2023-09-29 08:15:49

The point of this study seems to be that when you have a language modelling tool - "foundation models" in their lingo - it represents the "language" it models in a compressed way. And apparently not only the language it was trained on - which was written English, not recordings of spoken English. Supporting the notion that these models learn how to handle language generally, not just one particular language.

They did not at all try to compress a varied set of music, their test sets was ten hours of speech (mono, sampled at 16k), so the "2 CDs" objection is moot.

I downloaded it - it is is delivered as FLAC in .tar.gz , saving some percentages due to ... padding! @ktf, if the noise was split into similar-length files, then I think it would make up five-ish percentage points, explaining the 107?
The "clean" set (whatever that means), 2620 files was compressed to 142 kbit/s and the "other" set, 2939 files, to 136. They used flac 1.2.1, which doesn't matter much to the results: recompressing with 1.4.3 at -5 and -8p loses or gains 1 kbit/s. OptimFROG at --preset 8 gets them down to 130 resp. 123. So yeah, that is about as low it goes in audio compressors.

Of course their compressor itself - with the parameters - is yuuuuuge:
* The smallest of the Chinchilla models is twice this data set - 2 GB - and saves 5.5 points over FLAC. Breaks even with FLAC if you have 36 GB of speech.
* The biggest is 140 GB. So you need 1.5 TB to compress before it overtakes FLAC.
... and imagine the computing power required.

Re: AI language models can exceed PNG and FLAC in lossless compression, says study

Reply #4 – 2023-09-29 10:19:04

So I downloaded the test-other.tar.gz
original
351852295 bytes

Below all used foobar's "optimize file layout + minimize file size":

original
327764251 bytes

flac 1.4.3 -8
325770912 bytes

flac 1.4.3 -8b1024
317449945 bytes

Re: AI language models can exceed PNG and FLAC in lossless compression, says study

Reply #5 – 2023-09-29 10:46:00

Quote from: Porcus on 2023-09-29 08:15:49

The point of this study seems to be that when you have a language modelling tool - "foundation models" in their lingo - it represents the "language" it models in a compressed way. And apparently not only the language it was trained on - which was written English, not recordings of spoken English. Supporting the notion that these models learn how to handle language generally, not just one particular language.

Sure, I get that. Very interesting.

However, the study then goes on to compare the "performance" of their tool (just the file size, no other parameters) with various other compressors. The problem is, the input data has been mangled to be processed by this tool to the point where the specialized compressor are no longer capable. If FLAC compresses an image better than PNG, the input "image" can no longer be considered representative. And when the data isn't representative, why bother comparing at all?

Re: AI language models can exceed PNG and FLAC in lossless compression, says study

Reply #6 – 2023-09-30 21:31:14

Quote from: ktf on 2023-09-29 07:36:27

Apparently FLAC compresses the ImageNet data better than PNG, gzip and LZMA when offered in chunks of 2048 byte? That seems very unlikely?

PNG and gzip uses DEFLATE, and DEFLATE has rather big Huffman table in the front of the block, in the case a dynamic Huffman compressed block is used in the encoding method (that's the typical case). FLAC cannot have such a big codebook, so in a 2KB range, it is understandable that not having a huffman table can be favorable.

Re: AI language models can exceed PNG and FLAC in lossless compression, says study

Reply #7 – 2023-09-30 22:31:24

The compression rates of the LibriSpeech corpus, Table 1, is the most surprising.
According to the paper, LZMA2 is better than FLAC at compressing the LibriSpeech corpus on normal settings(Chunk Size ∞) on 16kHz mono 16bit. LZMA2 is 29.9%, FLAC 30.9%. Can it be true?
I tested test-other\LibriSpeech\test-other\367\293981\367-293981-0003.flac decoded to wav and compressed it with 7z(LZMA2), and the filesize was FLAC 200,379 bytes, WAV 306,444 bytes, 7z 210,684 bytes.
For test-other\LibriSpeech\test-other\8280\266249\8280-266249-0035.flac, FLAC 324,562 bytes, WAV 553,164 bytes, 7z 364,055 bytes.

Re: AI language models can exceed PNG and FLAC in lossless compression, says study

Reply #8 – 2023-10-01 12:00:01

Quote from: Kamedo2 on 2023-09-30 22:31:24

According to the paper, LZMA2 is better than FLAC at compressing the LibriSpeech corpus on normal settings(Chunk Size ∞) on 16kHz mono 16bit. LZMA2 is 29.9%, FLAC 30.9%. Can it be true?

Weird. The downloaded FLAC files are around 139 kbit/s, but uncompressed should be 256 since they are mono.

Quote from: bennetng on 2023-09-29 10:19:04

flac 1.4.3 -8b1024
317449945 bytes

-b1024 should be a good fit, if I understood correctly (and from the above, obviously I don't).
Apparently they took 2048 bytes chunks, and with mono 16 bits that corresponds to 1024 samples.

Re: AI language models can exceed PNG and FLAC in lossless compression, says study

Reply #9 – 2023-10-01 16:56:59

The downloads, converted to various formats. Does not match that table at all:

1 238 070 220 bytes: converted to .wav
   975 648 494 bytes: .wav folder compressed to   .zip with 7-zip at "Ultra" setting. That's LZMA, each file independently, right?
   783 452 740 bytes: .wav folder compressed to   .7z   with 7-zip at "Ultra" setting. That also compresses intra-file patterns.
   720 180 419 bytes: (149 kbit/s) WavPack -hhx6
   694 949 312 bytes: downloaded flac (extracted from .tar.gz, .flac files only) with padding
   650 590 982 bytes: (134 kbit/s) flac-1.4.3-win\Win64\flac.exe -f -8pe -r8 -b1024 -A "subdivide_tukey(5)" --no-padding

Re: AI language models can exceed PNG and FLAC in lossless compression, says study

Reply #10 – 2023-10-01 17:55:00

Quote from: Porcus on 2023-10-01 16:56:59

The downloads, converted to various formats. Does not match that table at all:

Thanks Porcus for testing all the samples.

ratio	bytes	comments
9304.80%	115,200,000,000	1000 hours of 16bit mono 16kHz uncompressed
80.86%	1,001,105,408	2048 bytes * 488,821 chunks - quote from the paper "We chunk the samples into batches of 2048 bytes and gather 488 821 such chunks into dataset of size 1 GB."
100.00%	1,238,070,220	converted to .wav
78.80%	975,648,494	.wav folder compressed to .zip with 7-zip at "Ultra" setting. That's LZMA, each file independently, right?
63.28%	783,452,740	.wav folder compressed to .7z with 7-zip at "Ultra" setting. That also compresses intra-file patterns.
58.17%	720,180,419	(149 kbit/s) WavPack -hhx6
56.13%	694,949,312	downloaded flac (extracted from .tar.gz, .flac files only) with padding
52.55%	650,590,982	(134 kbit/s) flac-1.4.3-win\Win64\flac.exe -f -8pe -r8 -b1024 -A "subdivide_tukey(5)" --no-padding

Hmm... it doesn't match the "LZMA2 is 29.9%, FLAC 30.9%" claim. Not even close. No classical programs seems to be able to compress 16bit monaural 16kHz speech that far.
My hypothesis is that authors wrongfully converted the LibriSpeech samples to stereo.

Re: AI language models can exceed PNG and FLAC in lossless compression, says study

Reply #11 – 2023-10-01 20:21:55

Quote from: Porcus on 2023-10-01 16:56:59

compressed to .zip with 7-zip at "Ultra" setting. That's LZMA, each file independently, right?

Assuming you specified LZMA, yes. (Changing the compression level to "Ultra" doesn't change the compression method, and the default method for .zip is Deflate.)

Re: AI language models can exceed PNG and FLAC in lossless compression, says study

Reply #12 – 2023-10-01 20:28:10

Quote from: Kamedo2

My hypothesis is that authors wrongfully converted the LibriSpeech samples to stereo

Or to 32 bit, since they wrote that they used Python to compress to FLAC (converting to float like Matlab/Octave?). Not sure what the default FLAC preset is in Python (Edit: apparently -5), but surely not -8 or -8pe or whatever. Can someone try compression with FLAC -1? If that results in roughly 60-61% compression ratio (i.e., twice their reported value), then it's starting to get reproducible.

Quote from: Porcus link=msg=1033500

-b1024 should be a good fit, if I understood correctly (and from the above, obviously I don't).

Oh, I think you understand very well how this should be done, but I'm not sure about the authors of that paper. Don't expect them to use any elaborate command-line such as yours. Probably just "-1", like I wrote above.

Quote from: ktf

It seems to me they have chopped up the data in chunks of 2048 bytes and concatenated them

Ah, yes, if they did that, they inserted clicks (waveform discontinuities) between many such chunks, reducing the compressibility for FLAC, possibly to 60% or so.

Chris

Re: AI language models can exceed PNG and FLAC in lossless compression, says study

Reply #13 – 2023-10-01 20:58:43

Quote from: C.R.Helmrich on 2023-10-01 20:28:10

Quote from: Kamedo2
My hypothesis is that authors wrongfully converted the LibriSpeech samples to stereo
Or to 32 bit, since they wrote that they used Python to compress to FLAC (converting to float like Matlab/Octave?). Not sure what the default FLAC preset is in Python, but surely not -8 or -8pe or whatever. Can someone try compression with FLAC -1? If that results in roughly 60-61% compression ratio (i.e., twice their reported value), then it's starting to get reproducible.

741 296 664 bytes. (Including 45 561 564 bytes of padding. Using flac.exe 1.2.1, as that was the version in the download.) 144 kbit/s. If they used 32-bit (integer), the bit rate would be 512. Maybe ... ?

Quote from: C.R.Helmrich on 2023-10-01 20:28:10

Quote from: Porcus link=msg=1033500
-b1024 should be a good fit, if I understood correctly (and from the above, obviously I don't).
Oh, I think you understand very well how this should be done, but I'm not sure about the authors of that paper.

Sure. Rather I was thinking, is there anything that could bring it down to the thirties? If there isn't any such thing achievable with high effort, then I was like, they can't have been that lucky.

Re: AI language models can exceed PNG and FLAC in lossless compression, says study

Reply #14 – 2023-10-01 21:09:39

Nope, no way "real" 30% are possible when remaining lossless, I would claim. But, when measured relative to 512kbps as "lossless reference" and taking your results (albeit not on a chunk-concatenated subset, if I understand correctly what you compressed), 134/512 ≈ 26.2% are possible with FLAC. So 26 or 27% with extremely fast state of the art with a dozen parameters per frame vs. 24.9% with an extremely slow whatever-network with 1 BILLION parameters, which aren't even accounted for in that percentage.

Hmm...

Chris

Re: AI language models can exceed PNG and FLAC in lossless compression, says study

Reply #15 – 2023-10-01 21:57:40

Quote from: 2tec on 2023-09-29 01:08:28

For audio, Chinchilla compressed samples from the LibriSpeech audio data set to just 16.4 percent of their raw size,...

Can anyone point me to the place in the evaluation section where they actually achieve the 16.4%? The lowest percentage I can find is 21.0.

Chris

Re: AI language models can exceed PNG and FLAC in lossless compression, says study

Reply #16 – 2023-10-01 23:47:18

something something entropy something... ^^

Re: AI language models can exceed PNG and FLAC in lossless compression, says study

Reply #17 – 2023-10-02 07:03:34

Everyone keeps talking about 16bps, but as I read it, the paper suggests that the samples have been converted to 8bps. They did that for PNG as well: the paper specifically mentions converting to greyscale to have 8bps. In other words, one byte per symbol.

Re: AI language models can exceed PNG and FLAC in lossless compression, says study

Reply #18 – 2023-10-02 07:43:50

Here come the times when for encoding you shouldn't consider only the size of encoded files, but executable and dictionary/model size too. Computing power and memory bandwidth requirements rose significantly too. Efficiently perhaps.

Re: AI language models can exceed PNG and FLAC in lossless compression, says study

Reply #19 – 2023-10-02 08:35:07

Yes, self-extracting archives like WavPack and OptimFROG offer are probably not viable for this technology

Re: AI language models can exceed PNG and FLAC in lossless compression, says study

Reply #20 – 2023-10-02 08:49:18

Quote from: rutra80 on 2023-10-02 07:43:50

Here come the times when for encoding you shouldn't consider only the size of encoded files, but executable and dictionary/model size too. Computing power and memory bandwidth requirements rose significantly too. Efficiently perhaps.

Their table does that. I gave the sizes in Reply#3, but who knows if I (or they) got it right.

Quote from: ktf on 2023-10-02 08:35:07

Yes, self-extracting archives like WavPack and OptimFROG offer are probably not viable for this technology

The sfx feature was removed in WavPack 5, and I don't know if anyone misses it.

Re: AI language models can exceed PNG and FLAC in lossless compression, says study

Reply #21 – 2023-10-02 08:56:54

Quote from: ktf on 2023-10-02 07:03:34

Everyone keeps talking about 16bps, but as I read it, the paper suggests that the samples have been converted to 8bps.

Where? I've been following this thread and think you mean 8kbps, or 16-bit (sample resolution), but I tried reading back and the thread appears to have been mangled – post timestamps are wrong and posts are missing. I know for sure this thread did not originate only 3 days ago, yet the timestamp of the first post is "2023-09-29 01:08:28".

Re: AI language models can exceed PNG and FLAC in lossless compression, says study

Reply #22 – 2023-10-02 09:28:41

Quote from: fooball on 2023-10-02 08:56:54

Quote from: ktf on 2023-10-02 07:03:34
Everyone keeps talking about 16bps, but as I read it, the paper suggests that the samples have been converted to 8bps.
Where? I've been following this thread

You probably need to check the paper itself, it is here: https://arxiv.org/abs/2309.10668

I am not sure if the paper suggests 8 bits per sample. It does say "(𝐶 = 2048 bytes, i.e., 2048 tokens of 8 bits", but that might just explain what a "byte" is.

Re: AI language models can exceed PNG and FLAC in lossless compression, says study

Reply #23 – 2023-10-02 09:54:29

Quote from: fooball on 2023-10-02 08:56:54

Quote from: ktf on 2023-10-02 07:03:34
Everyone keeps talking about 16bps, but as I read it, the paper suggests that the samples have been converted to 8bps.
Where?

Under 3.1

Quote

[...] 𝐶 = 2048 bytes, i.e., 2048 tokens of 8 bits that represent the ASCII characters [...] We extract contiguous patches of size 32 × 64 from all images, flatten them, convert them to grayscale (so that each byte represents exactly one pixel) to obtain samples of 2048 bytes.

It doesn't say it out loud, but it seems reasonable to assume that a model trained on ASCII works with data points that are a single byte. This is reinforced by the statement that the PNG data is 1 byte per pixel. Why would they restrict PNG to 1 byte per pixel, but do audio with 2 byte per sample?

All in all, I think the results are highly specific and cannot be generalized to audio and image data in general, unless the model can learn that data points might be larger than one byte. If it was easy to do that, why would they restrict the PNG data to 1 byte per pixel?

Re: AI language models can exceed PNG and FLAC in lossless compression, says study

Reply #24 – 2023-10-02 10:46:14

So what was the "everyone" then? I took that to mean within this thread.

Notice