Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Limits of Lossless Compression...? (Read 12968 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Limits of Lossless Compression...?

Reply #25
Guys,

Before even thinking about discussing theoretical limits of loseless compression you should absolutely read on Shannon's work and Information Theory.  I'm a bit rusty but I think the next things hold (English is not my primary language so some expressions may seem, eh, wierd):

1. you cannot compress (loselessly) beyond the quantity of information contained in the material you try to compress.
2. entropy determines the information quantity
3. block size doesn't matter - you can compress in bytes at a time or kilobytes and the result will approximately be the same (using optimal compression methods)
3a. encoding format doesn't matter - if you take some data, write it down using binary numbers, ASCII codes, UTF-16 codes or even smoke signals the result of compression will approximately be the same (using optimal compression methods)
4. theoretically optimal compression methods get very close to maximum compression (when compressing infinite amounts of data it converges to maximum compression)
5. Huffman coding is an optimal compression method

Someone mentioned using Pi for compression. Unfortunately Pi itself is a maximum entropy source. If you try transmit your compressed information to someone who's never heard of Pi, you have to transmit the Pi itself, which in turn defeats the purporse of compressing with Pi.
Same logic can be used for compression with "sound patterns". If the sound is complicated enough you will have to encode it using all of your patterns and you have to transmit the patterns themselves. And the net effect will always be a bit (or more) over the theoretically best compression.

I hope I didn't mix things up but this should get you going

Limits of Lossless Compression...?

Reply #26
Quote
5. Huffman coding is an optimal compression method

How can it be optimal when there are better things (arithmetic coding)?

Quote
Unfortunately Pi itself is a maximum entropy source. If you try transmit your compressed information to someone who's never heard of Pi, you have to transmit the Pi itself

But the formula to generate Pi can be transferred without taking too many bits.

Limits of Lossless Compression...?

Reply #27
Quote
Quote
5. Huffman coding is an optimal compression method

How can it be optimal when there are better things (arithmetic coding)?
Huffman coding, range encoding, and arithmetic coding are all considered optimal I believe (Huffman coding is actually specialised arithmetic coding, and range encoding is almost identical to arithmetic coding). However, if you already have a fairly good idea of the stream's entropy characteristics simpler static codes such as Rice coding can be better.

Limits of Lossless Compression...?

Reply #28
Quote
How can it be optimal when there are better things (arithmetic coding)?
[a href="index.php?act=findpost&pid=344014"][{POST_SNAPBACK}][/a]
Huffman coding is optimal for each symbol using an integral number of bits. Arithmetic coding uses fractional bits, and (I think) is optimal for that.
From the Huffman wiki entry:
Quote
Assertions of the optimality of Huffman coding should be phrased carefully, because its optimality can sometimes accidentally be over-stated. For example, arithmetic coding ordinarily has better compression capability, because it does not require the use of an integer number of bits for encoding each source symbol. LZW coding can also often be more efficient, particularly when the input symbols are not independently-distributed, because it does not depend on encoding each input symbol one at a time (instead, it batches up a variable number of input symbols into each encoded syntax element). The efficiency of Huffman coding also depends heavily on having a good estimate of the true probability of the value of each input symbol.


@schuberth:
You were probably referring to my post regarding pi. What you pointed out was exactly what I was trying to say: someone who has never heard of pi won't be able to compress it. It's got quite a large amount of entropy.
However, it would be trivial to make a generator for pi to incorporate into a compression codec, ond then the only thing which would need to be transmitted is the range of digits to be decoded.

Basically, although pi is (as far as we know) maximum-entropy, it is actually immensely compressible, given the proper algorithm.
"We demand rigidly defined areas of doubt and uncertainty!" - Vroomfondel, H2G2

Limits of Lossless Compression...?

Reply #29
@Omnion, @sheh

Theoretically and strictly speaking about compressing information you have to transmit the dictionary as well, that means the Pi itself. Now, it is true that you could transmit some definition of the number Pi (eq. take a circle and it's diameter, ... you get the picture) but this is based on interpreting the information, not just compressing it. Think about the poor aliens. So Shannon is theoretically right

On the other hand there are a couple of universal constants and you may be right. If we assume that these information sources (eg. Pi, gravitational constant, base of natural logarythm, Plank's constant, hmm I think the rest are derived from these) are omnipresent you shouldn't have to transmit them and you could construct an algorythm, which makes use of them.

Limits of Lossless Compression...?

Reply #30
Quote
Basically, although pi is (as far as we know) maximum-entropy, it is actually immensely compressible, given the proper algorithm.
[{POST_SNAPBACK}][/a]

IMO, pi is not compressible at all because of its "maximum entropy" feature.

You basically have two choices when you want to use pi in a data decompressor: either store pi at "full" length somewhere or compute it when required. The latter looks like pi were compressed to the size of the pi-calculation routine, but in fact you have to invest time and electricity into the computation of pi when you need a part of it. While doing this, you have to "throw away" huge portions of pi because you don't need them for that particular purpose. This "data loss" occurs at the same time as electricity is converted into heat - therefore entropy increases while pi is calculated.

OK, I know that the word entropy is used in thermodynamics and information theory for "slightly" different things but there definitely is a relationship.    (see also [a href="http://en.wikipedia.org/wiki/Entropy]Wikipedia: entropy[/url])

[span style='font-size:8pt;line-height:100%']edit: typo...[/span]

Limits of Lossless Compression...?

Reply #31
Quote
3a. encoding format doesn't matter - if you take some data, write it down using binary numbers, ASCII codes, UTF-16 codes or even smoke signals the result of compression will approximately be the same (using optimal compression methods)


this is really not true. Basically a compressor builds a model of the
data. If it assumes a different sample-size then the probability estimations
are really worse. that's why simple ppm is so bad on binary for example.

Limits of Lossless Compression...?

Reply #32
Quote
Quote
3a. encoding format doesn't matter - if you take some data, write it down using binary numbers, ASCII codes, UTF-16 codes or even smoke signals the result of compression will approximately be the same (using optimal compression methods)


this is really not true. Basically a compressor builds a model of the
data. If it assumes a different sample-size than the probability estimations
are really worse. that's why simple ppm is so bad on binary for example.
[a href="index.php?act=findpost&pid=344088"][{POST_SNAPBACK}][/a]


Note the optimal compression methods. And remember that I was speaking from the Information Theory angle. If you compress a large enough amount of data the representation of data doesn't matter. The catch if in the large enough. That is a relative statement that depends on the compression algorythm used. So we are both right (or we are both wrong    )

BTW, what is ppm?

Limits of Lossless Compression...?

Reply #33
AFAIK Pi is compressible because it is reducible to a simple formula



Edit: I guess I didn't read this carefully enough:

Quote
You basically have two choices when you want to use pi in a data decompressor: either store pi at "full" length somewhere or compute it when required. The latter looks like pi were compressed to the size of the pi-calculation routine, but in fact you have to invest time and electricity into the computation of pi when you need a part of it. While doing this, you have to "throw away" huge portions of pi because you don't need them for that particular purpose. This "data loss" occurs at the same time as electricity is converted into heat - therefore entropy increases while pi is calculated.
[a href="index.php?act=findpost&pid=344087"][{POST_SNAPBACK}][/a]


This may be correct. Hell, I'm no information theory expert. We certainly do need one in this thread!

Limits of Lossless Compression...?

Reply #34
Ok i'm always speaking from a practical point of view. compression
trys to catch up the nature of data. but there doesn't exist
something like an optimal model. compressing utf-16 with
an english-text model makes things really worse. ppm
means prediction by partial match and is a modeling technique.

Limits of Lossless Compression...?

Reply #35
Quote
AFAIK Pi is compressible because it is reducible to a simple formula
Except calculating pi to 10 decimal places using the Leibniz forumla takes around 10 billion operations, which would make the compression extremely inefficient.

Limits of Lossless Compression...?

Reply #36
The Leibniz formula might be inefficient, but there are other formulae that are much more efficient. There was one that I recall produced like 10 decimal places of pi per iteration.

Anyhow, the observed practical limit of lossless compression seems to be about 50-60% for most sources, regardless of encoding technique. I'm not sure why people are so obsessed with further compressing lossless, but I rather doubt we'd get below 50%, given my experiences with lossless compression of many types of file. Lossy audio, on the other hand...

Limits of Lossless Compression...?

Reply #37
Quote
Quote
Basically, although pi is (as far as we know) maximum-entropy, it is actually immensely compressible, given the proper algorithm.
[a href="index.php?act=findpost&pid=344061"][{POST_SNAPBACK}][/a]

IMO, pi is not compressible at all because of its "maximum entropy" feature.

You basically have two choices when you want to use pi in a data decompressor: either store pi at "full" length somewhere or compute it when required. The latter looks like pi were compressed to the size of the pi-calculation routine, but in fact you have to invest time and electricity into the computation of pi when you need a part of it. While doing this, you have to "throw away" huge portions of pi because you don't need them for that particular purpose. This "data loss" occurs at the same time as electricity is converted into heat - therefore entropy increases while pi is calculated.
[a href="index.php?act=findpost&pid=344087"][{POST_SNAPBACK}][/a]

Ah ha. That kind of compressibility.  I guess in terms of total universal entropy, you're right. However, I was talking about file size compressibility: the amount of data needed to be transmitted can be made very small, thereby putting the burden of creating all that extra entropy on the processor.

@Canar:
The Chudnovsky formula adds ~14 digits of pi per term. However, the terms themselves are pretty complex:
[(6k)! (13591409 + 545140134k)] / [(k!)3 (3k)! (-640320)3k]
for k from 0 to inf.

[ edit: Actually, I suppose that's not too complex, since you can save the factorials for use in the next iteration. You just need to throw around some REALLY big numbers. ]
"We demand rigidly defined areas of doubt and uncertainty!" - Vroomfondel, H2G2

Limits of Lossless Compression...?

Reply #38
Quote
Anyhow, the observed practical limit of lossless compression seems to be about 50-60% for most sources, regardless of encoding technique. I'm not sure why people are so obsessed with further compressing lossless, but I rather doubt we'd get below 50%, given my experiences with lossless compression of many types of file. Lossy audio, on the other hand...


I would say that's fair enough to say, I left a link to small section taking from one Shannon's papers on the previous page.  There really is no reason to get more than 50% anyway? Unless of course in this case we are invoking some sort mathmatical transform. if people are that obsessed about then they should use a hybrid lossless codec with correction to satisify there needs.  The original poster, however was speaking in terms of  theoretical limits with everything else aside.  Shannon's work highlights that quite well.  I was discussing this with someone once and you have to keep going "no! entropy in Thermodynamic speak is not exactly the same thing as entropy in terms of Information Theory, although they are somewhat related".
budding I.T professional

 

Limits of Lossless Compression...?

Reply #39
Quote
Quote

You basically have two choices when you want to use pi in a data decompressor: either store pi at "full" length somewhere or compute it when required. The latter looks like pi were compressed to the size of the pi-calculation routine, but in fact you have to invest time and electricity into the computation of pi when you need a part of it. While doing this, you have to "throw away" huge portions of pi because you don't need them for that particular purpose. This "data loss" occurs at the same time as electricity is converted into heat - therefore entropy increases while pi is calculated.

Ah ha. That kind of compressibility.  I guess in terms of total universal entropy, you're right. However, I was talking about file size compressibility: the amount of data needed to be transmitted can be made very small, thereby putting the burden of creating all that extra entropy on the processor.
[{POST_SNAPBACK}][/a]

And it seems that I was wrong with that: it is not necessary to calculate irrelevant digits of [a href="http://en.wikipedia.org/wiki/Pi]Pi[/url] and to "throw them away". The BBP formula can calculate any binary or hexadecimal digit of pi without calculating the digits preceding it. So, the use of Pi for data compression might be feasible for the decompression part.

The problem seems to be on the compression side of the codec. It would have to scan HUGE amounts of Pi digits for a match with the data to be compressed. Using current knowledge and technology, this is simply not feasible for data longer than a few bytes. Have a look at Pi-Search, for instance.