Asking how data compression works is a bit like asking how long a piece of string is. There are many different techniques for crunching data down so that it takes up less space in memory or on disk, with each suited to a different situation.
There are two broad kinds of data compression. "Lossy" compression achieves its gains mainly by throwing away - "losing" - some of the information. This only works for things like photos, audio and video, where it is possible to sacrifice some quality by throwing away details most people won't even miss. Digital music and video, including digital TV, DVD and Blu-ray, all use lossy compression.
"Lossless" compression, on the other hand, promises to restore compressed data to a pristine copy of the original.
But if you don't throw away any information, how can you compress anything?
Well, it turns out that digital information is not typically stored as efficiently as it could be. All digital information is stored as bytes, but not all of those bytes convey useful information. Some parts are just padding that take up space, while other parts are redundant and repeat what is already there. Lossless compression simply removes this unnecessary data.
One common variety of lossless compression, known as dictionary compression, recognises that a lot of digital data contains repeated sequences of bytes. In a human readable document, these might be words and phrases. A dictionary compressor finds these repeated byte sequences in its input, adding them to a "dictionary" as it goes. The dictionary appears just once in the output and each repeated sequence is replaced wherever it appears, with a note telling the decompressor where in the dictionary it can find the actual bytes.
Dictionary compression is commonly used in zip files and GIF and PNG graphics files.
Another common kind of compression, invented in the 1950s by United States computer scientist David Huffman, is more subtle. In English, some letters occur more often than others. That is why a Scrabble set has 12 "e" tiles, but only one "z". It is also why the Morse code for "e" is a single dot, while "z" is two dashes and two dots.
Using shorter signals for the more frequent letters makes Morse messages shorter and Huffman coding does the same thing for any kind of digital data.
A Huffman compressor analyses its input and works out which bytes occur most frequently. Those are then given a very short representation in the output, with less frequent bytes given longer and longer representations.
Generally, the output is a lot shorter as a whole.
A related technique called arithmetic coding can do even better, removing almost all the slack from the data it is given.
A lot of the things we take for granted in the digital age, such as digital cameras, VOIP and MP3, would not be practical without data compression to save storage space and transmission time.
- © Fairfax NZ News