Description Length


When we encode information as strings of 1's and 0's we might wonder if we're doing this in the best way possible, e.g., reducing the amount of bandwidth used to transmit these strings. For example, let's say we would like to transmit only one of four characters, A, B, C, or D.

We can pick a fixed length encoding where
A = 00
B = 01
C = 10
D = 11

For a particular input string (e.g., ABAC) the ratio of the length of the output string (00010010) to the input string is the description length of that string. In this case we can see we need 8 symbols to encode 4 input symbols. With a fixed length encoding such as the one above, the description length will always be 2, because we need 2 bits to encode each character.

But can we do better? Do we really need 2 bits (on the average) to encode symbols A, B, C, and D?

The average description length is the average number of bits needed per input character given a particular encoding and the source "language". For example, you can think of English as the source language. The letter 'e' occurs more often than 'z', so if we use fewer bits for 'e' and more bits for 'z', on the average we would expect to have lower description lengths than with a fixed length encoding. But some encodings may be better than others and have better average description lengths. We want to answer the following question: "What is the minimal average description length for a set of symbols and the frequencies with which they occur?".

We turn to Information Theory for the answer, which gives us a way to calculate this average description length.

Information Content

solid_color.gif 2d_barcode.jpg

It's mathematically possible to calculate an information content value for any set of things, based on the probabilities of certain values occurring. For example, if something is always the same, it would have a minimum information content.

Let's think first about binary events and binary numbers: the Binary Entropy function

A string of coin tosses (1=heads, 0=tails) of a perfect coin would have maximum information content (H=1); a binary string that was always 000000... would have minimal (H=0) information content. Real life situations generally lie somewhere in between

Shannon's Entropy gives us a generalized value for the information content of any representation. If we use a base=2 for the logarithm, then it is equivalent to the average minimum number of bits required to store a set of information

So Shannon converts a set of probabilities into an information content value. The probabilities can be derived via statistical analysis if not known a prori.

Why does this matter? Well, the Shannon's Entropy can tell us:
  • The maximum theoretical amount of compression possible, i.e., the minimum
  • The minimum average number of bits to store a sequence of things, i.e., the minimal average description length is equal to the entropy!

It is of particular use if we use base 2 for the logarithms, because then it tells us the average amount of bits we need (based on the probabilities, rather than the maximum) to store something. For example, think about tossing a coin. What is the entropy (base 2) of 3 coin tosses? What would it be if we had an imperfect coin that came up heads 75% of the time?

Here are some more examples:
  • To store English characters (25 including space) we need 5 bits (log2(25) rounded up). Yet, based on statistical analysis, the H value (base 2) for English language is between 0.6 and 1.3 (recently estimated as 1.1). So with a perfect compression algorithm we should only need 2 bits per character! This is why text compression works so well

Frequencies of English letters
Frequencies of English letters

  • However the entropy of DNA sequences seem to be much higher - we would need 2 bits to encode the 4 DNA letters, but the entropy seems to be somewhere between 1.84 and 1.95 indicating nature is very efficient at encoding!

Compression


Compression can be Lossy or Lossless: we have already been over this when considering images and videos. However, compression can be applied to anything that has an uncertainty (Shannon's Entropy) > 0. Concerns are reduction in storage space vs reduction in bandwidth and time to compress and decompress.

Here are some of the more important techniques for lossless compression:

For more, including common implementaitons, see the Wikipedia article.

Encryption (Cryptography)


When we talk about encryption, we are really talking about the field of Cryptography - or "hiding information".

Historically, there has been a cat-and-mouse game of clever people inventing encryption algorithms, and clever people breaking them: for example the classic tale of Alan Turing at Bletchley Park cracking the Nazi Enigma Machine.

Generally, any method will involve the use of an Encryption Key which both parties know. The security of the encryption (against brute force decryption) generally depends on the size of the encryption key (or more importantly, its entropy). For now a size of 128 bits is considered to be sufficient such that computers for the foreseeable future won't be able to run brute force algorithms against them in reasonable time.

A major development in 1976 was Public Key Cryptography which removes the need for an initial sharing of an encryption key.

Common two most common modern encryption methods are RSA, DES, AES,

Symmetric cryptography

Let's discuss several terms in the context of symmetric cryptography, where two communicating parties Alice and Bob share a secret key called k. For example this key can be 128 bits long and is used as input to the underlying encryption algorithm such as AES.

The message m to be encrypted is called the plaintext, and the encrypted version of the message is called the ciphertext.

Thus the encryption algorithm ENC takes message m and shared key k as input, and outputs ciphertext c, i.e., ENC(m,k) = c
Likewise the decryption algorithm DEC takes ciphertext c and shared key k as input, and outputs the original plaintext m, i.e., DEC(c, k) = m.

Encryption algorithm
Consider the Caesar Cipher, which shifts letters in the alphabet by some number of characters.


Plaintext : ABCDEFG....XYZ
Ciphertext: DEFGHIJ....ABC

Bruteforce attack and key space
In this case the key is "shift by 3 places". The message 'BAD CAB' would be encrypted to 'EDGFDE'. The ciphertext looks cryptic, but we can all agree this cipher is easy to break. To recover plaintext from a ciphertext, we could simply try all the keys (how many possible keys are there?) and see if any of the decryptions yields intelligible English. This process of trying all the keys is called the brute force attack since it does not try to break the cipher in a clever way, and instead just tries every single key until it succeeds in decrypting the message.

What we have performed is called the bruteforce attack, i.e., we have tried all the possible keys to decrypt the message. The key space for an algorithm is the set of all possible keys, and thus the amount of keys tried in the brute force attack is equal to the size of the key space. Note that we are making an assumption about the adversary: we assume the adversary is computationally bound and must use modern computers to crack the cipher. That is, the adversary is a "mere mortal" like the rest of us and cannot try keys faster than one would expect.

It was pretty easy to brute force the key space of the Caesar Cipher, but consider the following variation called a substitution cipher, where the cipher text row is a randomized order of the alphabet, e.g.,


Plaintext : ABCDEFG....XYZ
Ciphertext: TIZPBUY....QEL


What does the key look like (what would Alice and Bob have to agree on as their secret to successfully encrypt and decrypt messages to each other)?
What is the size of the key space? This key space is large enough to make a brute force attack impractical (at least in Caesar's times, now it is borderline feasible).

Attacking the algorithm instead
But just because we cannot bruteforce the key space doesn't mean we should give up. Perhaps there is a weakness in the algorithm itself! It turns out it took about a 1,000 years for somebody to figure out how to break this substitution cipher (or at least it took someone that long to admit it publicly).

It turns out one can exploit frequency analysis (see histogram above!) to realize the letter B seems to occur most often in the ciphertext. With enough ciphertext, one would expect the histogram of frequencies of letters in the ciphertext to match those in plain English, with just the work of mapping the correct letters left to be done. This attack is amazingly simple.

Kerchoff's principle
Common wisdom tells us we should not rely on "security by obscurity" of the encryption algorithm. Kerchoff's Principle tells us we should assume the algorithm is public, and the security of the cryptosystem should rely only the secrecy of the key. The general idea is that a publicly known algorithm should stand up to scrutiny in the public forum. If the algorithm is kept secret, it will likely have weaknesses that may have been caught publicly, and if the attacker gets a hold of the algorithm the security of all communications under that algorithm is under risk. This is what happened with the Enigma in World War II.

Key lengths
So one must use vetted algorithms such as AES, the current accepted standard for symmetric cryptography, and use keys that are long enough to prevent brute force attacks on modern computers. For example NIST recommends keys of size 128 (the size of the key space here is 2^128, which is a lot of brute force work for an attacker) in the short term, but for security well beyond 2030 one ought to use 256-bit keys.

Entropy of keys
But not all keys are created equal (don't use a key like 10101010101010....10). You want keys that are random, i.e., the entropy of a 128-bit key should be 128 bits. In other words, each bit in the key should be picked perfectly at random. For example, what is the entropy of your 128-bit key if you pick 16 alphabetical characters at random? That yields a key space of only 52^16 keys, which is only about 2^91. In other words, such keys would have only 91 bits of information encoded into 128 bits, and would be within reach of a brute force attacker.

Consider two real examples:
Closed source code: Netscape's SSL implementation generated a 128-bit key using only 47 bits of randomness derived from the time of day, process ID, and parent process ID.
Open source code: Kerberos used 56-bit DES keys derived from only 32 bits of data (process ID, host ID, key count), and some of these bits were predictable leading to really only 20 bits of randomness. That would take only seconds to brute force.

Lesson: don't cook up your own scheme to derive randomness. Use standard sources of randomness provided by the operating system, for example.

Key Distribution Problem

How many keys do n people need to exchange beforehand to have secure communication amongst themselves? Ans: "n Choose 2" = n(n-1)/2. That's a lot of keys. 30,000 students would have to set up 449,985,000 keys! When and how would they do that?

The key distribution problem can be greatly alleviated through public key cryptography or asymmetric cryptography.

Public Key Cryptography

In public key cryptography, each communicating entity has a keypair, comprising of a public key k, and a private key s. The public key k is publicized widely, and can be used to encrypt data to the owner of the key. Knowledge of the public key does not reveal any knowledge of the private key to the (computationally bound) attacker, and only Alice can decrypt messages encrypted with her public key by using her private key.

Enc(k, m) = c
Dec(s, c) = m
(note Enc uses public key k, and Dec uses private key s)
So now n people can advertise their n public keys, and anybody can encrypt messages to them! We have gone from n(n-1)/2 keys to n keys that need to be exchanged.

But of course, life isn't easy: how do we know the purported "Alice's public key" is actually Alice's public key? One still needs obtain Alice's key securely so that you're not using the wrong public key. This is the problem Public Key Infrastructure (PKI) tries to solve, although we do not discuss it here.

RSA is the main algorithm in use for public key cryptography. The security of the algorithm relies on the assumption that factorizing numbers is hard. Part of the public key includes a modulus N=pq, where p and q are prime numbers. Factorizing N would allow the attacker to infer the private key from the public key, but since N is chosen to be a large number (NIST currently recommends 2048 bits to be secure well beyond 2030), a brute force attack to try to factorize N would require a similar amount of work to cracking a 256-bit symmetric key.

One may wonder why 2048 bits provides only 256 (or so) bits of security. The reason is that factorization algorithms are clever enough to factorize 2048 bit numbers much faster than a simple brute force attack, i.e., the algorithm doesn't have to try all the numbers from 1 to sqrt(N), and can skip several numbers (e.g., we already know even numbers can be skipped).

One-time pads

It turns out we do have an unbreakable encryption scheme, called the one-time pad.

The one-time pad XORs the plaintext with a random key to yield a random looking ciphertext. The recipient can XOR the ciphertext with the same key and recover the plaintext. For example,


Plaintext : 101000010000
Key : 101010000101
Ciphertext : 000010010101
XOR again : 101000010000

If each bit of the key is picked perfectly at random, the ciphertext cannot be recovered even by a brute-force attacker! This is because all possible plaintexts have a corresponding key that may have been used. So the attacker has no idea which key is the correct key. An attacker trying all the keys will yield all possible plaintexts and will not know which plaintext is the one being transmitted. In other words, the attacker has no more information he/she had before the message was intercepted.

Caveats
Notice here the length of the key must be the same as the length of the plaintext. So to exchange a 1GB file, Alice and Bob would have to agree on a 1 GB key beforehand. These are very large symmetric keys, and further exacerbates the key distribution problem.

Furthermore, these are indeed one time pads, that is, the key can be used to encrypt messages only once! If Alice encrypts two plaintext messages P1 and P2 with the same key, an attacker can compute C1 XOR C2, the XOR of the two ciphertexts, to yield P1 XOR P2! Now a glorified frequency analysis on the XOR of the plaintexts can yield clues about the plaintext, similar to how we saw the substitution cipher was broken.


So in general, even though we have a theoretically perfectly secure encryption scheme, it is impractical because Alice and Bob have to agree on large random keys beforehand, and these keys can be used only once. Symmetric schemes such as AES with only 256-bit keys that can be reused provide a much better solution, but now we must hope the attacker cannot either brute force the key, or attack the algorithm itself.

Digital signatures, storing passwords, and hash functions

A cryptographic hash function is a deterministic procedure that takes an arbitrary number of bits as input (called the message) and gives a fixed sized output (called the digest). For example, SHA-1 gives an output of 160 bits. In addition, this function must satisfy at least the three properties of collision resistance, pre-image resistance, and second pre-image resistance. These properties come in handy when it comes to storing passwords and signing data for example.


Consider the compromise of a password database. If passwords are stored as cryptographic hashes, it is hard to invert the hash and recover the password (pre-image resistance) or generate some other password with the same hash (second pre-image resistance). Nonetheless, the attacker might spend time and space to build a large dictionary of 8-character inputs (for example) and their corresponding hashes for a popular choice of hash function such as SHA-1. Now if a password database (or an individual hash) is compromised, the attacker and perform a quick lookup on this database. The addition of a salt as input to the hash function forces the attacker to build a new dictionary for each possible salt value. The length of the salt can be increased to make such precomputation attacks infeasible to the attacker. For example, a 1 byte salt now requires 2^8 dictionaries. OS X currently uses SHA-1 as the hash function with 4-byte passwords, requiring 2^32 precomputed dictionaries.

While one could certainly sign large messages using a scheme such as RSA, it is too inefficient. Computing the cryptographic hash of the message is much faster, and one can sign the hash instead. The resultant signature is thus faster to compute and also much smaller (a fixed, small size). Collision resistance ensures that Alice cannot later claim she signed some other message m2 instead of m1 (where hash(m2) = hash(m1)). Similarly, second pre-image resistance prevents somebody else from forging signatures on some new message m2 and claiming it came from Alice (who had originally signed m1).

But how can one construct these hash functions? Various algorithms have been proposed such as SHA-1 and MD5. But recent attacks have shown that collision resistance has been broken. In SHA-1's case, it should ideally take about 2^80 tries to generate a collision by bruteforce, but in 2005 researchers showed this could be done in 2^69 tries. In MD5's case, it should ideally take 2^64 tries to generate a collision by bruteforce, but instead collision can be generated "in seconds" today. SHA-1 should work for now. NIST will announce the winner for SHA-3 in 2012.