The discrete source of information

We have seen how the logarithm of the number of possible signals (discrete case) increases *linearly* with time.

Let's consider now the information source. How can it be described mathematically? We want to know, for example, how many bits per second are produced by the source.

In telegraphy we send sequences of letters, but they are usually not random. They have some structure, for example that of the English language. In this language, the letter E occurs with more frequency than Q. Or the pair TH more than XP.

The structure of the sequences allows for encoding. For example, we could use the shortest symbols for the most common letters. In telegraphy we have something like this. The most common English letter, the E, requires a single dot, while infrequent letters like Q, X or Z have longer sequences. Also, standard greeting sentences or anniversary telegrams can be encoded into small sequences to save a lot of time.

A discrete source generates the message symbol by symbol. The choice of the next symbol depends on the symbol itself and on the preceding choices (it can have a lot of memory, or only remember the last one, etc).

A physical or mathematical system in which the sequence is governed by probabilities is called a *stochastic* process.

Examples:

1) Natural written languages as English or German or Chinese.

2) Continuous sources that have been discretised, or sampled, or quantised. For example, an audio digital recorder, which approximates a continuous function through a very fast discrete sampling.

3) Mathematical models in which we define an abstract stochastic process that generates sequences of symbols. For example:

We have A,B,C,D,E, each with probability = p = 0.2. Each step is independent of the previous. A sequence could be BDCBCECCCADCBDDAAECEEA.

If the letters are the same and their probabilities are 0.4, 0.1, 0.2, 0.2 and 0.1, respectively, and with independent successive steps, we can have AAACDCBDCEAADADACEDA...

A complication is to have probabilities that depend on preceding letters. The less complicated case is when probabilities only depend on the last step. In this case, we can write the *transition probabilities* as pᵢ(j), the probability of letter i being followed by j. Equivalently, we can write *digram probabilities* p(i,j), meaning the relative frequency of the digram/pair i j. The frequency of a single letter i is p(i). Then, we can relate these concepts as

p(i) = ∑{j}p(i,j) = ∑{j}p(j,i) = ∑{j}p(j)·pⱼ(i).

p(i,j) = p(i)·p(j)

∑{j}pᵢ(j) = ∑{i}p(i) = ∑{i}{j}p(i,j) = 1.

Let's work this with an example:

transition probabilities:
p(i followed by j):
pᵢ(j)  |     j
       | A    B    C
-------+-------------
     A | 0   4/5  1/5
 i   B |1/2  1/2  0
     C |1/2  2/5  1/10

single probabilities/frequency:
frequency of i:
  i  |  p(i)
-----+----------
  A  |  9/27
  B  | 16/27
  C  |  2/27

digram probability/frequency:
frequency of (consecutive) pair i,j:
p(i,j)  |    j
        | A    B    C  
--------+--------------
      A | 0   4/15  1/15
 i    B |8/27 8/27   0
      C |1/27 4/135 1/135

A representative sequence with these parameters would need to show B as the predominant single letter. The most frequent consecutive pair is AB. A possible sequence would be

ABBABABAbABABABBBABBBBBABABABABABBBACACABBABBBBABBABACBBBABA

A next step in complexity would require trigram frequencies. The choice of a letter would have *memory* of the two previous choices. A trigram frequency would be written as p(i,j,k). Alternative, transition probabilities like pᵢⱼ(k) could also be used. The process can be generalised to an n-gram.

Stochastic processes can be word-based instead of letter-based. Suppose we have letters ABCDE and we form 16 words with them. We associate a probability to each word. We can also introduce the same concepts as with letters, like transition probabilities or digram probabilities.

These are artificial languages, but they are useful to successively approximate natural languages. The 0 order of approximation consists in having equal-probability and independent letters. The 1st order of approximation would still retain the independence of letters but with probabilities equal to those appearing in a natural language. For example, letter E in English would have probability 0.12 or W with 0.02. But there is no memory yet of previous letters in this approximation, and there is no digram frequency enforced. It is the 2nd order approximation where digram structure is introduced. Or in other words, memory of the last letter is introduced. The third order requires trigram structure. And so on.