# Start from the Information Theory

It has been a long time since last update, part of reason is I need to work on my postgraduate paper, and, however, I’m a lazy man anyway 😛 Recently I’m reading about 《Pattern Recognition and Machine Learning》 and 《Beauty of Mathematics 》 which makes me to mark something down and help me to understand them better 🙂

The First thing I want to talk about is information theory, so if you need to predict an even is possible or not, the most straight forward way is you could use the history data to get its probability distribution P(X) depend on the value x. . So, if right now we want to evaluate the information content of x, we should find a quantify h(x) that is a monotonic function of the probability  P(X) that could expresses the information content. The way we could find that h(x) is, we need to have two events x and y that are unrelated with each other and then the information gain from observing both of them should be the sum of the information gained from each of them separately. so the h(x, y) = h(x) + h(y) and P(x, y) = p(x)p(y). The we could get $h(x) = -log_{2}p(x)$ And you could find that the h(x) is actually represent the “bits”. Now, suppose that a sender wishes to transmit the value of a random variable to a receiver. The average amount of information that they transmit in the process is obtained by taking the expectation of h(x) with respect to the distribution p(x) and is given by $H[x] = - sum_{x}p(x)log_{2}p(x)$. This important quantity is called the entropy of the random variable x. Consider a random variable x having 8 possible states and each of which is equally likely. In order to communicate the value of x to a receiver, we would need to reansmit a message of length 3 bits and the entropy is $H[x] = -8 * \frac{1}{8}log_{2}\frac{1}{8} = 3 bits$. Furthermore, if we have 8 possible states{a, b, c, d, e, f, g, h} for following strings: 0, 10, 110, 1110, 111100, .

So we have the ideal of entropy, then let’s see the other kind of differential entropy. When we minimize the value x and give a quantity $ln\Delta$, which diverges in the limit $\Delta \rightarrow 0$. For a density defined over multiple continuous variables, denoted collectively by the vector x, the different entropy is given by $H[x] = - \int p(x)lnp(x)d_{x}$ which denote the fact that to specify a continuous variable very precisely requires a large number of bits.

Support we have a joint distribution P(x, y) from which we draw pair of values of x and y. If a value of x is already know, then the additional information needed to specify the corresponding value of y is given by $-lnp(Y|X)$. Thus the average additional information needed to specify y can be written as $H[y|x] = sum_{x\in \textup{x}, y\in \textup{y}}p(x,y)log \frac{p(x)}{p(x,y)}$ which called the conditional entropy of y given x. It’s easily seen, using the product rule, that the conditional entropy satisfies the relation: H[x,y] = H[y|x] + H[x]