Consider some unknown distribution p(x), and suppose that we have modelled this using an approximating distribution q(x). If we use q(x) to construct a coding schemem for the purpose of transmitting values of x to a receiver, then the average additional amount of information required to specify the value of x as a result of using q(x) instead of the true distribution p(x) is given by: and it’s known as relative entropy or Kullback-Leibler divergence or KL divergence. You could also define it as we could give some conclusion in here: \n
1: The value of KL is zero if p(x) and q(x) are exactly same function.
2: If the difference between p(x) and q(x) is larger, the relative entropy will become bigger, otherwise, it will decrease if the variance is smaller.
3: If p(x) and q(x) are distribution function, the relative entropy could been used to measure the difference between them.
The thing need to point out is the relative entropy is not symmetrical quantity, that is to say
Now consider the joint distribution between two sets of variables x and y given by p(x,y). If the sets of variables are independent, then their joint distribution will factorize into the product if their marginals p(x, y) = p(x)p(y). If the variables are not independent, we can gain some idea of whether they are ‘close’ to being independent by considering KL divergence between the joint distribution and the product of the marginals, given by: or we can just say $I(X; Y) = H(X) – H(X|Y)$