Next:Convexity properties of entropy Up:EntropyPrevious:Entropy


The entropy of a thing is the asymptotic average of the logarithm of the number of ways that the thing occurs. Thus, if there are two independent event generators A and B each generating events a and b with uniform probabilities for each generator, and there are tex2html_wrap_inline12911 events possible for the first generator, tex2html_wrap_inline12913 for the second, then there are tex2html_wrap_inline12915 equally probable possibilities for both generators taken together. Consider N events from tex2html_wrap_inline12919, i.e. N pairs (a,b). There are tex2html_wrap_inline12925 ways that events from A can occur, and tex2html_wrap_inline12929 ways events from B can occur. There are tex2html_wrap_inline12933 ways that events from tex2html_wrap_inline12919 can occur. Thus, the entropy of the joint process is
which is an example of a very important and useful property of entropy - the entropy of independent processes is an additive quantity, whereas the number of ways is a multiplicative property. As a further example, consider the probabilities tex2html_wrap_inline12937 where tex2html_wrap_inline12939 and tex2html_wrap_inline12941 for each i. Sample this distribution N times and find that there are tex2html_wrap_inline12947 events of type i. The number of ways to see distinct vectors tex2html_wrap_inline12951 is
The logarithm of this is easily computed, and the asymptotics are simply found from Stirling's approximation (equation 6.1.37 of [3]) so that
Clearly there were N events in this, so the average of it is the entropy that was defined above
Now consider the joint process of two random variables, denoted A and B before, except now the two processes will not necessarily be independent. Generate N events from the joint distribution tex2html_wrap_inline12971, with tex2html_wrap_inline12973 and similarly tex2html_wrap_inline12975. Clearly, the joint entropy of this process is given by tex2html_wrap_inline12977. How does this relate to S(A) and S(B)? Clearly, the number of ways that two things may occur is no more than the product of the numbers of ways each occurs individually, and this is certainly reflected in this case by the fact that
with equality holding iff A is independent of B. The proof is trivial, simply note that tex2html_wrap_inline12987, consider tex2html_wrap_inline12989 and show that it's average logarithm over tex2html_wrap_inline12991tex2html_wrap_inline12993, is non-positive (where tex2html_wrap_inline12995, etc.). See also the generalization of this, the reduced entropy relationship theorem below.

Now, let's consider what happens when we are given one of the two outcomes of each event (a,b) from tex2html_wrap_inline12919 consistently, and we want to deduce the other. To be specific, let the events from tex2html_wrap_inline12919 be generated N times, and let the value of B be seen each time. What is the asymptotic average log number of ways to see A given that B is seen each time? Well, let tex2html_wrap_inline13011 be the number of times that tex2html_wrap_inline13013 occurs, and for these occurrences of tex2html_wrap_inline13013, let tex2html_wrap_inline13017 count the occurrences of tex2html_wrap_inline13019. Define the vectors tex2html_wrap_inline13021 and tex2html_wrap_inline13023; then for a fixed value tex2html_wrap_inline13013 of B there are
ways that the A values could be distributed for this tex2html_wrap_inline13013. Taking the product of these numbers of ways gives us the number of ways that the A values could be distributed given the B values. This is
Taking the logarithm and doing the asymptotics gives gives us
where we have tex2html_wrap_inline13051. Averaging by dividing by N gives us the entropy of A given B, or tex2html_wrap_inline13059. Note that tex2html_wrap_inline13061. Similarly, tex2html_wrap_inline13063. Note that we could have found the log number of ways that A could occur given tex2html_wrap_inline13013 as tex2html_wrap_inline13069, and then noted that asymptotically tex2html_wrap_inline13071 to average this and find the result above.

After working through these examples, the interpretation of entropy as an uncertainty - an additive quantity representing the state of ignorance of the outcome - is straightforward. For example, if A is determined by B, then there is no uncertainty in A given B, immediately tex2html_wrap_inline13081; further there is no more uncertainty in the joint distribution then there is in the distribution of B, i.e. S(A,B) = S(B). Finally, note that the quantity tex2html_wrap_inline13087 gives the uncertainty change between not knowing B and knowing B, and is called the mutual information. It is symmetric in its arguments, and can be written as
The mutual information is clearly a quantity that for two random variables can be labeled the information about one variable that is in the other, and vice-versa. It is the information that each random variable shares about the other. In section 3.12 higher order information functions of this nature are defined, the information correlation functions, and these can be interpreted as the information between a set of random variables.

There are several other information functions that are of interest. We may define the redundancy of one random variable in another as the mutual information of the two. We might also define the normalized redundancy of two random variables as the mutual information divided by the joint information (entropy), M(A,B)/S(A,B). This is a quantity that has value zero only for independent processes, and has value one when one process completely determines the other. For two or more random variables the redundancy has been defined as the sum of the single entropies minus the joint entropy, tex2html_wrap_inline13095 [66, 93]. This redundancy is distinctly different from that of the information correlation functions to be defined in section 3.12 When there are only two processes this is the mutual information. A measure of correlation has been defined as 1-S(B|A)/S(A) [16]. Note that this is asymmetric in the processes. It is 0 when the entropy of B given A is equal to the entropy of A, which for identically distributed variables occurs only when they are independent. A symmetric function with similar properties is 2(1-S(A,B)/(S(A)+S(B)))=2M(A,B)/(S(A)+S(B)).

Next:Convexity properties of entropy Up:EntropyPrevious:Entropy
David Wolf

Tue Mar 25 08:11:49 CST 1997