nextuppreviouscontents
Next:Information correlationcumulants, clusters, Up:EntropyPrevious:Entropy

Convexity properties of entropy

Before continuing it is necessary to quantify the convexity properties of the entropy. The following theorem [16] is used several times in this work, and demonstrates the convexity properties of the entropy succinctly

. Log sum inequality. Given non-negative tex2html_wrap_inline13019tex2html_wrap_inline13111tex2html_wrap_inline13113
equation798
with equality iff tex2html_wrap_inline13115tex2html_wrap_inline13113.

Proof: By Jensen's inequality and the convexity of f(x)=x log(x)
equation808
for tex2html_wrap_inline13121 and tex2html_wrap_inline13123. Let
equation816
in equation 2.12, along with continuity (if any tex2html_wrap_inline13019tex2html_wrap_inline13111 are zero) to find the result. QED.

Reduced probability distribution functions are defined as integrations of the complete distribution over some subset of the variables of the complete distribution. For example in equation 2.8 use has already been made of the reduced distributions tex2html_wrap_inline13129 and tex2html_wrap_inline13131. The next theorem makes explicit the connection between the entropies of different orders based on these reduced distributions. Let tex2html_wrap_inline13133 be the distribution function reduced over those variables not in the set indicated. As usual, define the reduced entropy as
equation822
The following theorem demonstrates the relationship between entropies based on different reductions of the distribution function.

. Reduced entropy relationship. Given the full distribution tex2html_wrap_inline13135 and the distributions reduced from it, tex2html_wrap_inline13137tex2html_wrap_inline13139 and tex2html_wrap_inline13141, where A and B are sets of random variables with elements in tex2html_wrap_inline13147 then
equation824
with equality when tex2html_wrap_inline13149 is independent of tex2html_wrap_inline13151.

Proof: From the continuous version of the log sum theorem we have
eqnarray826
From this the theorem follows almost immediately. QED.

Note that in the discrete case a looser upper bound occurs when tex2html_wrap_inline13141 is replaced by 1/d, where d is the product of the number of objects in each summation over tex2html_wrap_inline13159 with tex2html_wrap_inline13161. This also gives rise to the interpretation of the entropy as a dimension - here d or tex2html_wrap_inline13165 is the number of degrees of freedom in the stochastic variable tex2html_wrap_inline13151, which leads to interesting results regarding the dimension of chaotic time series, see [63].

Intuitively, the reduced distributions contain less information than the full distribution. This is correct, since tex2html_wrap_inline13169, indicating that tex2html_wrap_inline13171, the reduced entropy being less than the full entropy. But there is another sense in which the reduced distribution contains less information: the entropy per degree of freedom is less for the reduced distributions, as is shown in the next theorem.

. Reduced entropy per degree of freedom relationship. Given a full distribution over n objects tex2html_wrap_inline13175tex2html_wrap_inline13177, with the probabilities tex2html_wrap_inline13179 independent of k (the conditional density is dependent only on the values of its arguments, not the position in the sequence of variables) for tex2html_wrap_inline13183 and tex2html_wrap_inline13185, then
equation828

Proof: Because of the shift invariance, we may indicate the dependence on r neighbor subscripts tex2html_wrap_inline13189 by the subscript r. Similarly, indicate conditioning of tex2html_wrap_inline13193 on tex2html_wrap_inline13189 by the subscript tex2html_wrap_inline13197. Expand tex2html_wrap_inline13199
eqnarray830
Note that tex2html_wrap_inline13201 is conditioned on a superset of the variables that the other conditioned entropies above are conditioned on, i.e. that tex2html_wrap_inline13203 for tex2html_wrap_inline13205 (note tex2html_wrap_inline13207). Thus rearrange equation 2.18 as
eqnarray832
where the difference is clearly positive because each term in the sum is positive. QED.

The statement used in the proof of the theorem above that conditioning on more things decreases the entropy is a direct consequence of the log sum inequality. In chapter 9 on estimating unknown parameters from data, we prove that the uncertainty about the unknown parameters does not generally decrease when the inference is based upon (the distribution of the unknown parameters is conditioned upon) more data. This result may seem to contradict the theorem above, however, the data is a specific instance of possible data and is thus not averaged over, while the conditional entropy averages over the conditioning variables, too.

An interesting application of the reduced entropy per degree of freedom is [98] where an overall complexity measure for a time series is developed that is the summation of these entropies.


nextuppreviouscontents
Next:Information correlationcumulants, clusters, Up:EntropyPrevious:Entropy
David Wolf

Tue Mar 25 08:11:49 CST 1997