. Log sum inequality. Given non-negative
,
,
with equality iff
,
.
Proof: By Jensen's inequality and the convexity of f(x)=x
log(x)
for
and
.
Let
in equation 2.12,
along with continuity (if any
,
are zero) to find the result. QED.
Reduced probability distribution functions are defined as integrations
of the complete distribution over some subset of the variables of the complete
distribution. For example in equation 2.8
use has already been made of the reduced distributions
and
.
The next theorem makes explicit the connection between the entropies of
different orders based on these reduced distributions. Let
be the distribution function reduced over those variables not in the set
indicated. As usual, define the reduced entropy as
The following theorem demonstrates the relationship between entropies
based on different reductions of the distribution function.
. Reduced entropy relationship. Given the full distribution
and the distributions reduced from it,
,
and
,
where A and B are sets of random variables with elements in
then
with equality when
is independent of
.
Proof: From the continuous version of the log sum theorem we have
From this the theorem follows almost immediately. QED.
Note that in the discrete case a looser upper bound occurs when
is replaced by 1/d, where d is the product of the number
of objects in each summation over
with
.
This also gives rise to the interpretation of the entropy as a dimension
- here d or
is the number of degrees of freedom in the stochastic variable
,
which leads to interesting results regarding the dimension of chaotic time
series, see [63].
Intuitively, the reduced distributions contain less information than
the full distribution. This is correct, since
,
indicating that
,
the reduced entropy being less than the full entropy. But there is another
sense in which the reduced distribution contains less information: the
entropy per degree of freedom is less for the reduced distributions, as
is shown in the next theorem.
. Reduced entropy per degree of freedom relationship. Given a
full distribution over n objects
,
,
with the probabilities
independent of k (the conditional density is dependent only on the values
of its arguments, not the position in the sequence of variables) for
and
,
then
Proof: Because of the shift invariance, we may indicate the dependence
on r neighbor subscripts
by the subscript r. Similarly, indicate conditioning of
on
by the subscript
.
Expand
Note that
is conditioned on a superset of the variables that the other conditioned
entropies above are conditioned on, i.e. that
for
(note
).
Thus rearrange equation 2.18
as
where the difference is clearly positive because each term in the
sum is positive. QED.
The statement used in the proof of the theorem above that conditioning on more things decreases the entropy is a direct consequence of the log sum inequality. In chapter 9 on estimating unknown parameters from data, we prove that the uncertainty about the unknown parameters does not generally decrease when the inference is based upon (the distribution of the unknown parameters is conditioned upon) more data. This result may seem to contradict the theorem above, however, the data is a specific instance of possible data and is thus not averaged over, while the conditional entropy averages over the conditioning variables, too.
An interesting application of the reduced entropy per degree of freedom is [98] where an overall complexity measure for a time series is developed that is the summation of these entropies.