The measure of the uncertainty of (and confusion about) the unknown
parameters is the entropy of the distribution of the unknown parameters.
Working in the probability density framework the entropy of the distribution
of parameters after seeing n data samples
is given by
The change in the uncertainty of the parameters upon seeing the nth
data sample is then given by
.
At first it might be thought that this uncertainty should always decrease.
However, this is not the case. Suppose, for example, that the mean of a
gaussian distribution is to be inferred, while the width of the gaussian
is known. Suppose further that n=2, and that
and
happen to lie very far apart from each other, which can happen by chance
for gaussian distributed data. In this case,
is going to be more sharply peaked than
.
This is because the
inference puts density at
,
while the two sample
inference puts density at both of the data locations, which are far apart.
In fact, taking the uniform distribution for the parameter prior, the one
sample inference is a single gaussian bump centered on the data, while,
if the two data samples are very far apart, the two sample inference is
two identical gaussian bumps (having half the height and the same width
as the one sample inference gaussian bump), and the entropy of the two
sample inference will be one bit greater than that of the one sample inference.
Sometimes new data leads to increased confusion about the parameters.
Now, consider what happens when the average over data sets
of the change in uncertainty (confusion) is taken. The average of interest
is the average change in confusion when a new data sample is seen, given
by
The next step is to show that the average change in the confusion about
the parameters is negative. To do this, note that for any function
Now, expand the average change in the confusion written in equation9.3
as
and simplify the inner integral of the second term on the right side
using the identity proven in equation 9.4
to find
Collect the logarithms and note that the integral over
is a Kullback-Leibler distance, and allows us to apply the information
identity (for probability densities
and
,
)
to find that the average change in the confusion about the parameters is
negative
Because negative uncertainty is information, a negative change in confusion
corresponds to a positive change in information. Thus we have proven the
following theorems:
: Information increases on the average. Although in particular data the information about the parameters may decrease upon seeing a new data sample, on the average the information about the parameters increases upon seeing a new data sample.
: Average information increase is the Kullback-Leibler distance. The average increase in the information about the parameters is the average of the Kullback-Leibler distance between the parameter distributions conditioned on the data after and before the new sample is seen.