next up previous contents
Next: Inference increases information about Up: Estimating information and correlation Previous: Estimating information and correlation

 

Inferring unknown parameters from data

What is given in general is data, not the true data generating system, nor the true parameters for the data generating system. What is usually somewhat understood is the class of possible systems for the data generating system, and less well understood usually is what the values of the parameters are for any particular data generating system. For instance, the system generating the data may be known to give gaussian distributed data. But the mean and width parameters of the gaussian are unknown. The data does not determine these parameters, but it indicates them, and if we proceed intelligently we may infer them. With a sufficient amount of data the parameters are determined as well as is possible. In this case (large data) almost any method for estimating the parameters will suffice. What is crucial is the method used to estimate the parameters when the data does not strongly determine the parameters. For instance, how would you estimate the probability that the next bit will be a one or a zero after seeing only three examples of bits generated by a random bit generator which could be generating ones with any probability tex2html_wrap_inline15935? How would you estimate the uncertainty of the value of p guessed?

The nuts and bolts of estimating something unknown from data lies within Bayes' theorem. Bayes' theorem is simply two ways of writing the joint probability distribution of data and parameters, but most importantly in our application it relates the probability of the unknown parameters to the known data. Bayesian estimation methodology differs widely from likelihood based methods in that it is this distribution of parameters given data, tex2html_wrap_inline15939, that is used in making guesses about the parameters, rather than the likelihood tex2html_wrap_inline15941. The likelihood is simply what we use to generate simulated data after the unknown parameters are guessed. Bayes' theorem may be written
 equation4652
where tex2html_wrap_inline15955 is called the prior probability of the parameters tex2html_wrap_inline13239. In all Bayesian methods it is the choice of prior that reflects everything that was known about the distribution of the parameters before the data was seen. For any given data set, tex2html_wrap_inline15959 is a constant and can be found by summing over the parameters as in tex2html_wrap_inline15961.

Referring back to likelihood methods mentioned before, if you have a demon who is generating data using a known type of generating system but with the parameters for the system unknown to you, and the demon states the probability distribution from which it chose the parameters for the data generating system, then the Bayesian methodology is provably optimal when the prior is taken to be the demon's parameter choice distribution. Likelihood methods will tend to be too guided by the data, overfitting the parameters to match the data too closely, and they are provably not optimal in general.

In the next section the question of whether data always provides an increasing amount of information about the unknown parameters is answered. The answer is no. Sometimes more data leads to more confusion. However, on the average more data does provide more information about the unknown parameters. What is meant here by information, confusion, and average is quantified, and the theorem showing that on the average information about the unknown parameters increases upon seeing data is proved.

Later sections apply Bayesian methodology to the task at hand of estimating entropy, information correlation functions, chi-squared, moments, and correlations, from finite data. Finite data really is the only interesting case in problems of inference. Large data allows the use of almost any random provably suboptimal methodology to find good estimates. Developing methods of inference that make use of all available information is the elegant side of statistical inference. Practically, data is expensive. Thus the large data case is ignored.


next up previous contents
Next: Inference increases information about Up: Estimating information and correlation Previous: Estimating information and correlation

David Wolf
Tue Mar 25 08:11:49 CST 1997