The Maximum Entropy Principle: An Introduction

 

By S. Tarani

One of the most difficult parts of the inductive process is that of prediction. There is a constant need, and the corresponding interest in additional and more robust forecasting methodologies. Forecasting is always conditional in method we use, the data we use, in in the "stability" of the example being used to forecast and also to the degree of prior knowledge and subjectivity/bias in the cases of subjective forecasting. Forecasting is always linked to the assessment of uncertainty and, for that reason, the measurement of uncertainty constitutes a precondition for understanding and evaluating predictions of all kinds. One method to generate forecasts is through the use of the concept of entropy. The Maximum Entropy (MaxEnt) approach of Jaynes, from way back in 1957, is a powerfully useful, albeit not so common, approach to be used in forecasting. Jaynes was after some deep epistemological questions about the nature of reality or about the reason that the physical laws took the shape that they did and worked very hard to provide a coherent theory of uncertainty and estimation based on the MaxEnt principle. In recent years the MaxEnt approach has found many interesting applications in forecasting, in machine learning and in modeling real world processes.

We shall give the most elementary introduction to how this method might be used to generate useful information from our data and to then be used in forecasting. For this, let us suppose that we have a binary time series [math] \left\{x_{t}\right\}_{\tau=1}^{t}[/math] taking values in the set [math] {\cal S}=\left\{-1, 1\right\}[/math]. Now, it should be clear that for such a time series forecasting is equivalent to computing a probability, since [math] \mathsf{E}(x_{t+1}) \doteq 1\cdot\mathbb{P}(x_{t+1}=1)-1\cdot\mathbb{P}(x_{t+1}=-1)=2\cdot\mathbb{P}(x_{t+1}=1)-1=2p-1[/math]. For simplicity, and thus the unconditional mean, we shall assume that this probability is constant and not time varying depending on [math] p[/math]. The most obvious estimator and forecast in the above case is the sample mean. However, this standard approach might be using less information that is available from our data. Can we improve upon this? Using the MaxEnt approach we possibly can (depending of course on the nature of our time series!)

A discrete probability distribution, like the one of our example above, can be described through its entropy function, a measure of information and uncertainty: limit values of $latex p$ imply lower uncertainty and entropy and values close to 0.5 imply maximum entropy or maximum uncertainty. As we have no other prior information on the probability distribution we can search for possible values for $latex p$ via the maximization of entropy subject to moment constraints -- that is, subject to constraints on the properties of our data. This is the essence of the MaxEnt method.

We want to maximize the discrete information entropy:

[math] S(p) = -p\log(p) - (1-p)\log(1-p)[/math]

subject to moment constrains of the form:

[math]\widehat{\mu}_{k} = 1^{k}\cdot p + (-1)^{k}(1-p)[/math]

where the left-hand side is the sample moment estimator, i.e., [math] \widehat{\mu}_{k}=(1/t)\sum_{\tau=1}^{t}x_{\tau}^{k}[/math]. Note that the constraint is such so that the theoretical moment matches the sample moment, for recovering the corresponding MaxEnt probability distribution. Thus, contrary to the plain sample mean, in the MaxEnt approach we can extract additional information from higher moments in computing the desired probability [math] p[/math]. The above system can be solved numerically with the use of Lagrange multipliers and there are open-source software packages that can easily do this.

In conclusion, with the MaxEnt principle we model and forecast via the use of maximum uncertainty subject to sample constraints. We could say that we are trying to be minimally biased, to have the widest possible range of all the configurations of our forecasting system subject to constraints on the average behavior that we observe from our data.