Deriving probability distribution from entropy
21-Jul-2023
The main motivation is to show that from a general principle, we can recover an already known probability distribution. This is useful to understand the reason why a formula has to be that way.
Uniform distribution
Before information theory was established, it was already declared by statistician that there should be no prior bias when defining probability of event. In other words, each possibility has to be fair, unless you include a bias. Any random variable that follows this principle is called uniformly distributed.
For any N possible discrete states, then the probability of each state happens is exactly . An example would be a coin toss or a die toss. Each sides must have equal chance to happens.
This is usually accepted as fact, without questioning the reason.
In the perspective of information theory, an observation can be measured with information entropy. Differential Information entropy can be used to measure continuous probability distribution, and is defined as.
With is the probability of the event at .
By perceiving this as an optimization problem (min-max), we can reason that since entropy has to be always increasing in an observation, the probability distribution when all the information are gained needs to be stabilizing when entropy is at maximum.
From a physical perspectives, an alternative analogy was that when two heat contact touching each other, the final temperature has to happen when the entropy is maximized.
If we treat as a functional using variational calculus, we can then use Lagrangian multiplier method to derive the probability distribution.
The Lagrangian (function) can be constructed by using the entropy with an added linear constraints. Then if we set the constraint to 0, we should find the optimum points when the partial derivatives are 0.
The most fundamental constraint of probability distribution was that if you added all the chances, then it has to sum up to 1. In other words, . The constraint function , is basically a rearrangement of previous statement in such a way that
Our Lagrangian function then becomes:
Now, since the Lagrangian includes integrals, technically we should integrate it first to obtain the correct function. But the expression of is neatly an integral along . Usually in physics, by stationary action principle (or historically known as least-action principle), we can define the action of the Lagrangian as .
By matching it with the action expression, it would mean that we can also use the derivative of our previous Lagrangian , as our new Lagrangian .
Applying Lagrangian multiplier method, we take the partial derivative wrt and set it to zero.
Note that since is basically a constant, then is a constant as well. We just rename it as . Using the constraint we can then find
The explanation was that is the range of the integral, since is constant. If is a continuous random variable, then is just a segment from to , which is . So is essentially the uniform distribution we are familiar with.
Normal distribution
The article would not be complete if we didn’t derive normal distribution. Normal distribution is the simplest continuous probability distribution with a standard mean and variance to perform statistical analysis. It also exists in most cases in nature. From information theory perspectives, it was a consequence of nature that tends to maximize information entropy, such as heat exchange, or energy distribution.
The constraint is the same with uniform distribution. The integral over all must sum to 1. So, . Thus .
The second constraint is that we include additional assumptions that “not every chances are equal”, but “the average has to be at the center”.
This is just coming from a naive assumptions about the notion of “average”. For example, suppose you have datasets of person heights, You imagine that more people exists with the height around the center of the height range. For a distribution function, this would mean that . In a concept of physics (typically mass distribution), it just means that the first moment (center of mass) of the distribution exists. In this case, the first moment we are going to be set to a constant . To summarize,
The third constraint is that we assume the distribution is symmetric around the mean. In the concept of moment distribution, this would imply that . It will have a constant value . So, .
Note that the interesting things about these constraints is that they are all constants subtracted with an integral over . We can use the same approach to get a new Lagrangian . Because we apply derivative over and then over , then these constants doesn’t matter at all in the end, whatever value that was. Using the same approach, the Lagrangian