Maulana's personal blog
Main feedMath/PhysicsSoftware Development

Deriving probability distribution from entropy


The main motivation is to show that from a general principle, we can recover an already known probability distribution. This is useful to understand the reason why a formula has to be that way.

Uniform distribution

Before information theory was established, it was already declared by statistician that there should be no prior bias when defining probability of event. In other words, each possibility has to be fair, unless you include a bias. Any random variable that follows this principle is called uniformly distributed.

For any N possible discrete states, then the probability of each state happens is exactly 1N\frac{1}{N}. An example would be a coin toss or a die toss. Each sides must have equal chance to happens.

This is usually accepted as fact, without questioning the reason.

In the perspective of information theory, an observation can be measured with information entropy. Differential Information entropy can be used to measure continuous probability distribution, and is defined as.

H(p)=Xp(x)ln(p(x))dx\begin{align} H(p)=-\int_X p(x) \ln(p(x)) dx \end{align}

With p(x)p(x) is the probability of the event at X=xX=x.

By perceiving this as an optimization problem (min-max), we can reason that since entropy has to be always increasing in an observation, the probability distribution when all the information are gained needs to be stabilizing when entropy is at maximum.

From a physical perspectives, an alternative analogy was that when two heat contact touching each other, the final temperature has to happen when the entropy is maximized.

If we treat H(p)H(p) as a functional using variational calculus, we can then use Lagrangian multiplier method to derive the probability distribution.

The Lagrangian (function) can be constructed by using the entropy with an added linear constraints. Then if we set the constraint to 0, we should find the optimum points when the partial derivatives are 0.

The most fundamental constraint of probability distribution was that if you added all the chances, then it has to sum up to 1. In other words, Xp(x)dx=1\int_X p(x)dx = 1. The constraint function g(x)=1Xp(x)dxg(x)=1-\int_X p(x)dx, is basically a rearrangement of previous statement in such a way that g(x)=0g(x)=0

Our Lagrangian function F(x,p)F(x,p) then becomes:

F(x,p)=H(p)+λ0g(x)F(x,p)=Xp(x)ln(p(x))dx+λ0(1Xp(x)dx)\begin{align} F(x,p)&=H(p) + \lambda_0 g(x) \\ F(x,p)&=-\int_X p(x) \ln(p(x)) dx + \lambda_0 (1-\int_X p(x)dx) \\ \end{align}

Now, since the Lagrangian includes integrals, technically we should integrate it first to obtain the correct function. But the expression of F(x,p)F(x,p) is neatly an integral along dxdx. Usually in physics, by stationary action principle (or historically known as least-action principle), we can define the action SS of the Lagrangian as S=L(x,p)dxS=\int L(x,p)dx.

F(x,p)=H(p)+λ0g(x)F(x,p)=ddxF(x,p)=p(x)ln(p(x))λ0p(x)S=X(p(x)ln(p(x))λ0p(x))dx\begin{align} F(x,p)&=H(p) + \lambda_0 g(x) \\ F'(x,p)&=\frac{d}{dx} F(x,p)=-p(x) \ln(p(x)) - \lambda_0 p(x) \\ S &= \int_X \left( -p(x) \ln(p(x)) - \lambda_0 p(x) \right) dx \end{align}

By matching it with the action expression, it would mean that we can also use the derivative of our previous Lagrangian FF, as our new Lagrangian LL.

L(x,p)=F(x,p)L(x,p)=p(x)ln(p(x))λ0p(x)\begin{align} L(x,p)&=F'(x,p) \\ L(x,p)&= -p(x) \ln(p(x)) - \lambda_0 p(x) \end{align}

Applying Lagrangian multiplier method, we take the partial derivative wrt pp and set it to zero.

pL(x,p)=0ln(p(x))1λ0=0ln(p(x))=1λ0p(x)=e1λ0p(x)=A\begin{align} \frac{\partial}{\partial p}L(x,p)&=0 \\ -\ln(p(x)) - 1 - \lambda_0 &= 0 \\ \ln(p(x)) &=-1-\lambda_0 \\ p(x) &=e^{-1-\lambda_0} \\ p(x) &=A \\ \end{align}

Note that since 1+λ01+\lambda_0 is basically a constant, then e1λ0e^{-1-\lambda_0} is a constant as well. We just rename it as AA. Using the constraint g(x)g(x) we can then find p(x)p(x)

Xp(x)dx=1XAdx=1Δx=1A\begin{align} \int_X p(x)dx &= 1 \\ \int_X A dx &= 1 \\ \Delta x &= \frac{1}{A} \\ \end{align}

The explanation was that Δx\Delta x is the range of the integral, since AA is constant. If XX is a continuous random variable, then Δx\Delta x is just a segment from x=ax=a to x=bx=b, which is Δx=ba\Delta x= b-a. So p(x)p(x) is essentially the uniform distribution we are familiar with.


Normal distribution

The article would not be complete if we didn’t derive normal distribution. Normal distribution is the simplest continuous probability distribution with a standard mean and variance to perform statistical analysis. It also exists in most cases in nature. From information theory perspectives, it was a consequence of nature that tends to maximize information entropy, such as heat exchange, or energy distribution.

The constraint is the same with uniform distribution. The integral over all XX must sum to 1. So, Xp(x)dx=1\int_X p(x)dx=1. Thus g0(x)=1Xp(x)dxg_0(x)=1-\int_X p(x)dx.

The second constraint is that we include additional assumptions that “not every chances are equal”, but “the average has to be at the center”.

This is just coming from a naive assumptions about the notion of “average”. For example, suppose you have datasets of person heights, You imagine that more people exists with the height around the center of the height range. For a distribution function, this would mean that Xxp(x)dx=c1\int_X x\, p(x)dx=c_1. In a concept of physics (typically mass distribution), it just means that the first moment (center of mass) of the distribution exists. In this case, the first moment we are going to be set to a constant c1c_1. To summarize, g1(x)=c1Xxp(x)dxg_1(x)=c_1-\int_X x \, p(x)dx

The third constraint is that we assume the distribution is symmetric around the mean. In the concept of moment distribution, this would imply that Xx2p(x)dx=c2\int_X x^2 \,p(x)dx=c_2. It will have a constant value c2c_2. So, g2(x)=c2Xx2p(x)dxg_2(x)=c_2-\int_X x^2 \,p(x)dx.

Note that the interesting things about these constraints is that they are all constants subtracted with an integral over p(x)dxp(x)dx. We can use the same approach to get a new Lagrangian L(x,p)L(x,p). Because we apply derivative over dxdx and then over p\partial p, then these constants doesn’t matter at all in the end, whatever value that was. Using the same approach, the Lagrangian L(x,p)L(x,p)

L(x,p)=p(x)ln(p(x))λ0p(x)λ1xp(x)λ2x2p(x)pL(x,p)=ln(p(x))1λ0λ1xλ2x20=ln(p(x))1λ0λ1xλ2x2p=e1λ0λ1xλ2x2p(x)=Aeλ1xλ2x2\begin{align} L(x,p)&=-p(x) \ln(p(x)) -\lambda_0 p(x) - \lambda_1 x\,p(x) - \lambda_2 x^2\,p(x) \\ \frac{\partial}{\partial p}L(x,p)&= -\ln(p(x)) - 1 - \lambda_0 -\lambda_1 x -\lambda_2 x^2 \\ 0&= -\ln(p(x)) - 1 - \lambda_0 -\lambda_1 x -\lambda_2 x^2 \\ p &= e^{-1-\lambda_0 -\lambda_1 x - \lambda_2 x^2} \\ p(x) &= A e^{-\lambda_1 x - \lambda_2 x^2} \end{align}

Again, step (20) to (21) happens because e1λ0e^{-1-\lambda_0} is a constant, so we just renamed it as AA.

To find the values of these λn\lambda_n, “normally” you integrate and then match it with the constant being used cnc_n. But using that kind of approach is like circular definition if we have a prior assumptions about the normal distribution. We won’t be using that, and instead work on a more fundamental assumptions.

We can’t solve the first constraint yet, which is essentially a normalization, since there are 2 Lagrange multipliers needs to be solved first.

We are going to solve the second constraint. Notice that if the distribution has average values, then the probability at the average has to be the highest, since it is the most common. If it is the highest values, then the derivative of the distribution on that point is 0. Let us call this point an arbitrary name μ\mu, so x=μx=\mu.

p(x)=Aeλ1xλ2x2ddxp(x)=(λ1+2λ2x)p(x)p(μ)=(λ1+2λ2μ)p(μ)0=(λ1+2λ2μ)λ1=2λ2μ\begin{align} p(x) &= A e^{-\lambda_1 x - \lambda_2 x^2} \\ \frac{d}{dx}p(x) &= -(\lambda_1+2\lambda_2 x) p(x) \\ p'(\mu) &= -(\lambda_1+2\lambda_2 \mu) p(\mu) \\ 0 &= -(\lambda_1+2\lambda_2 \mu) \\ \lambda_1 &= - 2\lambda_2 \mu \end{align}

We now have expression for λ1\lambda_1

We are going to solve the third constraint. Using the same principle, we put assumptions that the second derivative must have an inflection point in two places. This is because as we see in the first derivative above, the values changes. Because the shape of the distribution is symmetric with a center, then obviously the second derivative will have two zeroes, which is in the left and right of the center, within the same distance from the center. We already know the center will be in x=μx=\mu. So these two particular position has to be in x=μ+σx=\mu+\sigma and x=μσx=\mu-\sigma, where σ\sigma is just arbitrary distance from μ\mu.

p(x)=2λ2(xμ)p(x)p(x)=2λ2p(x)+4λ22(xμ)2p(x)p(μ+σ)=p(μσ)=2λ2p(μ+σ)+4λ22σ2p(μ+σ)0=1+2λ2σ2λ2=12σ2\begin{align} p'(x) &= -2\lambda_2 (x - \mu) p(x) \\ p''(x) &= -2\lambda_2 p(x) + 4 \lambda_2^2 (x-\mu)^2 p(x) \\ p''(\mu+\sigma) = p''(\mu-\sigma) &= -2\lambda_2 p(\mu+\sigma) + 4 \lambda_2^2 \sigma^2 p(\mu+\sigma) \\ 0 &= - 1 + 2 \lambda_2 \sigma^2 \\ \lambda_2 &=\frac{1}{2\sigma^2} \end{align}

Plugging this back to p(x)p(x)

p(x)=Aeλ1xλ2x2=Ae2μx2σ2x22σ2=Aeμ22σ2μ22σ2+2μx2σ2x22σ2=Aeμ22σ2e12(xμσ)2=Be12(xμσ)2\begin{align} p(x) &= A e^{-\lambda_1 x - \lambda_2 x^2} \\ &= A e^{\frac{2\mu x}{2 \sigma^2} - \frac{x^2}{2\sigma^2}} \\ &= A e^{\frac{\mu^2}{2\sigma^2}-\frac{\mu^2}{2\sigma^2}+\frac{2\mu x}{2 \sigma^2} - \frac{x^2}{2\sigma^2}} \\ &= Ae^{\frac{\mu^2}{2\sigma^2}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2} \\ &= B e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2} \\ \end{align}

An explanation, step (33) to (34) we added terms to complete the squares. Step (34) to (35) rearranging into a square expressions Step (35) to (36) we rename Aeμ22σ2=BAe^{\frac{\mu^2}{2\sigma^2}}=B because it is just a constant.

Finally, applying the first constraint

Xp(x)dx=XBe12(xμσ)2dx1=XBe12(xμσ)2dx=Bσ2Yey2dy=Bσ2πB=1σ2π\begin{align} \int_X p(x) dx &= \int_X B e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}dx \\ 1 &= \int_X B e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}dx \\ &= B \sigma \sqrt{2} \int_Y e^{-y^2}dy \\ &= B \sigma \sqrt{2 \pi} \\ B&= \frac{1}{ \sigma \sqrt{2 \pi} } \\ \end{align}

The explanation, Step (38) to (39) we made a change of variables Y=Xμσ2Y=\frac{X-\mu}{\sigma \sqrt{2}}. Step (39) to (40) is because the definite integral of ex2dx=π\int_{-\infty}^\infty e^{-x^2}dx=\sqrt{\pi}. This expression probably deserves its own articles.

Thus we recovered the normal distribution:

p(x)=1σ2πe12(xμσ)2p(x)= \frac{1}{\sigma \sqrt{2 \pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}

The interesting thing with this derivation is that it “predicts” the datasets, rather than deducing from a datasets. The formula rises because when the observation is in a saturated state, then the distribution has to use that function.

No wonder a normal distribution is the most common we can see in nature. This is because it is the next simplest distribution with minimal assumptions that can fit more data.

To see what I really meant, let’s use the recovered formula to calculate the differential entropy.

H(p)=Xp(x)ln(p(x))dx=Xp(x)ln(1σ2πe12(xμσ)2)dx=Xp(x)ln(1σ2π)dx+Xp(x)12(xμσ)2dx=ln(σ2π)Xp(x)dx+12X(xμσ)2p(x)dx=12ln(2πσ2)+12=12ln(2πeσ2)\begin{align} H(p)&=-\int_X p(x) \ln(p(x))\,dx \\ &=-\int_X p(x) \ln(\frac{1}{\sigma \sqrt{2 \pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2})\,dx \\ &=-\int_X p(x) \ln(\frac{1}{\sigma \sqrt{2 \pi}})\,dx + \int_X p(x) \frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\,dx \\ &=\ln(\sigma \sqrt{2 \pi})\int_X p(x)\,dx + \frac{1}{2} \int_X \left(\frac{x-\mu}{\sigma}\right)^2 p(x) \,dx \\ &=\frac{1}{2}\ln( 2 \pi \sigma^2) + \frac{1}{2} \\ &=\frac{1}{2}\ln( 2 \pi e \sigma^2)\\ \end{align}

As you can see above, the entropy only depends on the variance σ\sigma. This can only means from a different sets of data, they will tend to approach the same entropy and distribution if they have the same variance.

Rizky Maulana Nugraha

Written by Rizky Maulana Nugraha
Software Developer. Currently remotely working from Indonesia.
Twitter FollowGitHub followers