Maulana's personal blog
Main feedMath/PhysicsLarge Language Model - Learning NotesSoftware Development

Deriving probability distribution from entropy: part 3

30-Dec-2023

This is another follow-up article from this part 2.

Previously we used 3 constraints to derive Poisson Distribution from Maximum Entropy Principle, but I intentionally omitted what happens if we only used 2 constraints. This article will talk about it.

(Negative) Exponential Distribution

In the previous article, we stated 3 constraints to derive Poisson Distribution. This time, we will only use two of them:

  1. It must satisfy the normalization axiom (total probability for all kk is 1)
  2. It must have an average value (mean)

Summarizing our Lagrangian constraints:

C1=λ0(1kp(k))C2=λ1(μkkp(k))\begin{align*} C_1 &= \lambda_0 (1-\sum_k p(k)) \\ C_2 &= \lambda_1 (\mu-\sum_k k \, p(k)) \\ \end{align*}

Our full entropy function is as follows (I omit the kk input notation for brevity).

H(k,p)=kpln(p)+C1+C2H(k,p)=kpln(p)+λ0(1kp)+λ1(μkkp)\begin{align*} H(k,p) &= -\sum_k p \, \ln(p) + C1 + C2 \\ H(k,p) &= -\sum_k p \, \ln(p) + \lambda_0 (1-\sum_k p) + \lambda_1 (\mu-\sum_k k \, p) \\ \end{align*}

Before we continue, let us notice that these two constraints were just basic constraints. There are no inherent assumptions about what the distribution should look like from a particular datasets.

We will begin adding assumptions as we go along.

First, notice that previously we are dealing with discrete distribution. However, currently there’s nothing to prevent us from assuming that these two constraints can’t be applied to a continuous distribution? So let’s just replace kk as xx and use integrals instead of sum.

H(x,p)=Xpln(p)dx+λ0(1Xpdx)+λ1(μXxpdx)\begin{align*} H(x,p) &= -\int_X p \, \ln(p) \, dx + \lambda_0 (1-\int_X p\, dx) + \lambda_1 (\mu-\int_X x \, p\, dx) \\ \end{align*}

Just like in the first article, because it’s a continuous distribution, we can take derivative of H(x,p)H(x,p) by x, and then do stationary action principle. Our Lagrangian then becomes:

L(x,p)=pln(p)λ0pλ1xp\begin{align*} L(x,p) &= -p \, \ln(p) - \lambda_0 \, p - \lambda_1 \, x \, p \\ \end{align*}

The partial derivative of the Lagrangian over p, must be 0

pL(x,p)=0ln(p)1λ0λ1x=0ln(p)=1λ0λ1xp(x)=e1λ0eλ1x=Aeλ1x\begin{align} \frac{\partial}{\partial{p}} L(x,p) &= 0 \\ - \ln(p) - 1 - \lambda_0 - \lambda_1 \, x &= 0 \\ \ln(p) &= - 1 - \lambda_0 - \lambda_1 \, x \\ p(x) &= e^{-1-\lambda_0} \, e^{-\lambda_1 \,x} \\ &= A \, e^{-\lambda_1 \,x} \\ \end{align}

Step (4) to (5), we replace e1λ0=Ae^{-1-\lambda_0}=A, because it was a constant.

To apply back constraint 1, we need to define the integral boundary. Remember that this is going to be a probability distribution. We can’t use lower bound x=x=-\infty, because the integral result will be unbounded (it will be pretty big), so we won’t have normalizing factor.

The next reasonable thing to do, is to integrate it from 00 to \infty.

1=Xp(x)dx=0Aeλ1xdx=Aλ1eλ1x0=Aλ1λ1=A\begin{align} 1 &= \int_X p(x) \, dx \\ &= \int_0^\infty A \, e^{-\lambda_1 \,x} \, dx \\ &= \left. -\frac{A}{\lambda_1} \, e^{-\lambda_1 \,x} \right\rvert_0^\infty \\ &= \frac{A}{\lambda_1} \\ \lambda_1 &= A \end{align}

Then applying constraint 2:

μ=Xxp(x)dx=0xAeAxdx=xeAx0+0eAxdx=1AeAx0μ=1A\begin{align} \mu &= \int_X x \, p(x) \, dx \\ &= \int_0^\infty x A \, e^{-A \,x} \, dx \\ &= \left. -x\, e^{-A \,x} \right\rvert_0^\infty + \int_0^\infty e^{-A \,x} \, dx \\ &= \left. -\frac{1}{A} \, e^{-A \,x} \right\rvert_0^\infty \\ \mu &= \frac{1}{A} \\ \end{align}

Our probability distribution then becomes:

p(x)=1μexμ(P)\begin{align*} \tag{P} p(x)=\frac{1}{\mu}e^{-\frac{x}{\mu}} \\ \end{align*}

Some properties of Negative Exponential Distribution

Although we only used basic constraints for these, the probability distribution is also related Poisson distribution and has similar form. Negative Exponential Distribution (NED), is mainly used to describe Poisson process as well. In comparison, Poisson distribution (PD) is used to describe how many decay events in a fixed time interval. Meanwhile, NED is used to describe the time interval between each decay events. In fact, NED can be derived from PD.

Deriving Negative Exponential Distribution from Poisson Distribution

Our Poisson distribution for fixed time interval is this:

p(k,r)=errkk!(PD)\begin{align*} \tag{PD} p(k,r) &= e^{-r}\frac{r^k}{k!} \end{align*}

The above formula is under assumptions that rr is the average total events in a fixed time interval. Previously, we use whatever time units as the unitary value 1. If we want to introduce the time parameter tt, then Poisson distribution will take the form:

p(k,r)=ert(rt)kk!(PD)\begin{align*} \tag{PD} p(k,r) &= e^{-rt}\frac{(rt)^k}{k!} \end{align*}

Notice that rt=λrt=\lambda, which is the average total events (without the time unit). Another way to think about it, is that λ\lambda will have unit called “count”, while rr is a “rate”, and it will have unit called “count per time”. It’s just that in our previous article, we define the time unit to be 1, so rr and λ\lambda has the same numerical value.

In NED, the average has unit “time per count”, which means the average time interval for each event. So, if we declare this parameter as μ\mu, then μ=1r\mu=\frac{1}{r}.

p(k,r)=ert(rt)kk!p(k,μ)=etμ(tμ)kk!p(k,μ)=etμ(tμ)kk!\begin{align*} p(k,r) &= e^{-rt}\frac{(rt)^k}{k!} \\ p(k,\mu) &= e^{-\frac{t}{\mu}}\frac{(\frac{t}{\mu})^k}{k!} \\ p(k,\mu) &= e^{-\frac{t}{\mu}}\frac{(\frac{t}{\mu})^k}{k!} \\ \end{align*}

The interesting insight was that, if we set k=0k=0, we will have p(0,μ)p(0,\mu), where it means that this is the probability of “nothing” happened in the chosen interval t=Tt=T.

This means, if we want to calculate the probability of the next event, we subtract it from 11.

p(k=0,μ)=etμ(tμ)00!p(k=0,t=t,μ)=etμp(t<T,μ)=etμp(tT,μ)=1etμ\begin{align*} p(k=0,\mu) &= e^{-\frac{t}{\mu}}\frac{(\frac{t}{\mu})^0}{0!} \\ p(k=0,t=t,\mu) &= e^{-\frac{t}{\mu}} \\ p(t\lt T,\mu) &= e^{-\frac{t}{\mu}} \\ p(t\ge T,\mu) &= 1-e^{-\frac{t}{\mu}} \\ \end{align*}

But this means, p(tT)p(t\ge T) is “probability of the next event happening” and it is a cumulative probability. To avoid confusion, we rename the function as cpcp.

To get the probability density from the cumulative probability, we just take derivative with respect to the variable. In this case, it was tt.

cp(tT,μ)=1etμcp(tT,μ)=1μetμp(t,μ)=1μetμ(NED)\begin{align*} cp(t\ge T,\mu) &= 1-e^{-\frac{t}{\mu}} \\ cp'(t\ge T,\mu) &= \frac{1}{\mu}e^{-\frac{t}{\mu}} \\ \tag{NED} p(t,\mu) &= \frac{1}{\mu}e^{-\frac{t}{\mu}} \\ \end{align*}

It has the same form with our NED, if we rename tt into xx.

Variance

As usual, we find the second moment μ2\mu_2

E[x2]=μ2=Xx2p(x)dx=Xx21μexμdx=x2exμ0+2Xxexμdx=2μXx1μexμdx=2μμ=2μ2\begin{align} E[x^2] = \mu_2 &= \int_X x^2 \, p(x)\, dx\\ &= \int_X x^2 \,\frac{1}{\mu}e^{-\frac{x}{\mu}} \, dx \\ &= -\left. x^2 e^{-\frac{x}{\mu}} \right\rvert_0^\infty + 2\int_X x \,e^{-\frac{x}{\mu}} \, dx \\ &= 2\mu \int_X x \, \frac{1}{\mu} \,e^{-\frac{x}{\mu}} \, dx \\ &= 2\mu \mu \\ &= 2\mu^2 \\ \end{align}

Step (19) to (20) is just reusing definition of μ=Xx1μexμdx\mu=\int_X x \, \frac{1}{\mu} \,e^{-\frac{x}{\mu}} \, dx

The variance is the second moment subtracted by the square of the first moment.

Var(x)=E[x2]E[x]2=2μ2μ2=μ2\begin{align} Var(x) &= E[x^2] - E[x]^2 \\ &= 2\mu^2 - \mu^2 \\ &= \mu^2 \\ \end{align}

Since the variance is μ2\mu^2, then the standard deviation is μ\mu.

Characteristic function

Since the distribution is continuous, we need to find the function using integral.

E[eitx]=Xeitx1μexμdx=1μXexitμ1μdx=1itμ1exitμ1μ0=11itμ\begin{align} E[e^{itx}] &= \int_X e^{itx} \, \frac{1}{\mu}e^{-\frac{x}{\mu}} \, dx \\ &=\frac{1}{\mu} \int_X e^{x\frac{it\mu-1}{\mu}} \, dx \\ &=\left. \frac{1}{it\mu-1} \, e^{x\frac{it\mu-1}{\mu}} \right\rvert_0^\infty \\ &=\frac{1}{1-it\mu} \\ \end{align}

Rizky Maulana Nugraha

Written by Rizky Maulana Nugraha
Software Developer. Currently remotely working from Indonesia.
Twitter FollowGitHub followers