Deriving probability distribution from entropy: part 3

30-Dec-2023

This is another follow-up article from this part 2.

Previously we used 3 constraints to derive Poisson Distribution from Maximum Entropy Principle, but I intentionally omitted what happens if we only used 2 constraints. This article will talk about it.

(Negative) Exponential Distribution

In the previous article, we stated 3 constraints to derive Poisson Distribution. This time, we will only use two of them:

It must satisfy the normalization axiom (total probability for all $k$ is 1)
It must have an average value (mean)

Summarizing our Lagrangian constraints:

\begin{align*} C_1 &= \lambda_0 (1-\sum_k p(k)) \\ C_2 &= \lambda_1 (\mu-\sum_k k \, p(k)) \\ \end{align*}

Our full entropy function is as follows (I omit the $k$ input notation for brevity).

\begin{align*} H(k,p) &= -\sum_k p \, \ln(p) + C1 + C2 \\ H(k,p) &= -\sum_k p \, \ln(p) + \lambda_0 (1-\sum_k p) + \lambda_1 (\mu-\sum_k k \, p) \\ \end{align*}

Before we continue, let us notice that these two constraints were just basic constraints. There are no inherent assumptions about what the distribution should look like from a particular datasets.

We will begin adding assumptions as we go along.

First, notice that previously we are dealing with discrete distribution. However, currently there’s nothing to prevent us from assuming that these two constraints can’t be applied to a continuous distribution? So let’s just replace $k$ as $x$ and use integrals instead of sum.

\begin{align*} H(x,p) &= -\int_X p \, \ln(p) \, dx + \lambda_0 (1-\int_X p\, dx) + \lambda_1 (\mu-\int_X x \, p\, dx) \\ \end{align*}

Just like in the first article, because it’s a continuous distribution, we can take derivative of $H(x,p)$ by x, and then do stationary action principle. Our Lagrangian then becomes:

\begin{align*} L(x,p) &= -p \, \ln(p) - \lambda_0 \, p - \lambda_1 \, x \, p \\ \end{align*}

The partial derivative of the Lagrangian over p, must be 0

\begin{align} \frac{\partial}{\partial{p}} L(x,p) &= 0 \\ - \ln(p) - 1 - \lambda_0 - \lambda_1 \, x &= 0 \\ \ln(p) &= - 1 - \lambda_0 - \lambda_1 \, x \\ p(x) &= e^{-1-\lambda_0} \, e^{-\lambda_1 \,x} \\ &= A \, e^{-\lambda_1 \,x} \\ \end{align}

Step (4) to (5), we replace $e^{-1-\lambda_0}=A$ , because it was a constant.

To apply back constraint 1, we need to define the integral boundary. Remember that this is going to be a probability distribution. We can’t use lower bound $x=-\infty$ , because the integral result will be unbounded (it will be pretty big), so we won’t have normalizing factor.

The next reasonable thing to do, is to integrate it from $0$ to $\infty$ .

\begin{align} 1 &= \int_X p(x) \, dx \\ &= \int_0^\infty A \, e^{-\lambda_1 \,x} \, dx \\ &= \left. -\frac{A}{\lambda_1} \, e^{-\lambda_1 \,x} \right\rvert_0^\infty \\ &= \frac{A}{\lambda_1} \\ \lambda_1 &= A \end{align}

Then applying constraint 2:

\begin{align} \mu &= \int_X x \, p(x) \, dx \\ &= \int_0^\infty x A \, e^{-A \,x} \, dx \\ &= \left. -x\, e^{-A \,x} \right\rvert_0^\infty + \int_0^\infty e^{-A \,x} \, dx \\ &= \left. -\frac{1}{A} \, e^{-A \,x} \right\rvert_0^\infty \\ \mu &= \frac{1}{A} \\ \end{align}

Our probability distribution then becomes:

\begin{align*} \tag{P} p(x)=\frac{1}{\mu}e^{-\frac{x}{\mu}} \\ \end{align*}

Some properties of Negative Exponential Distribution

Although we only used basic constraints for these, the probability distribution is also related Poisson distribution and has similar form. Negative Exponential Distribution (NED), is mainly used to describe Poisson process as well. In comparison, Poisson distribution (PD) is used to describe how many decay events in a fixed time interval. Meanwhile, NED is used to describe the time interval between each decay events. In fact, NED can be derived from PD.

Deriving Negative Exponential Distribution from Poisson Distribution

Our Poisson distribution for fixed time interval is this:

\begin{align*} \tag{PD} p(k,r) &= e^{-r}\frac{r^k}{k!} \end{align*}

The above formula is under assumptions that $r$ is the average total events in a fixed time interval. Previously, we use whatever time units as the unitary value 1. If we want to introduce the time parameter $t$ , then Poisson distribution will take the form:

\begin{align*} \tag{PD} p(k,r) &= e^{-rt}\frac{(rt)^k}{k!} \end{align*}

Notice that $rt=\lambda$ , which is the average total events (without the time unit). Another way to think about it, is that $\lambda$ will have unit called “count”, while $r$ is a “rate”, and it will have unit called “count per time”. It’s just that in our previous article, we define the time unit to be 1, so $r$ and $\lambda$ has the same numerical value.

In NED, the average has unit “time per count”, which means the average time interval for each event. So, if we declare this parameter as $\mu$ , then $\mu=\frac{1}{r}$ .

\begin{align*} p(k,r) &= e^{-rt}\frac{(rt)^k}{k!} \\ p(k,\mu) &= e^{-\frac{t}{\mu}}\frac{(\frac{t}{\mu})^k}{k!} \\ p(k,\mu) &= e^{-\frac{t}{\mu}}\frac{(\frac{t}{\mu})^k}{k!} \\ \end{align*}

The interesting insight was that, if we set $k=0$ , we will have $p(0,\mu)$ , where it means that this is the probability of “nothing” happened in the chosen interval $t=T$ .

This means, if we want to calculate the probability of the next event, we subtract it from $1$ .

\begin{align*} p(k=0,\mu) &= e^{-\frac{t}{\mu}}\frac{(\frac{t}{\mu})^0}{0!} \\ p(k=0,t=t,\mu) &= e^{-\frac{t}{\mu}} \\ p(t\lt T,\mu) &= e^{-\frac{t}{\mu}} \\ p(t\ge T,\mu) &= 1-e^{-\frac{t}{\mu}} \\ \end{align*}

But this means, $p(t\ge T)$ is “probability of the next event happening” and it is a cumulative probability. To avoid confusion, we rename the function as $cp$ .

To get the probability density from the cumulative probability, we just take derivative with respect to the variable. In this case, it was $t$ .

\begin{align*} cp(t\ge T,\mu) &= 1-e^{-\frac{t}{\mu}} \\ cp'(t\ge T,\mu) &= \frac{1}{\mu}e^{-\frac{t}{\mu}} \\ \tag{NED} p(t,\mu) &= \frac{1}{\mu}e^{-\frac{t}{\mu}} \\ \end{align*}

It has the same form with our NED, if we rename $t$ into $x$ .

Variance

As usual, we find the second moment $\mu_2$

\begin{align} E[x^2] = \mu_2 &= \int_X x^2 \, p(x)\, dx\\ &= \int_X x^2 \,\frac{1}{\mu}e^{-\frac{x}{\mu}} \, dx \\ &= -\left. x^2 e^{-\frac{x}{\mu}} \right\rvert_0^\infty + 2\int_X x \,e^{-\frac{x}{\mu}} \, dx \\ &= 2\mu \int_X x \, \frac{1}{\mu} \,e^{-\frac{x}{\mu}} \, dx \\ &= 2\mu \mu \\ &= 2\mu^2 \\ \end{align}

Step (19) to (20) is just reusing definition of $\mu=\int_X x \, \frac{1}{\mu} \,e^{-\frac{x}{\mu}} \, dx$

The variance is the second moment subtracted by the square of the first moment.

\begin{align} Var(x) &= E[x^2] - E[x]^2 \\ &= 2\mu^2 - \mu^2 \\ &= \mu^2 \\ \end{align}

Since the variance is $\mu^2$ , then the standard deviation is $\mu$ .

Characteristic function

Since the distribution is continuous, we need to find the function using integral.

\begin{align} E[e^{itx}] &= \int_X e^{itx} \, \frac{1}{\mu}e^{-\frac{x}{\mu}} \, dx \\ &=\frac{1}{\mu} \int_X e^{x\frac{it\mu-1}{\mu}} \, dx \\ &=\left. \frac{1}{it\mu-1} \, e^{x\frac{it\mu-1}{\mu}} \right\rvert_0^\infty \\ &=\frac{1}{1-it\mu} \\ \end{align}