Deriving probability distribution from entropy: part 2

29-Dec-2023

Yesterday, I was planning to make a follow-up article from this part 1. So I googled my previous article to reread it first.

Accidentally I found this awesome article talking about the same thing!. Heck, it even uses similar title: “Deriving probability distribution using the Principle of Maximum Entropy”. So, I guess this method is very common, since the article itself is dated 2017.

Poisson distribution

Originally, I only planned to write up to Normal distribution. But at some point, I saw some question in Twitter feeds regarding why atomic decay rate has to “randomly” decay by half in a given interval. Because it seems they “break causality” since they decay randomly without cause.

Before we derive the distribution, let’s talk about the characteristic of Poisson process.

The atomic decay is actually the perfect example of that. Let’s say a single nucleus decay with a certain probability $p$ . We are now concern ourselves with the probability of a group of nuclei. What was the decay rate of the whole group? The probability is not the sum of each individual nucleus. As time goes on the normalization factor changes, because the total nucleus changes. So the distribution actually measures the probability of a given size of the total independent events, if a single probability of one event is known.

Our first constraint is still the normalization axiom.

Second constraint is taken from the context. A Poisson process is characterized by a “memoryless” property. In the case of $k$ event occurrence, each event doesn’t care about the history of previous event. In the case of atomic decay above (that means $k=1$ ), a single decay event doesn’t care if the current size of the atoms is 1000 or just 2. It will decay with the same probability. However, since what human observe is a time interval between each occurrence, the distribution we recorded is “probability of the size of the nuclei to become half, given a specified interval $t$ ”. So, the input of the distribution is time instead of frequency, thus it appears somewhat counterintuitive to apply entropy on it.

From what the process describes, we know that the distribution is a function of two variables, $k$ and $t$ . But since occurrence is a discrete variables, then $p(k,t)$ is a discrete probability distribution. Our entropy formula is the plain simple sum, instead of integral that we used in the previous article.

Next, since we want to count the probability of $k$ events happens in a fixed interval, we need to introduce another parameters. Notice that in the case of radioactive decay events, what we observed is time. So we have variable $t$ . But the distribution we want to make is for a fixed $t$ intervals. That means, we need to express it backwards. For a given interval $t$ that means there can be different number of decay events happens/possible. Let’s just suppose that the average is $r$ . That would mean the total number of average events happens if we have variable $t$ , becomes $rt$ .

For practical purposes, $t$ is usually set to a specific unit of time, in which $r$ can be converted into. For example, $t$ can be 1 second or 1 minute or 1 hour, then the value of $r$ will match accordingly. So, as a value, we can also say that $t=1$ for most usages. I was being pedantic about the unit, because it was just a habit from physics.

In other words, it is okay to swap parameter $t$ with $r$ as long as you understand what it means. The parameter $r$ here acts as some sort fo “frequency” in the sense of probability distribution. It counts how many events happens on average, for a unit of time (kind of like Hertz unit).

We are ready with the constraints.

Now consider what happened with the entropy for a single decay event. Entropy is additive from the relative information. We are using discrete distribution now, so we need to count it individually.

\begin{align} H(p)=-\sum_{k,r} p(k,r) I(p(k,r)) \end{align}

Where $I$ is the Shannon’s self-information we currently considers.

When a single decay events happened, we will have information from the following constraints:

It must satisfy the normalization axiom (total probability for all $k$ is 1)
It must have a fixed average frequency/rate of $r$ total events for a unit time
For $k$ a very large number, the probability of multiple events happened must be increasingly small

Constraints 1 and 2 were straightforward because it is similar with other derivation from previous article, in which we derive the Uniform Distribution and Gaussian Distribution.

But the key here is the third constraint, which is the property of a Poisson process. If we only measure a single decay events (which is $k=1$ ), then the constraint disappear. Since the information is surely 0 because of the probability of measuring a single event, is a certainty. This is what happened with Uniform Distribution.

However, a Poisson process specifically tried to measure if $k \ge 1$ .

Since the condition is more specific. We can intuitively guess that the maximum entropy must be less than the entropy of Uniform Distribution.

Let us simulate how we gain this information. If $k=1$ , we observed only 1 decay events, which is a certainty, relative to the observation. That means:

I_1(p(1, r))=\ln{\frac{1}{1}}=0

The probability of us observing the second event within the same interval, should not be affected by the previous event. This is due to the memory-less property we are talking about earlier. Then it becomes like a coin flip where the chance of the second event happening is just $\frac{1}{2}$ .

Extending the analogy to a certain number of events, $k$ . The probability of the $k$ -th event happened is just $\frac{1}{k}$ , just like in uniform distribution. But this information needs to add up because $k$ corresponds to the total number of events, not the $k$ -th event, in our case.

In summary our 3rd constraint involved adding all $I_k$ currently observed for $k$ events.

I_k(p(k)) = \ln{\frac{1}{p(k)}} \\ I_k(p(k)) = \ln{k} \\ \sum_k I_k(p(k,r)) = \ln{k!} \\

Let’s summarize our Lagrangian constraints:

\begin{align*} C_1 &= \lambda_0 (1-\sum_k p(k)) \\ C_2 &= \lambda_1 (r-\sum_k k \, p(k)) \\ C_3 &= \lambda_2 (c-\sum_k I_k(p(k)) \, p(k)) = \lambda_2 (c-\sum_k \ln(k!) \, p(k)) \\ \end{align*}

Our full entropy function is as follows (I omit the $k$ input notation for brevity).

\begin{align*} H(k,p) &= -\sum_k p \, \ln(p) + C1 + C2 + C3 \\ H(k,p) &= -\sum_k p \, \ln(p) + \lambda_0 (1-\sum_k p) + \lambda_1 (r-\sum_k k \, p) + \lambda_2 (c-\sum_k \ln(k!) \, p) \end{align*}

A little bit different from previous article where we construct the Lagrangian by differentiating with respect to $x$ to eliminate the constants. We can’t do this now, since the parameter $k$ is discrete. So instead, we treat $H(k,p)$ as a Lagrangian.

\begin{align} \frac{\partial}{\partial p} H(k, p) &= 0 = - \sum_k 1 - \sum_k \ln(p) - \lambda_0 (\sum_k 1) - \lambda_1 (\sum_k k) - \lambda_2 (\sum_k \ln(k!)) \end{align}

Notice that each terms has sum over $k$ , due to the fact that we observed $k$ total events, instead of just one.

But, the maximum entropy principle implies that at each $k$ -th event, the Lagrangian condition applies as well. So, we could pick any arbitrary $k$ , and the equation should still be the same. This allows us to remove all the sum sign (we observe on specific $k$ -th event)

\begin{align} 0 &= - 1 - \ln(p) - \lambda_0 - \lambda_1 \, k - \lambda_2 \, \ln(k!) \\ \ln(p) &= -1 - \lambda_0 - \lambda_1 \, k - \lambda_2 \, \ln(k!) \\ p &= e^{-1 - \lambda_0} \, e^{-\lambda_1 k} \, (k!) ^{-\lambda_2} \\ \end{align}

We got an expression, but it would be difficult to find each constant. We haven’t even know where we are going to put $r$ in it.

For now, if we set $k=0$ , then the probability function becomes:

p(0) = e^{-1-\lambda_0}

But this probability must have been affected by the average rate $r$ . If the average rate events is high, the probability for no events observed, should be increasingly small. So we have some intuition that $\lambda_0$ is related with $r$ proportionally. We will replace it with a function $-\nu(r)$

p(0) = e^{-\nu}

Now, let’s say $k=1$ . Then

p(1) = e^{-\nu} \, e^{-\lambda_1}

But the probability of single event happened in Poisson distribution is exactly the probability of “no event happened” times the average of events. Like this:

p(1)= r \, p(0)

For $k=2$ , if we think about it, the probability of total events must be equal to the probability of the previous total events ( $k-1$ ), times a factor that depends on just the average and the $k$ -th value.

To illustrate this. Let’s say the average of events is 5. If we know $p(1)$ , then intuitively we can say that $p(2)$ value must be higher, because $k=2$ total events is still below average of 5. If we know $p(5)$ we can also think that $p(6)$ will have lower value/chance, because $k=6$ is above the average of 5. Basically the factors should look like $(\frac{r}{k})^\alpha$ .

From this hint and the general formula above. We can guess that Poisson distribution has some kind of recurrence relation:

p(k) = e^{-\lambda_1} \, k^{-\lambda_2} \, p(k-1)

Since the factor in front of $p(k-1)$ should take a general form like $(\frac{r}{k})^\alpha$ . It means that $\lambda_2 = \alpha$ and $e^{-\frac{\lambda_1}{\lambda_2}} = r$ .

Now, from this intuition alone, we can’t really decide the value of $\alpha$ . But it’s a pretty good intuition.

The definite way on how we got the value of these constants. Is by applying back the constraints above.

For the first constraint. The sum of all the probabilities must equal to one:

\begin{align} 1 &= \sum_k^\infty e^{-\nu} \, e^{-\lambda_1 k} \, (k!) ^{-\lambda_2} \\ e^{\nu} &= \sum_k^\infty e^{-\lambda_1 k} \, (k!) ^{-\lambda_2} \end{align}

This is a difficult equality to solve. To sum up the right hand side using integral, we need some kind of generating function. Instead of that approach, we know that $e^x$ can be broken apart as infinite sum. So we will use that, and let the sum index to match k:

\begin{align} e^{\nu} &= \sum_k^\infty e^{-\lambda_1 k} \, (k!) ^{-\lambda_2} \\ \sum_k^\infty \frac{\nu^k}{k!} &= \sum_k^\infty e^{-\lambda_1 k} \, (k!) ^{-\lambda_2} \\ \end{align}

Suppose that we match it term by term ( $k$ left side by $k$ right side). Then the only reasonable results is that $\lambda_2=1$ and $e^{-\lambda_1 k}=\nu^k$ .

From our previous intuitive guess, we can conclude that $\nu=r$ and $\alpha=1$ . But if we want to make sure, then we use the second constraints.

The average $r$ must be equal to the first moment of the distribution (by definition).

\begin{align} r &= \sum_k^\infty k \, p(k) \\ &= \sum_k^\infty k \, e^{-\nu} \, \frac{\nu^k}{k!} \\ &= e^{-\nu} \sum_k^\infty k \, \frac{\nu^k}{k!} \\ &= e^{-\nu} \sum_k^\infty \frac{\nu^k}{(k-1)!} \\ &= e^{-\nu} \, \nu \sum_k^\infty \frac{\nu^{(k-1)}}{(k-1)!} \\ &= e^{-\nu} \, \nu \sum_{j+1}^\infty \frac{\nu^{j}}{j!} \\ &= e^{-\nu} \, \nu \sum_{j}^\infty \frac{\nu^{j}}{j!} \\ &= e^{-\nu} \, \nu \, e^{\nu} \\ r &= \nu \\ \end{align}

Step (13), (14), (15), and (16), were justified because if we sum $k$ to $\infty$ , the index $k-1$ can be replaced with arbitrary index $j$ , in which it will not change the definition of $e^{x}$ expansion as infinite series. So, we could replace the sum into $e^\nu$ .

We got the conclusion that $r=\nu$ .

To wrap up, a Poisson probability distribution will maximize entropy using this formula:

\begin{align*} \tag{P} p(k,r) = e^{-r} \frac{r^k}{k!} \end{align*}

Some properties of Poisson Distribution

It is quite straightforward to derive Poisson distribution using Maximum Entropy Principle, because the third constraint is very special.

The third constraint uniquely define the distribution, just from the memory-less property of Poisson process. Just like we have described above, since the self-information content doesn’t depend on previous event, the new event is a simple uniform distribution over the existing information. For each new event, the new information adds up with the entropy equal to the probability of the $k$ -th Poisson event, times the information added up if we treat the event as cumulative successive events up to $k$ .

If we only look at the final form of Poisson distribution, it might not be clear from where the factors $\frac{1}{k!}$ comes from. But, using information theory, it is clear that the term $-\ln k!$ is the amount of information necessary to reduce the entropy from successive events being counted. It is the bias needed to constraint the distribution. Suppose that this term don’t exist, then we will recover Uniform Distribution from constraint 1 and 2.

From the final formula of $P$ . We got some interesting properties

Recurrence relation

We already guess the recurrence relation, it turns out to be in the form of:

p(k)=\frac{r}{k} \, p(k-1)

An interesting corollary is when $k=r$ , it turns out that

p(r) = p(r-1)

An easier way to understand this relation is if we view it in terms of information:

I(k) - I(k-1) = - \ln \frac{r}{k}

Information gain from event $k-1$ to $k$ has to be negative if $k$ is less than the average, because it was more common. Information gain from $k-1$ to $k$ has to be positive if $k$ is greater than average, because it is much rare/surprising.

In the special case of $k=r$ it implies that the relative information is zero, meaning that $I(k)$ and $I(k-1)$ convey the same amount of information around the average total events.

Variance

One other interesting property from Poisson distribution is that the process itself implies that the mean and variance has to be the same, which is $r$ . This is because if $r=0$ , then it is not possible to infer that the process is a Poisson process. We need to observe the smallest unit of time possible for the average events to be non-zero, in order to get a variance.

However, one can calculate the variance using the second moment $\mu_2$

\begin{align} E[k^2] = \mu_2 &= \sum_k^\infty k^2 \, p(k) \\ &= \sum_k^\infty k^2 \, e^{-r} \, \frac{r^k}{k!} \\ &= e^{-r} \, \sum_k^\infty k^2 \frac{r^k}{k!} \\ &= e^{-r} \, \sum_k^\infty k \frac{r^k}{(k-1)!} \\ &= e^{-r} \, \sum_k^\infty (k-1+1)\, \frac{r^k}{(k-1)!} \\ &= e^{-r} \, \left( \sum_k^\infty (k-1)\, \frac{r^k}{(k-1)!} + \sum_k^\infty \frac{r^k}{(k-1)!} \right) \\ &= e^{-r} \, r \, \left( \sum_k^\infty (k-1)\, \frac{r^{k-1}}{(k-1)!} + \sum_k^\infty \frac{r^{k-1}}{(k-1)!} \right) \\ &= e^{-r} \, r \, \left( r \sum_k^\infty \frac{r^{k-2}}{(k-2)!} + e^r \right) \\ &= e^{-r} \, r \, \left( r\, e^r + e^r \right) \\ &= r^2 + r \\ \end{align}

The variance is the second moment subtracted by the square of the first moment.

\begin{align} Var(k) &= E[k^2] - E[k]^2 \\ &= r^2 + r - r^2 \\ &= r \\ \end{align}

So we got the variance $r$ again, as expected. Meaning, the standard deviation is $\sqrt{r}$

Characteristic function

A characteristic function completely determines how probability distribution behaves. For this Poisson distribution, we compute it as follows:

\begin{align} E[e^{itk}] &= \sum_k^\infty e^{itk} \, e^{-r} \, \frac{r^k}{k!} \\ &= \sum_k^\infty \frac{\left(e^{itk} \, e^{-r} \, r^k \right)}{k!} \\ &= e^{-r} \, \sum_k^\infty \frac{\left(e^{itk} \, r^k \right)}{k!} \\ &= e^{-r} \, \sum_k^\infty \frac{\left(e^{it} \, r \right)^k}{k!} \\ &= e^{-r} \, e^{e^{it} \, r} \\ \phi(t) &= e^{r \left(e^{it} - 1 \right)} \\ \end{align}

The characteristic function shows an interesting property. Suppose we have multiple different Poisson distribution. The characteristic difference is only defined by its $r$ value, which is the rate average events per unit time. If we have multiple independent group of Poisson events, and we observe it using the same unit time, it turns out that the total number of events is additive. Which means the probability is also additives.

Summing the probability distribution corresponds to multiplying the characteristic function.

\begin{align} \phi(t)^n &= (e^{r \left(e^{it} - 1 \right)})^n \\ &= e^{n \, r \left(e^{it} - 1 \right)} \\ \end{align}

The last result is just a characteristic function of the same Poisson probability distribution, but with parameter $nr$ as it’s mean.

As an example, suppose we have $N$ clump of radioactive elements with decay event rate $r=5$ . It means, on average there will be 5 decay events per unit time. That would also mean, if we observe these clumps together, there will be on average $5N$ decay events per the same unit time.

It seems obvious, but now we know why this is true. It was just a consequence on how the characteristic function multiplied together.

This property is usually known “backwards” as “infinitely divisible property”. Suppose we have a group of radioactive elements with average of 10 decay events, then we can divide the group into 5 group where we can observe 2 decay events each.

In my opinion, the term infinitely divisible here is somewhat misleading because the probability distribution in this case is a discrete one. So it can only be meaningfully divided until each group has 1 decay events, which is the bare minimum for us to observe Poisson process for this unit of time.