Maulana's personal blog
Main feedMath/PhysicsSoftware Development

Deriving probability distribution from entropy: part 2


Yesterday, I was planning to make a follow-up article from this part 1. So I googled my previous article to reread it first.

Accidentally I found this awesome article talking about the same thing!. Heck, it even uses similar title: “Deriving probability distribution using the Principle of Maximum Entropy”. So, I guess this method is very common, since the article itself is dated 2017.

Poisson distribution

Originally, I only planned to write up to Normal distribution. But at some point, I saw some question in Twitter feeds regarding why atomic decay rate has to “randomly” decay by half in a given interval. Because it seems they “break causality” since they decay randomly without cause.

Before we derive the distribution, let’s talk about the characteristic of Poisson process.

The atomic decay is actually the perfect example of that. Let’s say a single nucleus decay with a certain probability pp. We are now concern ourselves with the probability of a group of nuclei. What was the decay rate of the whole group? The probability is not the sum of each individual nucleus. As time goes on the normalization factor changes, because the total nucleus changes. So the distribution actually measures the probability of a given size of the total independent events, if a single probability of one event is known.

Our first constraint is still the normalization axiom.

Second constraint is taken from the context. A Poisson process is characterized by a “memoryless” property. In the case of kk event occurrence, each event doesn’t care about the history of previous event. In the case of atomic decay above (that means k=1k=1), a single decay event doesn’t care if the current size of the atoms is 1000 or just 2. It will decay with the same probability. However, since what human observe is a time interval between each occurrence, the distribution we recorded is “probability of the size of the nuclei to become half, given a specified interval tt”. So, the input of the distribution is time instead of frequency, thus it appears somewhat counterintuitive to apply entropy on it.

From what the process describes, we know that the distribution is a function of two variables, kk and tt. But since occurrence is a discrete variables, then p(k,t)p(k,t) is a discrete probability distribution. Our entropy formula is the plain simple sum, instead of integral that we used in the previous article.

Next, since we want to count the probability of kk events happens in a fixed interval, we need to introduce another parameters. Notice that in the case of radioactive decay events, what we observed is time. So we have variable tt. But the distribution we want to make is for a fixed tt intervals. That means, we need to express it backwards. For a given interval tt that means there can be different number of decay events happens/possible. Let’s just suppose that the average is rr. That would mean the total number of average events happens if we have variable tt, becomes rtrt.

For practical purposes, tt is usually set to a specific unit of time, in which rr can be converted into. For example, tt can be 1 second or 1 minute or 1 hour, then the value of rr will match accordingly. So, as a value, we can also say that t=1t=1 for most usages. I was being pedantic about the unit, because it was just a habit from physics.

In other words, it is okay to swap parameter tt with rr as long as you understand what it means. The parameter rr here acts as some sort fo “frequency” in the sense of probability distribution. It counts how many events happens on average, for a unit of time (kind of like Hertz unit).

We are ready with the constraints.

Now consider what happened with the entropy for a single decay event. Entropy is additive from the relative information. We are using discrete distribution now, so we need to count it individually.

H(p)=k,rp(k,r)I(p(k,r))\begin{align} H(p)=-\sum_{k,r} p(k,r) I(p(k,r)) \end{align}

Where II is the Shannon’s self-information we currently considers.

When a single decay events happened, we will have information from the following constraints:

  1. It must satisfy the normalization axiom (total probability for all kk is 1)
  2. It must have a fixed average frequency/rate of rr total events for a unit time
  3. For kk a very large number, the probability of multiple events happened must be increasingly small

Constraints 1 and 2 were straightforward because it is similar with other derivation from previous article, in which we derive the Uniform Distribution and Gaussian Distribution.

But the key here is the third constraint, which is the property of a Poisson process. If we only measure a single decay events (which is k=1k=1), then the constraint disappear. Since the information is surely 0 because of the probability of measuring a single event, is a certainty. This is what happened with Uniform Distribution.

However, a Poisson process specifically tried to measure if k1k \ge 1.

Since the condition is more specific. We can intuitively guess that the maximum entropy must be less than the entropy of Uniform Distribution.

Let us simulate how we gain this information. If k=1k=1, we observed only 1 decay events, which is a certainty, relative to the observation. That means:

I1(p(1,r))=ln11=0I_1(p(1, r))=\ln{\frac{1}{1}}=0

The probability of us observing the second event within the same interval, should not be affected by the previous event. This is due to the memory-less property we are talking about earlier. Then it becomes like a coin flip where the chance of the second event happening is just 12\frac{1}{2}.

Extending the analogy to a certain number of events, kk. The probability of the kk-th event happened is just 1k\frac{1}{k}, just like in uniform distribution. But this information needs to add up because kk corresponds to the total number of events, not the kk-th event, in our case.

In summary our 3rd constraint involved adding all IkI_k currently observed for kk events.

Ik(p(k))=ln1p(k)Ik(p(k))=lnkkIk(p(k,r))=lnk!I_k(p(k)) = \ln{\frac{1}{p(k)}} \\ I_k(p(k)) = \ln{k} \\ \sum_k I_k(p(k,r)) = \ln{k!} \\

Let’s summarize our Lagrangian constraints:

C1=λ0(1kp(k))C2=λ1(rkkp(k))C3=λ2(ckIk(p(k))p(k))=λ2(ckln(k!)p(k))\begin{align*} C_1 &= \lambda_0 (1-\sum_k p(k)) \\ C_2 &= \lambda_1 (r-\sum_k k \, p(k)) \\ C_3 &= \lambda_2 (c-\sum_k I_k(p(k)) \, p(k)) = \lambda_2 (c-\sum_k \ln(k!) \, p(k)) \\ \end{align*}

Our full entropy function is as follows (I omit the kk input notation for brevity).

H(k,p)=kpln(p)+C1+C2+C3H(k,p)=kpln(p)+λ0(1kp)+λ1(rkkp)+λ2(ckln(k!)p)\begin{align*} H(k,p) &= -\sum_k p \, \ln(p) + C1 + C2 + C3 \\ H(k,p) &= -\sum_k p \, \ln(p) + \lambda_0 (1-\sum_k p) + \lambda_1 (r-\sum_k k \, p) + \lambda_2 (c-\sum_k \ln(k!) \, p) \end{align*}

A little bit different from previous article where we construct the Lagrangian by differentiating with respect to xx to eliminate the constants. We can’t do this now, since the parameter kk is discrete. So instead, we treat H(k,p)H(k,p) as a Lagrangian.

pH(k,p)=0=k1kln(p)λ0(k1)λ1(kk)λ2(kln(k!))\begin{align} \frac{\partial}{\partial p} H(k, p) &= 0 = - \sum_k 1 - \sum_k \ln(p) - \lambda_0 (\sum_k 1) - \lambda_1 (\sum_k k) - \lambda_2 (\sum_k \ln(k!)) \end{align}

Notice that each terms has sum over kk, due to the fact that we observed kk total events, instead of just one.

But, the maximum entropy principle implies that at each kk-th event, the Lagrangian condition applies as well. So, we could pick any arbitrary kk, and the equation should still be the same. This allows us to remove all the sum sign (we observe on specific kk-th event)

0=1ln(p)λ0λ1kλ2ln(k!)ln(p)=1λ0λ1kλ2ln(k!)p=e1λ0eλ1k(k!)λ2\begin{align} 0 &= - 1 - \ln(p) - \lambda_0 - \lambda_1 \, k - \lambda_2 \, \ln(k!) \\ \ln(p) &= -1 - \lambda_0 - \lambda_1 \, k - \lambda_2 \, \ln(k!) \\ p &= e^{-1 - \lambda_0} \, e^{-\lambda_1 k} \, (k!) ^{-\lambda_2} \\ \end{align}

We got an expression, but it would be difficult to find each constant. We haven’t even know where we are going to put rr in it.

For now, if we set k=0k=0, then the probability function becomes:

p(0)=e1λ0p(0) = e^{-1-\lambda_0}

But this probability must have been affected by the average rate rr. If the average rate events is high, the probability for no events observed, should be increasingly small. So we have some intuition that λ0\lambda_0 is related with rr proportionally. We will replace it with a function ν(r)-\nu(r)

p(0)=eνp(0) = e^{-\nu}

Now, let’s say k=1k=1. Then

p(1)=eνeλ1p(1) = e^{-\nu} \, e^{-\lambda_1}

But the probability of single event happened in Poisson distribution is exactly the probability of “no event happened” times the average of events. Like this:

p(1)=rp(0)p(1)= r \, p(0)

For k=2k=2, if we think about it, the probability of total events must be equal to the probability of the previous total events (k1k-1), times a factor that depends on just the average and the kk-th value.

To illustrate this. Let’s say the average of events is 5. If we know p(1)p(1), then intuitively we can say that p(2)p(2) value must be higher, because k=2k=2 total events is still below average of 5. If we know p(5)p(5) we can also think that p(6)p(6) will have lower value/chance, because k=6k=6 is above the average of 5. Basically the factors should look like (rk)α(\frac{r}{k})^\alpha.

From this hint and the general formula above. We can guess that Poisson distribution has some kind of recurrence relation:

p(k)=eλ1kλ2p(k1)p(k) = e^{-\lambda_1} \, k^{-\lambda_2} \, p(k-1)

Since the factor in front of p(k1)p(k-1) should take a general form like (rk)α(\frac{r}{k})^\alpha. It means that λ2=α\lambda_2 = \alpha and eλ1λ2=re^{-\frac{\lambda_1}{\lambda_2}} = r.

Now, from this intuition alone, we can’t really decide the value of α\alpha. But it’s a pretty good intuition.

The definite way on how we got the value of these constants. Is by applying back the constraints above.

For the first constraint. The sum of all the probabilities must equal to one:

1=keνeλ1k(k!)λ2eν=keλ1k(k!)λ2\begin{align} 1 &= \sum_k^\infty e^{-\nu} \, e^{-\lambda_1 k} \, (k!) ^{-\lambda_2} \\ e^{\nu} &= \sum_k^\infty e^{-\lambda_1 k} \, (k!) ^{-\lambda_2} \end{align}

This is a difficult equality to solve. To sum up the right hand side using integral, we need some kind of generating function. Instead of that approach, we know that exe^x can be broken apart as infinite sum. So we will use that, and let the sum index to match k:

eν=keλ1k(k!)λ2kνkk!=keλ1k(k!)λ2\begin{align} e^{\nu} &= \sum_k^\infty e^{-\lambda_1 k} \, (k!) ^{-\lambda_2} \\ \sum_k^\infty \frac{\nu^k}{k!} &= \sum_k^\infty e^{-\lambda_1 k} \, (k!) ^{-\lambda_2} \\ \end{align}

Suppose that we match it term by term (kk left side by kk right side). Then the only reasonable results is that λ2=1\lambda_2=1 and eλ1k=νke^{-\lambda_1 k}=\nu^k.

From our previous intuitive guess, we can conclude that ν=r\nu=r and α=1\alpha=1. But if we want to make sure, then we use the second constraints.

The average rr must be equal to the first moment of the distribution (by definition).

r=kkp(k)=kkeννkk!=eνkkνkk!=eνkνk(k1)!=eννkν(k1)(k1)!=eννj+1νjj!=eννjνjj!=eννeνr=ν\begin{align} r &= \sum_k^\infty k \, p(k) \\ &= \sum_k^\infty k \, e^{-\nu} \, \frac{\nu^k}{k!} \\ &= e^{-\nu} \sum_k^\infty k \, \frac{\nu^k}{k!} \\ &= e^{-\nu} \sum_k^\infty \frac{\nu^k}{(k-1)!} \\ &= e^{-\nu} \, \nu \sum_k^\infty \frac{\nu^{(k-1)}}{(k-1)!} \\ &= e^{-\nu} \, \nu \sum_{j+1}^\infty \frac{\nu^{j}}{j!} \\ &= e^{-\nu} \, \nu \sum_{j}^\infty \frac{\nu^{j}}{j!} \\ &= e^{-\nu} \, \nu \, e^{\nu} \\ r &= \nu \\ \end{align}

Step (13), (14), (15), and (16), were justified because if we sum kk to \infty, the index k1k-1 can be replaced with arbitrary index jj, in which it will not change the definition of exe^{x} expansion as infinite series. So, we could replace the sum into eνe^\nu.

We got the conclusion that r=νr=\nu.

To wrap up, a Poisson probability distribution will maximize entropy using this formula:

p(k,r)=errkk!(P)\begin{align*} \tag{P} p(k,r) = e^{-r} \frac{r^k}{k!} \end{align*}

Some properties of Poisson Distribution

It is quite straightforward to derive Poisson distribution using Maximum Entropy Principle, because the third constraint is very special.

The third constraint uniquely define the distribution, just from the memory-less property of Poisson process. Just like we have described above, since the self-information content doesn’t depend on previous event, the new event is a simple uniform distribution over the existing information. For each new event, the new information adds up with the entropy equal to the probability of the kk-th Poisson event, times the information added up if we treat the event as cumulative successive events up to kk.

If we only look at the final form of Poisson distribution, it might not be clear from where the factors 1k!\frac{1}{k!} comes from. But, using information theory, it is clear that the term lnk!-\ln k! is the amount of information necessary to reduce the entropy from successive events being counted. It is the bias needed to constraint the distribution. Suppose that this term don’t exist, then we will recover Uniform Distribution from constraint 1 and 2.

From the final formula of PP. We got some interesting properties

Recurrence relation

We already guess the recurrence relation, it turns out to be in the form of:

p(k)=rkp(k1)p(k)=\frac{r}{k} \, p(k-1)

An interesting corollary is when k=rk=r, it turns out that

p(r)=p(r1)p(r) = p(r-1)

An easier way to understand this relation is if we view it in terms of information:

I(k)I(k1)=lnrkI(k) - I(k-1) = - \ln \frac{r}{k}

Information gain from event k1k-1 to kk has to be negative if kk is less than the average, because it was more common. Information gain from k1k-1 to kk has to be positive if kk is greater than average, because it is much rare/surprising.

In the special case of k=rk=r it implies that the relative information is zero, meaning that I(k)I(k) and I(k1)I(k-1) convey the same amount of information around the average total events.


One other interesting property from Poisson distribution is that the process itself implies that the mean and variance has to be the same, which is rr. This is because if r=0r=0, then it is not possible to infer that the process is a Poisson process. We need to observe the smallest unit of time possible for the average events to be non-zero, in order to get a variance.

However, one can calculate the variance using the second moment μ2\mu_2

E[k2]=μ2=kk2p(k)=kk2errkk!=erkk2rkk!=erkkrk(k1)!=erk(k1+1)rk(k1)!=er(k(k1)rk(k1)!+krk(k1)!)=err(k(k1)rk1(k1)!+krk1(k1)!)=err(rkrk2(k2)!+er)=err(rer+er)=r2+r\begin{align} E[k^2] = \mu_2 &= \sum_k^\infty k^2 \, p(k) \\ &= \sum_k^\infty k^2 \, e^{-r} \, \frac{r^k}{k!} \\ &= e^{-r} \, \sum_k^\infty k^2 \frac{r^k}{k!} \\ &= e^{-r} \, \sum_k^\infty k \frac{r^k}{(k-1)!} \\ &= e^{-r} \, \sum_k^\infty (k-1+1)\, \frac{r^k}{(k-1)!} \\ &= e^{-r} \, \left( \sum_k^\infty (k-1)\, \frac{r^k}{(k-1)!} + \sum_k^\infty \frac{r^k}{(k-1)!} \right) \\ &= e^{-r} \, r \, \left( \sum_k^\infty (k-1)\, \frac{r^{k-1}}{(k-1)!} + \sum_k^\infty \frac{r^{k-1}}{(k-1)!} \right) \\ &= e^{-r} \, r \, \left( r \sum_k^\infty \frac{r^{k-2}}{(k-2)!} + e^r \right) \\ &= e^{-r} \, r \, \left( r\, e^r + e^r \right) \\ &= r^2 + r \\ \end{align}

The variance is the second moment subtracted by the square of the first moment.

Var(k)=E[k2]E[k]2=r2+rr2=r\begin{align} Var(k) &= E[k^2] - E[k]^2 \\ &= r^2 + r - r^2 \\ &= r \\ \end{align}

So we got the variance rr again, as expected. Meaning, the standard deviation is r\sqrt{r}

Characteristic function

A characteristic function completely determines how probability distribution behaves. For this Poisson distribution, we compute it as follows:

E[eitk]=keitkerrkk!=k(eitkerrk)k!=erk(eitkrk)k!=erk(eitr)kk!=ereeitrϕ(t)=er(eit1)\begin{align} E[e^{itk}] &= \sum_k^\infty e^{itk} \, e^{-r} \, \frac{r^k}{k!} \\ &= \sum_k^\infty \frac{\left(e^{itk} \, e^{-r} \, r^k \right)}{k!} \\ &= e^{-r} \, \sum_k^\infty \frac{\left(e^{itk} \, r^k \right)}{k!} \\ &= e^{-r} \, \sum_k^\infty \frac{\left(e^{it} \, r \right)^k}{k!} \\ &= e^{-r} \, e^{e^{it} \, r} \\ \phi(t) &= e^{r \left(e^{it} - 1 \right)} \\ \end{align}

The characteristic function shows an interesting property. Suppose we have multiple different Poisson distribution. The characteristic difference is only defined by its rr value, which is the rate average events per unit time. If we have multiple independent group of Poisson events, and we observe it using the same unit time, it turns out that the total number of events is additive. Which means the probability is also additives.

Summing the probability distribution corresponds to multiplying the characteristic function.

ϕ(t)n=(er(eit1))n=enr(eit1)\begin{align} \phi(t)^n &= (e^{r \left(e^{it} - 1 \right)})^n \\ &= e^{n \, r \left(e^{it} - 1 \right)} \\ \end{align}

The last result is just a characteristic function of the same Poisson probability distribution, but with parameter nrnr as it’s mean.

As an example, suppose we have NN clump of radioactive elements with decay event rate r=5r=5. It means, on average there will be 5 decay events per unit time. That would also mean, if we observe these clumps together, there will be on average 5N5N decay events per the same unit time.

It seems obvious, but now we know why this is true. It was just a consequence on how the characteristic function multiplied together.

This property is usually known “backwards” as “infinitely divisible property”. Suppose we have a group of radioactive elements with average of 10 decay events, then we can divide the group into 5 group where we can observe 2 decay events each.

In my opinion, the term infinitely divisible here is somewhat misleading because the probability distribution in this case is a discrete one. So it can only be meaningfully divided until each group has 1 decay events, which is the bare minimum for us to observe Poisson process for this unit of time.

Rizky Maulana Nugraha

Written by Rizky Maulana Nugraha
Software Developer. Currently remotely working from Indonesia.
Twitter FollowGitHub followers