Deriving probability distribution from entropy: part 2
29-Dec-2023
Yesterday, I was planning to make a follow-up article from this part 1. So I googled my previous article to reread it first.
Accidentally I found this awesome article talking about the same thing!. Heck, it even uses similar title: “Deriving probability distribution using the Principle of Maximum Entropy”. So, I guess this method is very common, since the article itself is dated 2017.
Poisson distribution
Originally, I only planned to write up to Normal distribution. But at some point, I saw some question in Twitter feeds regarding why atomic decay rate has to “randomly” decay by half in a given interval. Because it seems they “break causality” since they decay randomly without cause.
Before we derive the distribution, let’s talk about the characteristic of Poisson process.
The atomic decay is actually the perfect example of that. Let’s say a single nucleus decay with a certain probability . We are now concern ourselves with the probability of a group of nuclei. What was the decay rate of the whole group? The probability is not the sum of each individual nucleus. As time goes on the normalization factor changes, because the total nucleus changes. So the distribution actually measures the probability of a given size of the total independent events, if a single probability of one event is known.
Our first constraint is still the normalization axiom.
Second constraint is taken from the context. A Poisson process is characterized by a “memoryless” property. In the case of event occurrence, each event doesn’t care about the history of previous event. In the case of atomic decay above (that means ), a single decay event doesn’t care if the current size of the atoms is 1000 or just 2. It will decay with the same probability. However, since what human observe is a time interval between each occurrence, the distribution we recorded is “probability of the size of the nuclei to become half, given a specified interval ”. So, the input of the distribution is time instead of frequency, thus it appears somewhat counterintuitive to apply entropy on it.
From what the process describes, we know that the distribution is a function of two variables, and . But since occurrence is a discrete variables, then is a discrete probability distribution. Our entropy formula is the plain simple sum, instead of integral that we used in the previous article.
Next, since we want to count the probability of events happens in a fixed interval, we need to introduce another parameters. Notice that in the case of radioactive decay events, what we observed is time. So we have variable . But the distribution we want to make is for a fixed intervals. That means, we need to express it backwards. For a given interval that means there can be different number of decay events happens/possible. Let’s just suppose that the average is . That would mean the total number of average events happens if we have variable , becomes .
For practical purposes, is usually set to a specific unit of time, in which can be converted into. For example, can be 1 second or 1 minute or 1 hour, then the value of will match accordingly. So, as a value, we can also say that for most usages. I was being pedantic about the unit, because it was just a habit from physics.
In other words, it is okay to swap parameter with as long as you understand what it means. The parameter here acts as some sort fo “frequency” in the sense of probability distribution. It counts how many events happens on average, for a unit of time (kind of like Hertz unit).
We are ready with the constraints.
Now consider what happened with the entropy for a single decay event. Entropy is additive from the relative information. We are using discrete distribution now, so we need to count it individually.
Where is the Shannon’s self-information we currently considers.
When a single decay events happened, we will have information from the following constraints:
- It must satisfy the normalization axiom (total probability for all is 1)
- It must have a fixed average frequency/rate of total events for a unit time
- For a very large number, the probability of multiple events happened must be increasingly small
Constraints 1 and 2 were straightforward because it is similar with other derivation from previous article, in which we derive the Uniform Distribution and Gaussian Distribution.
But the key here is the third constraint, which is the property of a Poisson process. If we only measure a single decay events (which is ), then the constraint disappear. Since the information is surely 0 because of the probability of measuring a single event, is a certainty. This is what happened with Uniform Distribution.
However, a Poisson process specifically tried to measure if .
Since the condition is more specific. We can intuitively guess that the maximum entropy must be less than the entropy of Uniform Distribution.
Let us simulate how we gain this information. If , we observed only 1 decay events, which is a certainty, relative to the observation. That means:
The probability of us observing the second event within the same interval, should not be affected by the previous event. This is due to the memory-less property we are talking about earlier. Then it becomes like a coin flip where the chance of the second event happening is just .
Extending the analogy to a certain number of events, . The probability of the -th event happened is just , just like in uniform distribution. But this information needs to add up because corresponds to the total number of events, not the -th event, in our case.
In summary our 3rd constraint involved adding all currently observed for events.
Let’s summarize our Lagrangian constraints:
Our full entropy function is as follows (I omit the input notation for brevity).
A little bit different from previous article where we construct the Lagrangian by differentiating with respect to to eliminate the constants. We can’t do this now, since the parameter is discrete. So instead, we treat as a Lagrangian.
Notice that each terms has sum over , due to the fact that we observed total events, instead of just one.
But, the maximum entropy principle implies that at each -th event, the Lagrangian condition applies as well. So, we could pick any arbitrary , and the equation should still be the same. This allows us to remove all the sum sign (we observe on specific -th event)
We got an expression, but it would be difficult to find each constant. We haven’t even know where we are going to put in it.
For now, if we set , then the probability function becomes:
But this probability must have been affected by the average rate . If the average rate events is high, the probability for no events observed, should be increasingly small. So we have some intuition that is related with proportionally. We will replace it with a function