I don’t have any theoritical physic degree. The posts below are purely my personal opinion and does not reflect any contemporary academic view. Read it for your own amusement.
The very first experience I have with “Entropy” concept is coming from a chemistry class in high school. At that time, I was told that entropy is a measure of randomness and/or chaos within a thermodynamical system. This is overly simplified as you might already know. The “real” definition of entropy is divided into based on microscopic and macroscopic relations. The term entropy, for a physical meaning is a measure of how microscopic arrangement of energy can emerge to become macroscopic quantity. It is a brilliant significant landmark of classical + statistical physics.
The formula of physical entropy can be written as:
is the entropy, k is a physical constant called Stefan-Boltzmann constant and W corresponds to the microstate arrangement of the energy. It is a little bit difficult for me to explain what are microstates. However you can think of it as the link between macroscopic quantity, which is things that we can measure directly, and microscopic quantities, which are the things that we can not measure directly because there are so many of them.
Let’s imagine a box with air. It will have so many molecules, let’s just say N of them. Each of this air molecules will have kinetic energy. All of them combined means the air inside the box have internal energy. But we can’t measure individual kinetic energy of each air molecule. It’s just impossible. We can, however, measure the total internal energy.
The air inside the box will have air pressure P. This is a macroscopic quantity because it is attributed to the whole group of air inside the box, not individual air. Together, pressure and volume, we can have a direct relationship to calculate the internal energy of the air, because it is proportional. Thus there should be some kind of relationship between individual energy of the molecules with the energy that we directly measure.
Naturally, given N number of molecules, we expect some kind of statistical distribution for the energy level of each molecules. To think about this easily. If we have 4 number of molecules with a range of energy level between 0 (the molecule is not moving) to some , the maximum energy level a molecule can have, then it may happen that 1 molecule have 0 energy, 2 molecule have 1 Joule of energy, and 1 molecules have 2 Joule of energy. For this kind of arrangements, we can associate average energy of this system. One such way to calculate that is just to have the total energy of the system divided by total number of molecules.
If we plot the graph of energy level in x axis and number of particles that have said energy levels, we will get something that we can call the spectral density graph of energy. It is essentially a statistical distribution. If we imagine that the universe is fair, there is no reason for god to favor for one particle only, so each particle can have equally the same chance to have a certain energy level. This means the spectral graph can be a probability distribution!
Let’s stop for now and wonder for a second. The concept above can have a link or relation between microscopic arrangement (or what physicist call microstates) with its macroscopic values. A certain macroscopic properties, like pressure or internal energy can only emerge from a certain possible microscopic configuration (although there are many of them). Once we understand that, we can appreciate Boltzmann’s genius insight. However, why it has anything to do with information or computer science?
Well, you can guess that by yourself. After all, science is driven by curiosity. One of the general definition of physical entropy is Gibbs entropy and it is written like this:
With is the probability of a given microstate occurs for a certain energy level . For classical thermodynamics situation, it just reduce to our previous entropy definition and the microstates probability terms are simplified to just W.
Now, in computer science there is an entirely different concept gets introduced in this field. It was called Shannon information quantity. When I first saw it, I had a huge mindblown moment. It is talking about different thing. A different concept yet the formula looks exactly the same.
Look at it, the only difference is just the constant dissappear. If you think about it, it’s really makes sense. The constant is needed in the physical definition because you already uses energy unit for macroscopic quantity, meaning the derivation is from top to bottom. In the information/computer science definition, there is no constant because you define the thing from bottom up.
So why does two different concept use the same formula? Maybe they are closely related? The quantity here is sometimes called the information entropy, obviously because scientist before us noticed the resemblance to physical entropy. In fact they are so interchangeable that it is very seamless way and natural to think that a physical arrangement of things is by itself an information.
But, mind you, one important factor of why I think this is really cool is the possibility of more deeper connections. We used to think that arrangements of physical objects contains information. But what about the other way around? Can it be that physical phenomenon emerges from encoded information and they are just one and the same thing?
I learned about physical entropy way before I knew about Shannon’s definition of information entropy. The physical derivation of entropy is lengthy and contains many physical terms such as microstate/macrostate, pressure, volume, internal energy, etc. In contrast with that, information entropy (IMHO) can be derived and deduced easily from first principle. It is very elegant and based on pure math modelling. It is really surprising for me that such different derivation results in the same formula.
The physical derivation might be difficult to follow for people unfamiliar with physics, but the information entropy is very intuitive.
We start by familiarizing ourselves with what does information and data really means. In layman’s terms, data is a collection of quantity. Meanwhile, information is how we interpret data. Information can change depending on how much data we have. For example, in dry season, we knew that it is less likely to rain. So based on past historical data, the chance of tomorrow will be raining in dry season is very low. Our information concludes that tomorrow is not likely to rain. However, it turns out that tomorrow is raining hard. That means we have new data. Our information changes because the new data is interesting (because it’s raining despite the probability is low). If after that we have a week straight of rain, that means we have to change our information. There must be something going on horribly wrong in climate change to cause this. This is such example.
But the explanation above is a qualitative explanation. We need something that can be quantitatively measure, things to predict, etc. We want to create a mathematical model that can fit with our concept of information. For this, we define 4 basic properties of information. This serves as the fundamental principle and base axioms.
- If an event has a low chance of happening, the information if we observe such event is more meaningful and impactful compared with information from high occurence event.
This is very intuitive to understand. We knew that sun is rising from the east because we saw it everyday. We assume it’s going to be like that tomorrow. However if for some reason tomorrow the sun is rising from the west, people would freak out because this is such an important information. Could it be that it is a judgment day?
- There is no such thing as new information that you acquire that can delete previous information. It can only updates.
If you retrieve new data, it will always add new information. It’s reasonable to think that new data will not make you amnesia. It will just makes you wiser.
- Event that happens everytime can be safely ignored, because it’s not that important.
You can think about the sun like before. Or, for something more concrete, you know that you have a nose and your eye is actually seeing your nose, but your brain totally ignores it in your vision. This is because seeing your nose 100% of the time in your eye is not important for you. Other way to think about is, if you watch a sports match where a team is always the winner, the match is not interesting. Because you kind of know that the next match is going to be won by that team.
- Information is additive. If you receive new information from two independent events, that just means the informational value is added together.
This might be less obvious, so think of it like this: If you have two different independent hobbies, you can enjoy both of them additively. There’s no reason for you to force yourself to enjoy just one of them. Another way to think of it, is by seeing news. When you learn about some event in city A and city B from the news and those are independent events, then you have both information as if both happens at the same time.
Now let’s convert those statements into mathematical statements or models. Due to how often we rely on statistical event to define information. It is natural for us to choose statistical/probability as our mathematical language.
We define information as a function , whose input is the probability of an event happens. That means information is a function which satisfies:
- should decrease or stay the same (monotonically decrease) as becomes more probable.
- is always additive, that means it can never be a negative value
- , certain things have no informational value
- , information from two independent events a and b is just an addition of both independent information
From these 4 properties we can guess what kind of function will take. The property number 4 is the biggest hints of all and in my opinion the most remarkable.
Start with property #4, because both of a and b are independent events, we can take partial derivative of each, assuming that can be differentiated twice, we get:
By substituting , justified by the fact that is just a joint probability (so it can be an input of function I), we now have a simple differential equation: