Disclaimer:

I don’t have any theoretical physic degree. The posts below are purely my personal opinion and does not reflect any contemporary academic view. Read it for your own amusement.

Intro

The very first experience I have with “Entropy” concept is coming from a chemistry class in high school. At that time, I was told that entropy is a measure of randomness and/or chaos within a thermodynamical system. This is overly simplified as you might already know. The “real” definition of entropy is divided into based on microscopic and macroscopic relations. The term entropy, for a physical meaning is a measure of how microscopic arrangement of energy can emerge to become macroscopic quantity. It is a brilliant significant landmark of classical + statistical physics.

The formula of physical entropy can be written as:

S = k \ln(W)

$S$ is the entropy, k is a physical constant called Stefan-Boltzmann constant and W corresponds to the microstate arrangement of the energy. It is a little bit difficult for me to explain what are microstates. However you can think of it as the link between macroscopic quantity, which is things that we can measure directly, and microscopic quantities, which are the things that we can not measure directly because there are so many of them.

Let’s imagine a box with air. It will have so many molecules, let’s just say N of them. Each of this air molecules will have kinetic energy. All of them combined means the air inside the box have internal energy. But we can’t measure individual kinetic energy of each air molecule. It’s just impossible. We can, however, measure the total internal energy.

The air inside the box will have air pressure P. This is a macroscopic quantity because it is attributed to the whole group of air inside the box, not individual air. Together, pressure and volume, we can have a direct relationship to calculate the internal energy of the air, because it is proportional. Thus there should be some kind of relationship between individual energy of the molecules with the energy that we directly measure.

Naturally, given N number of molecules, we expect some kind of statistical distribution for the energy level of each molecules. To think about this easily. If we have 4 number of molecules with a range of energy level between 0 (the molecule is not moving) to some $E_{max}$ , the maximum energy level a molecule can have, then it may happen that 1 molecule have 0 energy, 2 molecule have 1 Joule of energy, and 1 molecules have 2 Joule of energy. For this kind of arrangements, we can associate average energy of this system. One such way to calculate that is just to have the total energy of the system divided by total number of molecules.

If we plot the graph of energy level in x axis and number of particles that have said energy levels, we will get something that we can call the spectral density graph of energy. It is essentially a statistical distribution. If we imagine that the universe is fair, there is no reason for god to favor for one particle only, so each particle can have equally the same chance to have a certain energy level. This means the spectral graph can be a probability distribution!

Let’s stop for now and wonder for a second. The concept above can have a link or relation between microscopic arrangement (or what physicist call microstates) with its macroscopic values. A certain macroscopic properties, like pressure or internal energy can only emerge from a certain possible microscopic configuration (although there are many of them). Once we understand that, we can appreciate Boltzmann’s genius insight. However, why it has anything to do with information or computer science?

Well, you can guess that by yourself. After all, science is driven by curiosity. One of the general definition of physical entropy is Gibbs entropy and it is written like this:

S = -k \sum_i p_i \ln p_i

With $p_i$ is the probability of a given microstate $i$ occurs for a certain energy level $E_i$ . For classical thermodynamics situation, it just reduce to our previous entropy definition and the microstates probability terms are simplified to just W.

Now, in computer science there is an entirely different concept gets introduced in this field. It was called Shannon information quantity. When I first saw it, I had a huge mindblown moment. It is talking about different thing. A different concept yet the formula looks exactly the same.

H = - \sum_i p_i \log_b p_i

Look at it, the only difference is just the constant dissappear. If you think about it, it’s really makes sense. The constant is needed in the physical definition because you already uses energy unit for macroscopic quantity, meaning the derivation is from top to bottom. In the information/computer science definition, there is no constant because you define the thing from bottom up.

So why does two different concept use the same formula? Maybe they are closely related? The quantity $H$ here is sometimes called the information entropy, obviously because scientist before us noticed the resemblance to physical entropy. In fact they are so interchangeable that it is very seamless way and natural to think that a physical arrangement of things is by itself an information.

But, mind you, one important factor of why I think this is really cool is the possibility of more deeper connections. We used to think that arrangements of physical objects contains information. But what about the other way around? Can it be that physical phenomenon emerges from encoded information and they are just one and the same thing?

The derivation of the Information Entropy

I learned about physical entropy way before I knew about Shannon’s definition of information entropy. The physical derivation of entropy is lengthy and contains many physical terms such as microstate/macrostate, pressure, volume, internal energy, etc. In contrast with that, information entropy (IMHO) can be derived and deduced easily from first principle. It is very elegant and based on pure math modelling. It is really surprising for me that such different derivation results in the same formula.

The physical derivation might be difficult to follow for people unfamiliar with physics, but the information entropy is very intuitive.

We start by familiarizing ourselves with what does information and data really means. In layman’s terms, data is a collection of quantity. Meanwhile, information is how we interpret data. Information can change depending on how much data we have. For example, in dry season, we knew that it is less likely to rain. So based on past historical data, the chance of tomorrow will be raining in dry season is very low. Our information concludes that tomorrow is not likely to rain. However, it turns out that tomorrow is raining hard. That means we have new data. Our information changes because the new data is interesting (because it’s raining despite the probability is low). If after that we have a week straight of rain, that means we have to change our information. There must be something going on horribly wrong in climate change to cause this. This is such example.

But the explanation above is a qualitative explanation. We need something that can be quantitatively measure, things to predict, etc. We want to create a mathematical model that can fit with our concept of information. For this, we define 4 basic properties of information. This serves as the fundamental principle and base axioms.

If an event has a low chance of happening, the information if we observe such event is more meaningful and impactful compared with information from high occurence event.

This is very intuitive to understand. We knew that sun is rising from the east because we saw it everyday. We assume it’s going to be like that tomorrow. However if for some reason tomorrow the sun is rising from the west, people would freak out because this is such an important information. Could it be that it is a judgment day?

There is no such thing as new information that you acquire that can delete previous information. It can only updates.

If you retrieve new data, it will always add new information. It’s reasonable to think that new data will not make you amnesia. It will just makes you wiser.

Event that happens everytime can be safely ignored, because it’s not that important.

You can think about the sun like before. Or, for something more concrete, you know that you have a nose and your eye is actually seeing your nose, but your brain totally ignores it in your vision. This is because seeing your nose 100% of the time in your eye is not important for you. Other way to think about is, if you watch a sports match where a team is always the winner, the match is not interesting. Because you kind of know that the next match is going to be won by that team.

Information is additive. If you receive new information from two independent events, that just means the informational value is added together.

This might be less obvious, so think of it like this: If you have two different independent hobbies, you can enjoy both of them additively. There’s no reason for you to force yourself to enjoy just one of them. Another way to think of it, is by seeing news. When you learn about some event in city A and city B from the news and those are independent events, then you have both information as if both happens at the same time.

Now let’s convert those statements into mathematical statements or models. Due to how often we rely on statistical event to define information. It is natural for us to choose statistical/probability as our mathematical language.

We define information as a function $I$ , whose input is the probability $p_i$ of an event $i$ happens. That means information is a function $I(p_i)$ which satisfies:

$I(p_i)$ should decrease or stay the same (monotonically decrease) as $p_i$ becomes more probable.
$I(p_i)$ is always additive, that means it can never be a negative value
$I(1)=0$ , certain things have no informational value
$I(p_a p_b)=I(p_a)+I(p_b)$ , information from two independent events a and b is just an addition of both independent information

From these 4 properties we can guess what kind of function $I$ will take. The property number 4 is the biggest hints of all and in my opinion the most remarkable.

Start with property #4, because both of a and b are independent events, we can take partial derivative of each, assuming that $I$ can be differentiated twice, we get:

I(p_a p_b)=I(p_a)+I(p_b) \\ p_b I'(p_a p_b)= I'(p_a) \\ I'(p_a p_b) + p_a p_b I''(p_a p_b)=0 \\

By substituting $p_a p_b = x$ , justified by the fact that $x$ is just a joint probability (so it can be an input of function I), we now have a simple differential equation:

I'(x)+xI''(x)=0 \\ \frac{dx}{x} = - \frac{dI'}{I'} \\ \ln x = - \ln \frac{I'}{I_0'} \\ \frac{dI}{dx}=\frac{I_0'}{x} \\ I(x)=I_0' \ln x

By choosing $I_0'$ value to be a value less than 0, then we satisfy all those 4 properties. The derivation is really simple and straightforward, we can conclude that the information quantity is a function $I(p_i)=-k \ln p_i$ with $k$ an arbitrary positive constant.

The $I(p_i)$ function represents the information content of an event $i$ . However for practical reasons we almost always deals with a collection of events, rather than a single event $i$ . If we have $N$ number of events, we will know individual information content. But what we are really asking is, do they have collective information, which is the information that represents collection of data/events?

The most straightforward way of modelling that is just to calculate the average. If we have $N$ information, let’s just calculate the average, right? We will now define the average information $H$ .

H= \sum_i p_i I(p_i)

Note that the formula above is just a statistical average. Nothing fancy. If you have $N$ number of quantity, with each quantity $i$ have probability $p_i$ and measure of quantity $I(p_i)$ , the average is just a weighted sum. In mathematical lingo of combinatorics, this is called the Expectance value.

But now, notice what happens when we expand $I(p_i)$ .

H= -k \sum_i p_i \ln p_i

This is an analog of physical entropy. So, information entropy is actually the average of information. This part is kind of mindblowing for me. In high school, I always relate the term entropy as a difficult concept in chemistry/physic. It is an abstract measure of randomness, something that I can’t concretely imagine. But if we define it this way, it fits naturally. Information entropy is just a fancy way of saying average information. I may even bet that we call this information entropy because the physical terms comes first before computer science. If only information theory comes first, this entropy might have been called average physical information.

Continuing a little bit, even if we know what average information means, it is helpful to imagine why we need to consider it or use it. The average information is important because it can represents whether a collection of data is important or not. Simply speaking if the average information is 0, then the data does not have any informational significance or not important. This is because if the average is 0, we can be sure that each information in the dataset have 0 informational value, because information can not be a negative number. In contrast, if the average information is very high then the dataset have high informational value, meaning most of the data point have high information. It is such an intuitive concept.

Do we have some kind of maximum value for average information? Yes! If each datum in the data is highly informational, that means they are equally probable to happen.

Consider a fair coin toss. There will be two data points, A and B. A if the coin toss results in head, B if the coin toss results in tail. Do the coin toss often enough, you will have statistical distribution. If the coin toss is really fair, $p_A=p_B=\frac{1}{2}$ because there are two possible outcome or space. If we want to encode this information in bits or binary digit, we will use logarithm in base 2. That means the average entropy:

H = -k \sum_i p_i \ln p_i = - \sum_i p_i \log_2 (p_i) \\ H = - p_A \log_2 p_A - p_B \log_2 p_B \\ H = - \frac{1}{2} \log_2 \frac{1}{2} - \frac{1}{2} \log_2 \frac{1}{2} \\ H = 1

What about a dice roll? They can have 6 outcomes or 6 possible datum/state. In bits, the average information is:

H = - \log_2 \frac{1}{6} \\ H \approx 2.58

Now, there could be an AHA moment in your head if you are a computer science students. There is a reason why I said the unit of H is bits at the moment. For the dice roll event, you need 2.58 bits to perfectly convey the information of that collection of data. But in digital system, you can only have integer of bits, so at minimum you can only use 3 bits.

For example:

000 —> if dice roll results in 1

001 —> if dice roll results in 2

010 —> if dice roll results in 3

011 —> if dice roll results in 4

100 —> if dice roll results in 5

101 —> if dice roll results in 6

Those encoding above is the possible encoding to represents 6 results of dice roll. Of course you can have 111 which are not mapped to any dice result because the number of bits is greater than what the average information needs.

From this concept, you can now easily mapped a concept of data compression. A lossless data compression means you are encoding a message with fewer bits but they still have the same average information. One long message and one short message have the same informational value if the average information is exactly the same. If they are exactly the same, sending a shorter message is more efficient.

The most amazing things from all of these is that the meaning of the message is not important at all! If the message have the same informational value or average information, it is the same message.

Previously, I mentioned that physical entropy historically provides insight to define information entropy. But now, let’s think the other way around. What if we live in a world where computer science are developed first as a foundation for statistical physics? The link to recover physical entropy from average information will be to think about a physical process as a computer simulation. If I think that physical process (or life, if philosophically speaking) is just a computer simulation, I can understand physical entropy better.

Deriving physical entropy from average information

Consider a system of multiple particles or molecules inside a box with fixed volume. These particles of air insides are moving, so each of the particle have a certain kinetic energy. If we assume that each particles are independent of the other, we will have a statistical/probability distribution that relates how many particles that have the same energy level.

We can then think that each particle is a datum and collection of particles provides data. The information that we get from each particle is it’s energy level. The information function $I(p_i)$ is a function of probability of the energy state being inspected. From that, it follows that the average information should corresponds to the average energy of the system somehow.

Just like with the coin toss or dice roll. We have different states of energy and we call each energy state $E_i$ . Because this is an isolated system, the total energy should not change, whatever is happening between each particles.

The probability function $p_i$ corresponds to the number of particle that can have energy state $E_i$ . If we use the general definition of Shannon’s entropy/average information:

H= -k \sum_i p_i \ln p_i

However, Boltzmann said there is no reason for particle to prefer specific energy level $E_i$ . That means, every energy level are equally probable. So, if the process are given long enough time, it will reach thermal equilibrium where for every energy level $E_i$ it will use the same probability function. That means every $p_i$ is the same and the formula reduce to.

H = -k \ln p

Where $p$ now is just one kind of probability distribution. As we see now, the average information H evolve given enough time. However since information is additive, it can only increase. Meaning H now is at it’s maximum value available. We conclude that there is something constraining the average information so that it reach maximum value.

Now remember that so far we only calculate the internal self-information of the system, which is individual information of the particles. But, in practice, in physics, what we can directly observe or measure is the macroscopic quantities, not individual energy of the particles (because that’s just too many of them and impossible to track!). If we think that an observation or measurement process is some sort of noise-free communication, we conclude an important link, which is the average information that gets passed to observer must stay the same.

Because the information contained inside the box must be the same with what the average information that the observer measures, that means whatever quantity that the observer can measure has to be directly proportional with the average information value. For an ideal gas system, we have 3 main macroscopic quantities which is pressure, volume, and temperature.

In the thermodynamical limit, specifically 1st thermodynamic law, the energy moving in and out the system must be conserved, implying:

dE = \delta Q + \delta W

There are several different formulation exists, they are explained the same. The left part is the internal energy of our system, and the right part is the calor and work that is in or out the system.

Note that for our system of ideal gas, the box and pressure stays the same (the box doesn’t change). This means, the only macroscopic quantity corresponds with our average information is just the temperature. Because we assume that the act of observation corresponds to the transfer of information, that means the temperature that we observe changes. This will imply that there is a heat transfer.

Let’s conclude for now. Due to the pressure and volume remains the same, there is no work. That measn $\delta W = 0$ .

There is a heat exchange. Because we transfer the information from the box to the observer, that means $\delta Q > 0$ . This would also imply that some energy are transferred to the observer, thus the internal energy of the box decrease.

So what happens? When H evolve, the total energy decrease, and average information increase to a maximum value. When H is in maximum and can’t increase again, that means the average information stays the same and temperature becomes stable in equilibrium because it would corresponds to the average information.

Let’s suppose we want to find the final energy distribution $p$ . Because the number of particles doesn’t change (well obviously), if the distribution changes, then that is only caused by the change of total energy. However we already said that average information stays the same. This can only mean that we must find a way that when the temperature changes, the total energy changes, but average information stays the same.

Remember that we can’t observe internal energy directly, but we do observe the temperature as macroscopic quantity. Relate this with average information of energy that is communicated by the system, we got the following relations:

E \propto H T

This reads as, the actual total energy of the system that we can’t observe directly, should be proportional to the average information of energy $H$ , times the macroscopic quantity that we measure $T$ (temperature). Why $T$ and not $\frac{1}{T}$ ? We would expect (by common sense in everyday life) that if the temperature increases, so does it’s internal energy. Hot things are more destructive anyway. We could combine different macroscopic variable like pressure, and volume. But in our case now we have set that the pressure and volumes remains fixed. Well, actually only the volumes are fixed, but regardless that makes no net work from the system even if we have a change in pressure.

From this, in thermal equilibrium where we have our energy distribution $p$ , if there is a change in temperature and total energy from state A to B, and substituting the formula for H, we got:

\frac{E}{T}=-k \ln p \\ p = e^{-\frac{E}{kT}}

Remember that probability distribution should sum over to 1. That means we should patch it with a normalizing factor $Z$ . Where $Z$ is the value of total sum if we add up all different energy level available.

p = \frac{1}{Z} e^{-\frac{E}{kT}}

The probability distribution above can change depending on how you count the energy level. For example if it is about kinetic energy of ideal gas in 3 dimension, it should expand into (I will spare you the derivation):

p = ( \frac{m}{2\pi k T})^{\frac{3}{2}} e^{-\frac{mv^2}{2kT}}

So once we have the probability distribution $p$ . If we look back at the average information formula and match it with classical thermodynamic property of entropy $S$ , we know that average information is in fact physical entropy in this case.

To be more specific, the classical thermodynamic entropy are defined as:

dS = \frac{\delta Q}{T}

If we interpret it as average information, it would become much more intuitive. The change of average information that the observer retrieves should corresponds to the heat transferred (more heat, more information), and inversely proportional to the current average information that the observer possess, which is the temperature.

Once we assume that heat is just information transfer, it’s like we are comparing the average information between the system ( $S$ ) and the average information that the observer currently have ( $T$ ). In thermal equilibrium it would makes sense that no heat will transfer because both system and observer achieve a mutual understanding and the same average information.

This would also directly explain why the second law of thermodynamics has to be that way. It is just a knowledge transfer.

If you have system with certain entropy/average information and you are connected with an environment, information will flow if you have difference in average information until both system have the same level of information. However, each system can only become wiser, that’s why the average information can only increase, hence entropy can only increase or stays the same.

Naturally, information will flow using energy as the medium (via heat transfer) from the system with low entropy (less information) to the system that have high entropy (more information). It’s just the same as communication in computer science. To achieve mutual understanding, the agent that doesn’t know anything about the other agent must receives more information.

Well, to put it in layman’s term. If you are stupid about your surrounding, you will learn more from your environment. Entropy is that.

In another article, maybe it’s possible to make some interactive simulation to see the results.