Published Mar 29, 2026

- 33 min read

Sketch of Proof: Deriving the Prime Counting Function from First Principles

This document attempts to reconstruct the path to the explicit formula for the prime counting function $\pi(x)$ , building from first principles using Information Theory, empirical observations, and Fourier Analysis. We explore a possible discovery narrative: deriving mathematical tools as they become necessary, then revealing their modern names.

Important disclaimer: This is an exploratory sketch, not a rigorous proof. We stand on the shoulders of giants—Gauss, Riemann, Chebyshev, and countless others who developed these profound ideas over centuries. Our goal is to explore one possible path of understanding, knowing that the actual historical development was far more complex and rigorous. There may be oversimplifications or gaps in this presentation; for complete rigor, consult standard texts on analytic number theory.

The exploration moves from empirical observations (Gauss’s tables) through information-theoretic reasoning (Maximum Entropy principle) to spectral analysis (Fourier transforms), attempting to arrive at Riemann’s explicit formula. Each step asks: “How might we naturally discover this if we approached it fresh, with these particular tools?“

1. Empirical Starting Point: The Staircase Shape

Historical Context: Counting Primes Before Modern Theory

In the late 18th century, mathematicians including Gauss and Legendre compiled extensive tables of prime numbers. By manually checking divisibility, they identified all primes up to several hundred thousand. Gauss, reportedly as young as 15 or 16 years old, studied these tables not to find individual primes, but to understand the pattern of their distribution.

The prime counting function $\pi(x)$ counts how many primes exist less than or equal to $x$ . For example:

$\pi(10) = 4$ because the primes less than 10 are $\{2, 3, 5, 7\}$
$\pi(100) = 25$
$\pi(1000) = 168$

The Staircase Pattern

When we plot $\pi(x)$ as a function of $x$ , we observe a distinctive staircase curve. The function remains constant between primes, then suddenly jumps by 1 each time we encounter a prime number. This creates a step function with an irregular, yet not completely random, pattern of jumps.

Despite the irregularity of individual steps, Gauss noticed something remarkable: when viewed from afar, the staircase appears to follow a smooth, upward trend. The density of primes seems to decrease as numbers get larger, but in a predictable way.

Gauss’s Empirical Observation

Through careful analysis of his tables, Gauss conjectured that the average density of primes near a large number $x$ is approximately $\frac{1}{\ln x}$ . This means the number of primes up to $x$ should be roughly:

\pi(x) \approx \int_2^x \frac{dt}{\ln t}

This integral is called the logarithmic integral $\text{Li}(x)$ . For large $x$ , it behaves asymptotically like $\frac{x}{\ln x}$ , giving us the famous asymptotic relation:

\pi(x) \sim \frac{x}{\ln x}

This conjecture was not proven until 1896 (independently by Hadamard and de la Vallée Poussin), when it became known as the Prime Number Theorem (PNT).

Wave Decomposition Strategy

The staircase pattern suggests a natural mathematical approach. In Fourier analysis, we know that well-behaved step functions can be decomposed into a superposition of waves—a smooth principal component plus oscillatory corrections:

\pi(x) = \underbrace{\text{principal wave}}_{\text{smooth trend}} + \underbrace{\text{oscillatory modes}}_{\text{the steps}}

Gauss’s empirical observation (later PNT) gives us the principal wave: $\frac{x}{\ln x}$ . This represents the “DC component” or average trend. But what creates the steps? What are the frequencies and amplitudes of the oscillatory modes that combine to produce the sharp jumps exactly at prime locations?

This is our quest: to derive the oscillatory corrections using first principles.

Next Step: Finding the Right Coordinate System

To analyze oscillations using Fourier methods, we need to choose an appropriate coordinate system and measure. The key question is: what symmetry does the prime distribution actually respect? Is it translation-invariant (constant spacing) or something else?

2. First Principle: Scale-Invariant Measure

The Problem of Measuring Prime Density

Consider a naive question: “What is the density of primes?” If we measure density as “primes per unit interval,” we immediately run into a problem. The density in the interval $[100, 200]$ (which contains 21 primes) is very different from the density in $[1000, 1100]$ (which contains only 16 primes), even though both intervals have the same width of 100.

This suggests that prime density is not translation-invariant—moving the interval while keeping its width constant changes the density.

Discovering Scale Invariance

Let’s try a different comparison. Consider these two intervals:

$[100, 200]$ : width = 100, midpoint = 150, scale factor = 2
$[1000, 2000]$ : width = 1000, midpoint = 1500, scale factor = 2

Even though the absolute widths differ by a factor of 10, both intervals represent a doubling of the lower bound. Empirically, when we count primes in these intervals and normalize appropriately, the densities become comparable!

This observation suggests that prime density respects scale invariance (multiplicative symmetry) rather than translation invariance (additive symmetry).

The Maximum Entropy Approach

In information theory, the Maximum Entropy principle states that when we have incomplete information, we should use the probability distribution that respects our known constraints while making no additional assumptions (maximizing entropy). This gives us the “least biased” framework for analysis.

Here, our constraint is scale invariance: the measure should treat multiplicative shifts symmetrically. The question becomes: what measure $d\mu$ on the real line respects multiplicative symmetry?

Deriving the Scale-Invariant Measure

For translation invariance, the natural measure is simply $dx$ (the standard Lebesgue measure). For scale invariance, we need a measure that remains unchanged under multiplicative scaling.

Consider the transformation $x \mapsto \lambda x$ for constant $\lambda > 0$ . Under this scaling:

Standard measure: $dx \mapsto \lambda \, dx$ (scales with $\lambda$ )
Logarithmic measure: $d(\ln x) = \frac{dx}{x} \mapsto \frac{d(\lambda x)}{\lambda x} = \frac{dx}{x}$ (invariant!)

Therefore, the natural scale-invariant measure is:

dt = \frac{dx}{x}

Integrating both sides:

t = \ln(x) + C

Since we’re interested in primes starting from 2 (the smallest prime), we can set $C = -\ln(2)$ , giving us $t = \ln(x/2)$ . For simplicity, we’ll often just work with $t = \ln(x)$ , understanding that the constant affects only the reference point.

Working in Log-Space

This derivation tells us that logarithmic space is the natural coordinate system for studying prime distribution. In this coordinate system:

Integers are located at positions $t_n = \ln(n)$
Multiplicative relationships become additive: $\ln(ab) = \ln(a) + \ln(b)$
Scale-invariant properties become translation-invariant in $t$ -space

This fundamental choice of measure will guide all subsequent analysis.

Next Step: Information Content of Individual Primes

Now that we have the right measure for density, we need to quantify the “information content” of each prime. When we discover that a number is prime, how much information have we gained? This requires understanding what it means to “locate” a prime in our logarithmic space.

3. Deriving the Information Weight Function

The Question: How Much Information is a Prime?

Imagine we’re systematically checking integers to determine which are prime. We’ve already tested all integers up to some value, and now we examine the next integer $n$ . If it turns out to be prime, we’ve gained information—we’ve reduced our uncertainty about the location of primes.

But how much information? In information theory, the information content of an event is quantified using Shannon entropy. If an event has probability $p$ , the information we gain when it occurs is:

I = -\ln(p)

(using natural logarithm). This is measured in nats—the natural logarithm equivalent of bits.

Estimating the Probability of Primality

In our log-space coordinate system $t = \ln(x)$ , integers are not uniformly spaced. They’re located at $t_n = \ln(n)$ , and the spacing between consecutive integers near $n$ is:

\delta t_n = \ln(n+1) - \ln(n) = \ln\left(1 + \frac{1}{n}\right)

For large $n$ , using the approximation $\ln(1 + \epsilon) \approx \epsilon$ for small $\epsilon$ :

\delta t_n \approx \frac{1}{n}

Now, we derived that the natural measure in our space is $dt = \frac{dx}{x}$ . At integer location $n$ , the “weight” or “probability measure” of that position is proportional to the spacing $\delta t_n \approx \frac{1}{n}$ .

If we model “discovering” a prime as randomly selecting a position in this measure, then finding a prime $p$ corresponds to an event with probability weight $\rho(p) \propto \frac{1}{p}$ .

Calculating Information Content

The information content of finding prime $p$ is:

I(p) = -\ln(\rho(p)) = -\ln\left(\frac{1}{p}\right) = \ln(p)

This is a beautiful result: the information content of discovering prime $p$ is simply its natural logarithm.

Extension to Prime Powers

Consider prime powers $p^k$ for $k \geq 1$ . The number $p^k$ is “constructed” from prime $p$ exactly $k$ times (via unique prime factorization). If we think of information as additive—finding $p$ twice gives twice the information—then:

I(p^k) = k \cdot \ln(p)

Defining the Von Mangoldt Function

We formalize this by defining a function $\Lambda(n)$ that assigns information weight to each positive integer:

\Lambda(n) = \begin{cases} \ln(p) & \text{if } n = p^k \text{ for some prime } p \text{ and positive integer } k \\ 0 & \text{otherwise} \end{cases}

This function is zero for composite numbers (like $6 = 2 \times 3$ ) because they’re “constructed from multiple different primes,” not powers of a single prime. Only prime powers carry this pure information signature.

Historical Note and Modern Name

What we’ve just derived from information-theoretic first principles is known in analytic number theory as the von Mangoldt function, named after the German mathematician Hans von Mangoldt who introduced it in the 1890s in the context of studying prime distribution and the Riemann zeta function.

Our derivation shows why this particular function is natural—it’s not an arbitrary choice, but rather the unique function that correctly weighs integers by their information content in log-space under scale-invariant measure.

Next Step: Accumulating Information

We now have the information weight $\Lambda(n)$ for individual integers. To understand the global distribution pattern of primes, we need to accumulate this information over ranges. How much total information is encoded in all integers up to $x$ ?

4. Building the Cumulative Information Function

Defining Total Information Up to $x$

Having established that each prime power $p^k$ carries information weight $\Lambda(p^k) = \ln(p)$ , we naturally want to accumulate this information to understand the distribution pattern. Define:

\psi(x) = \sum_{n \leq x} \Lambda(n) = \sum_{p^k \leq x} \ln(p)

where the second sum runs over all prime powers $p^k$ (primes and their integer powers) not exceeding $x$ .

This function $\psi(x)$ represents the total information content encoded in all integers up to $x$ , weighted by their fundamental prime structure.

Properties and Behavior

Let’s understand how $\psi(x)$ behaves:

Jumps at prime powers: The function is constant between prime powers, then jumps by $\ln(p)$ whenever we encounter $p^k$ :

At $x = 2$ : jump of $\ln(2)$
At $x = 3$ : jump of $\ln(3)$
At $x = 4 = 2^2$ : jump of $\ln(2)$ (second power of 2)
At $x = 5$ : jump of $\ln(5)$
At $x = 7$ : jump of $\ln(7)$
At $x = 8 = 2^3$ : jump of $\ln(2)$ (third power of 2)

Dominant contribution from first powers: Most of the contribution to $\psi(x)$ comes from primes themselves ( $k=1$ ), not higher powers. This is because:

\psi(x) = \sum_{p \leq x} \ln(p) + \sum_{p^2 \leq x} \ln(p) + \sum_{p^3 \leq x} \ln(p) + \cdots

The number of primes up to $x$ is roughly $\frac{x}{\ln x}$ (by PNT), but the number of prime squares is only $\approx \frac{\sqrt{x}}{\ln x}$ , prime cubes $\approx \frac{x^{1/3}}{\ln x}$ , and so on. These higher-power terms are increasingly negligible.

Relationship to $\pi(x)$ : The prime counting function $\pi(x)$ counts primes with unit weight, while $\psi(x)$ weights them logarithmically. The two are related, but $\psi(x)$ will turn out to have cleaner analytic properties.

Why Use $\psi(x)$ Instead of $\pi(x)$ ?

At first glance, we’ve moved away from our original goal (analyzing $\pi(x)$ ). Why introduce this new function $\psi(x)$ ?

Several reasons:

Natural from information theory: The weights $\ln(p)$ emerged naturally from our Maximum Entropy framework
Multiplicative structure: $\psi(x)$ respects the multiplicative structure of integers (prime factorization)
Analytic properties: As we’ll see, $\psi(x)$ has better behavior under Fourier transforms
Recoverable: We can convert back to $\pi(x)$ from $\psi(x)$ later (Section 8)

Historical Note and Modern Name

The function $\psi(x)$ we’ve constructed is called the Chebyshev function (specifically, the second Chebyshev function), named after Pafnuty Chebyshev who used it in the 1850s in his work on prime number distribution. Chebyshev proved important bounds on $\pi(x)$ using this function, coming tantalizingly close to proving the Prime Number Theorem decades before it was actually proven.

Our derivation shows this function isn’t an arbitrary mathematical trick—it’s the natural cumulative information function when primes are viewed through the lens of information theory.

Next Step: Spectral Decomposition

We’ve claimed that $\psi(x)$ can be decomposed into a principal wave (smooth trend) plus oscillatory modes (creating the steps). To find these modes, we need to compute the frequency spectrum of our information weight function $\Lambda(n)$ . This requires Fourier analysis in the appropriate space.

5. Fourier Analysis: Finding the Frequency Spectrum

The Goal: Decomposing Information into Frequencies

We want to understand $\psi(x) = \sum_{n \leq x} \Lambda(n)$ as a wave superposition. In classical Fourier analysis, we transform a signal from time/space domain to frequency domain to understand its spectral components. Here, we’ll transform $\Lambda(n)$ from the “integer domain” to a “frequency domain” parameterized by complex frequency $s$ .

5.1 The Forward Transform: Signal to Frequency

Direction and Analogy: In standard continuous Fourier analysis, the forward transform (signal → frequency) is:

\hat{f}(\omega) = \int_{-\infty}^{\infty} f(t) \cdot e^{-i\omega t} \, dt

Note the negative exponent in $e^{-i\omega t}$ —this is characteristic of the forward transform.

Our Discrete Case: Our signal $\Lambda(n)$ is defined only at discrete integer points $n = 1, 2, 3, \ldots$ . We cannot evaluate $\Lambda(2.5)$ or $\Lambda(\pi)$ —only integers have prime factorizations! Therefore, we must use a sum rather than an integral:

F(s) = \sum_{n=1}^{\infty} \Lambda(n) \cdot (\text{kernel})

Finding the Kernel: Recall that in log-space, integer $n$ is at position $t_n = \ln(n)$ , so $n = e^{t_n}$ . By analogy with the continuous case where the kernel is $e^{-i\omega t}$ , our kernel should be:

e^{-s t_n} = e^{-s \ln n} = n^{-s}

Here $s$ plays the role of complex frequency (generalization of $i\omega$ ).

Our Forward Transform:

F(s) = \sum_{n=1}^{\infty} \Lambda(n) \cdot n^{-s} = \sum_{n=1}^{\infty} \frac{\Lambda(n)}{n^s}

This is a Dirichlet series, the discrete analogue of the Laplace transform or Mellin transform. The negative power $n^{-s}$ corresponds to the negative exponent in the forward Fourier transform.

5.2 Connecting to Statistical Mechanics

Challenge: Computing $F(s) = \sum \frac{\Lambda(n)}{n^s}$ directly seems difficult because $\Lambda(n)$ is zero for most integers. Can we find a better way to understand this sum?

Analogy with Statistical Mechanics: In statistical mechanics, the partition function encodes all information about a system:

Z = \sum_{\text{states } i} e^{-\beta E_i}

where $E_i$ is the energy of state $i$ , and $\beta = \frac{1}{kT}$ is the inverse temperature. The average energy is related to the partition function by:

\langle E \rangle = -\frac{\partial \ln Z}{\partial \beta} = -\frac{Z'(\beta)}{Z(\beta)}

Our Analogue: In our setup:

States are labeled by integers $n$
“Energy” of state $n$ is the information content $\Lambda(n)$
Parameter $s$ plays the role of inverse temperature $\beta$
Weighting factor $n^{-s} = e^{-s \ln n}$ resembles the Boltzmann factor $e^{-\beta E}$

Our sum $F(s) = \sum \frac{\Lambda(n)}{n^s}$ represents the average information per state, weighted by $n^{-s}$ . By analogy with statistical mechanics, this should be related to the derivative of some partition function:

F(s) = -\frac{d}{ds} \ln Z(s) = -\frac{Z'(s)}{Z(s)}

for an appropriate partition function $Z(s)$ .

5.3 Solving for the Partition Function

Setting Up the Differential Equation: We have:

-\frac{Z'(s)}{Z(s)} = \sum_{n=1}^{\infty} \frac{\Lambda(n)}{n^s}

Equivalently:

\frac{d}{ds} \ln Z(s) = -\sum_{n=1}^{\infty} \frac{\Lambda(n)}{n^s}

Using Prime Power Structure: Recall that $\Lambda(p^k) = \ln(p)$ for prime powers. We can reorganize the sum:

\sum_{n=1}^{\infty} \frac{\Lambda(n)}{n^s} = \sum_{p \text{ prime}} \sum_{k=1}^{\infty} \frac{\ln(p)}{p^{ks}}

Therefore:

\frac{d}{ds} \ln Z(s) = -\sum_{p} \sum_{k=1}^{\infty} \frac{\ln(p)}{p^{ks}} = -\sum_{p} \ln(p) \sum_{k=1}^{\infty} \frac{1}{p^{ks}}

Integrating Both Sides: We integrate with respect to $s$ :

\ln Z(s) = -\sum_{p} \ln(p) \int \sum_{k=1}^{\infty} \frac{1}{p^{ks}} \, ds

The inner integral:

\int \frac{1}{p^{ks}} \, ds = \int p^{-ks} \, ds = \frac{p^{-ks}}{-k \ln(p)} = -\frac{1}{k \ln(p)} \cdot p^{-ks}

Therefore:

\ln Z(s) = -\sum_{p} \ln(p) \sum_{k=1}^{\infty} \left( -\frac{1}{k \ln(p)} \cdot p^{-ks} \right) = \sum_{p} \sum_{k=1}^{\infty} \frac{p^{-ks}}{k}

Recognizing the Series: We have the Taylor series:

-\ln(1-x) = \sum_{k=1}^{\infty} \frac{x^k}{k} \quad \text{for } |x| < 1

Setting $x = p^{-s}$ :

\ln Z(s) = \sum_{p} \sum_{k=1}^{\infty} \frac{p^{-ks}}{k} = -\sum_p \ln(1 - p^{-s}) = \ln \prod_p \frac{1}{1 - p^{-s}}

Solution - The Euler Product:

Z(s) = \prod_{p \text{ prime}} \frac{1}{1 - p^{-s}}

This beautiful product formula over primes is called the Euler product, discovered by Leonhard Euler in 1737.

Connection to All Integers: By the fundamental theorem of arithmetic (unique prime factorization), when we expand this product, each integer appears exactly once:

Z(s) = \prod_p \frac{1}{1-p^{-s}} = \prod_p (1 + p^{-s} + p^{-2s} + p^{-3s} + \cdots) = \sum_{n=1}^{\infty} \frac{1}{n^s}

Every integer $n$ has a unique factorization $n = \prod p_i^{k_i}$ , contributing $\prod p_i^{-k_i s} = n^{-s}$ to the sum.

Historical Note and Modern Name

What we just discovered—solving for the partition function from first principles—is the Riemann zeta function:

\zeta(s) = \sum_{n=1}^{\infty} \frac{1}{n^s} = \prod_{p \text{ prime}} \frac{1}{1-p^{-s}}

Named after Bernhard Riemann, who in his famous 1859 paper extended this function to the complex plane and discovered its profound connection to prime distribution. Euler had studied $\zeta(s)$ for real values in the 1730s, but Riemann’s insight was to treat $s$ as a complex variable and study the function’s zeros.

Our Discovery: The frequency spectrum of the information function $\Lambda(n)$ is:

F(s) = -\frac{\zeta'(s)}{\zeta(s)} = \sum_{n=1}^{\infty} \frac{\Lambda(n)}{n^s}

The zeros and poles of $\zeta(s)$ encode the entire structure of prime distribution!

Next Step: The Inverse Transform

We’ve found the frequency spectrum $F(s) = -\frac{\zeta'(s)}{\zeta(s)}$ . Now we need to invert the transform to recover $\psi(x) = \sum_{n \leq x} \Lambda(n)$ in the original space. What is the formula for the inverse transform?

6. Inverse Transform: From Frequency Back to Signal

The Challenge: Recovering the Cumulative Sum

We have the frequency spectrum:

F(s) = -\frac{\zeta'(s)}{\zeta(s)} = \sum_{n=1}^{\infty} \frac{\Lambda(n)}{n^s}

But we want the cumulative sum $\psi(x) = \sum_{n \leq x} \Lambda(n)$ , not the infinite series. How do we extract only terms with $n \leq x$ ?

6.1 Deriving the Inverse Transform

Strategy from Complex Analysis: Consider the product:

F(s) \cdot x^s = \sum_{n=1}^{\infty} \Lambda(n) \cdot \frac{x^s}{n^s} = \sum_{n=1}^{\infty} \Lambda(n) \left(\frac{x}{n}\right)^s

Each term involves $(x/n)^s$ . When $n < x$ , the ratio $x/n > 1$ . When $n > x$ , the ratio $x/n < 1$ . If we could somehow “pick out” terms where $x/n > 1$ , we’d extract exactly the terms with $n \leq x$ .

The Heaviside Step Function in Complex Domain: There’s a beautiful identity from complex analysis, related to Cauchy’s integral formula:

\frac{1}{2\pi i} \int_{c - i\infty}^{c + i\infty} \frac{y^s}{s} \, ds = \begin{cases} 1 & \text{if } y > 1 \\ 0 & \text{if } y < 1 \end{cases}

This integral along a vertical line in the complex plane acts as a Heaviside step function in $y$ . The parameter $c$ must be chosen so the vertical line $\text{Re}(s) = c$ lies in the region where the integral converges.

Applying to Our Problem: We apply this identity with $y = x/n$ :

\frac{1}{2\pi i} \int_{c - i\infty}^{c + i\infty} F(s) \cdot \frac{x^s}{s} \, ds = \frac{1}{2\pi i} \int_{c - i\infty}^{c + i\infty} \left(\sum_n \Lambda(n) \frac{(x/n)^s}{s}\right) ds

Assuming we can interchange sum and integral (valid for appropriate $c$ ):

= \sum_n \Lambda(n) \cdot \frac{1}{2\pi i} \int_{c - i\infty}^{c + i\infty} \frac{(x/n)^s}{s} \, ds

By the step function identity:

When $n < x$ : the integral equals 1, contributing $\Lambda(n)$
When $n > x$ : the integral equals 0, contributing nothing

Result:

\psi(x) = \sum_{n \leq x} \Lambda(n) = \frac{1}{2\pi i} \int_{c - i\infty}^{c + i\infty} F(s) \cdot \frac{x^s}{s} \, ds

This is the inverse transform we sought!

Historical Note and Modern Name

What we just derived is known as Perron’s formula, named after Oskar Perron who formalized it in 1908. It’s the discrete analogue of the inverse Mellin transform. Our derivation from Cauchy’s integral formula shows why this particular contour integral formula is natural—it’s precisely the tool needed to extract truncated sums from infinite series.

6.2 Applying to Our Specific Case

Substituting $F(s) = -\frac{\zeta'(s)}{\zeta(s)}$ :

\psi(x) = \frac{1}{2\pi i} \int_{c - i\infty}^{c + i\infty} \left(-\frac{\zeta'(s)}{\zeta(s)}\right) \cdot \frac{x^s}{s} \, ds

This is a complex contour integral along the vertical line $\text{Re}(s) = c$ in the complex plane. To evaluate it, we’ll use the residue theorem from complex analysis.

6.3 Evaluating via Residue Theorem

Strategy: We shift the contour of integration to the left (decreasing real part), picking up residues at the poles of the integrand $-\frac{\zeta'(s)}{\zeta(s)} \cdot \frac{x^s}{s}$ .

Poles of the Integrand:

Simple pole at $s = 1$ : The Riemann zeta function has a simple pole at $s = 1$ because the harmonic series $\sum \frac{1}{n}$ diverges. Near $s = 1$ : $\zeta(s) \approx \frac{1}{s-1}$ Therefore: $-\frac{\zeta'(s)}{\zeta(s)} \approx \frac{1}{s-1}$

The residue at $s = 1$ of $-\frac{\zeta'(s)}{\zeta(s)} \cdot \frac{x^s}{s}$ is: $\text{Res}_{s=1} = \frac{x^1}{1} = x$
Poles at zeros of $\zeta(s)$ : Whenever $\zeta(\rho) = 0$ , the function $-\frac{\zeta'(s)}{\zeta(s)}$ has a pole. For a simple zero, the residue is $-1$ : $\text{Res}_{s=\rho} = -\frac{x^\rho}{\rho}$

There are infinitely many such zeros!
Poles at zeros of $\zeta(s)$ (continued): Through analytic continuation, $\zeta(s)$ extends to the entire complex plane and has zeros at negative even integers: $s = -2, -4, -6, \ldots$ (called “trivial zeros”). Each contributes a residue: $\text{Res}_{s=-2k} = -\frac{x^{-2k}}{-2k} = \frac{x^{-2k}}{2k}$

For large $x$ , these terms $x^{-2k}$ decay rapidly (like $\frac{1}{x^2}, \frac{1}{x^4}, ...$ ) and contribute negligibly. Their sum can be written as: $\sum_{k=1}^{\infty} \frac{x^{-2k}}{2k} = -\frac{1}{2}\ln(1-x^{-2})$

This uses the Taylor series $\ln(1-y) = -\sum_{k=1}^{\infty} \frac{y^k}{k}$ with $y = x^{-2}$ .
Pole at $s = 0$ : The integrand has a pole at $s = 0$ from two sources:
- The factor $\frac{1}{s}$ in $\frac{x^s}{s}$
- The behavior of $\zeta(s)$ near $s = 0$
To evaluate this residue, we need to know values from the functional equation of $\zeta(s)$ :
- $\zeta(0) = -\frac{1}{2}$
- $\zeta'(0) = -\frac{1}{2}\ln(2\pi)$
Near $s = 0$ , we have $\zeta(s) \approx -\frac{1}{2} - \frac{1}{2}\ln(2\pi) \cdot s + O(s^2)$

The residue calculation at $s=0$ (which involves careful Laurent expansion) yields: $\text{Res}_{s=0} = -\ln(2\pi)$

Note: The functional equation and these specific values come from deeper properties of $\zeta(s)$ that we haven’t derived here. In a complete treatment, one would prove the functional equation using Poisson summation or theta function techniques. For our sketch, we state these known values.

Collecting All Residues: By the residue theorem, summing contributions from all poles:

\begin{align} \psi(x) &= \underbrace{x}_{\text{pole at } s=1} - \underbrace{\sum_{\rho} \frac{x^\rho}{\rho}}_{\text{non-trivial zeros}} \underbrace{- \ln(2\pi)}_{\text{pole at } s=0} \underbrace{- \frac{1}{2}\ln(1-x^{-2})}_{\text{trivial zeros}} \end{align}

where the sum runs over all non-trivial zeros $\rho$ of $\zeta(s)$ (zeros in the “critical strip” $0 < \text{Re}(s) < 1$ ).

For large $x$ , the last term becomes negligible since $\ln(1-x^{-2}) \approx -x^{-2} \approx 0$ .

This is a remarkable result! Let’s understand what it means.

Next Step: Interpreting the Formula

We’ve derived an explicit formula for $\psi(x)$ involving the zeros of the zeta function. This formula separates the principal wave from oscillatory corrections—exactly what we set out to find. Let’s interpret each term and understand the wave decomposition.

7. The Explicit Formula: Wave Decomposition Revealed

The Complete Formula

Our derivation has produced:

\psi(x) = x - \sum_{\rho} \frac{x^\rho}{\rho} - \ln(2\pi) - \frac{1}{2}\ln(1-x^{-2})

where $\rho$ ranges over all non-trivial zeros of the Riemann zeta function.

7.1 Interpreting Each Component

Principal Wave: $x$

The term $x$ is the dominant component for large $x$ . This gives the asymptotic linear growth of $\psi(x)$ . Recall that $\psi(x) \approx \sum_{p \leq x} \ln(p)$ , and by the Prime Number Theorem, there are roughly $\frac{x}{\ln x}$ primes up to $x$ , each contributing an average logarithm of $\approx \ln(x/2)$ . This gives:

\psi(x) \approx \frac{x}{\ln x} \cdot \ln(x/2) \approx x

This is our “DC component”—the smooth background trend.

Oscillatory Modes: $-\sum_{\rho} \frac{x^\rho}{\rho}$

This is the spectacular discovery! Each non-trivial zero $\rho = \beta + i\gamma$ of the zeta function contributes one oscillatory mode:

\frac{x^\rho}{\rho} = \frac{x^{\beta + i\gamma}}{\beta + i\gamma} = \frac{x^\beta \cdot e^{i\gamma \ln x}}{\beta + i\gamma}

Let’s unpack this:

Amplitude: $x^\beta$ (grows with $x$ if $\beta > 0$ , decays if $\beta < 0$ )
Oscillation: $e^{i\gamma \ln x}$ is a complex exponential
Frequency in log-space: In terms of $t = \ln(x)$ , this is $e^{i\gamma t}$ , a pure wave with frequency $\frac{\gamma}{2\pi}$

Each zero contributes a wave! The imaginary part $\gamma$ determines the frequency of oscillation, while the real part $\beta$ determines how the amplitude scales with $x$ .

Small Corrections: The terms $-\ln(2\pi)$ and $-\frac{1}{2}\ln(1-x^{-2})$ are small corrections that become negligible for large $x$ .

7.2 The Physical Picture: Resonant Frequencies

The formula reveals a profound physical picture. The staircase pattern of primes is created by an infinite superposition of waves, each corresponding to a zero of the zeta function:

\psi(x) = \underbrace{x}_{\text{smooth trend}} - \underbrace{\sum_{\rho} \frac{x^\rho}{\rho}}_{\text{infinitely many waves}} + O(1)

In log-space $t = \ln(x)$ , the oscillatory term becomes:

-\sum_{\rho} \frac{e^{\beta t} \cdot e^{i\gamma t}}{\rho}

These are damped oscillators (if $\beta < 1$ ) with discrete frequencies $\gamma_1, \gamma_2, \gamma_3, \ldots$ . The zeros of the zeta function are the resonant frequencies of the prime distribution!

The “unpredictability” of primes—the irregular spacing of the staircase steps—is encoded in this infinite sum. The waves conspire, through constructive and destructive interference, to create sharp jumps exactly at prime locations.

7.3 The Riemann Hypothesis: Spectral Uniformity

Empirical Observation: When mathematicians numerically compute the zeros $\rho = \beta + i\gamma$ , they find something astonishing: all computed non-trivial zeros lie exactly on the line $\beta = \frac{1}{2}$ .

The first few zeros are approximately:

$\rho_1 = \frac{1}{2} + 14.135i$
$\rho_2 = \frac{1}{2} + 21.022i$
$\rho_3 = \frac{1}{2} + 25.011i$

Billions of zeros have been computed, and all lie on the critical line $\text{Re}(s) = \frac{1}{2}$ .

The Riemann Hypothesis: In his 1859 paper, Riemann conjectured that all non-trivial zeros have real part $\frac{1}{2}$ . This is the famous Riemann Hypothesis, one of the most important unsolved problems in mathematics (and one of the Millennium Prize Problems with a $1 million prize).

Implication for Prime Distribution: If RH is true, all oscillatory terms have the same amplitude scaling:

\frac{x^\rho}{\rho} \sim \frac{x^{1/2}}{\rho}

This means:

\psi(x) = x + O(\sqrt{x} \cdot \log^2 x)

Information-Theoretic Interpretation: The Riemann Hypothesis states that the prime distribution has maximal spectral uniformity. All wave components decay at the same optimal rate ( $\sqrt{x}$ ). This is a profound statement about symmetry and entropy: the primes are distributed as “randomly” as possible while still maintaining their deterministic structure. If RH is true, primes achieve a kind of optimal “pseudorandomness.”

Historical Note and Modern Name

The formula we derived:

\psi(x) = x - \sum_{\rho} \frac{x^\rho}{\rho} + \text{(small terms)}

is Riemann’s explicit formula (specifically for the Chebyshev $\psi$ function), first stated in Riemann’s groundbreaking 1859 paper “On the Number of Primes Less Than a Given Magnitude.” Our derivation from information theory and Fourier analysis shows why this formula is natural—it’s simply the inverse transform of the frequency spectrum, evaluated via residues.

Next Step: Recovering $\pi(x)$

We set out to understand the prime counting function $\pi(x)$ , but we’ve derived a formula for $\psi(x)$ instead. While $\psi(x)$ was more natural for our analysis, we need to convert back to answer our original question. How do we recover $\pi(x)$ from $\psi(x)$ ?

8. Recovering the Prime Counting Function $\pi(x)$

The Relationship Between $\pi(x)$ and $\psi(x)$

Recall the definitions:

Prime counting function: $\pi(x) = \#\{p \leq x : p \text{ prime}\}$

This counts primes with unit weight (each prime contributes 1).

Chebyshev function: $\psi(x) = \sum_{p^k \leq x} \ln(p)$

This weights each prime power $p^k$ by $\ln(p)$ (the information weight).

8.1 Understanding the Conceptual Difference

The key difference between $\pi(x)$ and $\psi(x)$ is what they count:

$\pi(x)$ : Counts each prime once (unit weight)

Example: $\pi(10) = 4$ (counting 2, 3, 5, 7)

$\psi(x)$ : Weights each prime power $p^k$ by $\ln(p)$ (information weight)

At $n=2$ : $\Lambda(2) = \ln(2)$
At $n=3$ : $\Lambda(3) = \ln(3)$
At $n=4 = 2^2$ : $\Lambda(4) = \ln(2)$
At $n=5$ : $\Lambda(5) = \ln(5)$
At $n=7$ : $\Lambda(7) = \ln(7)$
At $n=8 = 2^3$ : $\Lambda(8) = \ln(2)$
At $n=9 = 3^2$ : $\Lambda(9) = \ln(3)$

Therefore: $\psi(10) = 3\ln(2) + 2\ln(3) + \ln(5) + \ln(7) \approx 7.32$

The challenge: How do we convert from $\psi(x)$ (which includes all prime powers) back to $\pi(x)$ (which counts only primes)?

8.2 Introducing the First Chebyshev Function

To bridge $\psi(x)$ and $\pi(x)$ , we introduce an intermediate function that counts only first powers of primes:

\theta(x) = \sum_{p \leq x} \ln(p)

This function $\theta(x)$ is called the first Chebyshev function (while $\psi(x)$ is the second).

Connection to $\pi(x)$ : The function $\theta(x)$ is directly related to $\pi(x)$ through a Stieltjes integral:

\theta(x) = \int_2^x \ln(t) \, d\pi(t)

This integral accumulates $\ln(p)$ each time $\pi$ jumps by 1 at a prime $p$ .

Example: For $x = 10$ : $\theta(10) = \ln(2) + \ln(3) + \ln(5) + \ln(7) \approx 4.14$

Compared to $\psi(10) \approx 7.32$ , we see that $\theta$ excludes the contributions from prime powers $4, 8, 9$ , which together contribute $2\ln(2) + \ln(3) \approx 3.18$ .

8.3 Relating $\psi$ to $\theta$ via Prime Powers

The Forward Relationship: We can express $\psi$ as a sum of $\theta$ evaluated at fractional powers:

\psi(x) = \sum_{k=1}^{\infty} \theta(x^{1/k})

Why this works: Recall that $\psi(x) = \sum_{p^k \leq x} \ln(p)$ . Reorganizing by the power $k$ :

\psi(x) = \sum_{k=1}^{\infty} \sum_{p \leq x^{1/k}} \ln(p) = \sum_{k=1}^{\infty} \theta(x^{1/k})

$\theta(x)$ contributes $\ln(p)$ for each prime $p \leq x$ ( $k=1$ term)
$\theta(x^{1/2})$ contributes $\ln(p)$ for each prime $p \leq \sqrt{x}$ ( $k=2$ term, accounting for $p^2$ )
$\theta(x^{1/3})$ contributes $\ln(p)$ for each prime $p \leq \sqrt[3]{x}$ ( $k=3$ term, accounting for $p^3$ )
And so on…

Verification: For $x=10$ :

$\theta(10) = \ln(2) + \ln(3) + \ln(5) + \ln(7)$
$\theta(10^{1/2}) = \theta(3.16...) = \ln(2) + \ln(3)$ (primes $\leq 3.16$ )
$\theta(10^{1/3}) = \theta(2.15...) = \ln(2)$ (primes $\leq 2.15$ )
$\theta(10^{1/k}) = 0$ for $k \geq 4$

Summing: $\psi(10) = 3\ln(2) + 2\ln(3) + \ln(5) + \ln(7)$ ✓

8.4 Möbius Inversion: From $\psi$ to $\theta$

We have the forward relationship $\psi(x) = \sum_{k=1}^{\infty} \theta(x^{1/k})$ . To recover $\pi(x)$ , we need to invert this: express $\theta$ in terms of $\psi$ .

This is a classic application of Möbius inversion.

The Möbius Function: Define the Möbius function $\mu(n)$ :

\mu(n) = \begin{cases} 1 & \text{if } n = 1 \\ (-1)^k & \text{if } n = p_1 p_2 \cdots p_k \text{ (product of } k \text{ distinct primes)} \\ 0 & \text{if } n \text{ has a squared prime factor} \end{cases}

Examples: $\mu(1) = 1$ , $\mu(2) = -1$ , $\mu(3) = -1$ , $\mu(4) = 0$ (since $4 = 2^2$ ), $\mu(6) = 1$ (since $6 = 2 \times 3$ ).

Möbius Inversion Formula: If $g(x) = \sum_{k=1}^{\infty} f(x^{1/k})$ , then:

f(x) = \sum_{k=1}^{\infty} \mu(k) g(x^{1/k})

Applying to our case: Since $\psi(x) = \sum_{k=1}^{\infty} \theta(x^{1/k})$ , we have:

\theta(x) = \sum_{k=1}^{\infty} \mu(k) \psi(x^{1/k})

This is the inverse relationship we need!

Physical interpretation: The Möbius function acts as an “inclusion-exclusion” operator, removing the overcounting from higher prime powers. The alternating signs $(-1)^k$ implement this correction.

8.5 From $\theta$ to $\pi$ via Integration by Parts

Now we relate $\theta(x)$ to $\pi(x)$ using integration by parts on the Stieltjes integral:

\theta(x) = \int_2^x \ln(t) \, d\pi(t)

Integration by parts formula: $\int u \, dv = uv - \int v \, du$

Let $u = \ln(t)$ and $dv = d\pi(t)$ . Then $du = \frac{dt}{t}$ and $v = \pi(t)$ :

\theta(x) = \left[\ln(t) \cdot \pi(t)\right]_2^x - \int_2^x \pi(t) \cdot \frac{dt}{t}

= \ln(x) \cdot \pi(x) - \ln(2) \cdot \pi(2) - \int_2^x \frac{\pi(t)}{t} \, dt

Since $\pi(2) = 1$ :

\pi(x) = \frac{\theta(x)}{\ln(x)} + \frac{1}{\ln(x)} \int_2^x \frac{\pi(t)}{t} \, dt + \frac{\ln(2)}{\ln(x)}

The integral term can be handled by substituting $\pi(t) \sim \text{Li}(t)$ (from our explicit formula), or by further integration by parts. The result is:

\pi(x) = \frac{\theta(x)}{\ln(x)} + \int_2^x \frac{\theta(t)}{t \ln^2(t)} \, dt + O\left(\frac{1}{\ln^2 x}\right)

8.6 Combining Everything: The Final Formula

We now have all the pieces to express $\pi(x)$ in terms of the explicit formula for $\psi(x)$ . Let’s combine them systematically.

Step 1: Start with the explicit formula for $\psi(x)$ (from Section 7):

\psi(x) = x - \sum_{\rho} \frac{x^\rho}{\rho} - \ln(2\pi) + O(x^{-2})

Step 2: Apply Möbius inversion to get $\theta(x)$ (from Section 8.4):

\theta(x) = \sum_{n=1}^{\infty} \mu(n) \psi(x^{1/n})

Substituting the explicit formula:

\theta(x) = \sum_{n=1}^{\infty} \mu(n) \left[x^{1/n} - \sum_{\rho} \frac{x^{\rho/n}}{\rho} - \ln(2\pi) + O(x^{-2/n})\right]

The dominant term is:

\sum_{n=1}^{\infty} \mu(n) x^{1/n} = x - x^{1/2} - x^{1/3} - x^{1/5} + x^{1/6} + \cdots

(The signs come from $\mu(n)$ , and only squarefree integers contribute.)

For large $x$ , the $x^{1/n}$ terms for $n \geq 2$ are much smaller than $x$ , so $\theta(x) \approx x + O(\sqrt{x})$ .

Step 3: Convert $\theta(x)$ to $\pi(x)$ (from Section 8.5):

\pi(x) = \frac{\theta(x)}{\ln(x)} + \int_2^x \frac{\theta(t)}{t \ln^2(t)} \, dt

Step 4: Substitute $\theta$ in terms of $\psi$ :

The calculation is involved, but the key observation is that each term $\psi(x^{1/n})$ in the Möbius inversion contributes to $\pi(x)$ through the integral transform. The result, after combining all terms, is:

\pi(x) = \sum_{n=1}^{\infty} \frac{\mu(n)}{n} \left[\frac{\psi(x^{1/n})}{\ln(x^{1/n})} + \int_2^{x^{1/n}} \frac{\psi(t)}{t \ln^2(t)} \, dt\right]

Step 5: Express using the Logarithmic Integral:

The combination $\frac{y}{\ln y} + \int_2^y \frac{t}{t\ln^2 t} dt$ can be expressed more elegantly. Define the logarithmic integral:

\text{Li}(y) = \int_2^y \frac{dt}{\ln t}

Through integration by parts and careful analysis, the relationship becomes:

\pi(x) = \sum_{n=1}^{\infty} \frac{\mu(n)}{n} \text{Li}(x^{1/n})

Step 6: Substitute the explicit formula $\psi(x) = x - \sum_{\rho} \frac{x^\rho}{\rho} + O(1)$ :

When we substitute and collect terms:

\pi(x) = \text{Li}(x) - \sum_{\rho} \text{Li}(x^\rho) - \ln(2) + O\left(\frac{x}{\ln^2 x}\right)

where the sum runs over all non-trivial zeros $\rho$ of $\zeta(s)$ .

Understanding the Terms:

$\text{Li}(x)$ : The principal term (Gauss’s empirical observation!)
$-\sum_{\rho} \text{Li}(x^\rho)$ : Oscillatory corrections, one for each zeta zero
$-\ln(2)$ : Small constant correction
Error term: Negligible for large $x$

Historical Note and Modern Name

This is Riemann’s explicit formula for $\pi(x)$ , the culminating result of Riemann’s 1859 paper. It gives an exact expression for the number of primes up to $x$ in terms of:

The logarithmic integral $\text{Li}(x)$ (the principal wave)
Corrections from each zero of the zeta function (the oscillatory modes)
Möbius-weighted corrections for prime powers

Gauss’s Observation Recovered: The principal term $\text{Li}(x)$ is exactly what Gauss empirically conjectured in the late 18th century! Gauss noticed that $\text{Li}(x)$ approximates $\pi(x)$ far better than $\frac{x}{\ln x}$ . Riemann’s formula proves this and adds the precise oscillatory corrections.

The Möbius inversion $\pi(x) = \sum_{n=1}^{\infty} \frac{\mu(n)}{n} \text{Li}(x^{1/n})$ is particularly elegant: it shows how counting primes (with multiplicity 1) requires carefully subtracting the overcounting from prime powers using the inclusion-exclusion principle encoded in $\mu(n)$ .

8.7 The Complete Picture

We now have the full wave decomposition of the prime counting function:

\pi(x) = \underbrace{\text{Li}(x)}_{\substack{\text{smooth trend} \\ \text{(Gauss's conjecture)}}} - \underbrace{\sum_{\rho} \text{Li}(x^\rho)}_{\substack{\text{oscillatory corrections} \\ \text{(from zeta zeros)}}} + O(1)

The staircase of primes emerges from:

A smooth principal wave (the logarithmic integral)
An infinite superposition of oscillatory modes, one for each zero of the Riemann zeta function

The zeros “conspire” to create the irregular jumps at exactly the right locations—the prime numbers.

Next Step: Reviewing the Complete Chain

We’ve completed the derivation from empirical observations to the explicit formula. Let’s step back and review the entire logical chain, understanding how each piece fits together.

9. Summary: The Complete Derivation Chain

What We Explored from First Principles

Starting from Gauss’s observation of a staircase pattern in prime tables, we attempted to reconstruct the theory through a logical chain of reasoning. Each “rediscovery” below acknowledges that these tools were actually developed by brilliant mathematicians over centuries:

Section 1: Empirical Foundation

Observation: $\pi(x)$ exhibits a staircase pattern
Gauss’s conjecture: $\pi(x) \sim \frac{x}{\ln x}$ (smooth trend)
Strategy: Decompose into principal wave + oscillatory modes (Fourier perspective)

Sections 2-3: Information Theory - Finding the Right Measure

Exploration: Prime density appears scale-invariant, not translation-invariant
Maximum Entropy derivation: Natural measure is $dt = \frac{dx}{x}$ → work in log-space $t = \ln(x)$
Information content: Finding prime $p$ has information value $I(p) = \ln(p)$
Rediscovered: Von Mangoldt function $\Lambda(n) = \ln(p)$ for $n = p^k$

Section 4: Cumulative Information

Definition: $\psi(x) = \sum_{n \leq x} \Lambda(n)$ = total information up to $x$
Properties: Jumps by $\ln(p)$ at each prime power, cleaner analytically than $\pi(x)$
Rediscovered: Chebyshev $\psi$ function

Section 5: Spectral Analysis - Forward Transform

Discrete Fourier transform: $F(s) = \sum \frac{\Lambda(n)}{n^s}$ (Dirichlet series)
Statistical mechanics analogy: $F(s)$ is average information, related to partition function $Z(s)$
Solving differential equation: $-\frac{Z'}{Z} = F(s)$ → $Z(s) = \prod \frac{1}{1-p^{-s}}$
Rediscovered: Riemann zeta function $\zeta(s) = \sum \frac{1}{n^s}$ and Euler product
Result: Frequency spectrum is $F(s) = -\frac{\zeta'(s)}{\zeta(s)}$

Section 6: Inverse Transform

Goal: Recover $\psi(x) = \sum_{n \leq x} \Lambda(n)$ from spectrum
Derivation from Cauchy’s formula: Heaviside step function in complex plane
Rediscovered: Perron’s formula
Evaluation via residue theorem:
- Pole at $s=1$ : contributes principal term $x$
- Zeros $\rho$ of $\zeta(s)$ : each contributes oscillatory term $-\frac{x^\rho}{\rho}$

Section 7: The Explicit Formula

Result: $\psi(x) = x - \sum_{\rho} \frac{x^\rho}{\rho} + O(1)$
Interpretation: Each zero $\rho = \beta + i\gamma$ is a resonant frequency creating one wave $x^\beta e^{i\gamma \ln x}$
Rediscovered: Riemann’s explicit formula for $\psi(x)$
Riemann Hypothesis: All zeros on line $\beta = 1/2$ → maximal spectral uniformity

Section 8: Recovery of $\pi(x)$

Introduced $\theta(x)$ : First Chebyshev function $\theta(x) = \sum_{p \leq x} \ln(p) = \int_2^x \ln(t) \, d\pi(t)$
Prime power relationship: $\psi(x) = \sum_{k=1}^{\infty} \theta(x^{1/k})$ (accounting for all powers)
Möbius inversion: $\theta(x) = \sum_{k=1}^{\infty} \mu(k) \psi(x^{1/k})$ (removing overcounting)
Rediscovered: Möbius function and Möbius inversion
Integration by parts: $\pi(x) = \frac{\theta(x)}{\ln x} + \int \frac{\theta(t)}{t \ln^2 t} dt$
Combined formula: $\pi(x) = \sum_{n=1}^{\infty} \frac{\mu(n)}{n} \text{Li}(x^{1/n})$
Final result: $\pi(x) = \text{Li}(x) - \sum_{\rho} \text{Li}(x^\rho) + O(1)$
Rediscovered: Riemann’s explicit formula for $\pi(x)$
Recovered: Gauss’s empirical observation $\text{Li}(x)$ as the principal term

Key Insights

Mathematical:

The “unpredictability” of primes is encoded as an infinite superposition of waves
The zeros of $\zeta(s)$ are the resonant frequencies needed to build the staircase
Riemann Hypothesis = all waves have the same amplitude scaling ( $\sqrt{x}$ )

Physical:

Each zero is like a harmonic oscillator with frequency $\gamma$ and damping $\beta$
The zeros “conspire” through interference to create sharp jumps at prime locations
Prime distribution is a spectral phenomenon

Information-Theoretic:

Information content of prime $p$ is $\ln(p)$ (natural measure in log-space)
$\psi(x)$ represents cumulative information encoded in integers
RH states primes achieve optimal “pseudorandomness”—maximal entropy consistent with deterministic structure

Standing on the Shoulders of Giants

Every major function we explored—von Mangoldt, Chebyshev, Riemann zeta, Perron formula, Möbius function—represents centuries of mathematical genius. In our reconstruction, these tools appeared to emerge “naturally” from our framework:

Information Theory (Maximum Entropy) suggested weights and measures
Empirical observation (Gauss’s tables) provided the principal wave
Fourier Analysis offered the spectral decomposition
Inclusion-Exclusion principle (through Möbius inversion) handled prime powers

But we must remember: the actual historical development was far more complex, involving countless dead ends, breakthroughs, and insights we’ve glossed over. Our “natural discovery” is only possible because we already know where we’re going. The original mathematicians—Euler, Gauss, Riemann, Chebyshev, Möbius, and others—deserve full credit for these profound discoveries.

10. Notes on Rigor and Methodology

Philosophy of This Derivation

This document is an exploratory sketch, not a rigorous proof. We emphasize discovery and intuition over formal rigor. Our goal has been to explore: “If 18th-century mathematicians had known information theory and Fourier analysis, how might they have approached Riemann’s formula?”

Important caveat: This reconstruction is inevitably colored by hindsight. We may have glossed over subtleties, made unjustified leaps, or oversimplified complex arguments. The actual mathematics requires much more care than presented here. If you find errors or gaps, please consult rigorous texts on analytic number theory.

What We Intentionally Omitted

For brevity and clarity, we’ve omitted several technical details that a fully rigorous treatment would require:

Convergence and Analytic Properties:

Convergence conditions for Dirichlet series (requires $\text{Re}(s) > 1$ for absolute convergence)
Analytic continuation of $\zeta(s)$ to the entire complex plane (requires functional equation)
Precise growth bounds for error terms
Justification of term-by-term integration and interchange of limits

Complex Analysis:

Rigorous contour deformation argument (shifting the integration line)
Behavior of $\zeta(s)$ near $s=1$ (requires Laurent expansion)
Proof that the shifted contour contributes negligibly
Detailed residue calculations

Number-Theoretic Subtleties:

Smoothing at discontinuities ( $\psi(x)$ vs $\psi_0(x)$ —the averaged version)
Prime powers vs primes (higher powers contribute $O(\sqrt{x})$ )
Error bounds in the explicit formula

Functional Equation:

The functional equation $\zeta(s) = 2^s \pi^{s-1} \sin(\frac{\pi s}{2}) \Gamma(1-s) \zeta(1-s)$
Location of trivial zeros (at negative even integers)
Symmetry of non-trivial zeros about $\text{Re}(s) = 1/2$

What We Emphasized

Complete Logical Chain:

Every mathematical object was derived, not assumed
Each step flows naturally from the previous
The “next step” transitions make the logic explicit

First Principles Approach:

Information Theory (Maximum Entropy) for measures and weights
Fourier Analysis for spectral decomposition
Complex Analysis for inversion (Cauchy’s formula, residue theorem)

Rediscovery Narrative:

Pose problems naturally
Derive solutions from first principles
Reveal modern names afterward

Physical and Information-Theoretic Intuition:

Waves and frequencies (spectral picture)
Information content and entropy (MaxEnt picture)
Statistical mechanics analogies (partition functions)

How This Differs from Standard Treatments

Standard Rigorous Approach:

Define $\zeta(s)$ and carefully prove its properties
Establish functional equation and analytic continuation with full rigor
Apply to prime counting with precise error bounds

Our Exploratory Approach:

Study primes directly (empirical observation)
Explore how $\zeta(s)$ might emerge (partition function analogy)
Attempt to understand what it tells us about primes

The standard approach is more rigorous, more complete, and mathematically correct. Our approach sacrifices rigor for intuition, attempting to show one possible path of understanding. We cannot claim our approach “reveals why these tools exist”—that’s overstating it. Rather, we hope it provides one possible intuitive framework among many.

Final Thought

Mathematics, at its best, reveals deep truths that were always there, waiting to be understood. The explicit formula for primes was implicit in the structure of arithmetic—but it took Riemann’s extraordinary insight (and decades of prior work by Euler, Gauss, and others) to see it.

Our exploration attempted to show one possible path through this beautiful mathematics, using information theory and Fourier analysis as guides. But we stand in awe of the original discoverers who found these truths without hindsight, without modern tools, through pure mathematical intuition and relentless dedication.

The connection between the zeros of the zeta function and the distribution of primes—between an analytic function in the complex plane and the multiplicative structure of integers—remains one of the most profound and mysterious phenomena in mathematics. We hope this sketch, despite its inevitable shortcomings, helps readers appreciate the beauty of this connection.

End of Sketch

Sketch of Proof: Deriving the Prime Counting Function from First Principles

1. Empirical Starting Point: The Staircase Shape

Historical Context: Counting Primes Before Modern Theory

The Staircase Pattern

Gauss’s Empirical Observation

Wave Decomposition Strategy

Next Step: Finding the Right Coordinate System

2. First Principle: Scale-Invariant Measure

The Problem of Measuring Prime Density

Discovering Scale Invariance

The Maximum Entropy Approach

Deriving the Scale-Invariant Measure

Working in Log-Space

Next Step: Information Content of Individual Primes

3. Deriving the Information Weight Function

The Question: How Much Information is a Prime?

Estimating the Probability of Primality

Calculating Information Content

Extension to Prime Powers

Defining the Von Mangoldt Function

Historical Note and Modern Name

Next Step: Accumulating Information

4. Building the Cumulative Information Function

Defining Total Information Up to xxx

Properties and Behavior

Why Use ψ(x)\psi(x)ψ(x) Instead of π(x)\pi(x)π(x)?

Historical Note and Modern Name

Next Step: Spectral Decomposition

5. Fourier Analysis: Finding the Frequency Spectrum

The Goal: Decomposing Information into Frequencies

5.1 The Forward Transform: Signal to Frequency

5.2 Connecting to Statistical Mechanics

5.3 Solving for the Partition Function

Historical Note and Modern Name

Next Step: The Inverse Transform

6. Inverse Transform: From Frequency Back to Signal

The Challenge: Recovering the Cumulative Sum

6.1 Deriving the Inverse Transform

Historical Note and Modern Name

6.2 Applying to Our Specific Case

6.3 Evaluating via Residue Theorem

Next Step: Interpreting the Formula

7. The Explicit Formula: Wave Decomposition Revealed

The Complete Formula

7.1 Interpreting Each Component

7.2 The Physical Picture: Resonant Frequencies

7.3 The Riemann Hypothesis: Spectral Uniformity

Historical Note and Modern Name

Next Step: Recovering π(x)\pi(x)π(x)

8. Recovering the Prime Counting Function π(x)\pi(x)π(x)

The Relationship Between π(x)\pi(x)π(x) and ψ(x)\psi(x)ψ(x)

8.1 Understanding the Conceptual Difference

8.2 Introducing the First Chebyshev Function

8.3 Relating ψ\psiψ to θ\thetaθ via Prime Powers

8.4 Möbius Inversion: From ψ\psiψ to θ\thetaθ

8.5 From θ\thetaθ to π\piπ via Integration by Parts

8.6 Combining Everything: The Final Formula

Historical Note and Modern Name

8.7 The Complete Picture

Next Step: Reviewing the Complete Chain

9. Summary: The Complete Derivation Chain

What We Explored from First Principles

Section 1: Empirical Foundation

Sections 2-3: Information Theory - Finding the Right Measure

Section 4: Cumulative Information

Section 5: Spectral Analysis - Forward Transform

Section 6: Inverse Transform

Section 7: The Explicit Formula

Section 8: Recovery of π(x)\pi(x)π(x)

Key Insights

Standing on the Shoulders of Giants

10. Notes on Rigor and Methodology

Philosophy of This Derivation

What We Intentionally Omitted

What We Emphasized

How This Differs from Standard Treatments

Recommended Reading for Rigor

Final Thought

Defining Total Information Up to $x$

Why Use $\psi(x)$ Instead of $\pi(x)$ ?

Next Step: Recovering $\pi(x)$

8. Recovering the Prime Counting Function $\pi(x)$

The Relationship Between $\pi(x)$ and $\psi(x)$

8.3 Relating $\psi$ to $\theta$ via Prime Powers

8.4 Möbius Inversion: From $\psi$ to $\theta$

8.5 From $\theta$ to $\pi$ via Integration by Parts

Section 8: Recovery of $\pi(x)$