(Throughout this post I will be using the convention $\log \equiv \log_e$ and $\lg \equiv \log_2$).

In my post on Shannon entropy I introduced the notion of *information
entropy* with respect to discrete random variables. There we derived, albeit in
a hand-wavy fashion, the entropy formula:

I hinted at the fact that $H$ is maximized when the underlying random variable $X$ is distributed uniformly. We saw this in the coin-toss example by seeing how any deviation from a fair-coin reduced the entropy.

My aim in this post is to prove (at least in the discrete case) that $H$ is indeed maximized when $X$ is distributed uniformly.

As an aside, one thing I neglected to mention in my original post on entropy is how $H$ is computed when $p = 0$. By convention, $-p\cdot \lg p \equiv 0$ when $p = 0$.

This agrees with our intuition since the case $p = 0$ is analogous to the case
$p = 1$ in that the event will almost-surely *not* occur. And recall that
$-1\cdot \lg 1 = 0$. So $p = 0$ can be seen as the symmetric case to $p = 1$,
making the above convention not far-fetched.

You can actually show using methods of indeterminate forms that $p\cdot \lg p \to 0 \text{ as } p \to 0$. So that if we interpret the product as a limit $p\to 0$ the above equality ceases to be a convention but instead a fact.

First we need to quickly review a standard log inequality.

$\log x \leq x - 1 \enspace \forall x >0$Where equality holds *if and only if* $x = 1$.

The proof follows directly from the definition,

$\log x = \int_1^x \frac{1}{t} \enspace .$If $x\geq 1$ then

$\log x = \int_1^x \frac{1}{t} \leq \int_1^x 1 = x - 1 \enspace .$If $0 < x < 1$ then

$\log x = \int_1^x \frac{1}{t} = -\int_x^1 \frac{1}{t} \leq -\int_x^1 1 = -(1-x) = x - 1 \enspace .$Where the inequality follows from the fact that

$t\in [x,1] \Rightarrow 1/t \geq 1 \Rightarrow -1/t \leq -1 \enspace .$If $x = 1$ it follows trivially that $\log x = x - 1$. We want to show, however,
that $x = 1$ is the *only* root.

For this we define the auxiliary function

$f(x) = \log x - x + 1$Suppose that $f$ has *two* roots

Assume, without loss of generality, that $x_0 < 1$.

According to Rolle’s theorem, since $f$ is continuous on $[x_0,1]$ and differentiable on $(x_0,x)$, there must exist a point $\xi \in (x_0,1)$ such that

$f'(\xi) = \frac{1}{\xi} - 1 = 0 \enspace .$However, the only root of $1/\xi -1 = 0$ is $\xi = 1$.

But $1 \notin (x_0,1)$.

*Contradiction.*

Therefore, $x = 1$ is the *only* root of $f$.

Next we want to show that $\lg n$ is an *upper bound* for $H$.

By the change of base formula I can write

$\lg(n) = \frac{\log(n)}{\log(2)}$Suppose for now that $p_j > 0$ for each $j = 1,\ldots, n$.

Then,

$\begin{aligned} H(X) - \lg n &= H(p_1,\ldots,p_n) - \lg n \\ &= -\sum p_j \cdot \lg p_j - \lg n \\ &= -\sum p_j \cdot \frac{\log p_j }{\log 2} - \frac{\log p_j }{\log 2} \\ &= -\frac{1}{\log 2}\cdot \sum p_j \cdot\left(\log p_j + \log n \right) \qquad \small{\text{since} \sum p_j = 1} \\ &= -\frac{1}{\log 2}\cdot \sum p_j \cdot\left(\log p_j \cdot n \right) \qquad \star \\ &= \frac{1}{\log 2} \cdot \sum p_j \cdot\left(\log \frac{1}{p_j \cdot n} \right) \\ &\leq \frac{1}{\log 2} \cdot \sum p_j \cdot\left(\frac{1}{p_j \cdot n} - 1 \right) \qquad \small{\text{from the inequality in step 1}} \\ &= \frac{1}{\log 2} \cdot \sum \left(\frac{1}{n} - p_j \right) \\ &= \frac{1}{\log 2}\cdot\left(n\cdot\frac{1}{n} - 1\right) \\ &= \frac{1}{\log 2}\cdot\left(1 - 1\right) \\ &= 0 \\ &\Rightarrow H \leq \lg n \\ &\square \end{aligned}$Suppose $p_j = 0$ for some $j$. Recall from the preliminary discussion above that $-p\cdot\lg p \equiv 0$ by convention. Then from step $\star$ above simply note that

$p_j = 0 \Rightarrow -p_j\cdot \lg p = 0 < \frac{1}{n} - 0 = \frac{1}{n} - p_j \enspace$which is precisely the inequality that is required to be shown in the last step. The inequality therefore still holds.

Finally, as discussed earlier, equality is reached *if and only if*

That is, when $p_j = \frac{1}{n}$ for each $j$.

That is, when $X$ is *uniformly distributed*.

$\square$