A Cryptographic Conundrum

union

Optimal Communication

Let there be Alice and Bob, whose mission this time is to and communicate in the most efficient way possible, while Alice is given a bunch of random outcomes.

Alice wants to tell with Bob about a sequence of \(n\) independent random outcomes sampled from a known distribution \(Q\). To keep things concise, they’ve agreed on a secret binary language. Now, the plot thickens with the concept of entropy. The entropy of \(Q\) is the expected number of bits necessary per random variable using the optimal code as \(n\) goes to \(\infty\). There is also the relative entropy between distributions \(P\) and \(Q\), we can think of it as the extra baggage in message length Alice and Bob have to lug around if they mistakenly believe the random variables are sampled from \(P\) instead of \(Q\).

Let $P$ be a measure on $[N]$ with $\sigma$-algebra $2^{[N]}$ and $X: [N] \to [N]$ be the identity random variable, i.e., $X(\omega) =X$. Since binary code is used to convey the message, they might code each $X$ as a binary code function $c: [N] \to \{0,1\}^* $ where $\{0,1\}^{*} $ is the set of finite sequence of zeros and ones. $c$ must be injective (so it won’t cause amibiguity between different random variables), and prefix free (so it won’t cause ambiguity between any two codes). This is simply because Bob needs to know where one symbol starts and ends for multiple samples.

We know that the easiest choice is to use $\lceil \log (N) \rceil$ bits no matter what value of $X$ is, but if $X$ is far from uniform, let’s say $P(X = 1) = 0.99$, and then no matter what the rest of them look like, it’s preferreable to use shorter code for $X = 1$ than $\lceil \log (N) \rceil$. A natural objective formulated is

$$ c^* = \arg\min_c \mathbb{E}_{i \sim P } [ length(c(i)) ]. $$

It is well known that this optimization problem can be solved by Huffman Coding, thus the optimal value satisfies:

$$ H_2(P) \leq \sum_{i=1}^N p_i length(c^*(i)) \leq H_2(P) + 1, $$

where $H_2(P)$ is the entropy of $P$

\[ H_2(P) = \sum_{i=1, p_i > 0}^N - p_i \log(p_i) . \]

The naive idea of using a code of uniform length is only recovered when $P$ is uniformly distributed. Why $p_i > 0$? Think about $\lim_{x \to 0^+} x\log(x) = 0$, or think about $H_2(P)$ as kind of an expectation, which should not change under the perturbation of measure $0$ set.

The entropy $H_2(P)$ is a fundamental quantity. It’s based on $\log_2$ since we are talking about binary code, sometimes it’s more convenient to scale it with natural logarithm. Shannon Source Coding Theorem tells us (informal) that any $P$ compressed to fewer than $N H_2(P)$ will inevitably result in information lost, so any coding that results average bits cost $H_2(P)$ is unimprovable.

Relative Entropy

Now, imagine Alice and Bob in a parallel universe where they use a code that is optimal for $X$ sampled from distribution $Q$, but actually $X$ is sampled from $P$. Here comes the terminology of related entropy between $P$ and $Q$, it measures how much longer the messages are expected to be using the optimal code for $Q$ than what is obtained from using optimal code for $P$. Let $p_i = P(X= i)$ and $q_i = Q(X= i)$, assuming Shannon’s coding, the definition of relative entropy can be written as

$$ D(P,Q) = \sum_{i \in [N]: p_i > 0} - p_i \log(q_i) - (\sum_{i \in [N]: p_i > 0} - p_i \log(p_i)) = \sum_{i \in [N]: p_i > 0} p_i \log(\frac{p_i}{q_i}) $$

From Jensen’s inequality ($\log$ is concave so using the fact that $\mathbb{E}_p \{\log(\frac{q_i}{p_i})\} \leq \log( \mathbb{E}_p ( \frac{q_i}{p_i})))$ or the optimality of coding, $D(P,Q) \geq 0$. Actually this is also called KL divergence just in some other contexts. Question remains that what if $p_i $ and $ q_i = 0$ for some $i \in [N]$? Well, it means that $i$ is not neccessary for consideration since by both $P$ and $Q$, $i$ is in a measure zero set. Also, the sufficient and necessary condition for $D(P,Q)$ to be finite is that whenver $q_i = 0$, $p_i = 0$. Using measure-theoretic language, this condition means that $P$ is absolutely continuous with respect to $Q$.

Now we jump out of the story and consider arbitrary measurable space $(\Omega, \mathcal{F})$. Support of $P$ might not be finite, or even countable. Defining entropy through the same path is pretty hard as the symbols needed have to be infinite. This fundamental difficulty is automatically resolved if we directly consider relative entropy.

Formally, if we do a discretization over the sample space $\Omega$, i.e., find a measurable map $X : \Omega \to [N]$. Then, the relative entropy can be defined as $$ D(P,Q) = \sup_{N \in \mathbb{N}^+} \sup_{X} D(P_X, Q_X), $$ $P_X$ and $Q_X$ are pushforwards, i.e., they measure, say, $\forall \mathcal{I} \in 2^{[N]}$, by $P( X^{-1}(\mathcal{I}))$. In other words, the relative entropy here actually measures the capacity of Bob distinguishing between $P$ and $Q$ by receiving the ‘‘codes’’ $\mathcal{I}$, however the encrpytion is done by Alice. This measurement has profound meanings in a ton of applications.

Theorem 1

Let $(\Omega, \mathcal{F})$ be a measurable space, also let $P$ and $Q$ be measures on this space. Then

$$ D(P,Q) = \begin{cases} \int \log(\frac{dP}{dQ}(\omega)) dP(\omega) ,\ \ &\text{if} P \ll Q; \\\ \infty, \quad &\text{otherwise} \end{cases} \label{thm1} \tag{1}$$

When calculating the relative entropy the densities are always used. If $\lambda$ is a $\sigma$-finite measure dominating both $P$ and $Q$, let $p = \frac{dP}{d\lambda}$ and $q = \frac{dQ}{d\lambda}$, if $P \ll Q$, by chain rule we write

$$ D(P,Q) = \int p \log(\frac{p}{q}) d\lambda \label{eq2} \tag{2} $$

Such a $\lambda$ can always be found, for example $\lambda = P+Q$ always dominate $P$ and $Q$.

Note that relative entropy measures the distance from $P$ to $Q$ but it can never be treated as a metric since there are some properties unsatisfied such as triangular inequality and commutability. However, it serves the same purpose.

Examples

Consider two Gaussian variables with means $\mu_1$ and $\mu_2$ and variance $\sigma^2$:

$$ D(\mathcal{N}(\mu_1 , \sigma^2), \mathcal{N}(\mu_2 , \sigma^2)) = \frac{(\mu_1 - \mu_2)^2}{2\sigma^2}. $$

The quadratic term matches our intuition about such a ‘‘distance’’.

Consider two Bernouli random variables with means $p, q \in [0,1]$, then:

$$ D(\mathcal{B}(p), \mathcal{B}(q)) = p\log(\frac{p}{q}) + (1-p)\log(\frac{1-p}{1-q}), $$

and we have to let $0 \log(\cdot) = 0$.

To Wrap It Up

Now we come to the inequality dictating the capacity of deception, an inequality that Connects the relative entropy to the hardness of hypothesis testing in the following theorem

Theorem 2 (Bretagnolle-Huber Inequality)

Let $P$ and $Q$ be probability measures on the same measurable spcae $(\Omega, \mathcal{F})$, and let $A \in \mathcal{F}$ be an arbitrary event. Then

$$ P(A) + Q(A^c) \geq \frac{1}{2} \exp (-D(P,Q)) $$

where $A^c = \Omega \backslash A$ is the complement of A.

Proof

For reals $a, b$, abbreviate $a \vee b: = \max\{ a, b\}$, and $a \wedge b := \min\{a,b\}$. If $D(P,Q) = \infty$, then the inequality holds trivially true. If it’s not, then $P \ll Q$ by Theorem $\eqref{thm1}$. Let $\nu = P+Q$, and the Radon-Nikodym derivatives $p =\frac{ dP }{d\nu}$, $q =\frac{ dQ }{d\nu}$. By $\eqref{eq2}$, the relative entropy

$$ D(P,Q) = \int p \log(\frac{p}{q}) d\nu $$

Sometimes we drop $\nu$ for brevity, writtin it as $\int p \log(\frac{p}{q})$. It turns out a stronger result is sufficient:

$$ \int p\wedge q \geq \frac{1}{2} \exp(-D(P,Q)). $$

Why? Because $\int p \wedge q = \int_A p\wedge q + \int_{A^c} p \wedge q \leq \int_A p + \int_{A^c} q = P(A) + Q(A^c)$. We firstly have to utilize the Cauchy-Schwarz inequality and identity $pq = (p\wedge q) (p\vee q)$,

$$ \left( \int \sqrt{pq}\right)^2 = \left( \sqrt{(p\wedge q) (p\wedge q)} \right) \leq \left( \int p \wedge q\right) \left( \int p \vee q\right). $$

Also, using identity $p\wedge q + p \vee q = p + q$, we have $\int p\wedge q = 2 - \int p \vee q \leq 2$, so for both $p \vee q$ and $p \wedge q$ we have them lower bounded by $\left(\int \sqrt{pq}\right)^2$. Now, using Jensen’s inequality we arrive at some elementry manipulation:

$$\begin{align} \left(\int \sqrt{pq}\right)^2 & = \exp(2 \log \int \sqrt{pq}) = \exp(2 \log\int_{p> 0} p \sqrt{\frac{q}{p}}) . \nonumber \\\ & \geq \exp(2\int_{p > 0} p \frac{1}{2} \log(\frac{q}{p})) = \exp(-\int_{pq > 0} p \log(\frac{p}{q})) \nonumber \\\ & = \exp(-\int p \log (\frac{p}{q})) = \exp(-D(P,Q)). \end{align}$$

Since $P\ll Q$, $q = 0$ implies $p = 0$, so $p>0$ implies $q > 0$, therefore $pq > 0$. Divide both sides by $2$ concludes the proof.

There’s a little bit intuition. If $P$ and $Q$ are close, we expect $P(A) + Q(A^c)$ to be large to be close enough to $1$, and how large it is is just quantified by this theorem. Also the result is symmetric and we can always replace $D(P,Q)$ by $D(Q,P)$, yet $D(P,Q)$ is not symmetric, therefore sometimes stronger inequality is obtained.

Next post

Nesterov