What (on earth) is a Sufficient Statistic?

union

It sometimes drives me insane to hear about engineering or economics people using the word “sufficient statistics” in the seminars. The idea of it, as many people will understand it correctly, is that this function summarizes the information from the data, impeccable, alright. But the terminology itself comes with a very rigorous definition. I would very much prefer if you do not plan to give the rigorous description about why it is sufficient in the statistical sense, use some other words like sufficient information or sufficient signaling if necessary.

Anyhow, in case we are really gonna use it, let’s talk about what a sufficient statistic is. Given a probability space $(\Omega, \mathbb{P}, \mathcal{F})$, where the parameter space $\Omega$ that labels all the possible statistical models ${P_\theta}$, we call a function, often denoted as $T$, that maps from data $X \sim P_\theta(\cdot) \in \Delta (\mathcal{X})$ to some ‘‘coonclusion’’ a statistic. E.g., from the Boston housing prices data we get the average price, and that is a statistic, we can also calculate the emprical variance, now that’s also a statistic. To be more general, we restrict the data to be in a measurable space $(\mathcal{X}, \mathcal{B})$ and the statistical outcome to be in a measurable space $(\mathcal{T}, \mathcal{C})$.

Definition 1

Let $(\mathcal{T} , \mathcal{C})$ be a measurable space such that the $\sigma$-field $\mathcal{C}$ contains all singletons. A measurable mapping $T : \mathcal{X} \to \mathcal{T}$ is called a statistic.

Usually we can think of $\mathcal{X}/\mathcal{T}$ as a subspace of $\mathbb{R}^d$ and $\mathcal{B}/\mathcal{C}$ its borel-algebra. Consider the distribution $P_\Theta$ densities $f_\Theta$ w.r.t. a measure $\nu$, so does the distribution of $T = T(X)$, the idea is that $T$ should say all the things about the $\Theta$-data-generating process: with $t = T(x)$, the conditional probability $$ f_{X|T, \Theta} (x|t, \theta) = \frac{f_{X, T | \Theta} (x, t | \theta)}{f_{T| \Theta} ( t|\theta)} = \frac{f_{X | \Theta} (x | \theta)}{f_{T| \Theta} ( t|\theta)} $$ remains the same for all the $\Theta = \theta \in \Omega$. In plain words, no matter what the statistical model $P_\Theta$ is, knowing the likelihood of the data generated ($f_{X | \Theta} (x | \theta)$) is equivalent to knowing the likelihood of the statistics calculated $f_{T| \Theta} ( t|\theta)$, in which case we don’t even care what the data looks like, since the statistics $t$ are sufficient. Simple, right? To say some quantities are sufficient statistics, one does not need to give a rigorous definition like the following, but at least discuss the ratio of the conditional probability above. Because in some cases, it’s simply not true.

Definition 2.

Suppose there exist versions of conditional distributions $\mu_{X|\Theta,T} (· | \theta, t)$ and a function $r : \mathcal{B} × \mathcal{T} \to [0, 1]$ such that

  • $r(\cdot, t)$ is a probability on $\mathcal{B}$ for each $t \in \mathcal{T} $,
  • $r(B, \cdot)$ is measurable for each $B \in \mathcal{B}$, and for each $\theta \in \Omega$ and $B \in \mathcal{B}$.
  • $\mu_{X|\Theta,T} (B | θ, t) = r(B, t)$, for $\mu_{T | \Theta}(\cdot | \theta) − a.e. \ \ t.$

Then $T$ is called a sufficient statistic for $\Theta$ (in the classical sense).

Notice that we haven’t really discussed whether our setting is Frequentist or Bayesian yet, in the sense that we don’t really know if there is a prior measure on $\Omega$. But this definition is considered Frequentist version by default.

Now let’s look at the Bayesian setting, where we have a prior $\mu_\Theta(\cdot) \in \Delta(\Omega)$.

Definition 3.

A statistic $T$ is called a sufficient statistic for the parameter $\Theta$ (in the Bayesian sense) if, for every prior $\mu_\Theta$, there exists versions of posterior distributions $\mu_{\Theta|X}$ and $\mu_{\Theta|T}$ such that, for every $A \in \mathcal{F}$, we have $$ \mu_{\Theta|X} (A | x) = \mu_{\Theta|T} (A|T(x)) \quad\quad \mu_X-a.s. $$ where $\mu_X$ is the marginal distribution of $X$.

When there are densities, the equality looks like this $$ \begin{aligned} \mu_{\Theta \mid X}(A \mid x) & =\int_{A} f_{\Theta \mid X}(\theta \mid x) \mu_{\Theta}(d \theta) , \\ \mu_{\Theta \mid T}(A \mid t) & =\int_{A} f_{\Theta \mid T}(\theta \mid t) \mu_{\Theta}(d \theta) . \end{aligned} $$ Therefore, for any element $x$ in the support of $\mu_X$, it must holds for $ f_{\Theta \mid X}(\theta \mid x) = f_{\Theta \mid T}(\theta \mid t) $, which collapses to the condition in the Frequentist setting.

Now, suppose someone gives you a parameterized family of statistical models ${P_\theta : \theta \in \Omega}$, which are all densities w.r.t some measure $\nu$, how do we find a sufficient statistic? Or, how do you check if a statistic is sufficient? The following theorem will help.

Factorization Theorem

$T$ is a sufficient statistic if.f. there exist $h$ and $g$, such that $$ f_{X|\Theta} (x| \theta ) = h(x) g(\theta , T(x)) $$

Proof

Sufficiency: $$ \begin{aligned} \frac{d \mu_{\Theta \mid X}}{d \mu_{\Theta}}(\theta \mid x) & =\frac{f_{X \mid \Theta}(x \mid \theta)}{\int_{\Omega} f_{X \mid \Theta}(x \mid \theta) \mu_{\Theta}(d \theta)} \\ & =\frac{h(x) g(\theta, T(x))}{\int_{\Omega} h(x) g(\theta, T(x)) \mu_{\Theta}(d \theta)} \\ & =\frac{g(\theta, T(x))}{\int_{\Omega} g(\theta, T(x)) \mu_{\Theta}(d \theta)}\end{aligned} $$ hence it is a function of $T$;

Necessity: $$ f_{X | \Theta}(x | \theta)= f_{\Theta \mid X}(\theta \mid x) f_{X}(x)= \underbrace{f_{X}(x)}_{h(x)} \underbrace{f_{\Theta \mid T}(\theta | T(x))}_{g(\theta, T(x))} . $$ $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad\quad \quad \quad \quad \quad \quad \quad\quad \quad \quad \quad \quad \quad \quad\quad \quad \quad \quad \quad \quad \square $

A sufficient statistic always exists because we can compute the exponential family, i.e., if we put $T(x) = (t_i(x))_{i=1}^d$, we can always put $$ f_{X \mid \Theta}(x \mid \theta) = \underbrace{h(x)}_{h(x)} \underbrace{c(\theta) \exp \bigg( { \sum_{i=1}^{d} \theta_{i} t_{i}(x) \bigg) }}_{g(\theta, T(x))}. $$

To summarize, a statistic is a function, when we say it is sufficient we have to tell people what (statistical) model, and what parameters it is sufficient to, even though sometimes it is intuitive and does not need all the fuss.