# Computing Sample Variance: Why Divide by N - 1?

## Variance Estimation¶

In statistics we know that the mean and variance of a population $Y$ are defined to be:

\left\{ \begin{aligned} \text{Mean}(Y) &= \mu = \frac{1}{N} \sum_{i=1}^{N} Y_i \\ \text{Var}(Y) &= \sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (Y_i - \mu)^2 \\ \end{aligned} \right.

where $N$ is the size of the population.

Given the population $Y$, we can draw a sample $X$ and compute statistics for $X$:

\left\{ \begin{aligned} \text{Mean}(X) &= \bar{X} = \frac{1}{n} \sum_{j=1}^{n} X_j \\ \text{Var}(X) &= s^2 = \frac{1}{n - 1} \sum_{j=1}^{n} (X_j - \bar{X})^2 \\ \end{aligned} \right.

where lowercase $n$ is the size of the sample, typically a much smaller number than $N$. One detail that is often not clearly explained in introductory statistics is why we should divide by $n - 1$ instead of $n$ in the calculation for the sample variance.

## Why divide by n - 1?¶

It turns out that we should divide by $n - 1$ because dividing by $n$ would give us a biased estimator of the population variance. Let's look at a concrete example before diving into the math for why. Let's say we have a population of 100,000 data points. These can represent, for instance, a movie rating for each of 100,000 people.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
from IPython.core.pylabtools import figsize
figsize(15, 5)

In [2]:
import pandas as pd
import numpy as np

In [3]:
np.random.seed(42)

In [4]:
N = 100000  # size of population

In [5]:
population = pd.Series(np.random.randint(1, 11, N))


We can easily calculate the population mean and population variance:

In [6]:
population.mean()

Out[6]:
5.5155599999999998
In [7]:
((population - population.mean()) ** 2).sum() / N

Out[7]:
8.2592178864000001

Note that we are dividing by $N$ in the variance calculation, also that in numpy or pandas this is the same as simply using the method var with ddof=0

In [8]:
population.var(ddof=0)

Out[8]:
8.2592178864000001

where ddof=0 means to divide by $N$, and ddof=1 means to divide by $N - 1$.

## Simulation¶

As usual in statistics, the population parameters are often unknown. But we can estimate them by drawing samples from the population. Here we are drawing a random sample of size $30$. As of version 0.16.1, pandas has a convenient Series.sample() function for this:

In [9]:
samples = {}
n = 30  # size of each sample
num_samples = 500  # we are drawing 500 samples, each with size n
for i in range(num_samples):
samples[i] = population.sample(n).reset_index(drop=True)

In [10]:
samples = pd.DataFrame(samples)
samples.T.tail()

Out[10]:
0 1 2 3 4 5 6 7 8 9 ... 20 21 22 23 24 25 26 27 28 29
495 4 6 3 7 9 1 3 6 8 7 ... 2 9 4 5 3 1 3 5 9 6
496 8 3 5 4 7 6 10 4 1 5 ... 5 8 1 5 3 10 2 5 9 2
497 8 4 7 5 3 10 5 7 9 2 ... 6 9 10 3 10 3 2 6 5 8
498 3 3 6 6 1 4 6 8 10 8 ... 8 1 3 8 3 6 4 9 7 3
499 10 5 1 3 10 7 3 6 8 9 ... 7 7 7 4 9 6 7 8 9 2

5 rows × 30 columns

As we expect, if we average all the sample means we can see that the it is a good estimate for the true population mean:

In [11]:
df = pd.DataFrame({'estimated mean': pd.expanding_mean(samples.mean()),
'actual population mean': pd.Series(population.mean(), index=samples.columns)})
df.plot(ylim=(4.5, 6.5))

Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x107a29198>

Now let's compare the results we would get by using the biased estimator (dividing by $n$) and the unbiased estimator (dividing by $n-1$)

In [12]:
df = pd.DataFrame({'biased var estimate (divide by n)': pd.expanding_mean(samples.var(ddof=0)),
'unbiased var estimate (divide by n - 1)': pd.expanding_mean(samples.var(ddof=1)),
'actual population var': pd.Series(population.var(ddof=0), index=samples.columns)})
df.plot(ylim=(6.5, 10.5))

Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x107a292b0>

We can see that the biased estimator (dividing by $n$) is clearly not estimating the true population variance as accurately as the unbiased estimator (dividing by $n-1$).

## Mathematical Proof¶

To prove that dividing by $n - 1$ is an unbiased estimator, we need to show that expected value of the estimaor is indeed $\sigma^2$: $$E(s^2) = E\left(\frac{1}{n - 1} \sum_{j=1}^{n} (X_j - \bar{X})^2\right) = \sigma^2$$

First we'll need to recall a few basic properties of expectation and variance:

\left\{ \begin{aligned} & E(Z_1 + Z_2) = E(Z_1) + E(Z_2), \text{ for any } Z_1, Z_2 \\ & \text{Var}(a Z) = a^2 \text{Var}(Z), \text{ for any } Z \\ & \text{Var}(Z_1 + Z_2) = \text{Var}(Z_1) + \text{Var}(Z_2), \text{ if } Z_1 \text{ and } Z_2 \text{ are independent} \\ \end{aligned} \right.

Also, the following is a useful form for variance: $$\text{Var}(Z) = E((Z - E(Z))^2) = E(Z^2 - 2ZE(Z) + E(Z)^2) = E(Z^2) - E(Z)^2$$

This is equivalent to $$E(Z^2) = \text{Var}(Z) + E(Z)^2$$

Using the above properties we can now simplify the expression for $E(s^2)$:

\begin{aligned} E(s^2) = E\left(\frac{1}{n - 1} \sum_{j=1}^{n} (X_j - \bar{X})^2\right) = & \frac{1}{n - 1} E \left( \sum_{j=1}^{n} (X_j^2 - 2X_j\bar{X} + \bar{X}^2) \right) \\ = & \ \frac{1}{n - 1} E \left( \sum_{j=1}^{n} X_j^2 - 2n\bar{X}^2 + n\bar{X}^2 \right) \\ = & \ \frac{1}{n - 1} E \left( \sum_{j=1}^{n} X_j^2 - n\bar{X}^2 \right) \\ = & \ \frac{1}{n - 1} \left[ E \left( \sum_{j=1}^{n} X_j^2 \right) - E \left( n\bar{X}^2 \right) \right] \\ = & \ \frac{1}{n - 1} \left[ \sum_{j=1}^{n} E \left( X_j^2 \right) - n E \left( \bar{X}^2 \right) \right] \\ \end{aligned}

Now notice that the first term can be simplied as:

\begin{aligned} \sum_{j=1}^{n} E \left( X_j^2 \right) = & \sum_{j=1}^{n} \left( Var(X_j) + E(X_j)^2 \right) \\ = & \sum_{j=1}^{n} \left( \sigma^2 + \mu ^2 \right) \\ = & \ n \sigma^2 + n \mu ^2 \\ \end{aligned}

Using the same trick, the second term becomes:

\begin{aligned} E(\bar{X}^2) = & \ Var(\bar{X}) + E(\bar{X})^2 \\ = & Var(\frac{1}{n} \sum_{j=1}^{n} X_j) + \mu ^2 \\ = & \frac{1}{n^2} Var(\sum_{j=1}^{n} X_j) + \mu ^2 \\ = & \frac{1}{n^2} \sum_{j=1}^{n} Var(X_j) + \mu ^2, \text{ because all } X_j\text{'s are independent} \\ = & \frac{1}{n^2} n\sigma^2 + \mu ^2 \\ = & \frac{1}{n} \sigma^2 + \mu ^2 \\ \end{aligned}

Plugging the two terms back we finally get:

\begin{aligned} E(s^2) = & \ \frac{1}{n-1} \left[ \sum_{j=1}^{n} E \left( X_j^2 \right) - n E \left(\bar{X}^2 \right) \right] \\ = & \ \frac{1}{n-1} \left[n \sigma^2 + n \mu ^2 - n \left( \frac{1}{n} \sigma^2 + \mu ^2 \right) \right] \\ = & \ \frac{1}{n-1} \left[n \sigma^2 + n \mu ^2 - \sigma^2 - n \mu ^2 \right] \\ = & \ \sigma^2 \\ \end{aligned}

Dividing by $n-1$ gives us an unbiased estimate for the population variance indeed!

## Source of Bias¶

One intuitive way to think about why the bias exists is to notice that we generally don't actually know the true population mean $\mu$, and therefore the sample variance is being computed using the estimated mean $\bar{X}$. However the quadratic form $\sum_{j=1}^{n} (X_j - a)^2$ is actually minimized by $a = \bar{X}$, which means that whatever the true population mean $\mu$ is, we will always have

$$\sum_{j=1}^{n} (X_j - \mu)^2 \geq \sum_{j=1}^{n} (X_j - \bar{X})^2$$

Therefore we are underestimating the true variance because we don't know the true mean.

In fact, we can see that we are underestimating by exactly $\sigma^2$ on average:

\begin{aligned} E\left(\sum_{j=1}^{n} (X_j - \mu)^2\right) = & \ E \left(\sum_{j=1}^{n} (X_j - \bar{X} + \bar{X} - \mu)^2\right) \\ = & \ E \left(\sum_{j=1}^{n} (X_j - \bar{X})^2 + \sum_{j=1}^{n} 2(X_j - \bar{X})(\bar{X} - \mu) + \sum_{j=1}^{n} (\bar{X} - \mu)^2 \right) \\ = & \ E \left(\sum_{j=1}^{n} (X_j - \bar{X})^2 + \sum_{j=1}^{n} (\bar{X} - \mu)^2 \right) \\ = & \ E \left(\sum_{j=1}^{n} (X_j - \bar{X})^2 \right) + E \left(\sum_{j=1}^{n} (\bar{X} - \mu)^2 \right) \\ = & \ E \left(\sum_{j=1}^{n} (X_j - \bar{X})^2 \right) + \sum_{j=1}^{n} E \left((\bar{X} - \mu)^2 \right) \\ = & \ E \left(\sum_{j=1}^{n} (X_j - \bar{X})^2 \right) + \sum_{j=1}^{n} \left( \text{Var} (\bar{X} - \mu) + E (\bar{X} - \mu)^2 \right) \\ = & \ E \left(\sum_{j=1}^{n} (X_j - \bar{X})^2 \right) + \sum_{j=1}^{n} \left( \text{Var} (\bar{X}) + E (\bar{X} - \mu)^2 \right) \\ = & \ E \left(\sum_{j=1}^{n} (X_j - \bar{X})^2 \right) + \sum_{j=1}^{n} \text{Var} (\bar{X}) \\ = & \ E \left(\sum_{j=1}^{n} (X_j - \bar{X})^2 \right) + n \text{Var} (\bar{X}) \\ = & \ E \left(\sum_{j=1}^{n} (X_j - \bar{X})^2 \right) + n \text{Var} (\frac{1}{n} \sum_{j=1}^{n} X_j) \\ = & \ E \left(\sum_{j=1}^{n} (X_j - \bar{X})^2 \right) + n \frac{1}{n^2} \sum_{j=1}^{n} \text{Var} (X_j) \\ = & \ E \left(\sum_{j=1}^{n} (X_j - \bar{X})^2 \right) + \sigma^2 \\ \end{aligned}

Combined with the result we have from the proof in the previous section, we can see that if we somehow magically knew the true mean $\mu$, dividing by $n$ would be unbiased:

\begin{aligned} E\left(\frac{1}{n} \sum_{j=1}^{n} (X_j - \mu)^2\right) = & \ \frac{1}{n} E \left(\sum_{j=1}^{n} (X_j - \mu)^2\right) \\ = & \ \frac{1}{n} \left[ E \left(\sum_{j=1}^{n} (X_j - \bar{X})^2\right) + \sigma^2 \right] \\ = & \ \frac{1}{n} \left[ (n - 1) \sigma^2 + \sigma^2 \right] \\ = & \ \sigma^2 \\ \end{aligned}

However since we don't know the true mean and are using the estimated mean $\bar{X}$ instead, we'd need to divide by $n - 1$ to correct for the bias. This is also known as Bessel's correction.