What really is the Expected Value?
I’ve been spending time reading statistician and professor Norman Matloff’s works, appreciating his skill of going “behind the scenes” on statistics concepts in Probability and Statistics for Data Science. His exposé on expected values in Chapter 3 was of particular interest. Unlike other sources that define expected value (a.k.a the mean) for a discrete random variable (say, \(X\)) using the well known formula
\[\begin{equation} \large E(X) = \sum_{x}^{} x*P(X = x) \tag{1} \end{equation}\]
with \(x\) being in the sample space of \(X\), Matloff sets the record straight by demonstrating that that formula is actually derived from a different formula
\[\begin{equation} \large E(X) = \lim_{n \to\infty}\frac{X_1 + X_2 + ... + X_n}{n} \tag{2} \end{equation}\]
with \(n\) being the number of (random) samples drawn. This is the actual definition of expected value, but it also reflects intuition and applies to both discrete and continuous random variables! The mean/expected value of a random variable is the result from repeatedly taking random samples (\(X_1\), \(X_2\), …) and averaging all the outcomes. As the number of random samples approaches \(\infty\), we approach the exact expected value.
Let’s demonstrate how \((2)\) implies \((1)\).
Setting the Stage
Matloff offers the following scenario. Suppose we have 10 fair coins. We flip all 10 and count the number of heads. Then we do it again, repeatedly, each time recording the number of heads observed. What would be the long-run average number of heads?
His explanation was somewhat terse, so I’d like to flesh out his logic. First, let’s get some notation out of the way.
- Define \(\large X_i\) to be \(\#\) of heads (\(0, 1, 2, ..., 10\)) observed on the \(i^{th}\) trial (\(i = 1, 2, 3, ...\)). The total number of trials is \(n\).
Note that a trial represents a single instance of tossing the 10 coins. After five trials we might observe:
\(X_1 = 6, X_2 = 6, X_3 = 2, X_4 = 5, X_5 = 1\)
This means 6 heads on the first trial, 6 heads on the second trial, 2 heads on the third trial, and so on.
At this point we could simply take the average across our trials:
\(\large \frac{6+6+2+5+1}{5} = \frac{20}{5} = 4\)
Of course, as this is only five trials, this isn’t a good estimate of the mean. Perhaps more important is computing the average in this way obscures the answer to the question:
for each possible outcome (i.e, number of heads) \(0, 1, 2, ..., 10\), how many times did it occur after 5 trials?
Looking back at the five trials, let’s make a frequency table describing how many of the trials resulted in 0, 1, …, 10 heads.
Number of Heads | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
Frequency of Occurrence (in five trials) | 0 | 1 | 1 | 0 | 0 | 1 | 2 | 0 | 0 | 0 | 0 |
For example, 2 trials (\(X_1\) and \(X_2\)), resulted in 6 heads. We can compute the average number of heads as follows:
\[\large \frac{0(0) + 1(1) + 2(1) + 3(0)+4(0)+5(1)+6(2)+7(0)+8(0)+9(0)+10(0)}{0+1+1+0+0+1+2+0+0+0+0} = \frac{20}{5} = 4\]
This is a weighted average, where each outcome \(0, 1, ..., 10\) is weighted by its frequency of occurrence. Note that the sum of the denominator is \(n\).
The above data was for only five trials, so let’s generalize this to \(n\) trials using some notation.
- Define \(\large K_{j,n}\) to be the number of times we observe \(j\) heads after \(n\) trials, with \(j = 0, 1, ..., 10\) and \(n\) being some real number.
Tying this back to the above example, \(n\) was \(5\) and \(K_{6,5} = 2\) (i.e, how many times did we get \(6\) heads in the \(5\) trials? \(2\) times.)
Generalizing
Armed with this notation, let’s generalize the above frequency table for any number of trials \(n\).
Number of Heads | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
Frequency of Occurrence (in \(n\) trials) | \(K_{0,n}\) | \(K_{1,n}\) | \(K_{2,n}\) | \(K_{3,n}\) | \(K_{4,n}\) | \(K_{5,n}\) | \(K_{6,n}\) | \(K_{7,n}\) | \(K_{8,n}\) | \(K_{9,n}\) | \(K_{10,n}\) |
The weighted average computation is now an infinite limit since we want to compute the weighted average for an arbitrarily large number of trials.
\[\begin{equation} \large E(X) = \lim_{n\rightarrow \infty}\frac{0(K_{0,n}) + 1(K_{1,n}) + 2(K_{2,n}) + ... + 10(K_{10,n})}{n} \tag{3} \end{equation}\]
Let’s break this up:
\[\begin{equation} \large \lim_{n\rightarrow \infty}\left ( \frac{0\cdot K_{0,n}}{n} + \frac{1\cdot K_{1,n}}{n} + ... + \frac{10\cdot K_{10,n}}{n} \right ) \tag{4} \end{equation}\]
And then use a couple of properties of limits to make this:
\[\begin{equation} \large 0\cdot \lim_{n\rightarrow \infty}\frac{K_{0,n}}{n} + 1\cdot \lim_{n\rightarrow \infty}\frac{K_{1,n}}{n} + ... + 10\cdot \lim_{n\rightarrow \infty}\frac{K_{10,n}}{n} \tag{5} \end{equation}\]
How do we simplify this? By recognizing that \(\large \lim_{n\rightarrow \infty}\frac{K_{j,n}}{n} = P(j)\)! Each limit in \((5)\) is just the long-run proportion of observing \(j\in (0, 1, ..., 10)\) heads — that is the very definition of probability.
Therefore,
\[\begin{equation} \large E(X) = \sum_{j=0}^{10}j\cdot P(j) \tag{6} \end{equation}\]
Remember that \(j\) represents all the values the random variable \(X\) can take on! This is a slight change in notation from Equation (1) as I didn’t want to overload the letter \(x\) in the derivations above.
To conclude,
\[\begin{equation} \large E(X) = \sum_{j}^{}j\cdot P(X = j) \tag{7} \end{equation}\]
Wrapping Up
Matloff illuminated a formula I took for granted and demonstrated its true origin. I hope this presentation draws out the details more clearly.