\( \newcommand{\prob}{\mathcal{P}} \newcommand{\expect}{\mathbb{E}} \)

Unexpected Average

Let's think about the average.

The average is often the first number that's presented to us when we're looking at a summary of a dataset. Average wage, average number of children per woman, average number of cars per person.

In mathematical jargon, the average is also known as expectation. It is the value that we expect to get when we draw a random element from the dataset.

But that's quite counter-intuitive. We often tend to conflate the expectation with quantiles. The \(p\)-quantile of a random variable \(X\) is the value \(x\) such that the probability \(\prob(X \lt x) = p\) (or equivalently \(\prob(X \geq x) = 1-p\)). The quantile is the answer to the question "what is the value \(x\) such that a given random variable comes out less than \(x\) with probability only \(p\)?"

Sometimes we hear "the average wage is $100" and intuitively think "there's a good chance we'll be earning at least $100".

So, what is the relationship between expectation and quantiles?

Consider a company with \(k\) employees and one CEO, where every employee earns $100, and the CEO earns $200. The average wage at this company is $$ \expect W = \frac{100k + 200}{k+1} = 100 \left(1 + \frac{1}{k+1}\right) \gt 100 $$ The probability \( \prob(W \geq \expect W ) \) is \(\frac{1}{k+1}\): you have to be the CEO.

This shows that the probability that a random variable attains at least its expectation can be arbitrarily close to 0. But can it be zero itself?

Intuition says that would be extremely weird—it would mean that almost surely (with probability 1) the outcome is smaller than the expected outcome. In order to verify that, we need to dive into probability theory.

Let us assume \(X\) is a random variable with finite expectation \(\expect X \) with \(\prob(X \lt \expect X ) = 1\). We have the set equality $$ \{X \lt \expect X \} = \bigcup_{n=1}^\infty \left\{X \lt \expect X - \frac{1}{n}\right\} $$ and by \(\sigma\)-(sub)additivity of probability at least one of the sets on the right-hand side must have a non-zero probability \(p \gt 0\). Let that be \(x \lt \expect(X)\) with $$ \prob(X \lt x) = p. $$ By monotonicity of the integral we then have $$\begin{eqnarray} \expect X = \hspace{-2em}\int\limits_{ \substack{ \{X \lt x\} \, \cup\\ \{x \leq X \lt \expect X \} \, \cup\\ \{\expect X \leq X\} } } \hspace{-1.5em}X \, d\prob \leq p x + (1-p) \expect X + 0 = \expect X + p(x - \expect X ) \lt \expect X , \end{eqnarray}$$ a contradiction.

Conclusion: if you know the average of some data, all you can be apriori certain of is that there is a non-zero probability that some of the data is greater than or equal to the average.

Weird Content

But there's more.

In the above proof, \(\sigma\)-additivity is essential. While \(\sigma\)-additivity is baked into the definition of a measure (of which probability is a special case), there is a generalization, called content, which only requires finite additivity. For contents, the proof breaks down, so the natural question is whether there is a content and a "random variable" using the content which is fully concentrated strictly on the left of its expectation.

Let \(x\) be a real number. Consider the content \(\lambda_x\) on the ring of sets spanned by open intervals of \(\mathbb{R}\) defined by $$ \lambda_x \big( (x-\varepsilon, x) \big) = 1, $$ for all \(\varepsilon \gt 0\), and by setting \(\lambda_x\) to zero wherever else possible. In particular, \(\lambda_x \big( (x-\delta, x-\varepsilon) \big) = 0\) for all \(\varepsilon \gt 0, \delta \in [\varepsilon, \infty]\).

Consider (for instance, many others would work as well) the function \(X:\mathbb{R} \rightarrow \mathbb{R}\) defined by $$ X(t) = \begin{cases} x - 1 & \text{if } t \leq x - 1 \\ t & \text{if } x - 1 \leq t \leq x \\ x & \text{if } x \leq t \end{cases}$$ \(X\) is integrable w.r.t. \(\lambda_x\) (whatever that actually means), and it is not too hard to see that \(\int X \, d\lambda_x = x\), because \(X\) can be arbitrarily approximated by simple functions whose integral is \(x - \varepsilon\) for any \(\varepsilon \gt 0\).

So, \(X\) (and \(\lambda_x\)) is fully concentrated left of \(x\), but its expectation is \(x\).

A way to think about this is that \(\lambda_x\) is concentrated infinitely close to \(x\). Probability can be seen as weight; the expectation is then the center of gravity. The proof above tells us that probability is well behaved—all the weight cannot be on one side of the center of gravity. The example with \(\lambda_x\) shows that if we relax \(\sigma\)-additivity, we can get the counter-intuitive phenomenon of having the full weight attached infinitely close to the center of gravity, but still all on its left.

This exercise also shows that even though it may be unclear at first why something as complicated and abstract as \(\sigma\)-additivity is required for measures and probabilities, it is actually the more natural notion compared to finite additivity.