Back

What does a typical person look like?

The curse of dimensionality

There is a famous tale1 about the U.S. army gathering measurements from thousands of soldiers to design a new fighter jet cockpit. They considered the average measurements in 10 different dimensions, thinking they’d design a cockpit which would fit the “average” soldier. However, when testing the cockpit, they realized that it didn’t fit anyone! Their crucial mistake was forgetting that as the number of dimensions increases, the likelihood of being far from the average in at least one dimension also increases.

Let’s model this problem mathematically by assuming everyone is defined by NN independent attributes XiX_i following a Gaussian distribution N(0,1)\mathcal N(0,1). As expected, E[Xi]=0\mathbb E[X_i] = 0 for every attribute. What this tells us is that a population’s values for the attribute ii are zero on average. But at the same time, the probability of a given person being ε\varepsilon-close to the average in all dimensions is P(X<ε)=γεNP(|X| < \varepsilon) = \gamma_\varepsilon^N, where γε=22π0εex2/2dx\gamma_\varepsilon = \frac{2}{\sqrt{2 \pi}} \int_0^\varepsilon e^{-x^2/2} \mathrm{d}x. This probability decreases exponentially with the number of dimensions NN.

What the army should have measured is what a ‘typical’ person looks like instead of the average point. This can be done by focusing on the average distance of points to the origin, which essentially tells you how far the average person is from this mythical perfectly average person. Taking the L2 squared distance R2=iXi2R^2 = \sum_i X_i^2, and remembering that the variance of each variable is 1, then E[R2]=N\mathbb E[R^2] = N. Moreover, the variance of the squared distance is

σ(R2)2=iσ(Xi2)2=iE[Xi4]E[Xi2]2=2N .\begin{aligned} \sigma(R^2)^2 & = \sum_i \sigma(X_i^2)^2\\ & = \sum_i \mathbb E[X_i^4] - \mathbb E[X_i^2]^2\\ & = 2N~. \end{aligned}

So the average squared distance grows linearly while the standard deviation grows as N\sqrt{N}, which means the relative thickness of the shell shrinks as 1/N1/\sqrt{N}. This is a really important result: as the number of dimensions grows, typical data points cluster in a thin shell far from the origin! This is exactly what happened with the cockpit design: most pilots they tried it on were in that shell, far away from the design they chose.

What remains to be understood is, given this distance, how ‘weird’ is a typical person? To answer this question, we need to reframe the way we draw samples. When measuring a new soldier, you can think of each attribute measurement as a sample of their probability distribution N(0,1)\mathcal N(0,1). Doing so, you can use that for a Gaussian distribution, around 68% of data is within 1σ1\sigma and 95% within 2σ2\sigma. With 10 dimensions measured by the army, a typical soldier will then be quite average in about 6-7 of the dimensions, a bit off the average for 2-3 of them and an outlier for what remains. Moreover, it’s quite likely that two random people will have different attributes play each role (which is why the average is zero). It’s really not surprising then that all their pilots somehow complained about the cockpit.

Is dimensionality really a curse?

In this experiment, the dimensionality of the problem played against them and led them to waste resources on a design that was bound to fail. This phenomenon is a consequence of what is commonly called the ‘curse of dimensionality’, which essentially states that as the number of dimensions increases, data becomes more and more sparse, diluted within an exponentially increasing volume. In data science and machine learning, this is a common issue we face, and data scientists often resort to dimensionality reduction techniques to mitigate it. But it’s important to keep in mind that dimensionality is also one of the reasons why machine learning models can produce value.

Most classification problems can be thought of as a separation problem. Given a bunch of data points, how can you split the space so that each part matches the provided label as best as possible? This is the kind of question that methods like SVM, decision trees and neural networks are built to answer. Of course, if your data is rather uniformly spread throughout the space, then this will be a really hard problem to solve, and dimensionality will indeed be a curse. But in general, such data is studied because there are patterns in it that make it cluster. In that case, dimensionality will actually help you, as what will end up diluted in the vast empty space is not the individual data, but the clusters themselves. In other words, the dimensionality helps make your data more separable. This is such a powerful insight that we sometimes even introduce more dimensions (usually along with some kernel to make sure the data is not spread uniformly) to make the problem easier to solve. So dimensionality may sometimes be a curse, but it generally is a blessing.

Footnotes

  1. Rose, T. (2016). The End of Average: How to Succeed in a World That Values Sameness. United Kingdom: Penguin Books Limited.