Lecture Notes from Stat2.1x | M.E. Irizarry-Gelpí

Wed 23 March 2016
Notes
#statistics

These are my lecture notes from a MOOC I took some time ago on statistics. It is very embarrassing to admit that, as a holder of a doctorate in theoretical physics, I know very little statistics. Or more like I have learned some in the past and have forgotten most about it. It is interesting to try to learn something basic now, with all the mathematical maturity I have acquire over the years. This post is based on this .pdf file.

Introduction

In this section I introduce variables and their different types. I also introduce a graphical way to describe a single categorical variable.

Terminology

A variable is a value or characteristic that can be different from individual to individual. One can have quantitative variables which have numerical values, often with units of measurements. Quantitative variables can be

continuous: values can be arbitrarily close to each other.
discrete: values are separated from each other by a fixed amount.

One can also have qualitative or categorical variables which have values that have no particular order or ranking in relation to each other. Some example of qualitative variables are: favorite color, gender, nationality, etc. However there are qualitative ordinal variables that do not have numerical values but have a natural order or ranking. For example, the temperature can be low, medium or high.

Bar Graphs

A bar graph is a common way of graphically describing a categorical variable. One draws a bar for every value of the category. The human eye is good at comparing areas. In order to have an accurate reading of a bar graph it is important that each rectangular bar have the correct area. By keeping the width of the bars equal one ensures that not only the height but also the area of each bar is proportional to the number of individuals in that category. A good bar graph gives you an accurate reading of the relative proportions in each category (the percents).

Histograms

In this section I introduce the histogram which is a graphical device to visualize the distribution of a quantitative variable, analogous to the bar graph used for a qualitative variable.

Stem and Leaf Plot

Consider the following data set:

$$ 48, \qquad 59, \qquad 63, \qquad 63, \qquad 63, \qquad 67. $$

More compactly, you can arrange it as

$$ \begin{split} 4 \; &| \; 8 \\ 5 \; &| \; 9 \\ 6 \; &| \; 3 3 3 7 \end{split} $$

This is called a stem and leaf plot. This type of graphical representation is useful when the range of the data is relatively small. It retains all the data, which is not necessarily a good thing.

Drawing Histograms

Annual Income of U.S. Adults in 2010 (thousands of dollars)	Percent
0 - 10	20
10 - 25	28
25 - 50	27
50 - 100	18
100 - 150	7

In the table above you have the distribution of annual incomes of U.S. adults in the year 2010. Whenever you are presented data in this way you need to know the endpoint convention: whether the income intervals contain the left endpoint but not the right, or the right endpoint but not the left. In this particular example the left endpoint convention was used. The percent of people that earn more than $150,000 is very small and has been swept in the last interval.

We would like to use something like a bar graph to show how the incomes are distributed. But notice how each income interval has a different size. This means that we should not be using the percentages as the heights of our bars. Indeed, the area of each bar should correspond to the percentage in that interval. Since the bar is a rectangle, we have

$$ \text{area} = \text{length} \times \text{width}. $$

The size of the first interval in the table above is $10,000 and it contains 20% of the data. Thus the height of the bar for this interval is 2% per thousand dollar. Using this type of bar leads to a graphical representation of the data that is called a histogram. A histogram shows how a quantitative variable is distributed over all its values.

Units and Density

It is important to specify the units along the horizontal axis in a histogram. Note that the vertical axis does not measure percents. The vertical axis measures individual percent per unit which is a density.

Sometimes one can use the assumption that inside each bar of a histogram the percent is uniformly distributed over the interval. This assumption is more reliable for the bars with smaller widths than for the bars with larger widths.

Percentiles

The median is the halfway point in the data set. It is also known as the 50th percentile. For example, consider the list

$$ 0, \qquad 2, \qquad 4, \qquad 7, \qquad 7. $$

There are five values, so the median is the third value which is 4. If we add another value to this list, we end up with six values which does not have a clear halfway point. We deal with this ambiguity by defining the $p$-th percentile of a list of numbers as the smallest number that is at least as large as $p$% of the list. For example, 2 is the 40th percentile, 4 is the 60th percentile and 7 is the 80th percentile of the list. Thus 4 is also the 50th percentile.

The 25th percentile is also known as the lower quartile and the 75th percentile is also know as the upper quartile. The interval between the 25th percentile and the 75th percentile is known as the interquartile range. Note that the interquartile range contains 50% of the data.

Measures of Location

In this section I introduce the average and its properties. I also discuss Markov's inequality.

Median and Mode

Given a data set, one sometimes would like to know roughly where is the distribution located (i.e. along the histogram). We have already seen that the median describes the halfway point of the data: 50% of the data is before the median and 50% of the data is after the median. The median measures a sort of center value of the distribution. Another important measure is the mode which corresponds to the value(s) that has the largest frequency. One can have more than one mode, which means that the histogram will have many high peaks. A unimodal distribution is a distribution that has only one peak.

Average

Consider a list of $N$ numbers $x_{j}$. The average of the list (also know as the mean) is denoted by $\mu_{x}$ and is defined as

$$ \mu_{x} \equiv \frac{1}{N} \sum_{j = 1}^{N} x_{j}. $$

Note that the average carries the same units as the numbers $x_{j}$ in the list.

Some properties of the average:

It is not true that half of the numbers in the list are below average and half are above average. That is, in general the average is not equal to the median.
In general, the average might not be equal to any of the numbers in the list.
The average does not correspond to a possible value of the variable being measured.
One should not round the value of the average in order to obtain a whole unit.

One way to understand the meaning of the average is as a smoother: it tells you how much value each individual has to contribute to the data in order for each contribution to be equal.

If one of the numbers in the list is changed, say by adding an amount $C$, the average of that list is changed. One does not need to know which entry is changed, the net result is that the sum of the entries increases by $C$. Thus the value of the average becomes

$$ \mu \rightarrow \mu + \frac{C}{N}. $$

The interpretation of this change is that one takes the amount $C$ and divides it into $N$ equal parts.

Comparing Averages

A longitudinal data set is obtained when the same individuals are followed over time. On the other hand, a cross-sectional data set is the snapshot of all individuals at a given time. When working with different groups of individuals, each with its own average, it is wise first to understand how the groups are related (i.e. whether the data is longitudinal or cross-sectional) before comparing the numerical averages.

Combining Averages

One should never take the average of a list of averages. Consider two lists of numbers. The first list has $N_{1}$ numbers with average $\mu_{1}$ and the second list has $N_{2}$ numbers with average $\mu_{2}$. The combined average $\mu$ corresponds to the weighted sum

$$ \mu = \left(\frac{N_{1}}{N_{1} + N_{2}}\right) \mu_{1} + \left( \frac{N_{2}}{N_{1} + N_{2}} \right) \mu_{2}. $$

Only for the case $N_{1} = N_{2}$ does $\mu$ correspond to the average of $\mu_{1}$ and $\mu_{2}$.

Average versus Median

An outlier is a data point that lies outside the general range of the data. The median is unaffected by outliers. However, the average is affected by each data point.

A right-skewed distribution is an asymmetric distribution with a right-hand tail. The average is greater than the median for such distributions. You can instead have left-skewed distributions with a left-hand tail. For these distributions the average is less than the median.

Average and Histograms

The median in a histogram corresponds to the point that divides the histogram into two sections of equal area. One can think of the average as the equilibrium point of the histogram: the pivot point where the "weight" on each side of the histogram balances. Thus for asymmetric distributions the equilibrium point is not at the center and thus the average is off-centered.

Markov's Inequality

For a distribution with a long tail, one question that you can ask is how much is contained in the tail. For example, consider the following statement:

The average age of a group of people is 20 years.

You know that age distributions are asymmetric with right-hand tails. What fraction of the people are more than 80 years old? You cannot answer this question without looking at the data. But you can use an inequality due to Markov to provide a bound for this fraction. You want to know what fraction of the data lies more than 4 averages away from the average. According to Markov, this fraction cannot be greater than $1/4$.

In general, let $k$ be a positive number. Markov's inequality says that if a list with average $\mu$ has only non-negative entries, then the fraction $F(k \mu)$ of entries that are greater or equal to $k \mu$ cannot be greater than $1 / k$. That is,

$$ F(k \mu) \leq \frac{1}{k}. $$

This result is an example of a tail bound and it is most useful for large values of $k$.

Measures of Spread

Given a data set you know that the median provides the halfway point of the data and the average provides the equilibrium point of the histogram. Both of these quantities describe a notion of a center. When looking at a data point it is useful to understand how far away from these centers is that data point. In this section I introduce the standard deviation, which is a useful measure of the spread of the data. I will also discuss Chebyshev's inequality and the notion of standard units.

Range and Interquartile Range

The largest measure of spread is the range: the interval over which the data is distributed. One can obtain the range by subtracting from the maximum value the minimum value:

$$ \text{range} = \text{maximum value} - \text{minimum value}. $$

The range can be divided into four equal parts with the boundaries at the lower quartile (25th percentile), the median (50th percentile) and the upper quartile (75th percentile). As mentioned earlier, the interval between the lower and upper quartiles is called the interquartile range and it contains 50% of the data.

Standard Deviation

Given a number $x_{j}$ in a list of numbers with average $\mu$, the deviation from average $d_{j}$ tells you roughly how far is that number from the average,

$$ d_{j} \equiv x_{j} - \mu. $$

Note that $d_{j}$ can be positive or negative, depending on whether $x_{j}$ is below or above average. By definition, the average of the list of deviations is zero. A more useful quantity is the standard deviation $\sigma$ which is defined as the square root of the average of the square of the deviations from average. That is,

$$ \sigma \equiv \sqrt{\frac{1}{N} \sum_{j = 1}^{N} d_{j}^{2}} = \sqrt{\frac{1}{N} \sum_{j = 1}^{N} (x_{j} - \mu)^{2}}. $$

In other words, $\sigma$ corresponds to the root-mean-square of deviations from average. The square of $\sigma$ is called the variance. Note that $\sigma$ is always positive and has the same units as the values of the list.

The standard deviation (SD) measures roughly how far off the entries are from average. No matter what the values of a list are, the vast majority of the entries will be in the range $\mu \pm n \sigma$ with $n$ a small positive integer.

Chebyshev's Inequality

Markov's inequality helps in finding bounds for the fraction of the data that is $k$ units away from the average, where $k$ is measured in averages. The SD measures roughly how far away from the average one is in both directions (i.e. left and right). Consider a list of numbers $x_{j}$ with average $\mu$ and SD $\sigma$. Let $n$ be a positive integer. Chebyshev's inequality says that the fraction $F(n \sigma)$ of the data that is outside the interval between $\mu - n \sigma$ and $\mu + n \sigma$ cannot be greater than $1/n^{2}$. That is,

$$ F(n \sigma) \leq \frac{1}{n^{2}}. $$

Note that with Markov's inequality you get a bound on one of the tails, but Chebyshev's inequality gives a bound for the two tails. However, you can use the two-tail bound on each tail. For example, outside of the interval between $\mu - 3\sigma$ and $\mu + 3 \sigma$ we have at most 11% of the data. This means that at least 89% of data must be inside the interval!

Standard Units

The numbers $x_{j}$ in a list typically have units. One can use many different units to describe the data. In order to change the units of $x_{j}$ one performs an "affine" transformation of the form

$$ x_{j}' = a x_{j} + b. $$

Let us first consider the transformation with $a = 1$ and $b \neq 0$. Under this transformation the average $\mu$ becomes $\mu' = \mu + b$ and the SD $\sigma$ remains invariant. On the other hand, under the transformation with $a \neq 0$ and $b = 0$ the average $\mu$ becomes $\mu' = a \mu$ and the SD $\sigma$ becomes $\sigma' = |a| \sigma$.

There is an affine transformation that takes us to a system of units where the average is at the origin and the units of measurement are standard deviations. This system of units is called standard units. Given a list of numbers $x_{j}$ with average $\mu$ and SD $\sigma$, the list of numbers $z_{j}$ in standard units is

$$ z_{j} = \frac{x_{j} - \mu}{\sigma} \quad \Longrightarrow \quad x_{j} = \mu + \sigma z_{j}. $$

Standard units measure how many SDs one is from average. This is sometimes called the $z$-score. Note that by construction, in standard units the average of the list is 0 and the SD is 1. Thus, when working in standard units you expect most of the data to take values inside the interval between $-4$ and $4$.

Normal Curve

In this section I introduce the normal curve. This is a very important curve that describes many distributions that have the shape of a bell.

Standard Normal Curve

There are many types of normal curves. The standard normal curve is described by the equation

$$ f(z) \equiv \frac{1}{\sqrt{2 \pi}} \exp{\left(- \frac{1}{2} z^{2} \right)}, \qquad - \infty < z < \infty. $$

Note that $f(z) = f(-z)$ which means that this curve describes a symmetric distribution around the origin. Thus the equilibrium point is at $z = 0$. We have points of inflection at $z = \pm 1$ (the points where $f''(z)$ changes sign). The area under the standard normal curve is 1:

$$ \int\limits_{-\infty}^{\infty} \mathrm{d}z \, f(z) = 1. $$

Some useful facts about the standard normal curve are:

Inside the interval $-1 < z < 1$ one has about 68.27% of the data.
Inside the interval $-2 < z < 2$ one has about 95.45% of the data.
Inside the interval $-3 < z < 3$ one has about 99.73% of the data.
By Chebyshev's inequality and using the symmetry of $f(z)$ we find that each of the tails $z > 1$ and $z < -1$ has 16% of the data. Similarly, each of the tails $z > 2$ and $z < -2$ has 2.5% of the data.
The 95th percentile is roughly at $z = 1.65$. Symmetry implies that the 5th percentile is roughly at $z = -1.65$.

Normal Curves

As you might suspect, the standard normal curve is a normal curve written in terms of standard units. So a general normal curve with average $\mu$ and SD $\sigma$ is of the form

$$ f(x|\mu, \sigma) \equiv \frac{1}{\sqrt{2 \pi} \sigma} \exp{\left[ -\frac{1}{2} \left( \frac{x - \mu}{\sigma} \right)^{2} \right]}, \qquad - \infty < x < \infty. $$

The general normal curve $f(x|\mu, \sigma)$ has an equilibrium point at $x = \mu$ and inflection points at $x = \mu \pm \sigma$. Note that a normal curve is completely specified by the value of the mean $\mu$ and the value of the SD $\sigma$.

If the histogram of a data set has a bell shape, then after computing its average $\mu$ and SD $\sigma$ one can use a normal curve $f(x| \mu, \sigma)$ to approximate the distribution of values on the histogram. This approximation will be mostly good but also bad in some parts of the histogram. Such distributions are approximately normal. It is important for the bell to be centered around a value and not be skewed with more data on a given side.

Chebyshev Revisited

In most cases, though, the distribution is not normal or even approximately normal. But Chebyshev's bound always holds. Remember that it is better to have a correct bound than a bad approximation.

Relations Between Two Variables

In this section I will look at problems with two variables and introduce a graphical way to study them call scatter diagrams and a numerical way to study whether there is a relation between the two variables called the correlation coefficient.

Scatter Diagrams

With bivariate data each individual represents a pair of values. Graphically we can represent this with a scatter diagram. In a scatter diagram one has two axis, each corresponding to one of the pair of values for each individual. Thus, each individual is denoted by a point on the plane. In a scatter diagram one looks for clusters of accumulation of points. If the two variables are related, then the scatter diagram will exhibit a trend or pattern. If there is any relation between the variables, we have an association. We have a positive association when above average values of one variable tend to go with above average values of the other and thus the scatter goes up. We could also have a negative association where above average values of one variable tend to go with below average values of the other and thus the scatter goes down. The simplest example of association is a linear association where the scatter diagram is clustered around a straight line.

Scatter diagrams with outliers are hard to analyze because the outliers can mislead on trends.

Correlation Coefficient

Now I turn to the question of how tightly clustered a scatter diagram is. The correlation coefficient $r$ is a measure of linear association that takes values between $-1$ and $1$. When $r = -1$ you have a perfect linear negative association and when $r = 1$ you have perfect linear positive association. If $r = 0$ then there is no linear association between the two variables in the scatter diagram. Let $x_{j}$ and $y_{j}$ be two lists of $N$ numbers with averages $\mu_{x}$ and $\mu_{y}$ and SDs $\sigma_{x}$ and $\sigma_{y}$. In standard units you have

$$ z_{j} = \frac{x_{j} - \mu_{x}}{\sigma_{x}}, \qquad w_{j} = \frac{y_{j} - \mu_{y}}{\sigma_{y}}. $$

The correlation coefficient $r$ is defined as the average of the product of $z_{j}$ and $w_{j}$

$$ r \equiv \frac{1}{N} \sum_{j = 1}^{N} z_{j} w_{j} = \frac{1}{N} \sum_{j = 1}^{N} \left( \frac{x_{j} - \mu_{x}}{\sigma_{x}} \right) \left( \frac{y_{j} - \mu_{y}}{\sigma_{y}} \right). $$

Since $r$ is defined in terms of variables that are in standard units, it has no units. Note that $\sigma_{x}$ and $\sigma_{y}$ can be viewed as the magnitudes of $N$-dimensional vectors $V_{x}$ and $V_{y}$ with components

$$ (V_{x})_{j} = \frac{x_{j} - \mu_{x}}{\sqrt{N}}, \qquad (V_{y})_{j} = \frac{y_{j} - \mu_{y}}{\sqrt{N}} \quad \Longrightarrow \quad \sigma_{x} = |V_{x}|, \qquad \sigma_{y} = |V_{y}|; $$

and thus $r$ can be understood as the inner product between two different unit vectors. This is proportional to the cosine of the angle between the two vectors and thus $-1 \leq r \leq 1$.

Adding a constant to any of the two list in the scatter diagram results in shifting the overall position of the scatter diagram. Thus, $r$ is invariant under this transformation. Since $r$ is defined in standard units, multiplying any of the two list by a positive constant also leaves $r$ invariant. However, if you multiply any of the lists (but not both) by a negative constant then the sign of $r$ will change (but its magnitude remains invariant).

Association is not Causation

If two variables have a nonzero correlation $r$, then they are related to each other in some way, but that does not mean that one causes the other.

It is important to emphasize that $r$ is a measure of linear association and thus should only be used in the case of an apparent linear association. Two variables are correlated if they are linearly related.

Note that given a data set for which $r = 0$, adding outliers can lead to a noticeable change in $r$.

A scatter plot of averages can lead to an artificial increase in clustering.

Regression

If two variables are correlated, then given one of them, you can compute an estimate of the other with a regression, the most commonly used statistical technique.

Univariate Estimation

Consider the following problem. You are told that a data set of heights of people has an average of 67 inches and an SD of 3 inches. Then one of these people is picked and you are asked to estimate that person's height. A natural estimate is the average, 67 inches. There will most likely be an error in your estimate,

$$ \text{error} = \text{actual} - 67 \text{ inches}. $$

When using the average, the error corresponds to the deviation from average introduced in above. Thus, the rough size of the errors is the SD. If one makes another estimate $c$, then the error is

$$ \text{error} = \text{actual} - c. $$

How is $c$ determined? The best value of $c$ should make the smallest error. It is a mathematical fact that the root mean square of the errors will be smallest if you choose $c$ to be equal to the average. In this sense, the average is a least square estimate.

Bivariate Estimation

Consider now a scatter diagram. Given the value of one variable, say $x_{j}$, you would like to estimate the value of the other variable, say $y_{j}$. In order to do this, you look at the column of values over $x_{j}$. The value of $y_{j}$ must be between the range of values in the column over $x_{j}$. An estimate is simply to calculate the average of this column. This estimate is good when the column of values is small.

Footballs

When a scatter diagram has a football shape, then the distributions of both the horizontal and vertical variables can be approximated with a normal curve. The distribution along each vertical strip (column) or horizontal strip (row) is also roughly normal.

Regression Line

The regression line is a line that goes through the average of the vertical strips. The regression line depends only on the geometry of the oval that outlines the football shape in the scatter diagram. In standard units, the oval is centered at the origin. Given a list of numbers $x_{j}$ you go to standard units to obtain $z_{j}$. The line that goes through the origin and the estimate at $z_{j}$ has slope $r$. This gives the estimate for $y_{j}$ in standard units.

Regression Effect

When using a regression line to estimate values of a variable, you see the regression effect: those points which have high values in one variable tend not to be quite as high in the other variable. This happens whenever there is a spread about a straight line.

Galton

Sir Francis Galton was into eugenics (!). He made the following observation:

Fathers who are tall tend to have sons who are not quite that tall, on average.

This observation is known as regression to mediocrity.

On average, points which have lower values in one variable tend to be slightly higher in the other variable. This is the opposite of the regression effect and is known as regression towards the mean.

Regression Fallacy

The believe that the regression effect is due to some external cause other than natural variability is known as the regression fallacy.

Error in the Regression Estimate

How good can a regression estimate be? In this section you study the error made when using a regression estimate.

Least Squares

Given a scatter diagram with a oval shape, we might draw a line to estimate the values of one of the variables from values of the other variable. The error at a point $x_{j}$ is defined as the difference between the actual value $y_{j}$ and the estimate $E(x_{j})$ from the line:

$$ \epsilon_{j} = y_{j} - E(x_{j}). $$

The quantity $\epsilon_{j}$ is also known as the residual. The root-mean-square of the list of $\epsilon_{j}$ (the list of residuals) gives you a rough size of the error. The regression line will correspond to the line with the smallest r.m.s. error.

Residuals

There is a short formula for the r.m.s. of the residuals:

$$ \chi = \sqrt{1 - r^{2}} \sigma_{y}. $$

Note that for perfect lines ($r = \pm 1$) you have $\chi = 0$. If $r = 0$, then you should not be using a linear regression in the first place. However, $\chi = \sigma_{y}$ when $r = 0$ which means that you are doing as well as predicting the average of $y_{j}$ all the time. For all other values of $r$, $\chi$ reduces to a fraction of the SD of $y_{j}$.

With a normal distribution you know that 68% of the data is contained in the interval made one SD from the average in both directions. For a scatter diagram, instead of the average, you have the regression line, and instead of the SD you have $\chi$. You can look at a strip in the scatter diagram centered at the regression line and made by shifting the regression line one $\chi$ up and one $\chi$ down. This strip will contain 68% of the data. Thus, for 68% of the points the estimate will be correct within one unit of $\chi$. Similarly, for 95% of the points the estimate will be correct within two units of $\chi$.

Residual Plot

The residual plot is a scatter diagram made with the residuals. This plot cannot show any trend or linear relation. The residual plot is useful to see nonlinearity and other properties of the data.