M.E. Irizarry-Gelpí

Physics impostor. Mathematics interloper. Husband. Father.

Lecture Notes from Stat2.2x


This post contains my lectures notes from Stat2.2x, a MOOC I took in 2013 to learn the basics of probability.

Two Fundamental Rules

In this section I introduce probability and two basic rules to aggregate probability.

Probability

The goal of probability theory is to understand and quantify randomness. There is no universally accepted answer to the question "what is probability?". If all outcomes in an event are equally likely, then the probability for a given event to occur is given by

$$ \text{probability of an event } = \frac{\text{number of outcomes in the event}}{\text{total number of outcomes}}. $$

This definition already illustrates that probabilities are real numbers between 0 and 1 because they are fractions. Another approach to defining probability is via frequency theory. Consider an experiment that consists of performing a coin toss multiple times. In the long run you find that the outcome of the experiment is heads roughly half of the time. This provides a way of defining the probability of the outcome of the coin toss as the long run fraction of times that it occurs as the experiment is repeated indefinetely.

However, many events cannot be repeated (e.g. sport events). For this type of events you can use subjective probabilities which are essentially opinions: a subjective probability is defined as the degree of belief that an event might happen. It is important to note that subjectivity propagates throught an analysis (i.e. the slightest subjective element will yield a completely subjective analysis).

More abstractly, a probability \(P(A)\) for an event \(A\) to happen is a real number such that

$$ 0 \leq P(A) \leq 1. $$

The case \(P(A) = 0\) means that the event \(A\) is impossible to happen while \(P(A) = 1\) means that \(A\) is certain to happen.

Standard Set of Cards

Let me introduce the standard set of cards. You have 52 cards which split into 4 suits:

$$ \clubsuit, \qquad {\color{red} \diamondsuit}, \qquad {\color{red} \heartsuit}, \qquad \spadesuit. $$

Each suit has 13 ranks:

$$ A, \quad 2, \quad 3, \quad 4, \quad 5, \quad 6, \quad 7, \quad 8, \quad 9, \quad 10, \quad J, \quad Q, \quad K. $$

There is only one card that is of rank \(A\) and suit \({\color{red} \heartsuit}\). However, you get more cards if you look for cards of rank \(A\) or suit \({\color{red} \heartsuit}\): all 13 ranks of the suit \({\color{red} \heartsuit}\) plus the 3 other rank \(A\) cards giving 16. This example illustrates how the conditional and is more restrictive than the conditional or.

A well shuffled deck is one where all 52 cards are equally likely to appear. This means that initially each card has a probability of \(1/52\) to be drawn.

Addition Rule

Consider asking what is the probability for drawing an \(A\) or a \(K\). Since there are four \(A\)'s and there are four \(K\)'s you have eight possible cards in the deck that satisfy the condition. Thus the probability is \(8/52\). Note that you can write

$$ P(A \text{ or } K) = P(A) + P(K) = \frac{4}{52} + \frac{4}{52} = \frac{8}{52}. $$

Thus you find that you can write the probability for a conditional or event as the sum of the probability for the possible outcomes. This result is correct but not general. Look at another example. Consider asking instead what is the probability for drawing an \(A\) or a \({\color{red} \heartsuit}\). You know that the result is \(16/52\). However, if you naively add the probability to obtain an \(A\) (\(4/52\)) and the probability to obtain a \({\color{red} \heartsuit}\) (\(13/52\)) you find a wrong probability of \(17/52\). The error is that there is one card that is being counted twice: the \(A\) of \({\color{red} \heartsuit}\). Thus you can fix the addition rule by writing

$$ P(A \text{ or } {\color{red} \heartsuit}) = P(A) + P({\color{red} \heartsuit}) - P(A \text{ and } {\color{red} \heartsuit}). $$

This is an example of inclusion-exclusion.

Both of these examples can be visualized in terms of Venn diagrams. In the first example you find that the set of \(A\)'s and the set of \(K\)'s are disjoint: the share nothing in common. You say that the two events are mutually exclusive. However, in the second example you find that the set of \(A\)'s and the set of \(K\)'s overlap. This overlap was counted twice in our problem so you substract it once.

In summary you can write down some general rules. The addition rule allows you to write

$$ P(a \text{ or } b) = P(a) + P(b). $$

if \(a\) and \(b\) are mutually exclusive. More generally, you have the inclusion-exclusion formula:

$$ P(a \text{ or } b) = P(a) + P(b) - P(a \text{ and } b). $$

That is, if \(a\) and \(b\) are mutually exclusive, then

$$ P(a \text{ and } b) = 0. $$

Another useful thing to introduce is the complement rule:

$$ P(\text{not } a) = 1 - P(a). $$

This is useful when it is easier to count the ways for an event \(a\) to not happen.

Multiplication Rules

The addition rule involves the probability \(P(a \text{ and } b)\) for two events \(a\) and \(b\) to occur. In this subsection I will study some cases where this type of probability can be computed.

Consider \(N_{T}\) tickets inside a box. Each ticket has a color. You can draw a ticket with replacement (returning the ticket to the box) or without replacement (keeping the ticket outside the box). You perform two draws without replacement and would like to compute the probability that the first draw gives a green ticket and the second draw gives a red ticket. The probability to draw a green ticket first is

$$ P(\text{draw green first}) = \frac{N_{\text{green}}}{N_{T}}. $$

Given that a green ticket was drawn, the number of red tickets remains the same but the total number of tickets is reduced by 1. The probability to draw a red ticket given that a green ticket was drawn is thus

$$ P(\text{draw red second}) = \frac{N_{\text{red}}}{N_{T} - 1}. $$

This probability is subjected to the first event occuring and thus is not independent or isolated. It is called a conditional probability. The probability for the entire event is:

$$ P(\text{draw green first, red second}) = \left( \frac{N_{\text{green}}}{N_{T}} \right) \left( \frac{N_{\text{red}}}{N_{T} - 1} \right). $$

This example illustrates the multiplication rule: the probability for event \(a\) and \(b\) to occur is given by the product of the probability for \(a\) times the conditional probability for \(b\) given that \(a\) occured:

$$ P(a \text{ and } b) = P(a) \times P(b|a). $$

Conditional versus Unconditional

Consider the following situation. You have five tickets, one of them is red and the other four are green. You would like to know what is the probability to draw a red ticket during the second trial with replacement or with no replacement. Drawing with replacement is easy: the probability to draw a red ticket is \(1/5\). The problem appears to be more complicated when drawing without replacement. In order to obtain a red on the second ticket, you must draw a green on the first ticket. Thus, the probability to draw a red ticket on the second trial is

$$ \frac{4}{5} \times \frac{1}{4} = \frac{1}{5}. $$

which is the same probability as drawing with replacement. This result is surprising, but follows as long as the outcome from the first drawing remains unknown. This is an example of symmetry in sampling without replacement: even if you are doing sampling without replacement, since no information is being given about the outcomes of the initial samplings, the probability for a given event during a later trial is the same as performing the trial with replacement and the chances are unconditional (even though the question "what is the probability of getting a red ticket on the second trial?" appears to be conditional).

Bayes' Rule

You can rewrite the multiplication rule as

$$ P(b|a) = \frac{P(a \text{ and } b)}{P(a)}. $$

This result is known as Bayes' rule and it can be used to find the conditional probability of an event at an earlier stage, given the result of a later stage. The equation can be read as: the probability of \(b\) given that \(a\) has happened can be found by dividing the probability for \(a\) and \(b\) to happen by the probability for \(a\) to happen.

Random Sampling

In this section, I introduce the notion of independence along with different probability distributions.

Independence

Many conditional probabilities can be found without using Bayes' rule. If you are asked to find the unconditional outcomes for a given event to happen, it does not matter if you have replacement or not. However, once you consider conditional outcomes you have to distinguish sampling with replacement and without replacement. Each sample taken with replacement is independent. Samples taken without replacement are dependent. Example of independent trials are: tossing a coin, rolling a die, drawing with replacement. Example of dependent trials are: dealing cards from a deck, drawing without replacement.

A rough definition of independence: Two random quantities are independent if knowing how one of them turned out does not change chances for the other. More explicitly, two events \(A\) and \(B\) are independent if

$$ P(B|A) = P(B|\text{not } A) = P(B). $$

The multiplication rule simplifies when the two events are independent:

$$ P(A \text{ and } B) = P(A) P(B). $$

That is, for two independent events \(A\) and \(B\) the probability for both events to hapen is the product of the probability for each event to happen.

Sampling with Replacement: Binomial Distribution

In this subsection I will focus on trials that have two possible outcomes: success or failure. This type of trial is also known as a Bernoulli trial. As an example, consider rolling a die 3 times and counting the number of times you get a six. Success in the trial means you get a six (with probability \(1/6\)) and failure in the trial means you do not get a six (with probability \(5/6\)). The number of sixes \(n_{6}\) is our random variable: depending on the outcomes you can either have

$$ n_{6} = 0, \qquad n_{6} = 1, \qquad n_{6} = 2, \qquad n_{6} = 3. $$

You would like to know the probability to get each of the four possible values of \(n_{6}\). If \(n_{6} = 0\), you did not get any six. The probability for each independent failure is \(5/6\) so the probability to get no sixes is

$$ P(n_{6} = 0) = \left( \frac{5}{6} \right)^{3}. $$

How about the probability to get only one six? Well, you can get a six on either the first, the second or the third roll. Each of these events is indepenent and has probability \((1/6)(5/6)^{2}\), so the probabilty to get 1 six is

$$ P(n_{6} = 1) = 3 \left( \frac{1}{6} \right) \left( \frac{5}{6} \right)^{2}. $$

Note the factor \(3\). This factor correspond to the 3-choose-1 ways of drawing 1 six. Similarly, you find

$$ P(n_{6} = 2) = 3 \left( \frac{1}{6} \right)^{2} \left( \frac{5}{6} \right). $$

and

$$ P(n_{6} = 3) = \left( \frac{1}{6} \right)^{3}. $$

It is impractical to make a list of all possible outcomes in every problem. The binomial formula nicely summarizes the probabilities you wish to compute. Consider \(n\) independent Bernoulli trials, each trial with chance of success given by \(p\). The chance for \(k\) successes in \(n\) trials is given by

$$ P_{n}(k) = {n \choose k} p^{k} (1-p)^{n-k} = \frac{\Gamma(n+1)}{\Gamma(k+1) \Gamma(n - k + 1)} p^{k} (1-p)^{n-k}. $$

This is simply the product of the probabilities for \(k\) successes and \(n - k\) failures times an overall combinatorial factor that counts the different order the successes can occur. You can use the binomial theorem to show that

$$ \sum_{k = 0}^{n} P_{n}(k) = (p + 1 - p)^{n} = 1. $$

That is, the sum of the \(n + 1\) probabilities \(P_{n}(k)\) is correctly normalized.

Sampling without Replacement: Hypergeometric Distribution

Drawing at random without replacement is called simple random sampling. The binomial formula is useful when sampling with replacement. The analogous result for sampling without replacement is the hypergeometric formula. This is derived as follows. Consider a population of size \(N\). This population has two types of members: good and bad. Let \(G\) be the number of good members in the population (thus, \(N - G\) is the number of bad members). You make a simple random sample of size \(n\) and would like to know what is the probability that this simple random sample has \(g\) good members (and thus \(n - g\) bad members). The number \(\mathcal{N}_{n}\) of possible simple random samples of size \(n\) out of a population of size \(N\) is

$$ \mathcal{N}_{n} = {N \choose n} = \frac{\Gamma(N + 1)}{\Gamma(n + 1) \Gamma(N - n + 1)}. $$

Similarly, out of \(G\) good elements, the number \(\mathcal{N}_{g}\) of ways of choosing \(g\) is

$$ \mathcal{N}_{g} = {G \choose g} = \frac{\Gamma(G + 1)}{\Gamma(g + 1) \Gamma(G - g + 1)}. $$

For each of these \(\mathcal{N}_{g}\) ways of choosing \(g\) good members, you have

$$ \mathcal{N}_{n - g} = {N - G \choose n - g} = \frac{\Gamma(N - G + 1)}{\Gamma(n - g + 1) \Gamma(N - G - n + g + 1)} $$

ways of choosing \(n - g\) bad members. Thus, the probability that the simple random sample has \(g\) good elements is given by

$$ P_{N}(n, g, G) = \frac{\mathcal{N}_{g} \times \mathcal{N}_{n - g}}{\mathcal{N}_{n}}. $$

Note that

$$ \sum_{g = 0}^{G} P_{N}(n, g, G) = 1. $$

That is, the sum of the \(G + 1\) propabilities \(P_{N}(n, g, G)\) is also correctly normalized.

A special case of the hypergeometric formula tells you that the probability for a given memember of the population to be in the sample is simply the fraction of the population being sample. This follows from the case \(g = G = 1\):

$$ P_{N}(n, 1, 1) = \frac{\mathcal{N}_{1} \times \mathcal{N}_{n - 1}}{\mathcal{N}_{n}} = \frac{n}{N}. $$

More generally, the case \(g = G = k\) with \(k \geq 1\) gives

$$ P_{N}(n, k, k) = \prod_{l = 0}^{k - 1} \left(\frac{n - l}{N - l} \right) = \frac{n (n - 1) \cdots (n - k + 1)}{N (N - 1) \cdots (N - k + 1)}, $$

which correspond to the familiar result from drawing \(n\) out of \(N\) without replacement.

Geometric Distribution

Both the binomial and hypergeometric distributions are useful when you are counting the number of successes/failures in a fixed number of trials. You could study another situation: having a fixed number of successes/failures and counting the number of trials it takes to get that number of successes/failures. This other type of problems involve waiting time randon variables.

Consider the case of independent trials. You have an event that occurs with success chance \(p\) (and thus failure chance \(1-p\)). The smallest number of successes that you can have is 1. The probability for one success after performing exactly \(k\) trials is given by the product of \(k-1\) failures and 1 success trial. Thus, the probability is given by

$$ P_{k}(p) = (1-p)^{k-1} p. $$

This is called the geometric distribution. Note that since you have independent trials (with replacement), the number of failures before the first success can be any positive integer. This leads to

$$ \sum_{k = 1}^{\infty} P_{k}(p) = p \left( \frac{1}{1 - (1 - p)} \right) = 1; $$

which shows that \(P_{k}(p)\) has the correct normalization.

Now consider a slightly different problem: finding the probability that more than \(n\) trials are needed to get a single success. First recall the geometric sum,

$$ \sum_{k = 1}^{n} r^{k-1} = \frac{1 - r^{n}}{1 - r}. $$

Thus

$$ Q_{n} \equiv \sum_{k = 1}^{n} P_{k}(p) = p \left( \frac{1 - (1-p)^{n}}{1 - (1-p)} \right) = 1 - (1-p)^{n}. $$

This corresponds to the sum of the probabilities for the first success after at least \(n\) trials. Hence the probability \(R_{n}\) that more than \(n\) trials are needed to get a \textbf{single success} is

$$ R_{n} = 1 - Q_{n} = (1 - p)^{n}. $$

This answer corresponds to the probability to get \(n\) failures in a row.

Negative Binomial Distribution

The geometric distribution is used when you wish to know the probability for a single success after a given number of trials. You can now generalize to the problem of finding the probability for \(r\) successes after \(k\) trials with the chance of success \(p\). This probability is the same as finding \(r - 1\) successes after \(k - 1\) trials and then a success on the last trial. You can use the product rule to find

$$ P_{k}(p, r) = { k - 1 \choose r - 1} p^{r} (1 - p)^{k - r}, \qquad r \geq 1, \qquad k \geq 1. $$

This distribution is called the negative binomial distribution. Note that

$$ \sum_{k = r}^{\infty} P_{k}(p, r) = 1, $$

which shows that the \(P_{k}(p, r)\) are correctly normalized.

Waiting Times and Sampling Without Replacement

Now I turn to waiting time problems when sampling without replacement (e.g. dealing cards). First, consider the case of getting one success after \(k\) trials. This is equivalent to getting \(k - 1\) failures followed by success. The probability for this event is found by using the product rule:

$$ P_{k} = \left[ \prod_{j = 1}^{k-1} P(F|j - 1) \right] P(S| k - 1); $$

where \(P(F|j)\) stands for the probability for failure given \(j\) previous failures and \(P(S|j)\) stands for the probability for success given \(j\) previous failures. Since you are sampling without replacement, the events are not independent and you need to handle conditional probabilities. Note that unlike the case of sampling with replacement, the variable \(k\) is bounded by the population size (i.e. you can draw at most 51 cards before finding a given card).

Next, consider the case of getting \(n\) successes after \(k\) trials. This is equivalent to getting \(n - 1\) successes in the first \(k - 1\) trials and then success in the last trial. Again, you use the product rule:

$$ P_{k}(n) = P(n - 1 \text{ successes in } k - 1) \times P(S| n - 1 \text{ successes in } k - 1). $$

If you have cards, you can use the hypergeometric formula to obtain the first factor.

The Law of Averages and Expected Values

In this section I study the notion of expected value, and some results.

The Law of Averages

Up to this point, the size of the the population and sample you worked with was relatively small. You could compute probabilities exactly. In order to make use of statistics you need to work with large random samples. Because of the combinatorial nature of some probabilities, the numbers become intractable for large samples. You thus require a set of approximations.

The simplest case of a large sample is a large sample of success/failure trials. As an example, consider a large number of coin tosses. You know the frequency theory interpretation of probability. This is also sometimes known as the law of averages. Applied to the coin toss, the law of averages says that as you keep tossing the coin, in the long run you get about half heads.

Before going into more details about the law of averages, its time to discuss some misconceptions. Consider tossing a coin 99 times and obtaining 99 heads. What can you say about the hundreth toss? The law of averages does not say that you are due a tail in order to balance the previous heads: each coin toss in independent of the previous outcomes and the probability for a head is \(1/2\). There is no contradiction with the law of averages because although 100 is a large number it is still a finite number and it leads to a short run.

Consider performing \(2N\) coin tosses. You can compute the probability to obtain \(N\) heads and \(N\) tails by using the binomial formula:

$$ P_{2N}(N) = \frac{\Gamma(2N + 1)}{\Gamma(N + 1) \Gamma(N + 1)} \left( \frac{1}{2} \right)^{2N}. $$

As \(2N\) becomes large, the probability to obtain exactly \(N\) heads becomes very small. This contradicts what you wrongly expect (that the probability tends to \(1/2\) as \(N \rightarrow \infty\)). The resolution is that the law of averages does not say that you will obtain exactly \(1/2\), but about \(1/2\). That is, there is room for error.

Now, consider performing \(100\) coin tosses and asking what is the chance of getting between \(50 - 5 = 45\) and \(50 + 5 = 55\) heads. The probability is given by adding different contributions from the binomial distribution:

$$ \sum_{k = 45}^{55} P_{100}(k) \approx 0.7287. $$

Similarly, you can perform \(1000\) coin tosses to find what is the chance of getting between \(500 - 5 = 495\) and \(500 + 5 = 505\) heads. You find:

$$ \sum_{k = 494}^{505} P_{1000}(k) \approx 0.2720. $$

The probability is still decreasing. This example is meant to illustrate how the law of averages says nothing about the actual number of heads, but about the fraction of heads.

Consider instead performing \(100\) coin tosses and asking what is the chance of getting \(50\% \pm 5\%\) heads. This will give the same probability as before,

$$ \sum_{k = 45}^{55} P_{100}(k) \approx 0.7287. $$

Next, perform \(1000\) coin tosses and again ask what is the chance of getting \(50\% \pm 5\%\) heads. Five-percent of \(1000\) is 50, so you need to add 100 contributions:

$$ \sum_{k = 450}^{550} P_{1000}(k) \approx 0.9986. $$

Now you find the probability is increasing and is close to \(1\).

Here is a more complete statement of the law of averages applied to coin tossing: As you keep tossing, in the long run, the chance that the fraction of heads is in the range

$$ \frac{1}{2} \pm \text{ fixed amount} $$

converges to \(1\). The fixed amount in this statement can be arbitrarily small.

The law of averages applies to more general settings than coin tosses. Consider independent, repeated, success/failure trials where the probability for success is \(p\). The law of average says that as the number of trials increases, the chance that the fraction of successes is in the range

$$ p \pm \epsilon $$

converges to \(1\) for arbitrary \(0 < \epsilon < 1\) as long as \(\epsilon\) is kept fixed.

Expected Value of Random Sum

A random variable is a variable that takes values which are subject to chance. I will denote random variables with uppercase letters from the end of the Latin alphabet. Every random variable has a probability distribution that contains the probability for the random variable to take any of its possible values.

As an example, consider rolling a die. The random variable \(X\) corresponds to the number of spots that is drawn after a roll. This random variable can take six possible values: \(1\), \(2\), \(3\), \(4\), \(5\) and \(6\). If the die is fair, then each value of \(X\) is equally likely with probability \(1/6\). The long run average value of \(X\) is found by multiplying each value that \(X\) can take by its probability and adding all possible outcomes:

$$ \begin{split} E(X) &= \sum_{X} X P(X) \\ &= 1 \cdot \frac{1}{6} + 2 \cdot \frac{1}{6} + 3 \cdot \frac{1}{6} + 4 \cdot \frac{1}{6} + 5 \cdot \frac{1}{6} + 6 \cdot \frac{1}{6} = \frac{7}{2} = 3.5. \end{split} $$

This is known as the expected value of \(X\) (also known as the expectation of \(X\)). The quantity \(E(X)\) has the same properties as the average:

  • It has the same units as the variable.
  • Its value may not correspond to a possible value that the random variable can take.
  • It corresponds to the equilibrium point of the (probability) histogram.

Consider now performing \(N\) rolls. On the \(i\)-th roll you obtain the number of spots \(X_{i}\). The total number of spots obtained is given by adding the \(N\) random variables \(X_{i}\):

$$ S_{N} = \sum_{i = 1}^{N} X_{i}. $$

Each random variable \(X_{i}\) has expectation value \(E(X_{i}) = 3.5\), so the expected value of this sum is

$$ E(S_{N}) = \sum_{i = 1}^{N} E(X_{i}) = 3.5 \times N. $$

The smallest value that \(S_{N}\) can take is \(N\) since this correspond to getting all 1's. This however has very small probability. Similarly, the largest value that \(S_{N}\) can take is \(6N\), corresponding to getting all 6's. The expected value \(E(S_{N})\) is exactly the midpoint between the two extreme cases (due to the symmetric properties of the distribution governing this process).

More generally, given \(N\) independent and identically distributed (i.i.d.) random variables \(X_{i}\), the expectation of the sum is given by

$$ E(S_{N}) = N E(X), $$

since the expectation of each random variable is the same.

Expected Value of Random Average

After computing the sum of all the values that a random variable takes after \(N\) trials, you can compute the sample average:

$$ A_{N} = \frac{1}{N} \sum_{i = 1}^{N} X_{i}. $$

The expected value of \(A_{N}\) is

$$ E(A_{N}) = \frac{1}{N} \sum_{i = 1}^{N} E(X_{i}) = E(X). $$

That is, the expected value of the sample average is the same as the expected value of the population. Note that \(E(A_{N})\) does not depend on the number of trials.

Central Limit Theorem

In this section I will introduce a measure of error and discuss an important theorem: the central limit theorem.

Standard Error

You know how to provide an estimate for the value of a random variable. Now you need to provide a measure of error in this estimate. This error measures how far off the random variable is from the expected value. So you arrive at the chance error:

$$ \text{chance error } \equiv \Delta X = X - E(X). $$

Note that \(\Delta X\) is also a random variable, but its expected value is zero:

$$ E(\Delta X) = E(X) - E(E(X)) = 0. $$

This outcome is analogous with how the average of the deviations from average is zero. Back in Stat2.1x I introduced the root-mean-square average of the deviations as a measure of spread. This lead to the standard deviation. Stretching this analogy farther, I will introduce the standard error:

$$ SE(X) \equiv \sqrt{E((\Delta X)^{2})}. $$

That is, the standard error corresponds to the square root of the expected value of the square of the chance error. Note that \(SE(X)\) corresponds to the standard deviation of the box sample.

The standard error measures the rough size of the chance error in \(X\): roughly how far off is \(X\) from the expected value \(E(X)\). Independent of the probability distribution of \(X\), with high probability the value of \(X\) will be in the range \(E(X) \pm k SE(X)\) for some small integer \(k\). This statement is analogous with averages and standard deviations. You can also apply Chebyshev's inequality: the probability \(P\) for \(X\) to be inside the interval \(E(X) \pm k SE(E)\) is bounded:

$$ P \geq 1 - \frac{1}{k^{2}}. $$

for some positive integer \(k\).

Random Sum

Consider the following situation: You make \(n\) draws with replacement from a box. The average of the box is denoted by \(\mu\) and the SD of the box is denoted by \(\sigma\). As you saw earlier, the expected value of the sum of the draws is

$$ (\text{number of draws}) \times (\text{average of the box}) = n \mu. $$

The standard error of the sum of the draws is

$$ \sqrt{\text{number of draws}} \times (\text{SD of the box}) = \sqrt{n} \sigma. $$

Normal Approximation of Binomial Distribution

Sometimes, when the number of trials \(n\) is held fixed, the histogram of the binomial distribution resembles the histogram of the normal distribution. Recall that the probability for \(k\) successes in \(n\) trials with probability of success \(p\) is given by

$$ P_{n}(k) = \frac{\Gamma(n + 1)}{\Gamma(k + 1) \Gamma( n - k + 1)} p^{k} (1 - p)^{(n - k)}, \qquad 0 \leq k \leq n. $$

You can promote \(k\) to a random variable \(X\). The posible values that \(X\) can take are the integers between \(0\) and \(n\) (including the boundary points). Thus, the expectation value \(E(X)\) is

$$ E(X) = \sum_{X = 0}^{n} X P_{n}(X) = \sum_{X = 0}^{n} \frac{X \Gamma(n + 1)}{\Gamma(X + 1) \Gamma( n - X + 1)} p^{X} (1 - p)^{(n - X)} = n p. $$

The chance error \(\Delta_{X}\) is

$$ \Delta_{X} = X - E(X) = X - np, $$

so the expected value of the square of the chance error is

$$ E(\Delta_{X}^{2}) = \sum_{X = 0}^{n} \Delta_{X}^{2} P_{n}(X) = \sum_{X = 0}^{n} \frac{\left( X - n p \right)^{2} \Gamma(n + 1)}{\Gamma(X + 1) \Gamma( n - X + 1)} p^{X} (1 - p)^{(n - X)} = n p (1 - p), $$

and thus, the standard error of the binomial distribution is

$$ SE(X) = \sqrt{E(\Delta_{X}^{2})} = \sqrt{n p (1 - p)}. $$

Comparing this with the standard error of a random sum you find that \(\sigma = \sqrt{p (1 - p)}\) which is largest for \(p = 1/2\) (giving \(\sigma = 1/2\)). This is the case of the fair coin.

For large values of \(n\), it becomes impractical to compute the exact values of the probabilities from \(P_{n}(k)\). However, since you have already noticed that the distribution resembles a normal distribution, you can approximate the binomial distribution with a normal distribution. To specify the normal distribution you need the average \(\mu\) and the SD \(\sigma\). You can use

$$ \mu = n p, \qquad \sigma = \sqrt{n p (1 - p)}. $$

The claim is that the normal distribution,

$$ P_{N}(X|n, p) = \frac{1}{\sqrt{2 \pi n p (1 - p)}} \exp{\left[- \frac{1}{2} \left( \frac{X - n p}{\sqrt{n p (1 - p)}} \right)^{2} \right]}; $$

is an approximation of the exact binomial distribution

$$ P_{B}(X|n, p) = \frac{\Gamma(n + 1)}{\Gamma(X + 1) \Gamma( n - X + 1)} p^{X} (1 - p)^{(n - X)}. $$

When plotting the binomial distribution, you draw bars that are centered at the values that \(X\) can take. Since \(X\) takes integer powers, the edges of the bars are located at \(X \pm 1/2\). When you use the approximate normal distribution you can draw a continuous curve, but you should apply the continuity correction: the ranges on the axis of the normal distribution range between the edges of the bars, not the centers of the bars. For example, consider a binomial distribution with \(n = 100\) and \(p = 0.5\). You wish to obtain the probability that \(X\) is between \(45\) and \(55\) (inclusive). It is easy to find that

$$ E(X) = 50, \qquad SE(X) = 5. $$

Since the number of trials is large, you can use an approximate normal distribution. Thus, the problem reduces to finding the area between the bar centered at \(45\) and the bar centered at \(55\). The left edge of the first bar is at \(44.5\) and the right edge of the second bar is at \(55.5\). This accounts for the continuity correction. You convert these values to normal units and use the standard normal curve.

The de Moivre-Laplace theorem formalizes the relation between the binomial and normal distributions. It roughly says that as the number of trials increases, the probability histogram for the binomial distribution looks like a normal curve with average \(\mu = n p\) and SD \(\sigma = \sqrt{n p (1 - p)}\). This theorem is a special case of the central limit theorem.

Central Limit Theorem

In plain English, the central limit theorem says that the probability histogram for the sum of a large number of draws at random with replacement from a box is approximately normal, regardless of the content of the box.

More specifically, let \(X_{j}\) (with \(j\) taking \(n\) values) be independent and identically distributed, each with expected value \(\mu\) and standard error \(\sigma\). Denote

$$ S_{n} = \sum_{j = 1}^{n} X_{j}. $$

Then for large \(n\), the probability distribution of \(S_{n}\) is approximately normal with mean \(n \mu\) and SD \(\sqrt{n} \sigma\), no matter what the distribution of each \(X_{j}\) is.

Scope of Normal Approximation

There is not a fixed number of trials where the central limit theorem becomes accurate. Given a value of \(n\) and \(p\) it could be the case that the resulting binomial distribution is skewed and the normal approximation does not hold. A good way to determine whether normality is appropriate is to compute the expected value \(\mu\) and the standard error \(\sigma\) and then check to see if one has valid values in the range \(\mu \pm 3 \sigma\).

Accuracy of Simple Random Sampling

In this section I will discuss the accuracy of some random sampling schemes.

Accuracy of Sample Average

Previously you looked at the sample sum: the sum of the outcome value a random variable takes after a trial. The sample average is simply the box average: the sample sum divided by the number of trials. The expected value of the sample average is the box average.

A population is a list of numbers. Consider \(n\) draws at random with replacement from a population that has average \(\mu\) and SD \(\sigma\). The sample sum is denoted by \(S\). You have already seen that the expected value of \(S\) is \(E(S) = n \mu\) and the standard error is \(SE(S) = \sqrt{n} \sigma\). According to the central limit theorem, if \(n\) is large, the distribution of \(S\) is roughly normal. Now consider the sample mean defined as \(M \equiv S / n\). The expected value of \(M\) is \(E(M) = \mu\) and the standard error is \(SE(M) = \sigma / \sqrt{n}\). Note that \(E(M)\) is independent of the number of trials, but \(SE(M)\) becomes smaller as \(n\) increases. That is, the estimate \(E(M)\) becomes better for large number of trials. The estimate for the sample mean becomes sharp, but the estimate for the sample sum becomes more variable (the standar error \(SE(S)\) grows for large \(n\)).

A special case is when the population consists of zeroes and ones. You make \(n\) draws with replacement from a population where the fraction of ones is \(p\) and the fraction of zeroes is \(1-p\). The population mean is \(p\) and the population SD is given by \(\sqrt{p (1 - p)}\). Consider the sample sum random variable \(S\). The expected value is \(E(S) = n p\) and the standard error is

$$ SE(S) = \sqrt{n p (1 - p)}. $$

Now consider the sample mean random variable \(M\). The expected value is \(E(M) = p\) and the standard error is

$$ SE(M) = \sqrt{\frac{p (1 - p)}{n}}. $$

As before, \(E(M)\) is independent of \(n\) and \(SE(M)\) becomes smaller as \(n\) is increased.

Sampling Without Replacement

When you repeat the same task over and over (like throwing a die or tossing a coin) you use sampling with replacement. However, when you sample people it makes more sense to use sampling without replacement.

Consider making \(n\) draws without replacement from a population of size \(N\) that follows the hypergeometric distribution. That is, there are two types of members in the population: good and bad. The number of good members in the population is \(G\). Your random variable \(X\) is the number of good members out of the \(n\) in the sample. You can either have \(n \geq G\) or \(n < G\). For now, assume that \(n \geq G\). Thus, the random variable \(X\) can take \(G + 1\) possible values. The expected value of \(X\) is

$$ E(X) = \sum_{X = 0}^{G} X P_{N}(n, X, G) = n \left( \frac{G}{N} \right). $$

This corresponds to \(n\) times the fraction of good members, which is analogous to the result with the binomial distribution. The squared of the standard error is

$$ SE^{2}(X) = \sum_{X = 0}^{G} \left( X - \frac{n G}{N} \right)^{2} P_{N}(n, X, G) = n \left( \frac{G}{N} \right) \left(1 - \frac{G}{N} \right) \left( \frac{N - n}{N - 1} \right). $$

Thus, the standard error is

$$ SE(X) = \sqrt{n \left( \frac{G}{N} \right) \left(1 - \frac{G}{N} \right) \left( \frac{N - n}{N - 1} \right)}. $$

Comparing this result with the standard error of the sample sum for a binary population, you see the appearance of a new factor, the finite population correction:

$$ F(n, N) \equiv \sqrt{\frac{N - n}{N - 1}}. $$

Note that \(F(n, N) \leq 1\). You can also look at the expected value of the sample mean \(Y \equiv X / n\):

$$ E(Y) = \frac{1}{n} \sum_{X = 0}^{G} X P_{N}(n, X, G) = \frac{G}{N}. $$

Thus, the squared of the standard error is

$$ SE^{2}(Y) = \sum_{X = 0}^{G} \left( \frac{X}{n} - \frac{G}{N} \right)^{2} P_{N}(n, X, G) = \frac{1}{n} \left( \frac{G}{N} \right) \left(1 - \frac{G}{N} \right) \left( \frac{N - n}{N - 1} \right). $$

Hence the standard error is

$$ SE(Y) = \sqrt{\frac{1}{n} \left( \frac{G}{N} \right) \left(1 - \frac{G}{N} \right) \left( \frac{N - n}{N - 1} \right)}. $$

Again, the finite population correction appears in contrast with the standard error of the sample mean for a binary population.

The typical situation in statistics involves a very large population of size \(N\) and taking a relatively small simple random sample of size \(n\). Thus, \(N - n \sim N\) and \(N - 1 \sim N\) so the finite population correction gives \(F(n, N) \sim 1\).

Accuracy

For an estimate to be more accurate, the standard error must be smaller. Consider simple random sample averages. In general, the standard error for the sample average is

$$ S(A) = \frac{\sigma}{\sqrt{n}} \sqrt{\frac{N - n}{N - 1}}. $$

As you increase the size of the sample \(n\), you have \(\sigma / \sqrt{n}\) becoming smaller and the finite size correction also becoming smaller. Thus, as you increase the sample size, the sample average becomes more accurate.

Suppose you wish to increase the accuracy by a factor of \(k\). This means that you want the standard error to decrease by a factor of \(k\). Suppose that \(F(n, N) \approx 1\). Then

$$ SE(A) \approx \frac{\sigma}{\sqrt{n}}. $$

So you can accomplish an increase in accuracy by increasing the size of the sample from size \(n\) to size \(k^{2} n\). Accuracy is expensive. A related observation is the square root law: If you multiply the sample size by a factor \(k\), the accuracy goes up by a factor \(\sqrt{k}\).