Descriptive Statistics of NY Lottery Data

Tue 17 July 2018
Data
#pandas, #altair, #NY

I am going to use four data sets to review descriptive statistics. Each data set contains winning numbers for a particular lottery in the State of New York. I got the data from data.ny.gov. Here is the list of data sets:

Lotto: 1,750 drawings between 09/2001 and 06/2018
Pick 10: 11,456 drawings between 01/1987 and 06/2018
Quick Draw: 702,625 drawings between 01/2013 and 06/2018
Take 5: 7,745 drawings between 01/1992 and 06/2018

What a great way to spend a Friday the 13th.

Lottery Drawings

Each of the four lotteries has a different kind of drawing:

Lotto: Draw 6 numbers without replacement (1 to 59)
Pick 10: Draw 20 numbers without replacement (1 to 80)
Quick Draw: Draw 20 numbers without replacement (1 to 80)
Take 5: Draw 5 numbers without replacement (1 to 39)

A given lottery consists of a draw of \(n\) objects without replacement from a population of \(N\) distinct objects. In a fair lottery, each object has an equal chance of being drawn.

Hypergeometric Distribution

The hypergeometric distribution is used when drawing without replacement. Let me first consider the simplest case. You have two kinds of objects: red and blue. Consider a population of size \(N\) with \(R\) of these being red and \(N - R\) being blue. You make a sample of size \(n\). What is the probability that there are \(r\) red objects and \(n - r\) blue objects? Out of \(N\) possible objects, the number of possible samples \(\mathcal{N}_{n}\) of size \(n\) is

\begin{equation*} \mathcal{N}_{n} = {N \choose n} = \frac{\Gamma(N+1)}{\Gamma(n+1)\Gamma(N - n + 1)} \end{equation*}

Similarly, out of \(R\) red objects, the number of possible samples \(\mathcal{N}_{r}\) of size \(r\) is

\begin{equation*} \mathcal{N}_{r} = {R \choose r} = \frac{\Gamma(R+1)}{\Gamma(r+1)\Gamma(R - r + 1)} \end{equation*}

For each of these \(\mathcal{N}_{r}\) ways of choosing \(r\) red objects you have \(\mathcal{N}_{n-r}\) ways of choosing \(n-r\) blue objects:

\begin{equation*} \mathcal{N}_{n-r} = {N-R \choose n-r} = \frac{\Gamma(N-R+1)}{\Gamma(n-r+1)\Gamma(N-R - n+r + 1)} \end{equation*}

Thus, the probability that a drawing of \(n\) objects has \(r\) red ones is given by

\begin{equation*} P(n, r, R, N) = \frac{\mathcal{N}_{r} \times \mathcal{N}_{n-r}}{\mathcal{N}_{n}} = \frac{ \Gamma(R + 1) \Gamma(N - R + 1) \Gamma(n + 1) \Gamma(N - n + 1) }{\Gamma(r + 1) \Gamma(R - r + 1) \Gamma(n - r + 1) \Gamma(N - R - n + r + 1) \Gamma(N + 1)} \end{equation*}

Note that

\begin{equation*} \sum_{r = 0}^{R} P(n, r, R, N) = 1 \end{equation*}

That is, the sum of the probabilities is normalized.

Consider the special case \(R = 1\):

\begin{equation*} P(n, r, 1, N) = \frac{ \Gamma(2) \Gamma(N) \Gamma(n + 1) \Gamma(N - n + 1) }{\Gamma(r + 1) \Gamma(2 - r) \Gamma(n - r + 1) \Gamma(N - n + r) \Gamma(N + 1)} \end{equation*}

Then the sample either has the good object (\(r = 1\)) or it does not (\(r = 0\)). You have

\begin{equation*} P(n, 0, 1, N) = \frac{ \Gamma(2) \Gamma(N) \Gamma(n + 1) \Gamma(N - n + 1) }{\Gamma(1) \Gamma(2) \Gamma(n + 1) \Gamma(N - n) \Gamma(N + 1)} = \frac{N - n}{N} \end{equation*}

\begin{equation*} P(n, 1, 1, N) = \frac{ \Gamma(2) \Gamma(N) \Gamma(n + 1) \Gamma(N - n + 1) }{\Gamma(2) \Gamma(1) \Gamma(n) \Gamma(N - n + 1) \Gamma(N + 1)} = \frac{n}{N} \end{equation*}

That is, the probability for a given member of the \(N\)-population to appear in the \(n\)-sample is the corresponding fraction of the population. Another special case is \(R = r = k\) with \(1 \leq k \leq n \leq N\):

\begin{equation*} P(n, k, k, N) = \prod_{j = 0}^{k-1} \frac{n - j}{N - j} \end{equation*}

This is the familiar sampling-without-replacement probability.

Instead of good and bad objects, you have winning and non-winning numbers. Out of the \(N\) numbers, only \(n\) are winning numbers. So the probability of drawing \(n\) numbers and all of them being winning numbers is given by

\begin{equation*} P(n, N) = \prod_{j = 0}^{n-1} \frac{n - j}{N - j} \end{equation*}

For the Lotto you have

\begin{equation*} P(6, 59) = \frac{6 \times 5 \times 4 \times 3 \times 2 \times 1}{59 \times 58 \times 57 \times 56 \times 55 \times 54} = \frac{1}{45,057,474} \end{equation*}

Similarly, for Take 5 you have

\begin{equation*} P(5, 39) = \frac{5 \times 4 \times 3 \times 2 \times 1}{39 \times 38 \times 37 \times 36 \times 35} = \frac{1}{575,757} \end{equation*}

For Pick 10, you draw 20 numbers (out of 80) but only match 10. That is, you need to draw all 10 winning numbers besides 10 loosing numbers. The probability for this is

\begin{equation*} {80 - 10 \choose 20 - 10} / {80 \choose 20} = \frac{49,588,065,527}{441,914,517,776,521,792} \approx \frac{1}{8,911,711} \end{equation*}

Note that these three probabilities are for the case of winning the jackpot.

NYS Lotto

You can load the data into a Pandas DataFrame:

import pandas as pd

df = pd.read_csv('ny-lotto.csv')

The Lotto data consists of 1,750 records with four columns:

Draw Date
Winning Numbers
Bonus
Extra

I am going to ignore the Draw Date. The Winning Numbers column has string values. The Bonus and Extra columns each have integer values. However, the Extra column only has 339 non-zero values.

Extra Column

You can extract the Extra column from the full dataframe:

extra = pd.DataFrame()
extra['Extra'] = df['Extra']
extra = extra.dropna()

The last line drops the records with missing values. A quick description is found with .describe(). From this you learn that the mean value is a bit over 31, which makes sense since the range of possible values is from 1 to 59. The minimum is 1, and the maximum is 59, which confirms that these two values were drawn at least once. More tellingly, the quartiles are close to where you expect them: 25%-quartile is between 16 and 17; 50%-quartile is 31; and the 75%-quartile is 47. If the desire is that this draw is fair and each of the 59 values is equally likely, then these statistics are promising. But due to the small size of the data, a bar chart with the counts is not very uniform:

Extra

Although a few values appear less often as most values, I am not going to conclude there was any bias.

Bonus Column

You can extract the Bonus column from the full dataframe:

bonus = pd.DataFrame()
bonus['Bonus'] = df['Bonus']

Unlike the Extra column, there are no missing values in the Bonus column. A quick description is found with .describe(). From this you learn that the mean value is a bit over 30, which makes sense since the range of possible values is from 1 to 59. The minimum is 1, and the maximum is 59, which confirms that these two values were drawn at least once. More tellingly, the quartiles are right where you expect them: 25%-quartile is 15; 50%-quartile is 30; and the 75%-quartile is 45. If the desire is that this draw is fair and each of the 59 values is equally likely, then these statistics are promising. Since you have 1,750 records, the bar chart with counts is more uniform:

Bonus

It is interesting that a number close to 50 has a higher count in both the Extra and Bonus columns.

Winning Numbers

The data in the Winning Numbers column needs to be transformed a bit:

wn = df['Winning Numbers']
wn = wn.apply(lambda x: str.split(x, ' '))
wn = wn.apply(lambda x: [int(n) for n in x])

This takes you from a string with six numbers to a list with six int values. In order to put each number drawn in its own column, I used the following:

numbers = pd.DataFrame()

for i in range(len(wn[0])):
    numbers[str(i)] = wn.apply(lambda x: x[i])

The first thing you can do is get the quick description with .describe(). But you find something interesting: There appears to be a bias in each drawing! For example, in the first drawing the mean is between 8 and 9, and the maximum is 41. Here is a bar chart with counts:

Winning Numbers

I am using Altair to produce these plots:

import altair as alt

alt.Chart(numbers).mark_bar().encode(
    alt.X(alt.repeat('column'), type='quantitative'),
    alt.Y(aggregate='count', type='quantitative'),
).repeat(
    column=['0', '1'],
)

The bias is due to the fact that each sequence of winning numbers is sorted. I wish I could understand this bias better. It seems to be related to a noncentral hypergeometric distribution.

NYS Take 5

Similar steps can be taken with the Take 5 data. Here you have five winning numbers ranging from 1 to 39. Again, the winning numbers are sorted, so the bar charts with counts show a bias:

Winning Numbers