M.E. Irizarry-Gelpí

Physics impostor. Mathematics interloper. Husband. Father.

Introduction to Big Data and Data Science


These are my notes from lecture 1 of the MOOC CS100.1x Introduction to Big Data with Apache Spark by BerkeleyX.

The outline of this lecture is:

  • Brief History of Data Analysis
  • Big Data and Data Science: Why All the Excitement?
  • Where Big Data Comes From?

Full disclosure: I have a technical background, but I do not have much experience with statistics or working with data. Hopefully, that will change soon. In the mean time, I am providing my own interpretation of the material provided in this lecture.

It is always a good idea to look back and provide a brief history of how things have change leading up to the current state. In the mid 1930s R.A. Fisher proposed "The Design of Experiments", along with some statistical tests. I suppose this was on the topic of designing experiments that would accurately lead to proving or disproving hypothesis and thus gaining knowledge. Fisher is also credited with the statement "correlation does not imply causation". This is a statement about how one interprets an outcome from a statistical experiment.

Towards the end of the 1930s W.E. Demming propose the idea of quality control using statistical sampling. I guess this is an application of statistics to improving the quality of products made by a business, so it serves as an example of how statistics can make a company better.

Later in the 1950s P. Luhn in "A Business Intelligence System" propose using indexing and information retrieval methods with text and data for business intelligence. According to Wikipedia:

Business Intelligence (BI) is the set of techniques and tools for the transformation of raw data into meaningful and useful information for business analysis purposes.

So apparently Luhn was one of the first to suggest a way of interacting with the business data that is similar to the way it is done currently.

In 1977 J.W. Tukey wrote the book "Exploratory Data Analysis" which lead later to the development of the programming languages S, S-PLUS and also R. According to Wikipedia:

Exploratory Data Analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.

So when you do EDA, you are basically getting to know the data.

As the years passed, technology and businesses change greatly. Towards the end of the 1980s H. Dresner proposed a modern approach to BI.

The book on machine learning by T. Mitchell from 1997 is mentioned. I guess in 2015 it is still a bestseller. Perhaps this was one of the first signs that machine learning could be useful and important for businesses.

The search engine Google started in 1996 by two PhD candidates in Stanford University. Google search was immensely useful in making the Internet useful for everyone. More abstractly, it helped organized the content and information on the Internet. It can also be viewed as an example of a big data problem.

In 2007 Microsoft released the eBook The Fourth Paradigm on data-driven science. The idea is that besides the traditional "small science" and "mid science" projects, there are "big science" projects like the Large Hadron Collider. New tools and new methods are needed in order to make new discoveries. Presumably these remarks about data-driven science can be adapted to data-driven businesses and industry.

Then in 2009 P. Norvig and others published "The Unreasonable Effectiveness of Data" where they presented the idea that "multiple small models and lots of data is much more effective than building very complex models".

So extracting information from data is a sure way to obtain evidence and knowledge. Well, not always. There are some steps to be taken in order for the analysis of data to lead to correct conclusions. There is the example of a study by A. Keys where he found a correlation between the fat calories consumed and deaths by degenerative heart disease. But there was a lot of controversy about this study because, among other things, he only studied a subset of all the countries involve (a selection bias?) and also failed to consider other factors. Always remember: correlation does not imply causation.

Like most task, there are common ways to do the task wrong, but there are also ways of doing the task right. There are ways of extracting correct information from data. Many companies have been acquiring lots of data for many years. With the advent of powerful analytical tools, it is hoped that all of these data will lead to better businesses.

One of the things that can be done with lots of data is nowcasting. Traditionally, data is used for forecasting. That is, they use data to make a model that predicts the future. With nowcasting you collect a lot of data and you build a model to explain what is happening in now. An example of this is Google Flu Trends, where data from Google searches was used to determined when influenza outbreaks were happening. Traditionally, outbreaks are declared by the CDC after receiving data from state health departments, who receive data from county health departments, who receive data from local town departments, who receive data from doctors and hospitals who treat sick people. In the traditional process it takes many weeks for the data to be communicated along the hierarchy, so there is a long period of time where, if there is an outbreak, nothing is being about it. Google developed a model based on analyzing data from searches during known outbreaks and extracting common search terms associated to people with the flu. Eventually, in 2010 they were able to predict an outbreak two weeks before the CDC. For a while the model was very accurate with CDC data, about 97% agreement. But during a time period the model disagree with CDC data by 200%, i.e. it was predicting twice as many flu cases as the CDC was finding. The reason was that people were reading and searching for flu on the Internet and skewing the results for the model. After taking into account these factors, the model became again accurate. Later, similar models were developed to predict other outbreaks, like Ebola. The take-away from this example is that just because a given model works well one time it does not necessarily mean that it will always work well. A healthy dose of skepticism seems to be a good thing to have when working with data. Also, being willing to always allow room for improvement.

Google Flu Trends is an example where search trends was used to make accurate predictions (in this case, about flu outbreaks). This does not mean that every analysis of search trends will lead to correct conclusions: correlation does not imply causation. There is the example of some researchers from Princeton University that used search trends for "MySpace" to predict the demise of another social network, Facebook. This essentially assumed that the correlation in the decrease of searches for MySpace was the causation of MySpace's demise. Facebook responded in turn by providing a bunch of examples of "causal correlations".

Increasingly more and more things are done over the Internet. Whether is via a computer, a smartphone, or a tablet, many aspects of a person's interaction with other people or services via the Internet are recorded as data. Not all of this data is analyzed. I will refer to this data as online activity data. Another source of data is the user who produces content. This data is referred to as user generated content. A single user might not generate much data, but considering all users of a single service leads to very large data sets. Yet another source of big data is any "big science" project like the LHC.

Graphs are convenient ways to encode connections between objects. In a social network like Facebook, graphs can be very large.

Log files are another source of big data. These are files that are generated by an application that contain a record of the activity performed by that application. For example, some web servers logs contain a record of every click. Accounting for many servers leads to very large data sets.

An emerging source of big data is the so-called Internet of Things, where objects are equipped with sensors that gather all sorts of data.