Data Science vs Data Analysis vs Data Mining: What’s the Difference?

This is a question that I often get asked by people new to data science. Because these are subjective, evolving terms, this question will never have a definitive answer. However, I think of it like this:

Data analysis is literally just the act of drawing an inference from some data. Something as simple as looking at a set of 10 numbers and calculating their average can constitute data analysis.

Data mining is, most generally, when the act of data analysis is partially or fully-automated. Data mining is strongly associated with large datasets, which you would expect, given that the ability to automate analysis is particularly useful with large datasets.

Data science is the most nebulous and vague term of the three. It’s better to think of data science as a craft, rather than a specific activity. The ultimate aim of a data scientist is simply to draw inferences from data; in that sense they are simply data analysts. But a data scientist is also equipped with the knowledge and skills to manage this process from end to end:

  1. to gather the data, and store and process it until it is in a form suitable for analysis,
  2. to perform the analysis, and
  3. to present the results of the analysis in a manner useful to the person who needs it.

Much of the reason that data science has emerged as a separate entity is because of the transition of data analysis from data-poor to data-rich. The transition has been extremely swift. People who were trained extensively to perform steps 2 and 3, because they were trained to work in a world where those steps were the bottleneck, are now choked by their inability to do step 1 well, simply because of the sheer volume, variety, and velocity of the data. Conventional data processing methods simply do not scale to data-rich environments. It is common knowledge in the industry that in data analysis, 90% of the time is spent preparing the data, and 10% of the time is spent doing actual science. These figures are not exaggerated.

Data scientists can not only do all steps 1-3, but importantly should be able to do them in a way that scales, such that the human effort is redistributed more effectively between the steps. This is one of the best ways to tell whether you have hired a true data scientist, or merely a statistician pretender.