Recently, an increasing number of businesses have been challenged to deal with large amounts of data. Not surprisingly, many businesses have adopted the methods of “data analytics” or data science to meet this challenge.
Data science has a foundation in applied mathematics. One of the biggest differentiators from BI tools and other visual methods is that a data science approach is founded on approaches, methods, and algorithms with real mathematical meat behind them. That does not mean that data science needs to be complex; many of the simplest approaches are often the most effective. But, it does mean that there is real math being done on the data.
Data wrangling/data cleaning/data munging is one of the most critical elements of data science. In order to do real math on the data – you often need very clean and customized data formats. Data wrangling typically involves usage of and familiarity with basic file read/write operations (command line/bash) as well as some sort of scripting (Python, Perl, etc.). Data wrangling can also be productized via a tailored ETL.
Communication and data visualization skills, as they relate to the effective steps for building analytic solutions, are also an important ingredient of data science. There is nothing more challenging or frustrating than an analytic solution that is difficult to communicate. We have all experienced this on numerous levels – which is why effective analytic communication skills are an important component of data science.
The ability to work in a distributed computing environment is a new but very necessary component of data science. Analytic approaches that are not capable or built with distributed computing in mind may scale today, but longevity is questionable given all of the trends in data generation and collection.
The data science approach is always analytic in nature – focused on distilling substantive expertise into actionable business decisions. For this reason, I think one of the greatest qualifiers for data science is a practical commitment to the scientific method. What is powerful about the scientific method is that it is very easy to formalize and even productize the logic of this process. It starts with asking business questions and doing some initial research on the most appropriate approaches and methods. Eventually this leads to hypothesis construction. A hypothesis can be tested with real world data using applied math and legitimate statistical methods.
Honest and healthy skepticism is one of the greatest tenets of data science that most organizations selectively choose to ignore. It’s very easy to build an analytic solution that works – but having the courage and humility to question and test underlying assumptions is critical. Selecting the most appropriate approach and method is often the most important first step in the data science process, but being a healthy skeptic and bouncing ideas off of others is the most critical last step.