Storage is cheap, connectivity is easy, and technology is everywhere, so data is plentiful. The next step, and often the most confounding one, is coming up with a way of doing something useful with the data, so this series of posts will explore some of the techniques and methods that are used transform raw data into meaningful information for enabling positive business action. This involves both characterizing the data itself, and using the data to gain insight into present and future trends. Although machine learning currently gets a lot of the limelight, it is difficult to create an accurate and useful predictive model without understanding the data used to create the model. So I believe it useful to begin with a discussion about distributions, univariate distributions in particular, and how they are used for statistical description and inference.

Descriptive statistics are used to provide a rigorous quantitative description of a collected data sample. Each descriptive statistic distills of a large amount of data into meaningful summary of information, such as the mean time to failure. While this gives you a convenient and easy to understand estimate of reliability, it doesn’t tell you if in your sample several machines failed within a few days, while one or two extremely reliable examples lasted considerably longer; or perhaps every machine failed within a couple of hours of the average time to failure. An excellent way to consider summarizing the properties of a single variable is to evaluate the frequency distribution.

The outcomes of most stochastic processes can be described and modeled at a basic level by various types of distributions. The univariate frequency distribution of set of outcomes can be used to provide a convenient summary of the data itself in addition to the information necessary to approximate a probability distribution, and thus link the same to the population. This allows a deeper understanding of both the past and present states in addition to some predictive power for possible future states. While this should be considered only the beginning when developing a predictive analytics model, much better results from more powerful and intricate machine learning techniques are possible with a good understanding of the data itself.
Distributions are characterized by a measure of central tendency (most often the mean) and the dispersion around that central tendency, usually quantified by the standard deviation. At this point, the values for nearly all possible outcomes are known, through Chebyshev’s inequality and the weak law of large numbers. Moreover, with only two numbers, a minimal baseline predictive model and a method with which to detect outliers and anomalies has been created. Simply predicting the mean (expected value), or other measure of central tendency such as the median or mode creates a minimum acceptable accuracy against which to judge future models as they are developed. Additionally, hypothesis (AB) testing, effect strength, correlation with other variables, and experimental power calculations are all possible at this point.

The type of probability distribution which best models the distribution of the sample also gives us insight into the very nature of the feature space we are examining. Take, for example, a coin flip. Of course we know intuitively that there are two possible outcomes, and thus when enough coins are flipped, the probability density function follows a Bernoulli distribution, a special case of a binomial distribution. And in one of the more common classification techniques, logarithmic regression, the conditional probability for assigning a class is a Bernoulli distribution. Many naturally occurring processes, such as tree heights or whale weights, have outcomes following a normal distribution. If fact, most variables with independent and random finite variance will end up normally distributed when the sample size is large. If you are studying frequency rank data generated by humans, such as article citations, urban populations, or internet page rank, it is extremely likely that your data will follow a Zipfian distribution, a discrete analogue of the Pareto distribution. Event frequencies bounded by specific intervals in time or space are modeled quite well by Poisson distributions, and the waiting time between events, such as failure of a machine are usually approximated by exponential distributions.

We have thus far only considered univariate frequency distributions of outcomes, and a very few of the possible probability distribution functions that can be used to model these frequencies, but clearly investigating the distributions inherent in newly obtain data samples should be considered one of the more useful and informative procedures in developing your predictive model. In addition, a robust statistical description is useful to both understand the data sample itself, and establish the minimum acceptable predictive power of any resulting models developed after.

Stay tuned for the next installment in this data science series coming soon.