In my last post, I presented a brief outline on a few ways to explore a single variable through descriptive methods, such as the mean and variance of the distribution of values for that variable. Now I’ll touch on some descriptive ways to evaluate the relationship between two variables, focusing on two primary measures, correlation coefficient and covariance. Eventually I’ll use these concepts to delve into more interesting methods, such as regression, that are built from the simple building blocks presented thus far.
Recall that although the mean describes the average value, there is some variability associated with the exact values that are recovered from the variable in question. The overall amount of distance from the mean is called the variance, and each individual value has some deviation from the mean. Intuitively, if two variables are associated in magnitude, then they can be considered to ‘covary’ with one another. That is, when one variable is distant from mean, then the values of the other variable will also be distant, whether in the same or opposite direction. A simple way to test for this type of relatedness is to calculate the covariance, which is simply the averaged (unbiased) sum of the product of deviations for the two variables. If the result is positive, then the two variables in question exhibit deviation from the mean in the same direction, while a negative result indicates they deviate in the opposite direction. Note that scale matters. A covariance of 10 when measured meters means something quite different than when measured in grams or counts or some other unit. This makes comparison of the covariance across different variables very difficult. Moreover, please keep in mind that covariance also has no information to impart regarding causality. We cannot derive causality or influence from covariance alone. While determination of causality is extremely difficult (and often impossible), a scale invariant measurement of covariance is fairly easy obtain.
Ideally, we would like to be able to obtain a measure of covariance whose magnitude can be compared across many variables, or indeed (sometimes) even separate experiments. A standardized measure of covariance is called a correlation coefficient, which is obtained by using the standard deviations for normalization. There are a few variations on this theme, including Spearman’s rho, Pearson’s product-moment correlation, and Kendall’s tau, all of which are bivariate correlation coefficients ranging from -1 to 1, with 0 indicating completely uncorrelated variables. Generally, Pearson’s correlation coefficient can be used for any interval data, however if the significance of the correlation is to be tested, then both variables must be normally distributed, unless one is dichotomous, or categorical with only two classes. In this case, Pearson’s method can still be used, however some care must be taken when considering whether the categorical variable is point-biserial or simply biserial. Spearman’s rho and Kendall’s tau are both non-parametric, though Kendall’s tau should be used when your data set is small and exhibits a large number of tied ranks. As with covariance, correlation does not imply causation. This cannot be overstated. Correlation does not imply causation. First, there is no way to tell, in a statistical or mathematical manner, which of the correlated variables influences the other. Nothing in the determination of the correlation coefficient (any of the three methods) provides insight or justification for choosing the direction of influence. Second if the third variable problem. There is always the possibility of the third variable, measured or unseen, that influences both correlated variables. There are partial correlation and semi-partial correlation methods used to control for the effects of a third variable, if known.
These simple descriptive techniques provide the underpinning for many more complex analytic and statistical methods such as regression and hypothesis testing, which will examine in more detail in future posts.
This is the second installment in Matt’s Data Science blog series. For the first portion, please click here.