In previous posts, we took a brief look how some of the more common descriptive statistic and correlation methods are used to investigate data. Now that we know about the mean (or expected value) of a variable, and how to determine if two variables might vary together, we can begin our foray into machine learning and predictive analytics. Instead of simply examining the correlation of one or more variables as seen, we can use regression to learn how to predict values of an outcome variable(s) based on input explanatory variables (or features). Regression analysis is one method by which we seek to discover approximate relationships between variables, and the outcome changes with variation in the explanatory variable(s). Once those relationships are determined we can then predict what a future outcome might be based only on explanatory variable information. Regression is also extremely useful when you have the power to change the input features in order to obtain a desired outcome.
Simple linear regression assumes a linear relationship between two variables the outcome variable Y, and the predictor variable X. We use training data of known X and Y values to quantitate this relationship in order to predict unknown (or at least unseen) values of Y from known values of X, starting with defined examples of both. For example, we might want to predict home price from indoor area or maybe fuel efficiency of a vehicle from engine displacement or power. In the most general sense, we are predicting the outcome based on a model plus some error. Thus, as we reduce the error, we get better predictions. The mathematical formulation for the ith Y is shown below:
which is, of course, the familiar straight line equation, where the coefficients β0 and β1 are the intercept and the slope (or gradient), respectively, and ε is the error between yi predicted and actual. The actual error being minimized is the total error across all i examples in Y from the historical observed data (training data). The error is minimized by varying the β values to find the pair with the least total error between observed and expected. The model created in this process produces the line of predicted Y values as a function of the measured X, β0, and β1, as the figure below demonstrates, displaying the trend of fuel efficiency (mpg) as a function of the weight of several 1970’s era cars1. While it is a fairly obvious conclusion, the fuel efficiency of a car generally decreases as the weight increases, by about 5.3 mpg per 1000 lbs. of car weight in our example. Intuitively, we ‘know’ that the fuel efficiency of a car generally increase as the weight decreases, but by completing a regression analysis, we can determine that for a typical car we can gain 5.4 mpg per 1000 lbs. of weight reduction. There are other factors to consider, but many of those can also be included in higher dimensional regression analyses.
Notice that the line generally goes though the center of the points, but actually touches very few of them. We could construct a model for the training data such that the line goes through all the points simply by adding more coefficients for each point to model, but then the error associated with the prediction of new, previously unseen, Y values is likely to be larger. Generally, we seek to create the best balance between accuracy on the training data and flexibility for fitting future data, commonly called the bias-variance tradeoff. The exact balance point desired depends on the use case at hand. The regression line, or line of best fit, is the line which ‘closest’ to all observed points. The closeness is measured in a variety of ways, most commonly by the method of least squares. The goal here is to minimize the error between the predicted (the line) and the observed (the points). There are several methods by which this line can be determined. If the data are well behaved we can simply solve the for the closed form solution of the ordinary differential equations, but usually iterative optimization methods, such as gradient descent, are used to find the coefficients which minimize the error between predicted and observed. Gradient descent methods find lowest total error by using the derivatives of the function determine in which direction to shift the coefficients. The coefficients are adjusted, and the new error calculated. Think of a whirlpool, with the trajectory of the total error as a function of the coefficients moving down to the lowest point, although the error surface is rarely so simple and normally has many low points, only one of which is global minima.
Now we can predict what the approximate y will be from an individual x by using the equation above and substituting the x value for which you would like to determine y and the coefficients recovered through regression. The can be considered visually by finding the x value on the x-axis, moving vertically until the regression line is reached, the moving horizontally until the y-axis is reached. This is the y value predicted by the observed x, illustrated in the figure below by the blue dashed lines.
We have considered the case of regression in one (finding the mean) or two (simple linear regression) dimensions, but linear regression can be extended to an arbitrary number of dimensions such that there are many explanatory variables and one outcome variable (multiple linear regression), or several outcome variables (multivariate linear regression), though the latter is less common and requires somewhat different techniques. It is, of course, more difficult to visualize in many dimensions, but the same general process applies. We have, thus far, assumed that the relationship between the outcome and the features is linear and additive. We can account for variety of curvilinear relationships by using polynomial regression; interestingly, the model itself is still linear in this case, so standard linear regression techniques can still be used for fitting. We can also apply a specific function to the feature in order to model certain behavior, such as is done in logistic regression. Other regression methods used to model general non-linear behavior include local regression and stepwise (spline) regression. Generalized additive models (GAM) allow a separate function to be applied to each feature, for an almost infinite variety of mathematical relationships to be evaluated, and GAMs are also important for linear-mixed effects analyses to examine the statistical differences between groups with confounding variables. Note, however, that while we can, and often do, include interaction terms to model non-additive relationships, restricting the evaluation space to utilize only additive terms allows us to investigate each explanatory variable and its effect on the outcome independently while holding the other variables constant. There are also a number of available methods with which to select features important to the outcome prediction. Collectively, these are called penalized regression methods and include techniques such as lasso and ridge regression, among others.
Regression analysis is one suite of methods with which we can examine the observed relationships between explanatory variables and a continuous outcome variable(s). Predicting the outcome is often the main goal of a regression analysis, but many times the coefficients have meaning also, and their values are informative. We have only touched on the most simple of cases involving regression, but the main ideas in simple linear regression propagates though even the most complex of cases. Next time we will look at how to investigate a categorical or class based outcome, and how basic classification methods are used to assign group membership.
Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411. Data from 1974 Motor Trend US magazine.