A common type of predictive analytics is called classification. In contrast to regression, where the goal is to predict a variable continuous outcome using features associated with that outcome, classification attempts to predict ‘group membership’ where the outcome is categorical in nature. For example, we might be interested in predicting what repair would fix a broken-down vehicle based on the sensor readings and the error codes from the time just before the breakdown. Or perhaps the rating a person would give a movie, based on the ratings they have given other movies, and characteristics about the movies.
As with regression, the feature space, that is the variables that are used to predict the outcome category, can be arbitrarily extended to many dimensions, and the outcome is usually restricted to a single outcome or category per example. While multi-label classification is possible, we will focus only on single label classification for this post. Methods used in classification projects include ensemble methods (random forest, gradient-boosted trees and so on), maximal margin classifiers (such as support vector machines), multi-layer perception based methods (neural network, deep learning), naive Bayes, and regression based techniques (logistic regression) among others. One technique that is both relatively common and easy to implement is called k-nearest neighbors.
k-nearest neighbors (knn) is based on the distance between the features for each example. Thus, examples that are ‘near’ to each other in n-dimensional feature space, are more likely to belong to the same class. Notably, knn is a non-generalizing technique, and does not, as with most other methods, attempt to construct a generalized model by finding coefficients (recall the slope and intercept from the linear regression post), but instead ‘remembers’ all of the training data, and predicts the class of new examples purely by locations and distance. It has, however, been used to quite good effect when the decision boundary is highly irregular, as there is no function that strictly defines the decision boundary across all examples.
The figure below illustrates a simple example of how knn might classify a new data point. If we have two groups distributed in two dimensions (the orange and green points) that are gathered from labeled examples, a new data point (grey point) would be assigned to a group based on the closest three known points, using k = 3. Thus, the new data point would be assigned to the ‘orange’ group, as the three nearest points are all orange.
Here, the distances are calculated using simple ‘brute-force’, as the number of both points and dimensions is small, so we can calculate the distance from the new unlabeled point to every other point fairly easily. There are more computationally efficient methods for larger datasets, such as K-D tree or ball tree, but here an optimized brute-force algorithm suffices. Because the number of dimensions is also small, we can use the Euclidean distance metric, which is simply the square root of the sum of the squared differences of Cartesian locations in all measured dimensions. This does highlight a major flaw in knn (as well as many other machine learning techniques) called the curse of dimensionality. Intuitively, as the number of dimensions increases, so does the distance between labels embedded in the defined n-dimensional space, and quickly every point becomes quite far away from every other point, making any relative distance quite difficult to determine.
Some additional items to consider about a classification exercise. In general, we must have a set of labeled data points with which to train the model. In the figure above, we already know that the colored points belong to their respective groups, and that information is used to determine the label of any new points. If no labels are known, then unsupervised (or in some cases semi-supervised) learning techniques, such as clustering, are more appropriate. In addition, the feature or input data should generally be related to the labels in come manner, even if the relationship is unknown or difficult to determine. In fact, the main goal of classification is to uncover these relationships as a function of prediction, even if they are primarily unintelligible at first glance. Moreover, as either the number of feature dimensions and/or the number of different labels increases, so must the number of labeled examples. In effect, we need to find a good signal in the noise. There is no firm rule about how much data is needed, as it depends on many factors, but this concept will be explored in a future post, so stay tuned.
To watch our latest Data Science video “Why Machine Learning Cannot Solve All Our Problems” click here.