In my last blog post of 2015, I outlined the difference, in IoT scenarios, between Business Intelligence and Data Analytics. These terms are frequently used interchangeably and I felt it was important to provide a better understanding of their meaning.
With that task completed, in this new post, I’d like to take a deeper dive into some of the methods which can be utilized to perform analytics on data.
When we talk to technical customers, they frequently jump to a discussion of machine learning (ML). ML is heavily hyped and appears as a ‘silver bullet’, which can solve any problem . The reality is quite different. First, machine learning consists of multiple approaches underneath the larger ML umbrella. Second, machine learning may not be the most prudent approach. For example a rule engine with carefully crafted rules maybe a better solution.
The right engagement model depends on the situation. The important piece is the order of selection and the overall intention behind why we are selecting to solve a problem in a specific way. For more context on approaches and methods, below is a list of the general classes of analytic problems which we have looked to include in DataV . For our customers, we’re able to say that we have a number of different approaches, ranging from methods in applied statistics to machine learning algorithms…depending on which approach and method is most efficient and effective at solving your business problem.
The numbered items are approaches and the underlying methods are highlighted beneath each .
1. Applied Statistics – In this approach, we collect and analyze data through sampling in order to generalize metrics about a population:
Sigma Analysis – This is a very simple but powerful way to use statistics to detect outliers in real-time . When you characterize the average value of some measurement, it’s often helpful to understand the variance as well. Knowing the variances helps you look at observations and measurements in real time to determine how many standard deviations (“sigmas”) these observations are away from the mean.
Statistical Hypothesis Testing – This is a method for testing whether an observation of a designed random variable is statistically significant, or unlikely to have occurred by chance. This is a powerful way to determine if a measured value is likely to be meaningful for making a business decision.
Analysis of Variance – This method determines whether differences exist between means for different groups. Similar applications as statistical hypothesis testing, but useful in comparing across multiple groups for statistical significance.
Lag Variogram – This method determines the periodicity of a process which is useful characterizing processes of unknown period or duration.
2. Probability Theory – This approach involves the analysis of random processes related to a population in order to characterize likely or expected observations.
Markov Chain Modeling – This method characterizes transition states in a process where the future state depends only on the current state and is powerful when expected transitions involve a finite number of states.
Decision Tree Modeling – This structure is very popular for visualizing downstream probabilities . We make use of a branching graph structure to model all possible consequences of a process with associated probabilities for each branch. Useful for characterizing downstream probabilities at the leaves of the tree.
3. Unsupervised Machine Learning – This approach includes algorithms that find hidden patterns or structures in large data sets using clustering, classification, and other statistically “heavy” methods.
Clustering – This method discovers patterns in data where elements in each cluster are most similar to one another than any of the elements in the other clusters.
Data Mining – An automated process for identifying anomalies in data or hidden rules in data based purely in statistics. Typically, there is little reliance on theory or subject matter expertise in data mining approaches. Data mining can be useful to develop hypotheses, but may be dangerous as a holistic solution.
Random Forest Modeling – This method is a variant of decision tree optimization wherein all possible trees are constructed for the creation of specific classes in data. The optimal tree, which is the best predictor of classes, is the output of the model.
4. Supervised Machine Learning – This approach leverages algorithms that optimize the decision-making and reasoning skills of human beings by programmatically capturing hidden preferences and rules.
Classification – This method identifies which class an element belongs to given a training set of classes, based on attributes of that element and comparison to the other elements in each class.
Predictive Coding – This method actively trains an algorithm as to which attributes are most important about an event or data element based on a human’s interaction determining which elements from random subsets are most meaningful.
Reinforced Learning – This is a hybrid machine learning method where a training set is identified by an unsupervised algorithm. The training set is supervised by a predictive coding process where a human reinforces or discourages learning and refinement.
5. Natural Language Processing – The NLP approach adds structure, computation, and quantities to traditional language in order to create analytic opportunity.
Term Frequency/ Document Frequency Matrix – This method characterizes how anomalous a document is based on the ratios of words used in that document to the ratios of words used in all documents throughout a corpus .
Sentiment Analysis – This includes methods that determine the sentiment of written text based on the words used and structure of the speech.
Topic Tagging – This method includes algorithms that determine the topic of a document based on the associations of words used and comparisons to “word bags” of interest
6. Network Analysis – This approach analyzes the structure of a network graph to determine the relationships between nodes and edges.
Network Descriptive Statistics – This method calculates descriptive measures to characterize network position and examine the change and evolution of position over time.
7. Geospatial Statistics is an approach that provides analysis of data that has geographical or spatial relevance.
Kernel Density – This method graphically measures the point density of multiple observations in multiple dimensions. Kernel density can be extended to linear density and other creative variants; but ultimately useful for “hot spot” characterization in two and three dimensions.
Local Outlier Factor – This method determines how likely an observation is given the proximity to its nearest neighbors, and is a powerful way to look for outlier observations in dense data that is very regularly measured. Nearest neighbors can be considered spatially, but can also be extended to temporal proximity, scalar proximity, etc.
In closing, I hope that I’ve given you a sense of the variety of approaches and methods that can be brought to bear on data in an IoT scenario that lead to better business outcomes . Analytics for the sake of analytics is never the right approach. Choose a solution that helps you understand a business problem then deploy the most appropriate approach and method to achieve success!