Statistical Learning & Data Mining

Exponent statisticians possess experience with both traditional and recently developed tools for data mining, and we monitor the state of the art in this rapidly evolving area of research. 

Advances in computing technology and the consequent abilities to create, store, and access increasingly larger volumes of data in modern business and science have created new opportunities and challenges. Data mining refers to the application of statistical methods to extracting implicit, previously unknown, and potentially useful information from data. 

Such methods can be applied to a diverse set of problems: diagnosing which disease afflicts a patient based on the values of biological markers and observable symptoms; predicting how much credit a consumer will use on their home equity line based on their credit score, financial history, and loan characteristics; and understanding which products are most likely to be purchased simultaneously by supermarket customers to achieve optimal placement of items in the store. The primary objective of the research may be advancement of knowledge or prediction of future events.

The learning involved in the modeling process may be supervised—i.e., guided by the presence of an outcome variable of interest—or unsupervised when there is no meaningful designation of inputs and outputs. Depending on the nature of the learning involved and the type of outcome variable, problems in data mining can be labeled as one of three types:

  • Classification 
  • Regression
  • Clustering

Classification problems involve a qualitative (or categorical) outcome variable. Common data mining methods for addressing classification problems include logistic regression, classification trees, and neural networks. In logistic regression the probability of observing a particular outcome is modeled explicitly as a function of one or more input variables. Classification trees are constructed by repeated binary splits of the data, beginning with the complete dataset, with the goal of creating descendant subsets that are more homogeneous in outcomes than the parent subsets. If knowledge discovery is the goal, interest may then focus on the input variables used in defining the criteria for splitting. A neural network is essentially a nonlinear statistical model, in which the central idea is to extract linear combinations of the input variables as derived features and then model the response as a nonlinear function of these features. Because neural networks do not represent explicit links between inputs and outputs, they are inherently better suited to prediction than to knowledge discovery.

Regression problems involve a quantitative outcome variable and are typically addressed using linear regression, regression trees, or neural networks. Linear regression models express the outcome as a linear function of input variables. Although linear regression was developed largely in the pre-computer era, these models are still valued for their simplicity and interpretability. Regression trees, like classification trees, use recursive partitioning algorithms to define a set of rules linking the output to inputs. Similarly, neural networks are readily adapted to accommodate quantitative outputs.

Clustering, or data segmentation, involves the unsupervised learning task of grouping observations (objects) into subsets or clusters. Distinctions between input and output quantities are not meaningful. Instead, the goal is to create clusters in which the objects within a cluster are more similar to one another than to objects in different clusters. K-means clustering, one of the most widely used algorithms, selects k objects, each of which initially represents a cluster mean or centroid. Each of the remaining objects is assigned to the closest cluster as measured by its distance to the cluster centroid. A different, though similar, task involves reducing the number of variables, or the dimensionality, of the data. The method of principal components is frequently used to find linear combinations of the original variables that significantly decrease the size of the data set while minimizing the loss of information.

The application of statistical methods to very large data sets with many variables increases technical concerns that statistically significant associations will be found by chance (i.e., false positives) and will not be indicative of true physical relationships. In supervised learning problems of classification and regression, these concerns are effectively addressed by cross-validation—that is, by dividing the data into a training subset to build a prediction model and a test subset to evaluate the model’s performance. Methods have also been developed to account for the expected number of falsely significant findings when evaluating a large series of statistical comparisons.