Data Visualization

The findings of a study are often most powerfully rendered not by complex statistical models but by simple visual displays. Data graphics visually display measured quantities by combining points, lines, a coordinate system, numbers, symbols, words, shading, and color (Tufte, The Visual Display of Quantitative Information, 2nd edition, 2001). Well-conceived graphical displays show large amounts of data efficiently and coherently, avoid distortion, and induce the viewer to focus on the substance of the intended message. Exponent statisticians possess the experience and software tools, extending well beyond the capabilities of Microsoft® Excel, to create effective visual displays for purposes of exploration, confirmation, and communication.

The most suitable display will depend on the types of data that have been gathered and the roles of key variables. In the case of a single variable or quantity of interest, bar graphs and pie charts can be used to display categorical data that are classified into one of several groups or categories (e.g., brand, product model, geographic area). For quantitative variables that take on values in an interval of the number line, other types of displays are needed. Stemplots are quick-to-construct displays that preserve the numerical values of the observations and are especially well suited for small data sets. Histograms, which resemble bar graphs of categorical data, break the values of a quantitative variable into intervals and display only the count or percent of observations that fall into each interval. Dotplots combine features of stemplots and histograms by representing one or more occurrences of a numerical value by a single dot along the number line. The constituent parts of a boxplot are a box containing the middle 50% of the data and lines extending from the box to more extreme values on both sides. 

The task of displaying data becomes more challenging as the number of variables or quantities of interest increases. Data sets that are time series—i.e., measurements of a variable taken at regular intervals—are typically displayed by adding a second axis to display the time reading as well as the primary quantity of interest. Such time plots can help to reveal the main features of a time series, particularly the presence of seasonal variation or a long-term trend. Scatterplots are used frequently in correlation and regression analyses to display simultaneously the values of two quantitative variables and aid in judging the strength of any linear or nonlinear association between them. Relationships between a quantitative and categorical variable usually can be portrayed effectively by creating multiple displays of quantitative data, one for each category, and placing them adjacent to one another on a common scale. Similar strategies may be deployed in cases involving two categorical variables.

Displays of multivariate data employ various features to represent additional dimensions. In a scatterplot of two quantitative variables, for example, different colors or symbols may be used to represent a third, categorical variable. For three-dimensional quantitative data, a third axis may be added, and perspective may be used to create the higher-dimensional effect in the display. Alternatively, such data can be effectively represented by contour plots, in which each two-dimensional contour corresponds to a selected value of the third variable. These ideas can be extended using such tools as scatterplot matrices, while dynamic graphics offer an interactive environment to explore high-dimensional data.

The experienced analyst relies on informative plots and numerical summaries of data not only to communicate results but also to check for anomalies in data collection, to identify important associations between variables, and to evaluate assumptions underlying candidate statistical models. Without thorough exploratory analysis, subtle features of the data may be overlooked, with potentially unfortunate consequences for the study’s findings and conclusions.