14  Data Visualization

The purpose of data visualization is to communicate quantitative and qualitative data in an accurate and easily understandable way. Design is not the primary purpose of data visualization. Data visualization is used to convey information without distracting from it. It should be useful, appealing and never misleading.

We discuss data visualization after data summarization because—like summarization—visualization is a means of reducing a large amount of information to the essential elements.

Gelman and Unwin (2013) discuss the difference between statistical graphics and information visualization and the goals of data visualization:

The authors point out an important distinction between presentation graphics and exploratory graphics. Presentation graphics are a small number of graphics prepared for a potentially large audience. They are highly polished and designed. Exploratory graphing is about producing a large number of graphics for an audience of one—yourself. Speed, flexibility, and alternative views of the data are of the essence. During exploratory work, you already have your own, undivided, interest. Attention-grabbing graphics elements are not necessary. But views that stimulate self-interest to dig deeper into the story the data are trying to tell are important. On the other hand, when preparing graphics for a presentation to company executives, you do not have their attention yet. Data visualization can be a great assistant in grabbing the attention and telling the story.

Choosing the right amount of information in your visualizations.

14.1 Visual Learning

It is often said that “a picture is worth more than a thousand words”. That does not mean that you should always choose a data visualization over a tabular data summary or over a descriptive statement. To choose the best medium for communicating information, you need to understand what medium works best for the audience. Tabular and graphical displays complement each other, or in the words of Gelman and Unwin (2013):

a picture may be worth a thousand words, but a picture plus 1000 words is more valuable than two pictures or 2000 words.

According to the VARK theory, there are four predominant learning styles, Visual, Auditory (aural), Read/Write, and Kinesthetic (tactile). Kinesthetic learners learn best through experience and interactions—they are hands-on learners. Read/Write learners prefer text as input and output; they interact with material through reading and writing. Aural learners prefer to retain information via listening—lectures, podcasts, discussions, talking out loud. Visual learners, finally, retain information best when it is presented in the forms of graphs, figures, images, charts, photos, videos.

Most humans are visual learners, one estimate claims 65% of us learn best through visual means. Visuals add speed to communication. If the mode of communication matches the preferred learning style, information is retained more easily, the learning process is more engaging and fun, and the memory created is stronger. As a visual learner you are more likely to recall a graphic than you are a paragraph of text.

Why is that?

The answer lies in the way in which we process information in our sub-conscious and conscious minds. The sub-conscious is uncontrolled, always on, and effortless—it is your brain on autopilot. The conscious mind is where the hard work happens, it requires effort to engage.

We can take in much more information through sight than through any other sense. The amount of information processed in the conscious mind is much lower compared to the sub-conscious mind for any of the senses, as shown in the following figure. That means the sub-conscious acts as a filter for information, passing to the conscious mind that which needs more in-depth processing and engagement. The combination of processing power and bandwidth is why sight is most suited for understanding data.

14.2 Twenty Apple Trees

To describe data, visualization is key in data analytics. Tabular data is how computers process information. It is not how we can best process data. However, do not discount showing numbers in tables. Whether raw data or summarized data, tabular displays have their place. Let’s look at an example.

Table 14.1 displays the diameters in inches of 20 apples from 2 trees over six measurements periods. The measurements are spaced 2 weeks apart and were collected at the Winchester Agricultural Experiment Station of Virginia Tech. The total data set contains 80 apples from 10 trees.

Table 14.1: Diameters in inches of twenty apples from two trees over six two-week measurement periods.
Tree Apple Period 1 Period 2 Period 3 Period 4 Period 5 Period 6
1 1 2.90 2.90 2.90 2.93 2.94 2.94
2 2.86 2.90 2.93 2.96 2.99 3.01
3 2.75 2.78 2.80 2.82 2.82 2.84
4 2.81 2.84 2.88 2.92 2.92 2.95
5 2.75 2.78 2.80 2.82 2.83 2.90
6 2.92 2.96 2.96 3.02 3.02 3.04
7 3.08
8 3.04 3.10 3.11 3.15 3.18 3.21
9 2.78 2.82 2.83 2.86 2.87
10 2.76 2.78 2.82 2.85 2.86 2.87
11 2.79 2.86 2.88 2.93 2.95 3.98
12 2.76 2.81 2.82 2.86 2.90 2.90
2 1 2.84 2.89 2.92 2.93 2.95
2 2.75 2.80 2.82 2.84 2.86 2.86
3 2.78 2.81 2.84 2.85 2.87 2.90
4 2.84 2.86 2.86
5 2.83 2.88 2.89 2.92 2.93 2.93
6 2.80 2.86 2.89 2.92 2.93 2.95
7 2.86 2.89 2.92 2.96 2.96 2.99
8 2.75 2.80 2.83 2.85 2.86 2.88

If we were to display the entire data in a table, it would use four times as much space and it would be difficult to comprehend the data—to see what is going on. However, the tabular display is useful in some respects:

  • The exact values are shown.

  • We see that there are missing values for some apples. For example, apple #14 on tree 1 has only one diameter measurement at the first occasion. Apple #15 on tree 2 has thee measurements.

  • Once measurements are missing, they remain missing, at least for the apples displayed in the table. That suggests that apples dropped out of the study, maybe they fell to the ground, were eaten or harvested.

  • Apple identifiers are not unique across the trees. Apples with id 11, 15, and 17 appear on tree #1 and on tree #2. That is important information; if we want to calculate summary statistics for apples, we must also take tree numbers into account. The technical term for this arrangement is that apple ids are nested within the tree ids.

Other aspects of the data that are difficult to ascertain from the table:

  • Apple diameters should not decrease over time. You need to scan every row to check for possible violations; they would suggest measurement errors.

  • What do the trends over time look like? Are there significant changes in apple diameter over the 12-week period? If so, what is the shape of the trend?

  • Do apples grow at similar rates on the different trees?

  • What does the data from the other trees look like?

  • How many apples were measured on each tree?

We can see trends much better than we can read trends.

Figure 14.2 shows a trellis plot of the diameters of all eighty apples over the twelve-week study period. This type of graph is also called a lattice plot or a conditional plot. The display is arranged by one or more variables, and a separate plot is generated for the data associated with values of the conditioning variables. Here, the trellis plot is conditioned on the tree number, producing ten scatter plots of apple diameters for a given tree. The plots are not unrelated, however. They share the same \(x\)-axis and \(y\)-axis to facilitate comparisons across the plots.

From the trellis plot we can easily see information that is difficult to read from the tabular display:

  • The varying number of apples per tree

  • There is a definite trend over time in apple growth and it appears to be linear over time. Every two weeks each apple seems to grow by a steady amount; the amount differs between apples and trees.

  • There is variability between the trees and variability among the apples from the same tree. The apple-to-apple variability is small on tree #2, and it is larger on trees #9 and #10, for example.

  • All eighty apples are shown in a compact display.

  • The smallest diameter seems to be around 2.8 inches. In fact, these apples are a subset of a larger sample of apples, limited to fruit with at least 2.75 inches of diameter at the initial measurement.

  • The measurements are evenly spaced.

The graph is helpful to see patterns: trends, groupings, variability.

It does make consuming some information more difficult. For example, we are losing track of the actual diameter measurements. We see which measurements are larger and smaller (the pattern), but not their actual value. Showing the actual values to two decimal places is not the purpose of the graph. If we want to know the diameter of apple id 4 on tree #7 at time 5, we can query the data to retrieve the exact value.

import duckdb
import polars as pl
con = duckdb.connect(database="../ads5064.ddb")
apples = con.sql("SELECT * FROM apples").pl()
apples.filter((pl.col("Tree")==7) & (pl.col("appleid")==4) & (pl.col("measurement")==5))
shape: (1, 4)
Tree appleid measurement diameter
i64 i64 i64 f64
7 4 5 2.92

We could add labels to the data symbols in the graph, or even replace the circles with labels that show the actual value, but this would make the graph really busy and messy to read.

We also have lost information about which data point belongs to which apple. Since diameters grow over time, our brain naturally interpolates the sequence of dots. For the largest measurements on say, tree #7 and tree #10, we are easily convinced that the dots belong to the same apple. We cannot be as confident when data points group together more. And we cannot be absolutely certain that the largest observations on tree #7 and tree #10 belong to the same apple. And without further identifying information, we do not know which apple that is.

There are other ways in which the graphic can be enhanced or improved:

  • Add trends over time for individual apples and/or trees. That can help identify a model and show the variability between and within trees.

  • Adding the word “Tree” to the trellis labels if it is not clear what the numbers in the grey bars refer to. A downside would be that repeating the word “Tree” ten times does not add new information beyond the first cell of the plot.

  • Varying the plotting symbols or colors within a cell of a lattice to show which data points belong to the same apple. (Is it better to vary colors or symbols or both?)

  • Add an apple id, maybe in the right margin of each cell, aligned with the last measurement. (Would that work if an apple is not observed at the last measurement?)

The tableau and the trellis graph display the raw data. We can also choose to work with summaries. Suppose we are interested in understanding the distribution of apple diameters by measurement time. The following statements compute several descriptive statistics from diameters at each measurement occasion.

q = (
    apples.lazy()
    .filter(pl.col("diameter") != None)
    .group_by("measurement")
    .agg(
        pl.col("diameter").count().alias('count'),
        pl.col("diameter").mean().alias('mean'),
        pl.col("diameter").min().alias('min'),
        pl.col("diameter").quantile(0.25).alias('q1'),
        pl.col("diameter").quantile(0.5).alias('median'),
        pl.col("diameter").quantile(0.75).alias('q3'),
        pl.col("diameter").max().alias('max')
        )
    .sort("measurement")
)

q.collect()
shape: (0, 8)
measurement count mean min q1 median q3 max
i64 u32 f64 f64 f64 f64 f64 f64

The output shows that the number of observations contributing at each measurement time decreases; this is expected as apples are lost during the study. The location statistics (mean, min, max, median, Q1, and Q3) all increase over time, showing that the average apple grows.

A visual display that produces the same information and conveys the distributions as well as the trend over time more clearly arranges box plots for the measurement times.

The grey box in the center of the box plot extends from the first to the third quartile; its width is called the inter-quartile range (IQR = \(Q_3 - Q_1\)). The box thus covers the central 50% of the distribution. The median is shown as the mid line of the box. The upper and lower extensions of the box are called the whiskers of the box plot. They extend to the largest and smallest observations, respectively, that are within 1.5 times the IQR from the edge of the box. If there are no dots plotted beyond the whiskers, they fall on the max and min, respectively. “Outliers” are shown as dots above and below the whiskers. In the apple data, they appear only at the upper end of the measurements. These outliers are not incorrect observations, they are simply unusual given the probability distribution. If you collect enough samples from any distribution, you will expect to draw a certain number of values from the tails of the distribution.

The box plot does not reveal the exact values of the points plotted, but it allows us to see the pattern in the data more easily than the tabular display of descriptive statistics. Except for outlier information, the tabular display contains the same information as the box plots.

Many decisions go into creating a good data visualization. What should we pay attention to? How do we choose a good plot type? How much embellishment of the graph is too little or too much.

14.3 Process and Best Practices

The data visualization process starts with defining the intention of the visualization:

  • What is the context for the graphic? Are you in exploratory mode to learn about the data or are you in presentation mode to tell the story of the data.

  • What is the statistical purpose of the visualization? Are you in a classification, regression, or clustering application? Are you describing the distributions and relationships in the data or the result of training a model?

  • What is the data to be visualized? Is it raw data or summarized data or data derived from a model?

The next step in the process is to examine candidate visualizations and to select a concept. Libraries of graphics examples come in handy at this step. Here are some sites that show worked graphics examples:

  • R Graph Gallery: an extensive collection of graphics in R using base R graphing functions and ggplot2.

  • Python Graph Gallery: an extensive collection of graphics in Python organized like the R gallery. Also offers tutorials for Matplotlib, Seaborn, Plotly, and Pandas.

  • MakeoverMonday: this is a social data visualization project in the UK that posts a new data set every Monday and invites users to visualize the data. Several visualizations are chosen and displayed in the gallery and the blog each week. The site is also a great resource for data sets.

  • Data Viz Project: a collection of data visualizations to get inspired and find the right visualization for your data. The project is run by Ferdio, an infographic and data viz company in Copenhagen.

  • Dataviz Inspiration: Hundreds of impactful data visualization projects from the person behind the R Graph Gallery. With code where available.

  • Information is Beautiful: great visualizations of good news, positive trends, and uplifting statistics.

  • Datylon’s Inspiration: Datylon is a data visualization platform, this page shows visualizations by category created with Datylon.

The next step is to implement the chosen design and to perform the Trifecta checkup discussed in more detail below. Finally, it is a good idea to test the visualization with others and to get constructive feedback on what works and what does not. Does the visualization achieve the goals set out in the first step of the process? What attracts the audience, what distracts the audience?

Pre-attentive attributes

Pre-attentive attributes are attributes such as length, width, size, shape, etc., that our brain processes almost instantaneously and with little effort.

Visual processing from light entering to comprehension. Source.

Visual processing from light entering to comprehension. Source.

When information (light) enters the retina, it is processed in iconic memory, a short-term buffer and processor that maintains an image of the world and enriches the information. This is where pre-attentive attributes are processed, rapid and massively parallel in the sub-conscious mind. The visual information is then passed on to visual working memory, short term storage that is combined with information from long-term memory for conscious processing.

Effective data visualizations encode as much possible information through pre-attentive attributes.

To encode quantitative values, length and position (in 2-dimensional space) are best as they are naturally interpreted as quantitative. Lines of different lengths are more easily interpreted as smaller and larger values than lines of different widths. Shapes are not useful as a pre-attentive attribute to encode quantitative values. Is a circle larger than a square? That takes more conscious processing to answer and slows down comprehension of the visualization.

On the other hand, if we want to identify groups of data points that belong together, then shape or color are very effective pre-attentive attributes, as well as proximity and connecting points.

You cannot infer actual values from pre-attentive attributes such as length or size, we only get a greater—lesser impression. Most of the times that is sufficient. Information about actual values has to be added through text, labels, grid lines. These are not pre-attentive attributes but learned symbols that require conscious processing. The effort to comprehend a visualization increases with the addition of non-pre-attentive attributes. You should weigh whether the additional processing effort is justified relative to the information gained. A labeled axis requires fewer annotations and mental processing than labeling every data point with its actual value.

If your visualization is used in a context where comparisons are required, the choice of attributes and features determines whether comparisons are more accurate or more generic. The most accurate comparisons are possible using 2-dimensional position and length attributes. Color intensity, color hue, area, and volume allow more generic comparisons but are not useful when accuracy is required.

A good example of the importance of pre-attentive attributes is displaying quantitative information in pie charts versus bar charts.

The pie chart in Figure 14.5 displays five values and uses area to compare and color to differentiate. Can you tell from the chart whether foo is larger than bar? How does ipsum compare to bar and lorem?

Figure 14.5: A pie chart displaying five values.

Since it is difficult to make accurate comparisons based on area, why not add labels to the chart. While we are at it, we can also dress up the display by using more color and 3-dimensional plotting.

A 3-dimensional pie chart. Yikes.

Adding percentages to the labels allows us to compare the values and conclude which slices of the pie are larger and which slices are smaller. But if we show the percentages, then why use a graphic in the first place? By using labels to show values the chart requires as much cognitive engagement as a table of values:

The information from the pie chart as a tabular display.
Category foo bar baz lorem ipsum
Percentage 20 24 8 32 16

The more vibrant colors do not add to the comprehension of the data and the 3-dimensional display makes things worse: comparing values based on volume is more difficult than comparing values based on area which in turn is more difficult than comparing values based on length.

Quantitative values can be encoded as pre-attentive attributes and comparisons are most accurate for lines and 2-dimensional positions. Figure 14.6 visualizes the data using pre-attentive attributes. The values are easily distinguished based on the length (height) of the bars. The categories have been ordered by magnitude. No color is necessary to distinguish the categories, the labels are sufficient.

Figure 14.6: A nice bar chart using the pre-attentive attribute length to convey differences.

Chartjunk

The term chartjunk was coined by Edward Tufte in his influential (cult) 1983 book “The Display of Quantitative Information” (Tufte 1983, 2001). Chartjunk are the elements of a data visualization that are not necessary to comprehend the information. Chartjunk distracts from the information the graph wants to convey. Tufte, who taught at Princeton together with John W. Tukey, subscribed to minimalist design: if it does not add anything to the interpretation of the data, do not add it to the chart.

The excessive use of colors, heavy grid lines, ornamental shadings, unnecessary color gradients, excessive text, background images, excessive annotations and decorations are examples of chartjunk. Not all graphical elements are chartjunk—you should ask yourself before adding an element to a graphic: is it helpful? Text annotations can be extremely helpful, but overdoing it can lead to busy charts that are not intelligible. By adding too many text labels, the label avoidance algorithm of graphing software might place labels in areas of the graph where they are misleading. If grid lines are not necessary for the interpretation of the graphic, leave them off. If grid lines are helpful, add them in a color or with transparency that does not distract from the data in the graph.

Unfortunately, it is all too easy to add colors, styles, annotations, grid lines, inserts, legends, etc. to graphics. Software makes it easy to overdo it.

Tufte’s war path on chartjunk needs to be moderated for our purposes. The minimalist view that anything that is unnecessary to comprehend the information is junk goes too far. We must keep the purpose of the data visualization in mind. In exploratory mode, you generate lots of graphics and different views of the data to learn about the data, find patterns, and stimulate thought—the audience is you. Naturally, we eschew adding too many elements to graphics, everything is about speed and flexibility—polish takes time. In presentation mode, the data visualization needs to grab attention and open the door for the audience to engage with the data. Annotations, colors, labels, headlines, titles, which would be chartjunk in exploratory mode, have a different role and place in presentation mode. They might not be necessary to comprehend the information but can reduce the cognitive burden the audience members have to expend.

Tufte also introduced the concept of the data-ink ratio: a visualization should maximize the amount of ink it uses on displaying data and minimize the amount of ink used on embellishments and annotations.

Figure 14.7 is a junkified version of the bar chart in Figure 14.6—it is full of chartjunk and devotes too much ink to things other than data:

  • The legend is not necessary; categories are identified through the labels.
  • The axis label “Category” is not necessary, it is clear what is being displayed.
  • The vertical orientation of the axis label and the horizontal orientation of the categories is visually distracting.
  • Varying the colors of the bars is distracting and does not add new information. The relevant information for comparison is the length of the bars.
  • The grid lines are intrusive and add too much ink to the plot.
Figure 14.7: A junkified bar chart.

The Trifecta checkup

The data visualization expert Kaiser Fung created a framework to criticize data visualizations. It is recommended that you run your visualizations through this checkup. It rests on three simple investigations:

  1. What question are we trying to answer (Q)?

  2. What do the data say (D)?

  3. What does the visualization say (V)?

Hopefully, the answers to the tree lines of inquiry are the same. The framework is arranged in a Question—Data—Visualization triangle, in what Fung calls the junk charts trifecta:

Figure 14.8: The Junk Charts Trifecta Checkup according to Kaiser Fung. Source.

A good visualization scores on all three dimensions, Q—D—V. It poses an interesting question, uses quality data that are relevant to the question, and visualization techniques that follow best practices and bring out the relevant information (without adding chartjunk). Tufte’s concerns about chartjunk and maximizing the data-ink ratio fall mostly in the V corner of the trifecta. But even the best visualization techniques are useless when applied to bad or irrelevant data or attempting to answer an irrelevant or ill-formed question.

An example of a graphic that fails on all three dimensions is discussed by Fung here and shown below. This graphic appeared in the Wall Street Journal.

Figure 14.9: Citi Bike riders.
  1. Q—What question are we trying to answer? How many riders use Citi Bike during a weekday? That is not a very interesting question, unless you are a city planner. Even then, you are more interested in when and where the bikes are used, rather than some overall number. The chart breaks the daily usage down over time. We see that most riders are in the morning and early evening—going to work and leaving work. That too is not very interesting and not at all surprising.

  2. D—What do the data say? The data were collected on two days in the fall. What does this represent? Certainly not the average usage over the year. How were those two days selected? Where was the data collected? Randomly throughout the city, only downtown, in certain districts? What was sampled? The days of the months? The riders? The city districts?

  3. V—What does the visualization say? Does the graphic answer the question using best practices for data visualization? There is much going wrong here.

    • The city background is chartjunk, an unnecessary embellishment that does not add any information.

    • Similarly, the bicycle icon is unnecessary. It might be moderately helpful in clarifying that “Bike” refers to bicycle and not motorbike, but that could be made clear without adding a graphics element.

    • The blue dots and the connecting lines are a real problem. How are the connections between the dots drawn? Are the segments (some curved, some jagged) based on observations taken at those times? If so, the data should be displayed. If not, what justifies connecting the dots in an irregular way?

    • The scale of the data is misleading. If there was a vertical axis with labels, one could clearly see that the dots are not plotted along an even scale. The vertical distance between the points labeled 65 and 166.5 is about 100 units and is greater than the distance between the points labeled 166.5 and 366.5, about 200 units apart. Once we discover this, we can no longer trust the vertical placement of the dots. Instead, we must make mental arithmetic to interpret the data by value. Displaying the data in a table would have had the same effect.


Figure 14.10 is a screenshot from a local TV newscast in Virginia; it shows a histogram for the five most active years in terms of number of tornadoes. Does this graphic pass the trifecta checkup?

  1. Q—What question are we trying to answer? Is 2024 an unusual year in terms of tornado activity?

  2. D—What do the data say? 2024 is among the top-5 years of tornado activity. But we do not know whether this is for the entire U.S., the entire world, or for the state of Virginia? It is probably not the latter, 1,000 reported tornadoes per year in Virginia is a bit much. How far along are we in 2024 when this was reported? 2024 might not be an unusual year compared to the top-5 if the report was issued at the end of tornado season; 2024 might be an outlier if the report was issued early on in the tornado season.

  3. V—What does the visualization say? There is a lot going on in the visualization. The bars are ordered from high to low, which disrupts the ordering by time. The eye is naturally drawn to the year labels at the bottom of the bars and has to work overtime to make sense of the chronology. When data are presented in a temporal context our brain wants the data arranged by time (Figure 14.11). The asterisk near the number 1,109 for 2024 leads nowhere. We associate it with the text in the bar for that year anyway, so the asterisk is not necessary. The visualization does not tell us what geographic region the numbers belong to. Is this global data or U.S. data?

Figure 14.11: Tornado frequencies arranged chronologically.

The problem with using a bar chart for select years only is not to give an accurate impression of how far the data points are separated in time. The first two bars are 3 years apart, the next two bars are 6 years apart. Figure 14.12 fixes that issue by spacing the bars approrpriately.

Figure 14.12: Tornado frequencies arranged chronologically and spaced correctly.

Infographics

Infographics are often guilty of adding extraneous information or displaying data in sub-optimal (nonsensical) ways. Below is another example from Junk Charts. The graphic visualizes the 2022 oil production measured in 1,000 barrels per day by country. The data are projected onto a barrel. Countries are grouped by geographic region and by an industry-specific classification into OPEC, non-OPEC, and OPEC+ countries. The geographic regions are delineated on the barrel surface with thick white lines. Thin white lines mark polygons associated with each country. Aggregations by geographic region are shown below the barrel. The industry-specific categorization is displayed as colored rings around the country flag.

Shapes are not a good pre-attentive attribute to display values, and polygons are particularly difficult to comprehend. Presumably, the size of the country polygons is proportional to their oil production—but who knows, there is no way of validating this. The use of polygons increases the cognitive burden to comprehend the visualization.

The visualization contains duplicated information:

  • Each polygon is labeled with the countries’ oil production; this information duplication is required to make sense of the data because the polygon area is difficult to interpret.

  • Greater/lesser oil production by country is also displayed through the size of the map inserts and the font (boldness and font size) of the country names.

  • The country information is duplicated unnecessarily. Countries are shown by name and with their flags. Some country names are abbreviated, and this adds extra mental processing to identify the country. If you are not familiar with working with country codes, identifying the oil production for Angola is tricky (AGO).

  • The geographic summaries display totals as labels and graphically by stacking barrel symbols; each barrel corresponds to 1,000 barrels of oil produced per day.

14.4 Choosing the Data Visualization

Andrew Abela published the Chart Chooser, a great visualization to help select the appropriate data visualization based on the data type and the goal of the visualization (Figure 14.14, Abela (2020)). An online version with templates for PowerPoint and Excel is available here.

To select a chart type, start in the center of the chooser with the purpose of the visualization. Do you want to compare items or variables? Do you want to see the distribution of one or more variables? Do you want to show the relationship between variables or how totals disaggregate?

The next two figures show adaptations of the Chart Chooser for continuous and discrete target variables by Prof. Van Mullekom (Virginia Tech).

Suppose you wish to display the monthly temperatures in four cities, Chicago, Houston, San Diego, and Death Valley. The goal is to compare the temperature throughout the year between the cities. According to the chart chooser for continuous target variables, a paneled scatterplot or overlaid time series plot could be useful. Since the data are cyclical, we could also consider a cyclical chart type.

A sub-optimal chart type, possibly inspired by plotting temperature, would be a “heat” map. Heat maps are used to display the values of a target variable across the values of two other variables, using color-type attributes (color gradient, transparency, …) to distinguish values. Here, temperature is the target variable, displayed across city and month.

Figure 14.17: Heat map of temperatures in four cities.

You can think of a heat map as the visualization of a matrix; the row-column grid defines the cells, and the color depends on the values in the cells. When properly executed, the heat map reveals patterns between the variables, such as hot spots. Correlation matrices are good examples for the use of heat maps. Also, when you are plotting large data sets, binning the data and using a heat map can reveal patterns while limiting the amount of memory needed to generate the graph. An example is a residual plot for a model with millions of data points.

The problems with using the heat map in this example are:

  • There is no specific ordering between Chicago, San Diego, Houston, and Death Valley. The cities on the vertical axis are not arranged from North to South either. Death Valley Junction is further north than San Diego and Houston; Houston is the southern-most city of the four. Heat maps are best used when the vertical and horizontal axis can be interpreted in a greater/lesser sense or when both axis refer to the same categories.

  • The cyclical nature of the year is somewhat lost in the display. January connects to December in the same way that it connects to February.

  • Colors are not a good pre-attention attribute for value comparisons. It is clear that the summer months are hotter in Death Valley than in the other cities, but how much hotter?

  • A lot of ink is spent on coloring the squares.

A simpler—and more informative—display of the same data is shown below. A line chart of temperature by month, separate for each location. The differences between the cities are easier to see. The cyclical nature of the data is hinted at through the \(x\)-axis label—it begins and ends with January. The grid lines help to identify the actual temperature values.

Figure 14.18: Line chart for temperatures in four cities.

Data visualization is only one method of information visualization. The interactive periodic table of visualization methods, presents visualization of data, methods, information, strategy, metaphors, and concepts, in the form of a periodic table. Hover over an element to see an example of the visualization method.

14.5 Data Visualization with Python

A large number of Python tools are available for data visualization. You can find the open-source software tools at PyViz.org. From this page of all OSS Python visualization tools on PyViz you see that many of them are built on the same backends, primarily matplotlib, bokeh, and plotly.

Matplotlib

The matplotlib library was one of the first Python visualization libraries and is built on NumPy arrays and the SciPy stack. It pre-dates Pandas and was originally conceived as a Python alternative for MATLAB users; that explains the name and why it has a MATLAB-style API. However, it also has an object-oriented API that is used for complex visualizations.

While you can do anything in matplotlib, it does require a lot of boilerplate code for complex graphic; other libraries are providing higher-level APIs to speed up the creation of good data visualizations. Packages such as seaborn are built on matplotlib, so the general vernacular and layout of a seaborn chart is the same as for matplotlib. You can find the extensive matplotlib documentation here.

Seaborn

The seaborn library has a higher-level API built on top of matplotlib and is deeply integrated with Pandas, remedying two of the frequent complaints about matplotlib. Seaborn alone will get you far, but code often calls matplotlib functions. A typical preamble in Python modules using seaborn is thus something like this:

import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt

import seaborn as sns
import seaborn.objects as so

Seaborn categorizes plotting functions into figure-level and axes-level functions. In matplotlib vernacular, axes-level functions plot data onto a matplotlib.pyplot.Axes object. Figure-level functions such as relplot(), displot(), and catplot(), interact with matplotlib through the seaborn FacetGrid object.

The figure-level functions, e.g, displot(), provide an interface to its axes-level functions, e.g., histplot(), and each figure-level module has a default axes-level function (histplot() in the displot() module).

import seaborn as sns
import seaborn.objects as so

penguins = sns.load_dataset("penguins")
sns.histplot(data=penguins, x="flipper_length_mm", hue="species", multiple="stack")

The figure-level histogram is created by calling the displot() function. You can explicitly ask for histograms with kind=”hist”, or let the function default to producing histograms.

sns.displot(data=penguins, x="flipper_length_mm", hue="species", multiple="stack",kind="hist")

A side-effect of using a figure-level function is that the figure owns the canvas. The legend is placed outside of the chart of the figure-level function and inside the chart of the axes-level function.

One advantage of figure-level functions is that they can create subplots easily. Removing multiple="stack" produces:

sns.displot(data=penguins, x="flipper_length_mm", hue="species", col="species")

Removing multiple="stack" from the axes-level chart produces three overlaid histograms that are difficult to interpret:

sns.histplot(data=penguins, x="flipper_length_mm", hue="species")

Plotly

Plotly is a Python library for interactive graphics, based on the d3.js JavaScript library. The makers of plotly also have a commercial analytic platform, Dash, but plotly is open source and free to use.

To install plotly, assuming you are using pip to manage Python packages, simply run

pip install plotly

On my system I also had to

pip install --upgrade nbformat

and restart VSCode after the upgrade. You will know that this step is necessary when fig.show() throws an error about requiring a more recent version of nbformat than is installed.

The plotly library has two APIs, plotly graph objects and plotly express. The express API allows you to generate interactive graphics quickly with minimal code.

The southern_oscillation table contains monthly measurements of the southern oscillation index (SOI) from 1951 until today. The SOI is a standardized index based on sea-level pressures between Tahiti and Darwin, Australia. Although the two locations are nearly 5,000 miles apart, that pressure difference corresponds well to changes in ocean temperatures and coincides with El Niño and La Niña weather patterns. Prolonged periods of negative SOI values coincide with abnormally warm ocean waters typical of El Niño. Prolonged positive SOI values correspond to La Niña periods.

The following statements load the SOI data from the DuckDB database into a Pandas DataFrame and use the express API of plotly to produce a box plot of SOI values for each year.

import pandas as pd
import plotly.express as px

so = con.sql("SELECT * FROM southern_oscillation").df()
fig = px.box(x=so["year"],y=so["soi"])
fig.show() 

There is a lot happening with just one line of code. The graphic produced by fig.show() is interactive. Hovering over a box reveals the statistics from which the box was constructed. The buttons near the top of the graphic enable you to zoom in and out of the graphic, pan the view, and export it as a .png file.

The graph object API of plotly is more detailed, requiring a bit more code, but giving more control. With this API you initiate a visualization with go.Figure(), and update it with update_layout(). The following statements recreate the series of box plots above using plotly graph objects.

import plotly.graph_objects as go

fig = go.Figure(data=[go.Box(x=so["year"],y=so["soi"])])
fig.update_layout(title="Box Plots of SOI by Year",
                  xaxis_title="Year",
                  yaxis_title="SOI")
fig.show()

Next is a neat visualization that you do not see every day. A parallel coordinate plot represents each row of a data frame as a line that connects the values of the observation across multiple variables. The following statements produce this plot across the sepal and petal measurements of the Iris data. The three species are identified in the plot through colors, which requires a numeric value. Species 1 corresponds to I. setosa, species 2 corresponds to I. versicolor, and species 3 to I. virginica.

from functools import reduce
iris = con.sql("SELECT * FROM iris").df()

# recode species as numeric so it can be used as a value for color
unique_list = reduce(lambda l, x: l + [x] if x not in l else l, iris["species"], [])
res = [unique_list.index(i) for i in iris["species"]] 
colors = [x + 1 for x in res]

fig = px.parallel_coordinates(
    iris, 
    color=colors, 
    labels={"color"       : "Species",
            "sepal_width" : "Sepal Width", 
            "sepal_length": "Sepal Length", 
            "petal_width" : "Petal Width", 
            "petal_length": "Petal Length", },
    color_continuous_scale=px.colors.diverging.Tealrose,
    color_continuous_midpoint=2)

fig.update_layout(coloraxis_showscale=False)
fig.show()

The parallel coordinates plot shows that the petal measurements for I. setosa are smaller than for the other species and that I. setosa has fairly wide sepals compared to the other species. If you wish to classify iris species based on flower measurements, petal length and petal width seem like excellent candidates. Sepal measurements, on the other hand, are not as differentiating between the species.

Vega-Altair

Vega-Altair is a declarative visualization package for Python and is built on the Vega-Lite grammar. The key concept is to declare links between data columns and visual encoding channels such as the axes and colors. The library attempts to handle a lot of things automatically, for example, deciding chart types based on column types in data frames. The following image is from the Vega-Altair documentation:

import altair as alt

from vega_datasets import data
cars = data.cars()

alt.Chart(cars).mark_point().encode(
    x='Horsepower',
    y='Miles_per_Gallon',
    color='Origin',
).interactive()

alt.Chart(cars) starts the visualization. mark_point() instructs to display points encoded as follows: Horsepower on the \(x\)-axis, Miles_per_Gallon on the \(y\)-axis and the color of the points associated with the Origin variable.

Pandas vs Polars

Matplotlib, Seaborn, Plotly, ggplot, and Altair (since v5+) can work with Polars DataFrames out-of-the-box. If you are running into problems passing a Polars DataFrame to a visualization routine that works fine with a Pandas DataFrame you can always convert using the .to_pandas() function. For example, the parallel coordinates plot in plotly express works with a Pandas DataFrame but generates an AttributeError with a Polars DataFrame. Using the .to_pandas() method took care of the problem.

iris = con.sql("SELECT * FROM iris").pl()

fig = px.parallel_coordinates(
    iris.to_pandas(), 
    color_continuous_scale=px.colors.diverging.Tealrose,
    color_continuous_midpoint=2)

fig.update_layout(coloraxis_showscale=False)
fig.show()

Grammar of Graphics (ggplot)

The grammar of graphics was described by statistician Leland Wilkinson and conceptualizes data visualization as a series of layers, in an analogy with linguistic grammar (Wilkinson 2005). Just like a sentence has subject and predicates, a scientific graph has parts.

The grammar of graphics is helpful because we associate data visualization not by the name of this plot or that chart type, but by a series of elements, depicted as layers.

Each graphic consists of at least the following layers:

  • The data itself

  • The mappings from data attributes to perceptible qualities

  • The geometrical objects that represent the data

In addition, we might apply statistical transformations of the data, must place all objects in a 2-dimensional space and decide on presentation elements such as fonts and colors. And if the data consist of multiple groups, we need a faceting specification to organize the graphic or the page in multiple groups.

Figure 14.20 shows the layered representation of the grammar of graphics.

Thinking about data visualization in these terms is helpful because we get away from thinking about pie charts and box plots and line charts, and instead about how to organize the basic elements of a visualization.

R programmers are familiar with the grammar of graphics from the ggplot2() package. The grammar of graphics paradigm is implemented in Python in the plotnine library.

The following statements generate a data visualization from data frame df (layer 1). This data frame contains data from 196 observations of the optic nerve head from patients with and without glaucoma. The aes() function defines the aesthetics layer, associating variable eag with the \(x\)-axis and specifying variable Glaucoma as a grouping variable (eag is the global effective area of the optic nerve head measured on a confocal laser image). The geometries and statistics layers are added to the previous layers with the geom_density() function, requesting a kernel density plot. The scale_x_log10() function modifies the data-to-aesthetics mapping by applying a log10 scale to the \(x\)-axis. The result is a grouped density plot on the log10 scale.

from plotnine import ggplot, aes, geom_density, scale_x_log10   

df = con.sql("SELECT eag, Glaucoma from glaucoma").df()

(ggplot(df, aes(x="eag", group="factor(Glaucoma)", fill="factor(Glaucoma)"))
    + geom_density() 
    + scale_x_log10()
)
<Figure Size: (640 x 480)>

ggplot works out of the box with Polars DataFrames:

glauc_pl = con.sql("SELECT eag, Glaucoma from glaucoma").pl()

ggplot(glauc_pl) + aes(x="eag", group="factor(Glaucoma)", fill="factor(Glaucoma)") \
    + geom_density() + scale_x_log10()
<Figure Size: (640 x 480)>

14.6 Snafu or Misinformation?

Data visualizations are powerful tools, they allow the author to focus our attention on information. In choosing a visualization type, filling it with data, and annotating it, you exercise control over what is displayed and how it is interpreted. In 1954, Darrell Huff published one of the most widely sold statistics text, “How to Lie with Statistics” (Huff 1954). Huff covered such topics as how to introduce bias through sampling, how to cherry-pick statistics (“the well-chosen average”), how to misinterpret correlation for causation, and how to distort with statistical graphics. He said

Many a statistic is false on its face. It gets by only because the magic of numbers brings about a suspension of common sense.

Examples of poorly designed graphics and misleading graphics abound. The reasons might be malfeasance, chicanery, disingenuity, or incompetence.

Example: Covid-19 Statistics

The following graphic aired on a local TV channel in North Carolina on April 5, 2020. It was the beginning of the Covid-19 pandemic and audiences were keen to hear about the local case counts. Is there anything wrong with the visualization?

Covid-19 cases per day as reported on April 5, 2020 by a local TV station.

Covid-19 cases per day as reported on April 5, 2020 by a local TV station.

The number of daily cases has been more or less steadily increasing since March 18 (33 cases) and two weeks later stands at 376 cases per day.

The placement of the bubbles seems odd. The first bubble, 33 cases, seem further away from the horizontal grid line than, say, the bubble for 112 cases on March 21 is distant from the grid line at 100 cases. Maybe it is the angle at which the graph is viewed, but 112 (March 21) and 116 (March 22) should be about 1/2 way between 100 and 130, they are drawn closer to the line at 100.

Spending a bit more time with the \(y\)-axis you notice that the grid lines are evenly spaced, but that the reference labels are not equi-distant. The differences between the grid labels are 30, 30, 10, 30, 30, 30, 50, 10, 50, 50, and 50 units. Why would someone do this?

In a blog entitled “How to Spot Visualization Lies”, Nathan Yau gives numerous examples how chart construction can mislead. For example, a common device to exaggerate differences between groups is to draw bar charts with a baseline different than zero. The height of the bar is the information conveyed by the bar chart so bars should always start at zero (Figure 14.21, data from NOAA).

Figure 14.21: Three-month (February–April) tornado occurrence in the U.S. from 1950–2023.

When the baseline of the bar chart is changed, the length of the bars represent the difference from the baseline and the data need to be presented as such. The coloring of the bars in Figure 14.22 draws extra attention to the fact that we are looking at deviations from a baseline.

Figure 14.22: Three-month (February–April) tornado occurrence in the U.S. from 1950–2023 compared to 1980.

Statistics that are false on their face are comparisons that are based on absolute numbers but should be done on a relative scale. The absolute tornado numbers are comparable in Figure 14.21 because they relate to time intervals of the same length, February–April in each year. Someone could raise an objection here, because leap years have an extra day in February compared to non-leap years, so 1/4 of the bars have a basis of 90 days and 3/4 of the bars have a basis of 89 days. OK, sue me. The comparisons between the bars is still pretty darn fair.

However, if you compare, say, crime statistics, cancer incidences, service subscribers, etc., between regions or states, then large absolute numbers in large regions are probably not a surprise. There being fewer crimes in Blacksburg, VA than in Chicago does not make Blacksburg a safer place to live. It is a safer place because there are fewer crimes per resident or per 10,000 residents in Blacksburg, VA compared to Chicago.