Data + Science

3/4/2017
Using Treemaps to Visualize Data

In a recent blog post here by my friend Andy Kriebel, Andy talks about Makeover Monday and discusses his views on treemaps and bubble charts. A few people asked questions on Twitter about when these could be used. A few people, including Andy, responded "never". I decided to write this post to offer some background and a few scenarios where I think these charts work, specifically the treemap.

What are treemaps

The treemap was invented by Ben Shneiderman. Dr. Shneiderman created the treemap to visualize hierarchical data. He wanted "a compact visualization of directory tree structures", but other more common visualization methods did not work well for this. For example, the folder structure on a computer is a tree structure. The folder "My Documents" might contain "Pictures", "Videos", and "Documents". Those folders might also contain sub-folders and so on. The treemap was a way to visualize this large amount of data, many folders and subfolders, in an efficient way. He writes, "Tree structured node-link diagrams grew too large to be useful, so I explored ways to show a tree in a space-constrained layout." For more information about the history of treemaps see his article Treemaps for space-constrained visualization of hierarchies.

Martin Wattenberg created a slight variation on this, which Dr. Shneiderman referred to as a clustered treemap. This design is what most people today commonly refer to as a treemap.

The treemap visualizes the largest segement to the smallest segment in order, encoding the data using size of the rectangle. Color is often used to encode additional data or as double encoding.

The problems with treemaps

The biggest problem with treemaps (and bubble charts), as Andy points out in his blog post, is that using size to encode the data makes it impossible to make precise quantitative comparisons vs. using length/height of a bar or position of a dot or line. In other words, bar charts, dot plots and line charts offer a much better way of encoding data for precise comparisons. In addition, now that treemaps are common chart types in many business intelligence tools, they are often being used to show simple categorical comparisons that would be better visualized as a bar chart.

Let's look at an example. The treemap below is from a Makeover Monday viz here.

Now it's quiz time. See if you can answer the following questions quickly.

What are the top 5 countries?
What are the bottom 5 countries?
What is the selling price in Croatia?
Can you visualize the small difference between Sweden and Romania?

I think you'll find some of these are easier to answer than others. You had trouble answering the bottom 5 because in the format of this treemap you can't see the names and spending amounts on these labels. Croatia isn't marked because there wasn't enough room for the labels and Romania and Croatia were hard to compare because they appear to be about the same size rectangle. So you had to rely on the order of them and the label to make that comparison.

Now look at the same data in comparison to a standard bar chart and ask the same questions.

These questions are much easier to answer with the bar chart. Even the small difference in the data for Sweden and Romania can now be seen. So using this example I completely agree with Andy. The bar chart is a much better way to show this comparison.

Hierarchical Data

As noted above, the original intent of the treemap was to visualize hierachical data. Sometimes there is a need to show comparisons at several levels. In this next treemap example, see if you can answer the following questions.

Which Region has a larger population, Africa or the Americas?
Which Region has the smallest population?
What country in Africa has the largest population?
What are the top 3 most populated countries in the Americas?
What is the third most populated country in Asia?

There are several things to point out. First, notice that you are answering questions at two different levels in the data. You were able to answer questions comparing regions, but also countries inside of regions. This is an important distinction. If creating bar charts to replace this image then you would need several charts. You would need a bar chart comparing regions as well as 7 other bar charts to compare countries within regions to answer these same questions.

In addition, there is a part-to-whole comparison that you can now make, albeit in an estimated way, that you cannot make with bar charts. For example, even without labels on every country, you can see that the country of India with 1.2 billion people is slightly bigger than the entire region of Africa (all of the blue rectangles), which is 1.1 billion. This type of comparison in built-in to the treemap, but would require special handling if creating bar chart comparisons across these different levels.

Let's take another quiz and see if we can answer these questions with the same treemap, this time in the style of Hans Rosling.

Which countries below have the largest population?

Nigeria or Vietnam?
Bangladesh or Germany?
Japan or Italy?
South Africa or Nepal?

As Hans Rosling did in his survey of countries, these countries were picked so that the population for one country is twice the size of the other. I picked South Africa and Nepal to showcase that even without the data label it is still possible to answer the question.

Even though it's possible to answer these questions, you probably noticed that it wasn't easy to find these countries. Your eyes had to search for each of them and then make the comparison. In addition, all of them had the country name listed in the box. I didn't ask about Malaysia, which is not labeled. This highlights another issue with treemaps, which is that in most cases there can't be a label for every box. This means that you have to rely on interactive features of the visualization, like tooltips or highlight functions to make some of the comparions and therefore it becomes less useful when printed or displayed in static form.

Large numbers of categorical comparisons

There will be times when you will need to make categorical comparisons with a large number of categories. For example, data in 50 states, 196 countries, 3,100+ counties in the US or other really large numbers of categories. Bar charts are not useful in this situation. One solution is to show the top n or bottom n from the data. However, this is only a small subset of the entire data.

One good example of a treemap with 50 states is the electoral college. This particular dataset can be very difficult to visualize. It is often visualized on a choropleth map (shaded map), which distorts the comparison because of the size of the land in the different states. Others have tried visualizing it with cartograms which distorts the shape of the country making it barely recognizable (here are some variations by the Financial Times). My co-author, Steve Wexler, for our upcoming book the Big Book of Dashboards wrote a great blog post on this very topic. In his post, In Praise of Treemaps, he used a treemap to compare the electoral votes for the poltical parties for the 2012 election. It's very easy to see which party is over 50% of the electrate votes. Knowing that locating states in the treemap might be difficult, Steve added the list of the states on the right making it easy for a user to find any state or set of states and highlight them.

Notice that the blue is more than half of the entire treemap. This is easy to see, but he also supplemented this with a bar chart underneath so the reader could make a precise comparison.

Even when the data isn't hierarchical a treemap can still be useful. In the next example I show 607 companies in the Consumer Financial Protection Bureau's complaint database. The treemap highlights that only a few companies make up over half of the complaints. I included a separate bar chart in this visualization for the top 10, not shown, but the treemap and stacked bar chart highlight a few things. First, the top 6 companies represent 60% of the complaints in the database. A precise comparison is not needed to see that Citibank has more complaints than Capital One. The data is actually quadruple encoded, so even if the reader can't easily compare the size of the rectangle, the order, label and color will assist the reader. The interactive version has a tooltip so the data can be seen for the boxes that are not labeled. An additional annotation "584 companies = 14.7%" was added to highlight the number of small companies that are in the bottom right corner of the treemap.

I'm not advocating that this is the only solution. In fact, a bar chart could also work well in this situation.

Notice in the bar chart solution that the 584 companies are aggregated into an "Other" category. This may or may not be the best way to visualize this information depending on the purpose of the visualization. For example, it would be impossible to highlight "USAA Bank" because it's aggregated in the "Other" category. This could be very important, especially if you work for USAA Bank and want to see how you rank against the other banks.

A dot plot might be another alternative depending on the data. It can be very useful to show how one data point compares to all of the others. However, this approach has it's downsides too. For example, in the case of the CFPB complaints, there are so many small companies that even when applying jitter to the dot plot it is diffciult to see all of the companies near zero.

Conclusion

It's important to understand the strengths and weaknesses of a treemap if you are going to use them.

Strengths:
   1. Originally built for hierarchial data.
   2. Allows for non-precise comparisons between top level categories as well as comparisons within categories at a lower level.
   3. Can encode a large number of categories, hierachical or not.
   4. Allows for easy secondary encoding with color.
   5. Allows for a part-to-whole comparison.
   6. Shows all of the data (non-aggregated) enabling highlight and tooltips when interactive.
   7. Can be used as a filter, highlight or navigation tool.

Weaknesses:
   1. Cannot make precise comparisons because it encodes data using size (and color) of rectangles.
   2. Smaller boxes cannot be labeled making them hard to find and make comparisons.
   3. Smaller boxes cannot be labeled making it difficult or impossible to read, especially in printed form.

Consider the Alternatives:

Bar Chart - if showing a small number of categories then a bar chart is almost always going to be better.

Bar Chart with "Other" Category - if there are too many categories for an effective bar chart and you are able to aggregate the data then grouping categories into an "Other" category can be very effective.

Bar Chart with Top N - if there are too many categories for an effective bar chart and only the top N are important then showing only the top N can be an effective solution.

Dot Plot - using a dot plot or a dot plot with jitter (aka jitter plot) can be a great way to show large number of categoral comparisons without aggregating the data.

Small Multiples - multiple charts can be used to allow for different comparison levels in the data. For example, a series of bar charts.

Icicle charts - Adam McCann pointed out that another alternative might be an icicle chart. For the right number of categories this could be useful, but as the number of nodes increases this will be hard to visualize in the same space.

Other experts in the field of data visualizaiton have written about treemaps. For example, Ben Shneiderman's article, Discovering Business Intelligence Using Treemap Visualizations, is featured here as a guest author on Stephen Few's PerceptualEdge.com.

Stephen Few discussed treemaps in his article, Tableau Veers from the Path. He writes, "Ben Shneiderman created treemaps to display large numbers of values that exceed the number that could be displayed more simply and effectively using a bar graph." He also shows an example of a treemap from his book Now You See It (Page 46).

As usual, Steve is able to describe the treemap perfectly.

When conventional graphs, such as bar graphs, cannot be used because there are too many items to represent as bars in a single graph or even a series of graphs on a single screen, treemaps solve the problem by making optimal use of screen space. Because they rely on pre-attentive attributes to encode values (area and color) that we can't compare precisely, we reserve such methods for circumstances when other more precise visualizations cannot be used, or precision isn't necessary." - from Now You See It, Page 46, Stephen Few, 2009.

"Treemaps can display a great deal of information quite powerfully but for a limited set of purposes. That is, treemaps were not designed to support precise quantitative comparisons, which we can't make based on relative size and color." -from Now You See It, Page 90, Stephen Few, 2009.

Update: Thank you Rob Radburn who reminded me that there has been some research in this area. Reference below.

Perceptual Guidelines for Creating Rectangular Treemaps by Nicholas Kong, Jeffrey Heer, and Maneesh Agrawala, 2010. This paper outlines many of the things discussed in this post, specifically that treemaps are useful when comparing across nodes or when there are very large numbers of categories to compare.

These are the key finding taken from this paper:

Leaf/Leaf comparisons - "Bar charts are more accurate than treemaps up to a density of 2,048 leaves, after which treemaps become equally accurate. At 4,096 leaves, treemaps become faster than bar charts—up to 5 seconds faster at 8,000 leaves."
Leaf/Non-Leaf comparisons - "Treemaps are more accurate than bar charts at all densities, but no faster."
Non-Leaf/Non-Leaf comparisons - "Treemaps are more accurate, but exhibit similar estimation times."

I'm hopeful that more research will be done in the future to help us understand these chart types better and how they can be used effectively. Like many other chart types (sankey diagrams, chord diagrams, node-link diagrams, etc.) treemaps have significant limitations. However, when used for the right purpose they can be effective charts for visualizing data.

I hope you find this information helpful. If you have any questions feel free to email me at Jeff@DataPlusScience.com

Jeffrey A. Shaffer
Follow on Twitter @HighVizAbility

What are treemaps

The problems with treemaps

Hierarchical Data

Large numbers of categorical comparisons

Conclusion

Further Reading