This is part 3 of my series on data literacy
Welcome back! In this installment, we’re going to walk through an example of the process for analyzing fitness of averages (which I described in part 2).
So let’s say we’ve been asked to create a report monitoring sales performance. The KPI’s that the company likes to use are the average sales per order and the average profit per order.
In an ideal world, it would be best to have a conversation about the usefulness of those KPIs – how well do these two KPIs actually measure sales performance? However, we’re going to take the strategy of starting small. We can do this by first optimizing the KPIs that are already being used. Improving how a company reports on and understands existing KPIs is the first step (of many) on that road to increasing data literacy maturity.
To do that, we’ll use that data exploration workflow to analyze fitness of averages for these two KPIs.
Time to use our expert tools: the boxplot and histogram.
Here’s an example of a boxplot for profit per order, by product subcategory.
So how can we tell if this boxplot distribution is adding useful information? Information that’s worth knowing, from a business perspective?
If all the points were within the interquartile range (the box part of the boxplot), or even within the bounds of the whiskers, then this boxplot wouldn’t be very interesting to us. That would mean that there isn’t much variation from the average. In such cases, the boxplot doesn’t give us any new information.
But in our example, there are some points (orders) that are far beyond the whiskers on both sides of the boxplot. This means there are extreme values that might be significantly skewing that overall average profit number.
We also need to look at variation among the groups. If the boxplots for each of the product subcategories looked pretty much the same, then it wouldn’t be very useful to look at this data by product subcategory.
When comparing categories like this, it’s important to look at two things:
- Is there a difference in the highest and lowest numbers across groups? Do some groups have much lower values than other groups, or much higher numbers?
- Is there a difference in the spread of the range across groups?
In our chart, there definitely appear to be some variations going on here. At a quick glance, it looks like binders, copiers, and machines would be interesting to explore and keep an eye on.
Next, we look at this distribution with a histogram.
Once again, we need to ask: is this visual tool providing us with any new information? And is this information worth knowing?
The answer would be no if a histogram shows a normal distribution.
Histograms with normally distributed data look symmetrical: the center bin would have the highest frequency, and the bins on each side would taper out evenly. If we had a distribution that looked like that, and you then added reference lines for the average and median, those lines would be right on top of each other. They would also be located at the literal center of the chart. If you printed out a histogram with a normal distribution, you could fold the paper exactly in half, and the crease would be along that middle bin that contains your average value.
Of course I’ve never seen that happen with a real data set, but there are definitely some situations where it’s pretty close.
In many other situations, there is a good amount of space between the average and the median. If for example you have a few very high outliers, they will push that mean average over to the right. This would also mean that most of the orders have much lower profit than the overall average. Reporting that simple average could mislead a company to think that they’re doing better than they actually are.
Let’s return to our example. In our boxplot analysis, we identified three product subcategories with potentially interesting distributions. Two of these – copiers and machines – are in the technology product category. In this next step, we’ll use a histogram to further analyze the distribution of profit in this product category. For products in the technology category, the average profit per order was $94. But with only a quick glance at our histogram , we can see that the most frequent profit was between $0 and $25. Those extreme high-profit outliers are skewing the overall average, making it appear higher than what’s truly the typical profit for technology products.
Before continuing with our analysis, I recommend spending some time optimizing your histogram. This process can be a bit time consuming, but it will be well worth it. If you selected bin size that is too large or too small, it can significantly mask the true shape of the data.
If the bin size is too small, visual scale can make the difference between bins appear bigger than it actually is, when in reality there is only a small difference in the frequency of those profit values. On the other hand, you need to make sure the bin interval isn’t too big, because it could hide patterns in the shape of the distribution. For example, if the most common profit amount is $0 to $25, but we sized our bins in $50 increments, then it would make it appear as if the most common profit amount was $0 to $50.
In workflow for optimizing histograms I explain how to optimize histograms in more detail.
Now that we’ve (hopefully) optimized our histogram, let’s get back to our analysis. So far, we found some potentially interesting things going on around the average profit, especially in the technology product category. Now we can use some more advanced visualizations of the distribution to see if we can uncover even more interesting things!
I like to call this one… the histobox.
It just looks super cool. But aside from that, it’s also a very useful way of looking at the distribution of data points within each histogram bin. So, you can see just how much variation there is within each bin, as well as between the different bins.
This can get even more interesting when you use it to compare different measures.
In this example, there are bins for the number of orders per customer. Each customer is then plotted as a circle in their applicable bin. Adding in a second measure of the average order on the Y axis gives us the ability to compare customers’ total orders to the average amount they spend on each order. Now you can start to see some interesting behavioral patterns here – both for individual points and for the differences in median for each bin.
Another next-step chart I like to use is a density plot. In Tableau, it’s actually a heatmap (aka highlight table), that uses two bins. Histograms let you see which bin(s) have the most common values. Density plots let you see what are the most commonly occurring combinations of values among two different measures.
In some situations, it might work better to create a scatterplot. You can turn this chart into a scatterplot by setting the bins to be continuous, and changing the mark type to circle. A scatterplot would probably work better if there is a large range of values, and if there are also a lot of values with no data. Just experiment to see which one works best.
In the example below I chose a heatmap. You can see how effective it is in making it easy to spot that the biggest category of orders is $0 to $25 in sales and $0 to $25 in profit.
So if we look at it by $5 increments, we can now see that the largest category is $5-$10 in sales, with $0-$5 in profit. We’ve gained some useful information by looking at how these two distributions intersect, and by doing a bit of experimenting with bin size.
We’ve now identified a good opportunity for moving beyond averages. We know that in our fictitious example, the sales and profit could be understood better by looking at more than just averages. In the next installment, I’ll show you how to gently integrate these new charts with existing reports, in a way that minimizes defensiveness and resistance to change.