I'm trying to build a boxplot based on a csv file with a column cat
with three categories and a column val
with numerical values. If I load this data into Tableau Desktop, drag cat
to columns and val
to rows, then right-click on val
and go to Measure > Median, the median for each category is shown correctly.
However, if I instead do Measure > Dimension, Tableau displays a boxplot with the incorrect median for category a
. It says the median is 1.5 instead of 0 because it ignores repeated values by default. In order to fix this, I need to go to the Analysis tab and unclick "Aggregate Measures".
Why does Tableau do this? I can't imagine a situation in which you would want to ignore repeated values while calculating the median. Is it something to do with how I have organised the data before putting it into Tableau? Am I doing it wrong?
After this experience, I don't trust Tableau. But a lot of people use it, so I really want to know best practice to avoid similar gotchas in future.
Example Data:
F1 | cat | val |
---|---|---|
0 | a | 0 |
1 | a | 0 |
2 | a | 3 |
3 | b | 4 |
4 | b | 5 |
5 | c | 6 |
Why are you treating Val as a dimension instead of a measure? That is the problem.
If you want to calculate some sort of aggregation function such as Mean, Max, Median etc., then you have to treat that field as a Measure. That's essentially what being a measure means.
Dimensions are used to partition or group data into sets of data rows prior to calculating the aggregation functions for the measures, exactly as the fields that follow the SQL GROUP BY keyword
You can trust Tableau once you understand how it works. It only does what you ask it to do.
If you make val a dimension, then you are grouping in the SQL sense by category and value, getting one summary row for each unique value, and then making a box plot based on those unique values.
If instead you turn off aggregate queries, then Tableau does not use a group by keyword in the generated sql. Instead, it plots each value (so if you lasso a circle, you may find that you've selected many values drawn on top of each other). In that case, the box plot will be based on the individual values rather than aggregated values.
Other points to keep in mind. Most of time, Tableau creates a SQL query and sends that to the data source, letting the database do the calculations and send back just the query results which Tableau then presents visually. Most of the time, these are GROUP BY (aggregate) queries returning aggregated summary results. This lets you use Tableau to quickly summarize and view information about very large datasets efficiently. You can control the granularity aka level of detail of the results by choosing which fields are treated as dimensions, i.e. the fields that follow the GROUP BY keyword.
As you discovered, you can turn off aggregate queries, in which case the whole idea of dimensions and measures goes away, and Tableau just asks the data sources to send back rows at the same level of detail as the original table. Aggregate queries are by far the most common case, but sometimes it is useful to turn off aggregation.
So bottom line, most calculations, including computing averages, medians are performed by the data source. There are a few exceptions. For example, table calculations are performed locally upon the aggregate summary results. Another exception is the set of calculations available on the Analytics tab in the left margin, such as Box Plot.
Box Plot (quartile) calculations are performed locally by Tableau upon the summary results returned from the data source. So if you are using aggregate queries (the default), then the box plot is based on those aggregate values. If you turn off aggregation, then the box plot will be based on the individual values. That's the difference between, say, calculating the median of the daily sales totals over a month versus calculating the median of the individual sales transactions.