Calculating averages and percentiles
Use case
We want to understand the distribution of values for a certain numeric property
within a dataset. We're used to average values and intuitively understand how to
calculate them. However, we also know that average values can be misleading for
skewed (opens in a new tab) distributions which are common
in the real world: for example, 2.5 is the average value for both (1, 2, 3, 4)
and (0, 0, 0, 10)
.
So, it's usually better to use
percentiles (opens in a new tab). Parameterized by a
fractional number n = 0..1
, where the n-th percentile is equal to a value that
exceeds a specified ratio of values in the distribution. The
median (opens in a new tab) is a special case: it's defined
as the 50th percentile (n = 0.5
), and it can be casually thought of as "the
middle" value. 2.5 and 0 are the medians of (1, 2, 3, 4)
and (0, 0, 0, 10)
,
respectively.
Data modeling
Let's explore the data in the users
cube that contains various demographic
information about users, including their age:
[
{
"users.name": "Abbott, Breanne",
"users.age": 52,
},
{
"users.name": "Abbott, Dallas",
"users.age": 43,
},
{
"users.name": "Abbott, Gia",
"users.age": 36,
},
{
"users.name": "Abbott, Tom",
"users.age": 39,
},
{
"users.name": "Abbott, Ward",
"users.age": 67,
},
];
Calculating the average age is as simple as defining a measure with the built-in
avg
type.
Calculating the percentiles would require using database-specific functions.
However, almost every database has them under names of PERCENTILE_CONT
and
PERCENTILE_DISC
,
Postgres (opens in a new tab) and
Snowflake (opens in a new tab)
included. For BigQuery (opens in a new tab),
you'd need to use the APPROX_QUANTILES
function.
cubes:
- name: users
# ...
measures:
- name: avg_age
type: avg
sql: age
- name: median_age
type: number
sql: PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY age)
- name: p95_age
type: number
sql: PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY age)
Result
Using the measures defined above, we can explore statistics about the age of our users.
[
{
"users.avg_age": "52.3100000000000000",
"users.median_age": 53,
"users.p95_age": 82
}
]
For this particular dataset, the average age closely matches the median age, and 95% of all users are younger than 82 years.
Source code
Please feel free to check out the
full source code (opens in a new tab)
or run it with the docker-compose up
command. You'll see the result, including
queried data, in the console.