Accelerating non-additive measures
Use case
We want to run queries against pre-aggregations only to ensure our application's superior performance. Usually, accelerating a query is as simple as including its measures and dimensions to the pre-aggregation definition.
Non-additive measures (e.g., average values or distinct counts) are a special case. Pre-aggregations with such measures are less likely to be selected to accelerate a query. However, there are a few ways to work around that.
Data modeling
Let's explore the users
cube that contains various measures describing users'
age:
- count of unique age values (
distinct_ages
) - average age (
avg_age
) - 90th percentile of age
(
p90_age
)
cubes:
- name: users
# ...
measures:
- name: distinct_ages
sql: age
type: count_distinct
- name: avg_age
sql: age
type: avg
- name: p90_age
sql: PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY age)
type: number
All of these measures are non-additive. Practically speaking, it means that the pre-aggregation below would only accelerate a query that fully matches its definition:
cubes:
- name: users
pre_aggregations:
- name: main
measures:
- distinct_ages
- avg_age
- p90_age
dimensions:
- gender
This query will match the pre-aggregation above and, thus, will be accelerated:
{
"measures": ["users.distinct_ages", "users.avg_age", "users.p90_age"],
"dimensions": ["users.gender"]
}
Meanwhile, the query below won't match the same pre-aggregation because it
contains non-additive measures and omits the gender
dimension. It won't be
accelerated:
{
"measures": ["users.distinct_ages", "users.avg_age", "users.p90_age"]
}
Let's explore some possible workarounds.
Replacing with approximate additive measures
Often, non-additive count_distinct
measures can be changed to have the
count_distinct_approx
type
which will make them additive and orders of magnitude more performant. This
count_distinct_approx
measures can be used in pre-aggregations. However, there
are two drawbacks:
- This type is approximate, so the measures might yield slightly different
results compared to their
count_distinct
counterparts. Please consult with your database's documentation to learn more. - The
count_distinct_approx
is not supported with all databases. Currently, Cube supports it for Athena, BigQuery, and Snowflake.
For example, the distinct_ages
measure can be rewritten as follows:
cubes:
- name: users
measures:
- name: distinct_ages
sql: age
type: count_distinct_approx
Decomposing into a formula with additive measures
Non-additive avg
measures can be rewritten as
calculated measures
that reference additive measures only. Then, this additive measures can be used
in pre-aggregations. Please note, however, that you shouldn't include avg_age
measure in your pre-aggregation as it renders it non-additive.
For example, the avg_age
measure can be rewritten as follows:
cubes:
- name: users
measures:
- name: avg_age
sql: "{age_sum} / {count}"
type: number
- name: age_sum
sql: age
type: sum
- name: count
type: count
pre_aggregations:
- name: main
measures:
- age_sum
- count
dimensions:
- gender
Providing multiple pre-aggregations
If the two workarounds described above don't apply to your use case, feel free to create additional pre-aggregations with definitions that fully match your queries with non-additive measures. You will get a performance boost at the expense of a slightly increased overall pre-aggregation build time and space consumed.
Source code
Please feel free to check out the
full source code (opens in a new tab)
or run it with the docker-compose up
command. You'll see the result, including
queried data, in the console.