Incrementally building pre-aggregations for a date range
Use case
In scenarios where a large dataset spanning multiple years is pre-aggregated with partitioning, it is often useful to only rebuild pre-aggregations between a certain date range (and therefore only a subset of all the partitions). This is because recalculating all partitions is often an expensive and/or time-consuming process.
This is most beneficial when using data warehouses with partitioning support (such as AWS Athena and Google BigQuery).
Data modeling
Let's use an example of a cube with a nested SQL query:
cubes:
- name: users_with_organizations
sql: >
WITH users AS (
SELECT
md5(company) AS organization_id,
id AS user_id,
created_at
FROM public.users
), organizations AS (
(
SELECT
md5(company) AS id,
company AS name,
MIN(created_at)
FROM
public.users
GROUP BY
1,
2
)
) SELECT
users.*,
organizations.name AS org_name
FROM
users
LEFT JOIN organizations
ON users.organization_id = organizations.id
pre_aggregations:
- name: main
dimensions:
- id
- organization_id
time_dimension: created_at
refresh_key:
every: 1 day
incremental: true
granularity: day
partition_granularity: month
build_range_start:
sql: SELECT DATE('2021-01-01')
build_range_end:
sql: SELECT NOW()
dimensions:
- name: id
sql: user_id
type: number
primary_key: true
- name: organization_id
sql: organization_id
type: string
- name: created_at
sql: created_at
type: time
The cube above pre-aggregates the results of the sql
property, and is
configured to incrementally build them as long as the date range is not before
January 1st, 2021.
However, if we only wanted to build pre-aggregations between a particular date
range within the users table, we would be unable to as the current configuration
only applies the date range to the final result of the SQL query defined in
sql
.
In order to do the above, we'll "push down" the predicates to the inner SQL
query using FILTER_PARAMS
in conjunction
with the build_range_start
and build_range_end
properties:
cubes:
- name: users_with_organizations
sql: >
WITH users AS (
SELECT
md5(company) AS organization_id,
id AS user_id,
created_at
FROM public.users
WHERE
{FILTER_PARAMS.users_with_organizations.created_at.filter('created_at')}
), organizations AS (
(
SELECT
md5(company) AS id,
company AS name,
MIN(created_at)
FROM
public.users
GROUP BY
1,
2
)
) SELECT
users.*,
organizations.name AS org_name
FROM
users
LEFT JOIN organizations
ON users.organization_id = organizations.id
# ...
Result
By adding FILTER_PARAMS
to the subquery inside the sql
property, we now
limit the initial size of the dataset by applying the filter as early as
possible. When the pre-aggregations are incrementally built, the same filter is
used to apply the build ranges as defined by build_range_start
and
build_range_end
.