Incrementally building pre-aggregations for a date range

Use case

In scenarios where a large dataset spanning multiple years is pre-aggregated with partitioning, it is often useful to only rebuild pre-aggregations between a certain date range (and therefore only a subset of all the partitions). This is because recalculating all partitions is often an expensive and/or time-consuming process.

This is most beneficial when using data warehouses with partitioning support (such as AWS Athena and Google BigQuery).

Data modeling

Let’s use an example of a cube with a nested SQL query:


cubes:
  - name: users_with_organizations
    sql: |
      WITH users AS (
        SELECT
          md5(company) AS organization_id,
          id AS user_id,
          created_at
        FROM public.users
      ), organizations AS (
        (
          SELECT
            md5(company) AS id,
            company AS name,
            MIN(created_at)
          FROM
            public.users
          GROUP BY
            1,
            2
        )
      ) SELECT
        users.*,
        organizations.name AS org_name
      FROM
        users
      LEFT JOIN organizations
        ON users.organization_id = organizations.id
 
    pre_aggregations:
      - name: main
        dimensions:
          - id
          - organization_id
        time_dimension: created_at
        refresh_key:
          every: 1 day
          incremental: true
        granularity: day
        partition_granularity: month
        build_range_start:
          sql: SELECT DATE('2021-01-01')
        build_range_end:
          sql: SELECT NOW()
 
    dimensions:
      - name: id
        sql: user_id
        type: number
        primary_key: true
 
      - name: organization_id
        sql: organization_id
        type: string
 
      - name: created_at
        sql: created_at
        type: time


cube('users_with_organizations', {
 
  sql: `
    WITH users AS (
      SELECT
        md5(company) AS organization_id,
        id AS user_id,
        created_at
      FROM public.users
    ),
    organizations AS (
      (
        SELECT
          md5(company) AS id,
          company AS name,
          MIN(created_at)
        FROM
          public.users
        GROUP BY
          1,
          2
      )
    )
    SELECT
      users.*,
      organizations.name AS org_name
    FROM
      users
    LEFT JOIN organizations
      ON users.organization_id = organizations.id
    `,
 
  pre_aggregations: {
    main: {
      dimensions: [CUBE.id, CUBE.organization_id]
      time_dimension: CUBE.created_at,
      refresh_key: {
        every: `1 day`,
        incremental: true
      },
      granularity: `day`,
      partition_granularity: `month`,
      build_range_start: { sql: `SELECT DATE('2021-01-01')` },
      build_range_end: { sql: `SELECT NOW()` }
    }
  },
 
  dimensions: {
    id: {
      sql: `user_id`,
      type: `number`
      primary_key: true
    },
 
    organization_id: {
      sql: `organization_id`,
      type: `string`
    },
 
    created_at: {
      sql: `created_at`,
      type: `time`
    }
  }
})

The cube above pre-aggregates the results of the sql property, and is configured to incrementally build them as long as the date range is not before January 1st, 2021.

However, if we only wanted to build pre-aggregations between a particular date range within the users table, we would be unable to as the current configuration only applies the date range to the final result of the SQL query defined in sql.

In order to do the above, we’ll “push down” the predicates to the inner SQL query using FILTER_PARAMS in conjunction with the build_range_start and build_range_end properties:


cubes:
  - name: users_with_organizations
    sql: |
      WITH users AS (
        SELECT
          md5(company) AS organization_id,
          id AS user_id,
          created_at
        FROM public.users
        WHERE
      {FILTER_PARAMS.users_with_organizations.created_at.filter('created_at')}
      ), organizations AS (
        (
          SELECT
            md5(company) AS id,
            company AS name,
            MIN(created_at)
          FROM
            public.users
          GROUP BY
            1,
            2
        )
      ) SELECT
        users.*,
        organizations.name AS org_name
      FROM
        users
      LEFT JOIN organizations
        ON users.organization_id = organizations.id
 
  # ...


cube("users_with_organizations", {
  sql: `
WITH users AS (
    SELECT
      md5(company) AS organization_id,
      id AS user_id,
      created_at
    FROM public.users
    WHERE ${FILTER_PARAMS.users_with_organizations.created_at.filter(
      "created_at"
    )}
),
organizations AS (
  (
    SELECT
      md5(company) AS id,
      company AS name,
      MIN(created_at)
    FROM
      public.users
    GROUP BY
      1,
      2
  )
)
SELECT
  users.*,
  organizations.name AS org_name
FROM
  users
LEFT JOIN organizations
  ON users.organization_id = organizations.id
`,
 
  // ...
})

Result

By adding FILTER_PARAMS to the subquery inside the sql property, we now limit the initial size of the dataset by applying the filter as early as possible. When the pre-aggregations are incrementally built, the same filter is used to apply the build ranges as defined by build_range_start and build_range_end.

Was this page useful?