Concepts
Cube borrows a lot of terminology from OLAP theory (opens in a new tab), and this document is intended for both newcomers and regular users to refresh their understanding.
We'll use a sample e-commerce database with two tables, orders
and
line_items
to illustrate the concepts throughout this page:
orders
id | status | completed_at | created_at |
---|---|---|---|
1 | completed | 2019-01-05 00:00:00.000000 | 2019-01-02 00:00:00.000000 |
2 | shipped | 2019-01-17 00:00:00.000000 | 2019-01-02 00:00:00.000000 |
3 | completed | 2019-01-27 00:00:00.000000 | 2019-01-02 00:00:00.000000 |
4 | shipped | 2019-01-09 00:00:00.000000 | 2019-01-02 00:00:00.000000 |
5 | processing | 2019-01-29 00:00:00.000000 | 2019-01-02 00:00:00.000000 |
line_items
id | product_id | order_id | quantity | price | created_at |
---|---|---|---|---|---|
1 | 31 | 1 | 1 | 275 | 2019-01-31 00:00:00.000000 |
2 | 49 | 2 | 6 | 248 | 2021-01-20 00:00:00.000000 |
3 | 89 | 3 | 6 | 197 | 2021-11-25 00:00:00.000000 |
4 | 71 | 4 | 8 | 223 | 2019-12-23 00:00:00.000000 |
5 | 64 | 5 | 5 | 75 | 2019-04-20 00:00:00.000000 |
Cubes
Cubes represent datasets in Cube and are conceptually similar to views in
SQL (opens in a new tab). Cubes are usually declared in separate files with one
cube per file. Typically, a cube points to a single table in
your database using the sql_table
property:
cubes:
- name: orders
sql_table: orders
You can also use the sql
property to accommodate more
complex SQL queries:
cubes:
- name: orders
sql: >
SELECT *
FROM orders, line_items
WHERE orders.id = line_items.order_id
Each cube contains the definitions of its members: dimensions, measures, and segments. You can control the access to cubes and their members by configuring the member-level security.
Joins are used to define relations between cubes. Pre-aggregations are used to accelerate queries to cubes. Cubes and their members can be further referenced by views.
Note that cubes support extension, polymorphism, and data blending. Cubes can be defined statically and you can also build dynamic data models.
Cube supports data models that consist of thousands of cubes and views. For massive multi-tenancy configurations, e.g., with more than 1,000 tenants, consider using a multi-cluster deployment.
See the reference documentaton for the full list of cube parameters.
Views
Views sit on top of the data graph of cubes and create a facade of your whole data model with which data consumers can interact. They are useful for defining metrics, managing governance and data access, and controlling ambiguous join paths.
Views do not define their own members. Instead, they reference cubes by specific join paths and include their members. Optionally, you can also group members of a view into folders.
In the example below, we create the orders
view which includes select members
from base_orders
, products
, and users
cubes:
views:
- name: orders
cubes:
- join_path: base_orders
includes:
- status
- created_date
- total_amount
- total_amount_shipped
- count
- average_order_value
- join_path: base_orders.line_items.products
includes:
- name: name
alias: product
- join_path: base_orders.users
prefix: true
includes: "*"
excludes:
- company
Views do not define any pre-aggregations. Instead, they reuse pre-aggregations from underlying cubes.
View can be defined statically and you can also build dynamic data models.
See the reference documentaton for the full list of view parameters.
Dimensions
Dimensions represent the properties of a single data point in the cube.
The orders
table contains only dimensions, so representing them in the
orders
cube is straightforward:
cubes:
- name: orders
# ...
dimensions:
- name: id
sql: id
type: number
# Here we explicitly let Cube know this field is the primary key
# This is required for de-duplicating results when using joins
primary_key: true
- name: status
sql: status
type: string
Note that the id
dimension is defined as a primary key.
It is also possible to have more than one primary key dimension in a cube if
you'd like them all to be parts of a composite key.
The line_items
table also has a couple of dimensions which can be
represented as follows:
cubes:
- name: line_items
# ...
dimensions:
- name: id
sql: id
type: number
# Again, we explicitly let Cube know this field is the primary key
# This is required for de-duplicating results when using joins
primary_key: true
- name: order_id
sql: order_id
type: number
If needed, dimensions can be organized into hierarchies. Also, proxy dimensions are helpful for code reusability and subquery dimensions can be used to join cubes implicitly.
See the reference documentaton for the full list of dimension parameters.
Dimension types
Dimensions can be of different types, e.g., string
, number
, or time
. Often,
data types in SQL are mapped to dimension types in the following way:
Data type in SQL | Dimension type in Cube |
---|---|
timestamp , date , time | time |
text , varchar | string |
integer , bigint , decimal | number |
boolean | boolean |
See the dimension type reference for details.
Time dimensions
Time-based properties are modeled using dimensions of the time
type. They allow grouping the result set by a unit of
time (e.g., days, weeks, month, etc.), also known as the time dimension
granularity.
The following granularities are available by default for any time dimension:
year
, quarter
, month
, week
(starting on Monday), day
, hour
, minute
,
second
. You can also define custom granularities
and optionally expose them via proxy dimensions in case
you need to use weeks starting on Sunday, fiscal years, etc.
See this recipe for more custom granularity examples.
Here's how we can add time dimensions to the data model:
cubes:
- name: orders
# ...
dimensions:
- name: created_at
sql: created_at
type: time
# You can use this time dimension with all default granularities:
# year, quarter, month, week, day, hour, minute, second
- name: completed_at
sql: completed_at
type: time
# You can use this time dimension with all default granularities
# and an additional custom granularity defined below
granularities:
- name: fiscal_year_starting_on_february_01
interval: 1 year
offset: 1 month
Time dimensions are essential to enabling performance boosts such as partitioned pre-aggregations and incremental refreshes.
Measures
Measures represent the properties of a set of data points in the cube. To
add a measure called count
to our orders
cube, for example, we can do the
following:
cubes:
- name: orders
# ...
measures:
- name: count
type: count
In our LineItems
cube, we can also create a measure to sum up the total value
of line items sold:
cubes:
- name: line_items
# ...
measures:
- name: total
sql: price
type: sum
Calculated measures and subquery dimensions can be used for measure composition.
See the reference documentaton for the full list of measure parameters.
Measure types
Measures can be of different types, e.g., count
, sum
, or number
. Often,
aggregate functions in SQL are mapped to measure types in the following way:
Aggregate function in SQL | Measure type in Cube |
---|---|
AVG | avg |
BOOL_AND , BOOL_OR | boolean |
COUNT | count |
COUNT(DISTINCT …) | count_distinct |
APPROX_COUNT_DISTINCT | count_distinct_approx |
MAX | max |
MIN | min |
PERCENTILE_CONT , MEDIAN | number |
STRING_AGG , LISTAGG | string |
SUM | sum |
Any function returning a timestamp, e.g., MAX(time) | time |
See the measure type reference for details.
Measure additivity
Additivity is a property of measures that detemines whether measure values, once calculated for a set of dimensions, can be further aggregated to calculate measure values for a subset of these dimensions.
Measure additivity has an impact on pre-aggregation matching.
Additivity of a measure depends on its type. Only measures
with the following types are considered additive:
count
,
count_distinct_approx
,
min
,
max
,
sum
.
Measures with all other types are considered non-additive.
Example
Consider the following cube:
cubes:
- name: employees
sql: >
SELECT 1 AS id, 'Ali' AS first_name, 20 AS age, 'Los Gatos' AS city UNION ALL
SELECT 2 AS id, 'Bob' AS first_name, 30 AS age, 'San Diego' AS city UNION ALL
SELECT 3 AS id, 'Eve' AS first_name, 40 AS age, 'San Diego' AS city
measures:
- name: count
type: count
- name: avg_age
sql: age
type: avg
dimensions:
- name: city
sql: city
type: string
If we run a query that includes city
as a dimension and count
and avg_age
as measures, we'll get the following results:
city | count | avg_age |
---|---|---|
Los Gatos | 1 | 20 |
San Diego | 2 | 35 |
Then, if we remove the city
dimension from the query, we'll get the following
results:
count | avg_age |
---|---|
3 | 30 |
As you can see, the value of the count
measure that we've got for the second
query could have been calculated based on the results of the first one:
1 + 2 = 3
. It explains why the count
measure, having the count
type, is
considered additive.
However, the value of the avg_age
measure that we've got for the second query
can't be calculated based on the results of the first one: there's no way to
derive 30
from 20
and 35
. This is why the avg_age
measure, having the
avg
type, is considered non-additive.
Leaf measures
Measures that do not reference other measures are considered leaf measures.
By definition, all measures that only reference SQL columns and expressions are leaf measures. On the other hand, calculated measures might not necessarily be leaf measures because they can reference other measures.
Whether a query contains only additive leaf measures has an impact on pre-aggregation matching.
Joins
Joins define the relationships between cubes, which then allows accessing and
comparing properties from two or more cubes at the same time. In Cube, all joins
are LEFT JOIN
s.
An INNER JOIN
can be replicated with Cube; when making a Cube query, add a
filter for IS NOT NULL
on the required column.
In the following example, we are left-joining the line_items
cube onto our
orders
cube:
cubes:
- name: orders
# ...
joins:
- name: line_items
# Here we use the `CUBE` global to refer to the current cube,
# so the following is equivalent to `orders.id = line_items.order_id`
sql: "{CUBE}.id = {line_items.order_id}"
relationship: many_to_one
There are three types of join relationships
(one_to_one
, one_to_many
, and many_to_one
) and a few other
concepts such as the direction of joins and trasitive
joins pitfalls.
See the reference documentaton for the full list of join parameters.
Segments
Segments are pre-defined filters that are kept within the data model instead of a Cube query. They help to simplify queries and make it easy to reuse common filters across a variety of queries.
To add a segment which limits results to completed orders, we can do the following:
cubes:
- name: orders
# ...
segments:
- name: only_completed
sql: "{CUBE}.status = 'completed'"
See the reference documentaton for the full list of segment parameters.
Pre-aggregations
Pre-aggregations provide a powerful way to accelerate frequently used queries
and keep the cache up-to-date. Within a data model, they are defined using the
pre_aggregations
property:
cubes:
- name: orders
# ...
pre_aggregations:
- name: main
measures:
- count
dimensions:
- status
time_dimension: created_at
granularity: day
A more thorough introduction can be found in Getting Started with Pre-Aggregations.
See the reference documentaton for the full list of pre-aggregation parameters.