Query format in the SQL API
SQL API runs queries in the Postgres dialect that can reference those tables and columns.
By default since 1.0, the SQL API executes regular queries, queries with post-processing and queries with pushdown. This page explains their format and details if they are handled differently by the SQL API.
Query pushdown in the SQL API is available in public preview. Read more (opens in a new tab) in the blog.
Data model mapping
In the SQL API, each cube or view from the data model is represented as a table. Measures, dimensions, and segments are represented as columns in these tables.
Cubes and views
Given that you have a cube or a view called orders
, you can query it as if it's
a table:
SELECT * FROM orders;
Dimensions
Given that your cube or view has a dimension called status
, you can reference it
as a column in the SELECT
clause. Note that you'll also have to add it to the
GROUP BY
clause:
SELECT status
FROM orders
GROUP BY 1;
Measures
Given that your cube or view has a measure called count
, you can reference it
by wrapping with the MEASURE
aggregate function:
SELECT MEASURE(count)
FROM orders;
The SQL API allows aggregate functions on measures as long as they match measure types.
Aggregate functions
The special MEASURE
function works with measures of any type.
Measure columns can also be aggregated with the following aggregate functions that
correspond to measure types:
Measure type in Cube | Aggregate function in an aggregated query |
---|---|
avg | MEASURE or AVG |
boolean | MEASURE |
count | MEASURE or COUNT |
count_distinct | MEASURE or COUNT(DISTINCT …) |
count_distinct_approx | MEASURE or COUNT(DISTINCT …) |
max | MEASURE or MAX |
min | MEASURE or MIN |
number | MEASURE or any other function from this table |
string | MEASURE or STRING_AGG |
sum | MEASURE or SUM |
time | MEASURE or MAX or MIN |
If an aggregate function doesn't match the measure type, the following error
will be thrown: Measure aggregation type doesn't match
.
Segments
Segments are exposed as columns of the boolean
type (opens in a new tab).
Given that your cube or view has a segment called is_completed
, you can reference it
as a column in the WHERE
clause:
SELECT *
FROM orders
WHERE is_completed IS TRUE;
Joins
Please refer to this page for details on joins.
Post-processing and pushdown
Since 1.0 by default, the SQL API executes, regular queries, queries with post-processing and queries with pushdown.
Query post-processing
The following query is performing a SELECT
from the orders
cube:
SELECT
city,
SUM(amount)
FROM orders
WHERE status = 'shipped'
GROUP BY 1
For this query, the SQL API would transform SELECT
query fragments into a regular
query. It can be represented as follows in the REST API query
format:
{
"dimensions": [
"orders.city"
],
"measures": [
"orders.amount"
],
"filters": [
{
"member": "orders.status",
"operator": "equals",
"values": [
"shipped"
]
}
]
}
Because of this transformation, not all functions and expressions are supported
in query fragments performing SELECT
from cube tables. Please refer to the
SQL API reference to see whether a specific expression or function
is supported and whether it can be used in selection (opens in a new tab)
(e.g., WHERE
) or projection (opens in a new tab) (e.g., SELECT
) parts
of SQL queries.
For example, the following query won't work because the SQL API can't push down the
CASE
expression to Cube for processing. It is not possible to translate CASE
expressions in measures.
SELECT
city,
MAX(CASE
WHEN status = 'shipped'
THEN '2-done'
ELSE '1-in-progress'
END) AS real_status,
SUM(number)
FROM orders
CROSS JOIN users
GROUP BY 1;
You can leverage nested queries in cases like this. You can wrap your SELECT
statement from a cube table (inner query) into another SELECT
statement
(outer query) to perform calculations with expressions like CASE
.
This outer SELECT
is not part of the SQL query that being rewritten and
thus allows you to use more SQL functions, operators and expressions.
You can rewrite the above query as follows, making sure to wrap the original
SELECT
statement:
SELECT
city,
MAX(CASE
WHEN status = 'shipped' THEN '2-done'
ELSE '1-in-progress'
END) real_status,
SUM(amount) AS total
FROM (
SELECT
users.city AS city,
SUM(number) AS amount,
orders.status
FROM orders
CROSS JOIN users
GROUP BY 1, 3
) AS inner
GROUP BY 1, 2
ORDER BY 1;
--- You can also use CTEs to achieve the same result
The above query works because the CASE
expression is supported in SELECT
queries not querying cube tables.
Query pushdown
Query pushdown is currently in public preview, and the API and behavior may change in future versions.
Query pushdown is enabled by default since 1.0 and is controlled by
CUBESQL_SQL_PUSH_DOWN
environment variable.
Query pushdown currently has the following limitations:
- No support for joins between cubes (opens in a new tab).
- No support for custom aggregations in
number
measures (opens in a new tab).
Query pushdown provides a safe net for queries that can't be rewritten into combination of a regular query and post-processing. Such queries' SQL would be transpiled to target database query leveraging all target database capabilities for data processing.
During the rewrite process, Cube validates that the target database would support transpired SQL queries. If direct conversion is not possible, different SQL transformation rewrite rules can be applied to achieve successful translation. Please refer to the SQL API reference for the list of supported SQL functions and clauses. Support varies based on the target database.
Top-down and bottom-up evaluation
Fundamentally, every SQL operation results in a tabular data set. This is usually referred to as SQL operational closure or bottom-up SQL evaluation. However, for OLAP queries, most of the time, top-down evaluation is required. Top-down evaluation is whenever the outermost sub-query operation decides on how measures would be actually evaluated as opposed to innermost sub-query in case of standard SQL behavior.
To balance between SQL guarantees and OLAP requirements, Cube
- uses top-down evaluation from the innermost aggregation operation down to all ungrouped sub-queries,
- uses bottom-up evaluation from the innermost aggregation tabular result set up to the outermost sub-query.
This behavior is enabled whenever query pushdown is enabled. It's enabled by default since 1.0.
To drill-down on how this works, let's consider following example date model
cubes:
- name: orders
sql_table: ECOM.ORDERS
dimensions:
- name: id
sql: ID
type: number
primary_key: true
- name: status
sql: STATUS
type: string
description: The status of the order (completed etc)
- name: created_at
sql: "{CUBE}.CREATED_AT"
type: time
measures:
- name: count
type: count
- name: completed_count
type: count
filters:
- sql: "{CUBE}.STATUS = 'completed'"
- name: completed_percentage
type: number
sql: "(100.0 * {CUBE.completed_count} / NULLIF({CUBE.count}, 0))"
format: percent
And the query to the SQL API:
SELECT id, status, created_at, completed_percentage FROM orders
Such a query is considered an ungrouped query and would result in the following result set:
id | status | created_at | completed_percentage |
---|---|---|---|
1 | shipped | 2024-01-01 | 0.0 |
2 | completed | 2024-01-01 | 100.0 |
3 | completed | 2024-01-02 | 100.0 |
On the other hand, a typical query that various BI tools generate:
SELECT date_trunc('day', created_at), MEASURE(completed_percentage)
FROM (
SELECT id, status, created_at, completed_percentage FROM orders
) inner_query
GROUP BY 1
...would still yield correct results
date_trunc('day', created_at) | completed_percentage |
---|---|
2024-01-01 | 50.0 |
2024-01-02 | 100.0 |
For this particular query, inner_query
won't be evaluated as a table.
Instead, Cube would postpone its execution until wrapping GROUP BY
and would use only date_trunc('day', created_at)
as a dimension to evaluate completed_percentage
measure instead of full set of inner_query
columns id
,status
and created_at
.
To make it possible, Cube keeps track of ungrouped queries and evaluates them only on the first occurrence of a GROUP BY
query in case there's one.
Aggregated and non-aggregated queries
SQL API supports two types of queries against cube tables: aggregated
(those with GROUP BY
statement) and non-aggregated (those without).
Without query pushdown, queries that Cube runs against your database will always be aggregated,
regardless of whether you use aggregated (with GROUP BY
) or non-aggregated
queries with the SQL API.
Whenever you enable query pushdown, queries which do not contain GROUP BY
clause will be executed as ungrouped queries.
A non-aggregated query would only include bare column names in SQL:
SELECT
status, -- dimension
count -- measure
FROM orders
With query pushdown disabled, Cube will still use GROUP BY
to execute such a query.
Automatic use of GROUP BY
is disabled by default.
Whenever query pushdown is enabled, such query would run as ungrouped query.
As with REST API such queries do not use GROUP BY
and render measures as if those would be grouped by primary key of a cube.
If query pushdown is enabled, calculated number
, string
or time
measures queried by SQL API can't use aggregation function definitions with it's sql
paremeter.
Such measures can still reference other aggregate type measures though.
Aggregated query must aggregate all measure columns and group by
all dimension columns. You can use the special MEASURE
aggregate function
for measures of any type. This is quite convenient, especially
in case you're manually writing ad-hoc queries:
SELECT
status, -- dimension
MEASURE(count) -- measure
FROM orders
GROUP BY 1
If any measure columns are not aggregated or any dimension columns aren't included
in GROUP BY
, the following error will be thrown: Projection references non-aggregate values
. This is a standard SQL consistency check for the GROUP BY
operation, and it's enforced by the SQL API as well.
Filtering
Without query pushdown, Cube supports most simple equality operators like
=
, <>
, <
, <=
, >
, >=
as well as IN
and LIKE
operators.
Cube tries to push down all filters into a regular query.
In some cases, filtering can only be done during post-processing.
Time dimension filters will be converted to time dimension date ranges whenever
it's possible.
Ordering
Without query pushdown, Cube tries to push down all ORDER BY
statements into
a regular query.
Row limit edge case
If the ORDER BY
statement can't be pushed down, ordering would be performed during
post-processing. If there are more than 50,000 rows
in the result set, this can yield incorrect results.
However, given that queries to Cube are usually aggregated, this is a very rare case; anyway, please keep this limitation in mind when designing your queries.
Consider the following query. Because of the SUM(total_value) + 2
expression
in the projection of the outer query, thr SQL API can't push down ORDER BY
:
SELECT
status,
SUM(total_value) + 2 AS transformed_amount
FROM (
SELECT * FROM orders
) AS orders
GROUP BY status
ORDER BY status DESC
LIMIT 100;
You can use EXPLAIN
against the above query to look at the query plan.
As you can see, the sorting operation is done after the regular query and the projection:
+ GlobalLimitExec: skip=None, fetch=100
+- SortExec: [transformed_amount@1 DESC]
+-- ProjectionExec: expr=[status@0 as status, SUM(orders.total_value)@1 + CAST(2 AS Float64) as transformed_amount]
+--- CubeScanExecutionPlan