About Query Interfaces for OLAP Systems

One of the key questions I raised in my recent blog post on the evolution of OLAP was about the communication protocol between a standalone semantic layer, as the next generation of OLAP systems, and the data consumption layer.

Historically, OLAP servers relied on MDX. Many BI tools supported it, but as OLAP started to fade out with the rise of Big Data and later Cloud Data Warehouses, so did MDX. While some BI tools still support it today, most vendors actively discourage its use. Even Microsoft, the original creator of MDX, introduced a modern successor—DAX. DAX has since gained significant traction within the Microsoft ecosystem, yet it never achieved widespread adoption outside of it.

At the same time, we've witnessed a resurgence of SQL across the board. Not only did NoSQL databases fail to replace classic SQL-based OLTP databases like PostgreSQL and MySQL, but SQL has also solidified its position as the dominant protocol for data processing and transformation. Every major CDW is SQL-first, modern transformation tools are SQL-first, and even streaming architectures are increasingly moving toward SQL-first approaches.

Given this landscape, SQL is well-positioned as the ideal communication protocol between a universal semantic layer and BI tools. It's ubiquitous, its adoption is accelerating, and it's already the default language for data analysts. BI tools already support SQL, and the integration ecosystem is mature—virtually every programming language has libraries to communicate over SQL protocols.

SQL interface challenge

While SQL seems like the natural choice for a universal semantic layer interface, it comes with a fundamental challenge: SQL was designed primarily for OLTP queries and lacks native multidimensional primitives—particularly the concept of a measure.

It's easy to introduce a new aggregation function like MEASURE() or AGGREGATE() to query measures, but the real challenge lies in how the query is evaluated. SQL follows a different evaluation flow from OLAP-focused languages, making it impossible to correctly execute certain analytical queries in SQL that contain measures.

The key difference between SQL and OLAP query languages like DAX is the evaluation flow:

SQL follows a bottom-up evaluation flow, where child calculation nodes are evaluated first, and their results are passed upward in the query tree. Parent nodes do not influence child node evaluation.
DAX follows a top-down evaluation flow, where parent nodes influence child node calculations. This allows for context-aware aggregation, a crucial feature in multidimensional analysis.

To illustrate this issue, let’s look at an example. Suppose we have a completed_percentage measure already defined in the semantic layer. Now, we want to query it from a BI tool using SQL:

SELECT
  DATE_TRUNC('day', created_at),
  MEASURE(completed_percentage)
FROM (
  SELECT
    city,
    state,
    status,
    created_at,
    completed_percentage
  FROM orders
  WHERE state = 'CA'
) inner_query
GROUP BY 1;

If SQL evaluates it bottom-up, we get the following inner query:

SELECT
  city,
  state,
  status,
  created_at,
  SUM(CASE WHEN status = 'completed' THEN 1 END) /
    NULLIF(SUM(CASE WHEN is_shipped = true THEN 1 END), 0) 
    AS completed_percentage
FROM orders
WHERE state = 'CA'
GROUP BY 1, 2, 3, 4;

Here’s the problem:

Since SQL evaluates the inner query first, it aggregates the completed_percentage measure too early, before the outer query processes the data at a higher level.
The outer query then tries to aggregate the already-aggregated value, leading to incorrect results.
In an OLAP-style top-down evaluation, this measure would be correctly computed at the final grouping level, rather than prematurely at the row level.

This mismatch in evaluation strategies is one of the biggest challenges in using SQL as a communication protocol for a standalone semantic layer.

Solution

The solution to this problem is to rewrite the incoming SQL query so that it can be correctly evaluated bottoms-up. This means unwrapping all measure calls and ensuring aggregations are placed at the appropriate level in the query structure.

If we take our example query and rewrite it like this, we will get the correct results:

SELECT
  DATE_TRUNC('day', created_at),
  SUM(CASE WHEN status = 'completed' THEN 1 END) /
    NULLIF(SUM(CASE WHEN is_shipped = true THEN 1 END), 0) 
    AS completed_percentage
FROM orders
WHERE state = 'CA'
GROUP BY 1;

This transformation ensures that all calculations happen at the correct stage of query execution, avoiding the incorrect aggregation issues caused by SQL’s default bottom-up evaluation model.

Rewriting queries is not a new concept in the database world. It is a widely used technique in query optimization, where databases internally transform SQL queries to improve performance.

At Cube, we built our own term rewrite system that takes SQL queries with measures as input and produces valid, executable SQL that aligns with bottom-up evaluation. We also handle SQL dialect variations across different data sources—including Snowflake, Databricks, Redshift, and BigQuery—as each of these platforms has different syntax, functions, and optimization behaviors.

Our query rewriting system is part of Cube’s open-source semantic layer engine, built in Rust and based on E-Graph theory. This allows us to efficiently explore multiple possible query rewrites and select the most optimal transformation dynamically. We are also planning to publish our research behind this E-Graph-based term rewrite system soon—so stay tuned!

About Query Interfaces for OLAP Systems

SQL interface challenge

Solution

Upgrade your data stack today

More on Online Analytical Processing (OLAP)

How to Ensure a Smooth Transition from SSAS to Cube Cloud

Why Microsoft SQL Server Analysis Services No Longer Meets the Needs of Cloud-First Organizations

Cube for the Microsoft Enterprise