Deep dive: What the heck is the Semantic Layer

When I was reading Pedram’s deep dive on metrics layers a few weeks ago, I couldn’t help but question the tool selection. One can obviously blame my employee-employer affiliation which is hard to deny. However, the real reason for that was this summary statement:

All three tools have different trade-offs, and their strengths and weaknesses tell of the challenges a metrics layer faces.

While this is a true statement, it dismisses the fact that there’s an ergonomic, feature-complete, and accessible solution on the market that doesn’t suffer from highlighted pitfalls. With that, I’m putting Cube in the spotlight and kindly requesting your ~~blowback~~ feedback on social media.

In previous episodes

If you haven’t read the original deep dive on semantic layers, please do. I promise it’s gonna be a rewarding experience.

There, Pedram mentions some prior thought leadership on the topic and introduces a challenge for three tools: Looker, dbt Metrics, and Lightdash. They have to model and provide access to the Activation Metric, a crucial characteristic of an imaginary B2B SaaS product with users and workspaces. A workspace is considered active if users perform some activation event, so we want to report on the activation rate daily, weekly, or monthly and how it changes over time. Later, Pedram shows how Looker, dbt Metrics, and Lightdash solve the data modeling and elaborates on their strengths and pitfalls.

Here’s the workspace_details table matching the imaginary dataset that was used by all tools to calculate the metrics. I’ve generated 20 million rows and put them into Postgres, so we have something to work with:

dataset

Now, let’s talk about Cube.

Cube, the semantic layer

Cube is the semantic layer for building data applications, created to make data consistent and accessible. It solves the many-to-many problem many data teams have by serving as a source of truth for metrics definitions, access control rules, and caching settings.

Cube in the pipeline

Cube connects to numerous data sources, from cloud data warehouses to streaming platforms, and delivers data to all kinds of data consumers via multiple APIs, including the SQL API. It is deliberately visualization-agnostic and delegates data presentation and visualization to BI tools, data notebooks, front-end apps, etc.

Cube is open source and there's also Cube Cloud that provides fully managed experience and additional tooling on Free, Standard, and Enterprise tiers (see pricing).

Data modeling

In Cube, we define metrics in data model files that seem to be heavily inspired by LookML.

At the top level, everything is grouped in “cubes” that are usually modeled over tables or views in the data warehouse but can also be modeled over pretty complex queries. Cubes include definitions for measures (quantitative features), dimensions (qualitative features), segments (reusable filters), and pre-aggregations (declarative definitions of rollups that drastically speed up queries, more on them later).

Here’s the data model for our activation metric:

cube(`WorkspaceActivation`, {
  sql: `SELECT * FROM public.active_workspace_details`,

  measures: {
    count: {
      type: `count`
    },

    n_workspaces: {
      sql: `workspace_id`,
      type: `countDistinct`,
    },

    n_active_workspaces: {
      sql: `workspace_id`,
      type: `countDistinct`,
      filters: [ {
        sql: `${is_active_workspace} IS TRUE`
      } ],
    },

    activation_rate: {
      sql: `ROUND(100 * (1.0 * ${n_active_workspaces} / ${n_workspaces}), 2)`,
      type: `number`
    }
  },

  dimensions: {
    workspace_id: {
      sql: `workspace_id`,
      type: `number`,
      primaryKey: true
    },

    reporting_day: {
      sql: `reporting_day`,
      type: `time`
    },

    is_active_workspace: {
      sql: `is_active`,
      type: `boolean`
    }
  },
});

A few observations here:

Similarly to Looker, we define measures as formulas, not as fully-formed SQL tables.
While we can use “formatting” functions like ROUND in measure definitions, formatting options are deliberately limited. Cube puts the burden of data presentation on downstream tools.
By default, we don’t have to specify how our reporting date should be broken down. We’ll be able to provide desired time granularity at query time.
Needless to say, we can define other cubes and joins between them.

Query syntax

Cube provides a set of APIs to fetch data: SQL API (mostly for BI tools and data notebooks), REST API, and GraphQL API (mostly for front-end applications). A simple query to the SQL API:

SELECT reporting_day, activation_rate
FROM WorkspaceActivation;

Of course, we can run a more complex query, too:

SELECT
  DATE_TRUNC('WEEK', reporting_day) AS reporting_week,
  activation_rate
FROM WorkspaceActivation
ORDER BY 1 DESC;

Queries would turn into JSON when expressed in the REST API syntax. First one:

{
  "measures": [
    "WorkspaceActivation.activation_rate"
  ],
  "timeDimensions": [ {
    "dimension": "WorkspaceActivation.reporting_day"
  } ]
}

Second one:

{
  "measures": [
    "WorkspaceActivation.activation_rate"
  ],
  "timeDimensions": [ {
    "dimension": "WorkspaceActivation.reporting_day",
    "granularity": "week"
  } ],
  "order": {
    "WorkspaceActivation.reporting_day": "desc"
  }
}

Since we haven’t connected a downstream tool yet, we can run queries and explore the metrics in Cube’s Playground with drop-down inputs for measures and dimensions:

Playground

We can instantly get the JSON queries for the REST API here:

JSON query

More conveniently, we can also look up the SQL statement that would be generated by Cube and run against the data warehouse when the query is executed. No surprise here: Cube uses the provided query and data model definitions to compose the final query:

Generated SQL query

Query acceleration

Obviously, queries can be expensive: costly to run, too complex to wait for, or both.

In that case, we can define a pre-aggregation (actually, the recommendation is always to use pre-aggregations in production). Here’s what we need to append to the data model:

  preAggregations: {
    main: {
      measures: [ activation_rate ],
      timeDimension: reporting_day,
      granularity: `week`,
      refreshKey: {
        every: `1 day`
      }
    }
  }

It instructs Cube to build a rollup and use it to serve queries that contain any subset of provided measures and dimensions (namely, activation_rate and reporting_day with weekly granularity). Usually, a single pre-aggregation speeds up multiple queries. We can also configure freshness: a pre-aggregation update can be triggered on schedule (e.g., every day) or when an arbitrary SQL statement yields a new result.

Now, the last query would return the same result, but a completely different query would be executed against the data warehouse:

Accelerated SQL query

Truth be told, it wouldn’t be executed against the data warehouse at all. The prod_pre_aggregations.workspace_activation_main table is transparently created and maintained in Cube Store, a custom-built massively distributed columnar storage for pre-aggregation data.

There’s an option to skip using Cube Store and store pre-aggregations in a data warehouse, but that’s not the most reasonable thing to do. With pre-aggregations in Cube Store, Cube can serve practically any query with sub-second latency (expect something between 50 ms and 500 ms) and allow for a decent concurrency (say, 100 QPS or maybe 1000 QPS).

Okay, but what would Cube do if we need to query for other granularity? Since it’s aware that queried measures are defined with distinct counts, it won’t be using the rollup and will transparently fall back to a query to the data warehouse (there’s an option to disallow that at all times, if we need to).

Non-accelerated SQL query

Data consumers

Now, we’re ready to connect downstream tools. We can pick anything that supports Cube’s SQL API, and that’s practically any tool. Cube’s SQL API uses Postgres dialect and wire protocol, so if something works with Postgres as an upstream source, it works with Cube:

Supported data sources

We have numerous options here. Out of respect for the first tech company ever where the first triple-name-tie is Claire, let’s connect Cube to Hex:

Connection to Hex

Indeed, Hex works with Cube:

Hex notebook

In case we’re building a front-end app, we can directly fetch from the REST API endpoint or use the provided JavaScript client library:

import cubejs from '@cubejs-client/core';

const cubejsApi = cubejs.CubejsApi(
  process.env.CUBE_TOKEN,
  { apiUrl: process.env.CUBE_API_URL }
);

const query = {
  measures: [ 'WorkspaceActivation.activation_rate' ],
  timeDimensions: [ {
    dimension: 'WorkspaceActivation.reporting_day',
    granularity: 'week'
  } ],
  order: {
    'WorkspaceActivation.reporting_day': 'desc'
  }
};

const resultSet = await cubejsApi.load(query);

console.log(resultSet.tablePivot());

Now you can plot the data with a charting library of your choice:

Data in the terminal

Wrapping up

Hopefully, this write-up provides some insight into how Cube works and helps compare it to Looker, dbt Metrics, and Lightdash from the original deep dive.

Also, if you’re using dbt Metrics, please keep in mind that you can leverage Cube’s integration with dbt that transparently reads metrics definitions from dbt and automagically creates a data model from them.

Currently, Cube uses JSON-like markup for data model definitions and Looker-like metrics organization into cubes. Later in 2022, Cube is likely to introduce YAML markup for data models and an alternative approach to metrics.

With that, I do have an answer to the final question from Pedram’s article:

Can a metrics layer be universal enough to gain applicability across the data stack yet still be designed in such a way to be relevant to BI tools?

Yes. Try Cube.

The easiest way to get started with Cube is Cube Cloud. It provides managed hosting and convenient tools for Cube, including the Playground, the data model editor with syntax highlighting, and—more importantly—the ability to instantly fork data models and query private versions of APIs while the production version stays intact and serves queries.

Sign up for Cube Cloud on the free tier today and experience that yourself. Feel free to connect your production dataset or tinker with the Postgres instance used in this blog post:

Host:      demo-db-examples.cube.dev
Database:  ecom
User:      cube
Password:  12345

Deep dive: What the heck is the Semantic Layer

In previous episodes

Cube, the semantic layer

Data modeling

Query syntax

Query acceleration

Data consumers

Wrapping up

Upgrade your data stack today

More on Deep Dives

Ensuring Data Governance with Cube Cloud: Implementing Fine-Grained Access Controls

How to Ensure a Smooth Transition from SSAS to Cube Cloud

Why Microsoft SQL Server Analysis Services No Longer Meets the Needs of Cloud-First Organizations