When I was reading Pedram’s deep dive on metrics layers a few weeks ago, I couldn’t help but question the tool selection. One can obviously blame my employee-employer affiliation which is hard to deny. However, the real reason for that was this summary statement:
All three tools have different trade-offs, and their strengths and weaknesses tell of the challenges a metrics layer faces.
While this is a true statement, it dismisses the fact that there’s an ergonomic, feature-complete, and accessible solution on the market that doesn’t suffer from highlighted pitfalls. With that, I’m putting Cube in the spotlight and kindly requesting your blowback feedback on social media.
In previous episodes
If you haven’t read the original deep dive on semantic layers, please do. I promise it’s gonna be a rewarding experience.
There, Pedram mentions some prior thought leadership on the topic and introduces a challenge for three tools: Looker, dbt Metrics, and Lightdash. They have to model and provide access to the Activation Metric, a crucial characteristic of an imaginary B2B SaaS product with users and workspaces. A workspace is considered active if users perform some activation event, so we want to report on the activation rate daily, weekly, or monthly and how it changes over time. Later, Pedram shows how Looker, dbt Metrics, and Lightdash solve the data modeling and elaborates on their strengths and pitfalls.
Here’s the workspace_details
table matching the imaginary dataset that was used by all tools to calculate the metrics. I’ve generated 20 million rows and put them into Postgres, so we have something to work with:
Now, let’s talk about Cube.
Cube, the semantic layer
Cube is the semantic layer for building data applications, created to make data consistent and accessible. It solves the many-to-many problem many data teams have by serving as a source of truth for metrics definitions, access control rules, and caching settings.
Cube connects to numerous data sources, from cloud data warehouses to streaming platforms, and delivers data to all kinds of data consumers via multiple APIs, including the SQL API. It is deliberately visualization-agnostic and delegates data presentation and visualization to BI tools, data notebooks, front-end apps, etc.
Cube is open source and there's also Cube Cloud that provides fully managed experience and additional tooling on Free, Standard, and Enterprise tiers (see pricing).
Data modeling
In Cube, we define metrics in data model files that seem to be heavily inspired by LookML.
At the top level, everything is grouped in “cubes” that are usually modeled over tables or views in the data warehouse but can also be modeled over pretty complex queries. Cubes include definitions for measures (quantitative features), dimensions (qualitative features), segments (reusable filters), and pre-aggregations (declarative definitions of rollups that drastically speed up queries, more on them later).
Here’s the data model for our activation metric:
A few observations here:
- Similarly to Looker, we define measures as formulas, not as fully-formed SQL tables.
- While we can use “formatting” functions like
ROUND
in measure definitions, formatting options are deliberately limited. Cube puts the burden of data presentation on downstream tools. - By default, we don’t have to specify how our reporting date should be broken down. We’ll be able to provide desired time granularity at query time.
- Needless to say, we can define other cubes and joins between them.
Query syntax
Cube provides a set of APIs to fetch data: SQL API (mostly for BI tools and data notebooks), REST API, and GraphQL API (mostly for front-end applications). A simple query to the SQL API:
Of course, we can run a more complex query, too:
Queries would turn into JSON when expressed in the REST API syntax. First one:
Second one:
Since we haven’t connected a downstream tool yet, we can run queries and explore the metrics in Cube’s Playground with drop-down inputs for measures and dimensions:
We can instantly get the JSON queries for the REST API here:
More conveniently, we can also look up the SQL statement that would be generated by Cube and run against the data warehouse when the query is executed. No surprise here: Cube uses the provided query and data model definitions to compose the final query:
Query acceleration
Obviously, queries can be expensive: costly to run, too complex to wait for, or both.
In that case, we can define a pre-aggregation (actually, the recommendation is always to use pre-aggregations in production). Here’s what we need to append to the data model:
It instructs Cube to build a rollup and use it to serve queries that contain any subset of provided measures and dimensions (namely, activation_rate
and reporting_day
with weekly granularity). Usually, a single pre-aggregation speeds up multiple queries. We can also configure freshness: a pre-aggregation update can be triggered on schedule (e.g., every day) or when an arbitrary SQL statement yields a new result.
Now, the last query would return the same result, but a completely different query would be executed against the data warehouse:
Truth be told, it wouldn’t be executed against the data warehouse at all. The prod_pre_aggregations.workspace_activation_main
table is transparently created and maintained in Cube Store, a custom-built massively distributed columnar storage for pre-aggregation data.
There’s an option to skip using Cube Store and store pre-aggregations in a data warehouse, but that’s not the most reasonable thing to do. With pre-aggregations in Cube Store, Cube can serve practically any query with sub-second latency (expect something between 50 ms and 500 ms) and allow for a decent concurrency (say, 100 QPS or maybe 1000 QPS).
Okay, but what would Cube do if we need to query for other granularity? Since it’s aware that queried measures are defined with distinct counts, it won’t be using the rollup and will transparently fall back to a query to the data warehouse (there’s an option to disallow that at all times, if we need to).
Data consumers
Now, we’re ready to connect downstream tools. We can pick anything that supports Cube’s SQL API, and that’s practically any tool. Cube’s SQL API uses Postgres dialect and wire protocol, so if something works with Postgres as an upstream source, it works with Cube:
We have numerous options here. Out of respect for the first tech company ever where the first triple-name-tie is Claire, let’s connect Cube to Hex:
Indeed, Hex works with Cube:
In case we’re building a front-end app, we can directly fetch
from the REST API endpoint or use the provided JavaScript client library:
Now you can plot the data with a charting library of your choice:
Wrapping up
Hopefully, this write-up provides some insight into how Cube works and helps compare it to Looker, dbt Metrics, and Lightdash from the original deep dive.
Also, if you’re using dbt Metrics, please keep in mind that you can leverage Cube’s integration with dbt that transparently reads metrics definitions from dbt and automagically creates a data model from them.
Currently, Cube uses JSON-like markup for data model definitions and Looker-like metrics organization into cubes. Later in 2022, Cube is likely to introduce YAML markup for data models and an alternative approach to metrics.
With that, I do have an answer to the final question from Pedram’s article:
Can a metrics layer be universal enough to gain applicability across the data stack yet still be designed in such a way to be relevant to BI tools?
Yes. Try Cube.
The easiest way to get started with Cube is Cube Cloud. It provides managed hosting and convenient tools for Cube, including the Playground, the data model editor with syntax highlighting, and—more importantly—the ability to instantly fork data models and query private versions of APIs while the production version stays intact and serves queries.
Sign up for Cube Cloud on the free tier today and experience that yourself. Feel free to connect your production dataset or tinker with the Postgres instance used in this blog post: