Risen from the ashes of the Great Unbundling of Airflow, data transformation tools serve an important role in modern data pipelines. (Note that I don’t dare to say “modern data stack” in late 2023.)
Among very promising peers, dbt is still the most popular data transformation tool, with almost 8,000 stars on GitHub, a vibrant community, and a lot of impact on the current state of data engineering; more on that below.
Data transformation upstream of the semantic layer
Nowadays, no one would be surprised to see a data transformation tool being used to define hundreds or thousands of staging and intermediate models, mince the data through a complex DAG, and finally output it to a warehouse as data marts.
Similarly, virtually no one would disagree that direct access to transformed data in the warehouse is the root of all evil. It leads to duplicated effort, disparate calculated metrics, poor performance, incoherent access controls, and generally unhappy data teams and business users. Luckily, these are exactly the issues that the semantic layer is designed to solve.
With a semantic layer, you would define your domain-specific data model on top of data marts, use it as the single source of truth for metric definitions, leverage consistent access controls and aggregate awareness, and expose metrics to all kinds of data consumers (BI tools, AI agents, and data apps) via a rich set of APIs:
Here's how a few Cube users describe their approach and experience with using dbt and Cube together:
We create a star schema with dbt and feed that to Cube. In Cube, we define metrics on top of that star schema.
We’ve chosen to expose views managed in dbt to Cube. These views are downstreams from our fact and dimension tables. To define the views, we usually do
SELECT *
and then add the dimensions that we want to calculate on the fly. That way, all dimension logic is built in dbt, and we use Cube to handle metric definitions and joins.
Indeed, it's very heart-warming to know that members of the Cube community are successful building their data pipelines with both tools. As a leader and fast-mover among semantic layers, with more than 16,000 stars on GitHub, Cube has proven itself as a mature and feature-rich solution that integrates and works great with all kinds of data tools, including dbt.
All-new dbt & Cube integration
We’ve always wanted dbt and Cube users to have the best experience using both tools together. Recently, we have introduced the support for Python and Jinja to Cube that laid the groundwork for the integration.
Today, we’re excited to announce the all-new dbt integration that is based on Python and Jinja. It comes as the cube_dbt
package that simplifies defining the data model in the semantic layer on top of dbt models. We are also releasing a guide on using Cube with dbt that explains the integration steps in detail.
In a nutshell, the integration involves loading the metadata of a dbt project from manifest.json
, selecting necessarty dbt models, and exposing them to Jinja:
The only next step is to render dbt models as cubes with dimensions in YAML (with column types and primary keys automagically inferred from the dbt project metadata) and enrich the data model with measures, joins, and pre-aggregations:
Needless to say, we encourage you to check the guide on using Cube with dbt and see how everything works in Cube Cloud or Cube Core today. Support for Python and Jinja comes with the Cube Core v0.34 release (please note that macOS on M1/M2 as well as Windows are not supported yet but we’re actively working on bringing these platforms very soon).
You’re also very welcome to join our Slack community to share your thoughts and feedback on this integration. Finally, we’ve planned a webinar on October 11, 2023 to talk in-depth about the dbt integration—please don’t hesitate to come, see everything in action, and ask your questions.