Previously, we’ve discussed the recent emergence of headless BI, and reviewed headless BI’s four essential components. Now, let’s take a more in-depth look at one of those layers: data modeling. Here is an introduction to open source data modeling in Cube’s headless BI platform.
How data is modeled is the first fundamental element of the business intelligence stack, because it organizes and streamlines querying a data source. Additionally, the data model defines the relationships between each data entity, which forms the basis of all analysis of that data.
Data modeling is a vital component of the BI stack and must be consistent and universally accessible. And so, its logic should be defined only once, and should be defined in code. Furthermore, this code should be governed by version control, to enable safe iteration as an organization grows.
A data modeling language for Headless BI
Cube’s entire headless BI platform is open source—and this includes our open source data modeling language. This language is used to describe measures, dimensions, and relationships in a data source. Then, Cube servers use the model written in this language to generate SQL queries against data sources when responding to requests made to our REST, GraphQL, and SQL APIs.
Unlike other data modeling languages, Cube’s data modeling language is not tied to any specific interface, visualization, or user experience. This decoupling makes it possible to use the same model definitions across every application. Each data consumer can have a custom-tailored experience in the data application of their choice while still working with consistent and standardized definitions.
The design of Cube’s data modeling language
Cube’s data model is a relational graph of the entities that contain measures and dimensions.
In Cube, entities are called “cubes”—inspired by OLAP cubes. In essence, they are data tables with a semantic meta layer that describes both the cube’s measures and dimensions and its relationships to other cubes.
Cube operates on the foundational primitives of dimensions and measures. Users can define metrics by combining these primitives.
Dimensions are properties of entity objects, e.g., a user’s country or a product’s category.
Measures are aggregations, typically defined as SQL expressions or algebraic operations over other measures:
Finally, cubes are connected to each other and constitute a data graph.
When users construct metrics, the dimensions and measures they query don’t have to belong to the same cube. They can be located in different places on the data graph. Cube will generate appropriate SQL, including taking care of chasm / fan-outs, to join the underlying tables to respond to the request.
Cube’s approach is different from a query-oriented approach in which metrics are defined by a custom query and a bunch of attributes are assigned to that query. That approach essentially results in single cubes, each with a single measure and a subset of dimensions and filters. Because these cubes aren’t connected to one another, we can’t reuse dimensions, filters, joins, access controls, caching configuration, etc.
If we were to continue answering more questions and building more metrics, we would have to build more queries, with duplicated definitions and different join paths. At best, this would be repetitive; at worst, it would lead to inconsistent definitions and a combinatorial explosion of metrics and dimensions.
Instead, we’re focused on letting users define relationships and reuse definitions to build maintainable and scalable data models.
Cube’s data modeling language is based on JavaScript. This may come as a surprise, given that many data modeling frameworks and languages are based on YAML. However, as a markup language, YAML has no runtime and only limited capacity for code organization. This is a big limitation for complex and dynamic data models.
By running a data model in a JavaScript virtual machine, Cube can mutate the model at runtime both reactively and proactively, which enables dynamic data definitions, caching, and access control rules.
Cube’s data modeling language also manages caching. Users can define a set of measures and dimensions that are commonly used together, and materialize them as rollup tables. Since the caching rules are defined in code, it’s possible to manage them using best practices including version control, code reviews, and tests in isolated environments.
In Cube, data processing and storage are written in Rust, based on the Datafusion query execution framework, which uses Apache Arrow as its in-memory format.
A data modeling example
Let’s take a look at how to use Cube’s data modeling language.
We’ll model data for an imaginary e-commerce website. Assume the data warehouse has tables like orders
and users
with information about transactions and who placed them.
Orders
id | ID of the order |
user_id | ID of the user |
value | Value of the order in USD |
status | Status of the order, either “processing” or “completed” |
Users
id | ID of the user |
country | Country of the user |
We can represent these two tables in Cube’s modeling language as cubes with this configuration:
Let’s look at the major elements of the model.
The sql
property describes the table of data that will be used in cube, which references the underlying table within the data warehouse. It also can support arbitrary SQL queries in complex cases.
The cubes include defined measures and dimensions:
- Measures: quantitative data, such as the number of units sold, number of unique visits, revenue, and so forth
- Dimensions: properties of data entities, such as state, gender, product name, or units of time
To build data relationships, we define the joins
property on one of the cubes:
Querying the data model
With the above data model, it’s now possible to query Cube’s APIs to learn total value grouped by country, or a specific company‘s total value from completed orders.
To find out the total value of orders by country, we’d use the following JSON query:
You also can explore the data model from within a BI tool:
These examples are relatively simple, but a data model and queries can get very complex, with many measures, dimensions, and relationships.
By creating a data model foundation and exposing metrics via many APIs, Cube ensures that data is accessible and consistent for every downstream application.
- Want to explore Cube’s data modeling for yourself? Create a free Cube Cloud account now.
- Have questions or input? Drop us a line, join our Slack, or view our headless BI project on GitHub.
- Want updates like these in your inbox? Subscribe to our newsletter.