BigQuery is one of the most well-known data warehouses and it's great at handling large datasets. However, due to its distributed architecture, it will never give you a sub-second response, on any dataset.
For application developers, building an app directly on top of BigQuery means having substantial wait time on dashboards and charts, especially if they're dynamic (where users can select different date ranges or change filters). Most certainly, we can tolerate that for internal apps, but not for the public-facing apps that we ship to customers.
Anyway, we still want to leverage BigQuery’s cheap data storage and the ability to process large datasets, while not giving up on the performance. BigQuery will act as a single source of truth and store the raw data, however, we'll need to add an acceleration layer on top of it. This layer will be based on Cube.js and we'll store only small, aggregated tables (so-called "external rollups" or "pre-aggregations") in Cube Store which will provide us with desired sub-second responses. We'll also build an example front-end app with a dashboard:
You can check out the demo and the source code on GitHub. Make sure to play with date range and switchers: dynamic dashboards benefit the most from the external rollups (pre-aggregations).
Cube.js provides external rollups (pre-aggregations) as a part of its two-level caching system. The schema below shows how Cube.js and Cube Store interact with BigQuery: the raw data is pre-aggregated asynchronously and put in Cube Store; when a query comes to the API, it will be served from pre-aggregations, without requests to BigQuery.
To use the acceleration layer, we need to configure Cube.js to connect to BigQuery and specify the pre-aggregations that we want to build. If you are new to Cube.js, I recommend checking this 101-style tutorial first and then coming back here. We are going to use the public Hacker News dataset in BigQuery for our app.
First, please make sure you have Docker installed on your machine. It's recommended to run Cube.js with Docker.
Second, let's create a new folder for your Cube.js app and navigate to it:
Then, create a new docker-compose.yml
file with Cube.js configuration. We'll use environment variables
from the .env
file and instruct Cube.js to connect to BigQuery:
In the .env
file we'll set credentials for BigQuery. You can learn more how to obtain them in the
docs. Note that Cube.js should have a write
access to a schema in BigQuery where pre-aggregations will be temporarily stored before being loaded
into Cube Store.
That is all we need to let Cube.js connect to BigQuery. Also note that Cube Store is readily available for
your development purposes when the CUBEJS_DEV_MODE
environment variable is set to true
. Learn more
about running Cube Store in production in the docs.
Also, we're ready to create our first Cube.js data schema file. Cube.js uses the data schema to generate
SQL code which will be executed in the database. Let's create the schema/Stories.js
file with the
following contents:
Now we're ready to run Cube.js:
Then, let's navigate to the Developer Playground,
a convenient web-based tool that helps explore the data, at localhost:4000.
You can select the Stories Count
measure and the Category
dimension there, alongside a time dimension
named Stories Time
for All time
by Month
, to build a chart as shown below:
If we inspect generated SQL by clicking on the SQL
button, we’ll see the following:
This SQL shows us that this query runs against the raw data inside BigQuery. Now, let’s make it run against the pre-aggregated table inside Cube Store. To do that, we are going to define a pre-aggregation. Usually, it is done inside the same cube, but for the sake of the tutorial, so we can compare the performance with and without pre-aggregation, let’s create a new cube. We can do it inside the same file.
Add the following code to the schema/Stories.js
file to declare a pre-aggregation and specify which
measures and dimensions to include into it.
Now, go to the Developer Playground, refresh the page, and select the same measures and dimensions as
before: count, category, and time grouped by month, but from the Stories Pre Agg
cube.
When running this query the first time, Cube.js will generate a pre-aggregation and load it into
Cube Store. All subsequent requests will go directly to Cube Store. You can inspect the generated SQL
and it should look like the following:
As you can see, query goes to the stb_pre_aggregations.stories_pre_agg_main
table inside
Cube Store. You can play around with filters to see the performance boost of the accelerated query
compared to the raw one.
You can also check this demo dashboard with multiple charts and compare its performance with and without pre-aggregations. The source code of the example dashboard is available on GitHub.
Happy hacking!