This article will show how to use the Firebolt data warehouse to generate insights from terabytes of Wikipedia page view logs. We will also use Cube to provide access control and deliver the data via Сube's REST API to end users of a web application built with React.
Here's the application we're going to build:
You can view the demo or browse the source code on GitHub.
Also, if you're already curious about Firebolt and Cube:
"Building lightning-fast data apps with Firebolt and Cube" webinar on September 15, 2022.
You'll see both tools in action, learn from their teams, and have a chance to discuss how Firebolt and Cube can enable sub-second analytics experiences for your company.
Check out the agenda and register today 👈
What is Firebolt?
Firebolt is a fully-managed data warehouse service offering sub-second and highly concurrent analytics experience.
Performance. High performance while processing TBs of data is possible because Firebolt stores data in a custom Firebolt File Format (F3). This column-based format supports data compression and indexes the values while ingesting them into the data warehouse. The sparse index built during data ingestion, together with Firebolt query optimization, allows to filter out large amounts of data fast while processing the query. Thus, it decreases the amount of data that needs to be scanned to produce the result.
Cloud. With Firebolt, we don’t need to worry about setting up the computing environment. Everything is automated, including shutdowns of an unused data warehouse to avoid unnecessary costs. If we need more processing power, we can switch the underlying database engine to a more powerful cluster using a human-friendly web UI within a few minutes. Similarly, when we don’t need massive processing power, we can deploy a cheaper cluster with two clicks of a button.
How to create a Firebolt account?
We’ll need a Firebolt account to run the code in this tutorial. If you want to follow along and don’t have an account yet, you will need to schedule a call with the Firebolt team using the contact form on their website.
What is Cube?
Cube is a headless business intelligence platform. Its data modeling layer allows us to hide the complexity of the data warehouse schema from data consumers. Additionally, we can use Cube’s access control layer to provide row-based data protection regardless of the upstream data source.
APIs. Cube can deliver data to every application with REST API, GraphQL API, and SQL API. Every data consumer — and every user — receives the same data, no matter which access method they prefer.
With Cube, we can build a single data access platform powering data analytics dashboards and front-end applications. At the same time, Cube works as an abstraction layer between the data engineering team and the data consumers. Data engineers can modify the underlying data structures any way they want while using Cube to provide a consistent interface for the downstream users.
How to create a Cube account?
In addition to a Firebolt account, we also need access to Cube. In this case, you have three options. You can sign up to create a new account in Cube Cloud, deploy your own self-hosted instance of Cube, or schedule a call with the Cube team to discuss what would work best for your project.
What are we building?
Let’s use Firebolt to analyze the massive volume of page view logs from Wikipedia. (It has around 4 TBs of data per year of logs.) Additionally, we will use Cube to encapsulate the queries and provide access control.
We’ll try to figure out whether a sudden spike in the number of readers for a particular article correlated with the events in the real world. For example, we’ll see whether more people than usual visited the Wikipedia page about Queen Elizabeth II on June 2nd, 2022.
Outlier detection
We won’t use any sophisticated algorithms. Instead, we will calculate the average number of views per article during a single-month time window and the standard deviation. We will mark a number of views as an outlier when the number of visitors on a given day exceeds the average + 3 times the standard deviation calculated using data from the preceding 30 days.
For example, let's say we have the following 30 days of values. The value on the 31st day is an outlier because it exceeds the outlier threshold defined in the previous paragraph.
This is not a perfect solution. This method makes sense when the data is normally distributed. However, as with many human activities, website views tend to follow a log-normal distribution. Finding outliers in a log-normal distributed dataset is a rather math-heavy operation.
If you are interested in getting better results in such cases, look at “The use of outlier detection methods in the log-normal distribution for the identification of atypical varietal experiments“ research paper by Andrzej Kornacki and Andrzej Bochniak from the University of Life Sciences in Lublin, Poland. Our approach may not be suitable for a very critical production deployment, but it’s good enough for a technical tutorial.
Data ingestion
We will ingest the Wikipedia logs into an S3 bucket, transform the log data to a format supported by Firebolt, and load them into the data warehouse. After that, we will calculate the time window aggregations and search for the outliers. We will store all of them in a separate fact table inside the data warehouse.
Following industry customs, we've run run all of those operations using an Airflow DAG. Being midnful of your time, we won't be describing this pretty much obvious step.
Later, we will setup Cube integration to access the calculated outliers. In Cube, we will define access control rules to limit the data a user can retrieve. Finally, we will expose the data via a REST API endpoint and build a web application displaying the number of outliers on a given day.
Data engineering inside Firebolt
After the page view logs are downloaded from the Wikipedia log server and stored in S3, we're ready to load them into Firebolt. But before that, we need to perform a one-time setup.
One-time setup: S3 integration and tables
Let’s login to Firebolt web UI and create a new database. Of course, if you already have a database, you can move to the step where we create an external table.
In the web UI, we click the New Database button and fill out the form. Remember to choose the correct AWS region. You may also need to create a database engine if it doesn’t exist yet.
After creating the database, we start the Firebolt engine, wait until it boots up, and open the "SQL Workspace" tab.
In the SQL Workspace, we create a new external table backed by an S3 bucket. In the table definition, in addition to the column definition, we specify the AWS access credentials, file compression, and data format:
If you need help configuring proper access permissions, refer to the “Create an IAM permissions policy” section in Firebolt’s “Using AWS roles to access S3” documentation page. You can also use an AWS role instead of user credentials. This documentation page will help you configure it as well.
Now, we create the first fact table. It will be our first table using the performant F3 file format for storage:
Our logs contain data from desktop and mobile clients. We can combine them using an aggregating index. Below, we define an aggregating index over three columns. Our index isn’t perfect, though. We have only two rows for every log date, wiki code, and article title.
Firebolt recommends creating indexes with low cardinality (low number of unique grouping keys). In our case, we cannot do anything about it, but please remember that we would get better performance if we could aggregate more rows into a single group and reduce the number of unique groups.
We will generate time-window aggregations. To avoid recalculating them all the time, we will store the results in yet another table:
We could include the mapping between the aggregation window and the next day in every data retrieval query, but we can also create a dimension table to store the mapping:
Finally, we want a table with all the outliers. We store outlier detection results in a table, so we can access them without running the filter query all the time. Here is our table:
Our schema looks like this:
Daily data ingestion
After the setup, we can start ingesting the data. Below only the SQL queries used to load the data from S3 are shown. (Needless to say, these queries should be run with Airflow. See the DAG on GitHub.)
In the query, we read from the external table and copy the values into our pageviews table. Note that we extract the date from the source_file_name and use the file name to limit the amount of processed data. The {{ ds_nodash }}
value comes from the Airflow task context.
After the data ingestion, we aggregate the daily counts from the past 30 days for every article separately. We will store them in the bm_cube_pageviews_aggregates
table. Again, we use the Airflow template variables to define the time window:
Finally, we update the dimension table to store the mapping between aggregated values and the daily count. In the dimension table, we want to refer to the daily count a day after the end of the time window:
Outlier detection in Firebolt
Finally, we can perform the last stage of a daily load and calculate which pages had more visitors than we expected.
In our final query, we combine the values from the aggregated time-range table and the daily count using the dimension table. Here we also define what an “outlier” means in our business domain by calculating the threshold of the number of views.
What makes Firebolt so fast?
Before we continue integrating Cube and Firebolt, let’s talk about the technology that makes Firebolt such a high-performing data warehouse.
First of all, is Firebolt fast? Let’s send a query to retrieve outliers in the English Wikipedia on 2022-06-10
:
Firebolt finds and sorts over half a million rows in less than a second.
Also, note the tiny amount of scanned data: only 31 MB. How is that possible? Let’s take a look at the numbers. I have almost 4 TB of data in my database at the time of writing this paragraph. Of course, the compressed size is much smaller. Note the impressive compression rate of our bm_cube_pageview_days
table: 10.5 times smaller than the uncompressed data!
Compression reduces the amount of data transmitted between cluster nodes and the underlying storage. However, compression alone wouldn’t be enough if we retrieved entire files every time. Usually, data warehouses or cloud query engines use partitions to limit the number of retrieved files. Firebolt Cloud Data Warehouse Whitepaper explains how Firebolt fixes the problem and retrieves even smaller chunks of data:
Firebolt accesses much smaller ranges through the use of indexes. For example, Firebolt only fetches the data ranges it needs, not entire partitions or segments. What makes this possible is sparse indexing. In Firebolt, the unit of remote storage is a segment, and it can be any size (see the Data ingestion section below for more.) Within each segment are much smaller units called ranges. Firebolt only accesses much smaller ranges of data instead of larger segments. Storing only the ranges not only dramatically reduces data access times. It reduces the amount of data scanned, and scan times as well.
In Firebolt, we can define aggregating and join indexes to prepare pre-computed data. Its query optimizer will automatically get the pre-computed data instead of retrieving it from the storage. Of course, that’s not the only feature of the optimizer. The Firebolt whitepaper says:
The query optimizer evaluates whether query performance can be improved by reordering query operations, or by using indexes in place of operations. It reorders operations to take advantage of predicate pushdowns, and uses aggregating and join indexes to further reduce both data access and scans.
Connecting Cube to Firebolt
Let’s switch to Cube. After signing up for Cube Cloud, we open the Cube deployments dashboard and click Create Deployment:
In the next screen, we choose the deployment platform (AWS), the region, and the name of the Cube application:
We don’t have any Cube code yet, so we must start from scratch. Hence, we click Create in the second step of the setup:
Finally, we choose Firebolt in the database connection window:
In the final setup step, we provide the database name, database engine endpoint URL, the API URL, and our credentials:
After connecting to the database, we can choose to generate the schema automatically. In our case, we need only the outliers table, so that’s the only table we choose:
Let’s click Generate and wait until Cube deploys our application. When the app is ready, we open the "Playground” tab and test whether everything works as expected.
In the Playground, we configure a query and click Run. Note that we specify the Wiki code and limit the number of daily hits because we want to get more interesting results. The top page is usually Wikipedia's homepage:
It seems we have found the answer to our question about the Queen. Not entirely, though. The page Queen_of_the_United_Kingdom
is just one of many redirections to the Wikipedia article about the English Monarchy. Wikipedia's readers visited some of them more often than usual.
As everything works correctly, we can switch to the “Schema” tab and define access control rules.
Access control in Cube
In the previous section, we filtered results by Wiki code. Let’s make it a permanent rule.
To secure our dataset, we’ll use the column-based access control in Cube. We want to allow users to access outliers from only some Wikipedias. We’ll need to retrieve the Wikipedia names from the user’s security context and add a filter to every query issued by the user.
Let’s open the "Schema" tab and the cube.js
file. In the file, we implement the queryRewrite
function.
First, we must check whether the security context contains a list of Wikipedias and reject all queries without it. Next, we modify the user’s query by adding a filter limiting the data the user can see, and we return a modified query to overwrite the user’s query:
Now, we save the changes, commit them, and push the code to the repository. Click ... in the top-right corner, then Commit and push.
After a while, the application will get deployed in production, and we will no longer be able to run our previous test query.
Let’s open the previous query in the Playground and try running it anyway. In the result, we should see an error message because our security context doesn’t contain a list of Wikipedias:
In the Playground, we can modify the user’s context by clicking Add Security Context and using this as JSON:
Now, our query returns pages from the English Wikipedia and nothing else:
Exposing data to React application
Now we're all set to build a front-end app.
We will modify the data model to perform a last-mile data transformation: front-end engineers will thank us for sound dimensions exposed via the API. Then, we'll build a web application with React, the most popular JavaScript franework, and nivo charting library, a modern production-ready data visualization tool.
Updating the data model
Let's go back to Cube Cloud, click Enter Development Mode, open the Schema tab, navigate to schema/BmCubeOutliers.js
, and replace it with the following:
A few highlights to the code above:
- In the
sql
property, we use a slightly more complex SQL than a regular "select star" query;region
andurl
are derived from the dataset for the needs of the front-end app. - In the
segments
property, we define a set of filters to simplify selecting only a part of the whole dataset, e.g., only Wikipedia articles or Wiktionary definitions; very handy to simplify the queries in the front-end app code. - In the
preAggregations
property, we configure caching for non-critical slowly changing queries: list of all regions and list of all days. These queries would be run by every user on every web application load, so it makes sense to use cache to prevent numerous indentical queries, traversing terabytes of data when fulfilled, from hitting the data warehouse. Unique and unsightful queries would be run directly against Firebolt.
Also, let's replace the contents of cube.js
with the following:
The new access control logic:
- Let's accept all queries, even if
securityContext.wikipedias
is not set. - If
securityContext.wikipedias
is set, enforce the filter byregion
.
Time to hit ... in the top-right corver, then Commit and push. Now the Cube deployment is ready, let's get down to the front-end app.
Building a React application
Let's build a single page application with React and prvide end users with a nice UI to explore Wikipedia data.
Here it is! You can check the full source code on GitHub or instantly run it with yarn dev
. You'll get your own copy of this demo application.
This is a typical React app with an entry point in src/index.js
. Let's review the contents of this file:
A few highlights to the code above:
@cubejs-client/core
and@cubejs-client/react
packages contain convenient components and functions to interact with Cube's REST API in React applications. You can also use the REST API with other frameworks like Vue or Angular, vanilla JavaScript apps, or apps written in any other language. Using Cube's GraphQL API is also an option.cubejsApi
helps establish an API connection; it's configured with a JWT token and an API URL. In production, Cube is often used with JSON Web Key Sets; integrations with Auth0 and AWS Cognito are available.- A typical API interaction flow in React app looks like this:
- use
useState
to create a state variable (e.g.,regions
); - compose a Cube query (e.g.,
regionsQuery
); - call
useCubeQuery
to fetch the result set (e.g.,regionsResultSet
); - use
useEffect
to await for the data; - transform the data, if needed, with
tablePivot
and functions likemap
,filter
,reduce
, orparseInt
; - assign the data to the state variable (e.g., with
setRegions
).
- use
- If needed, you can derive queries from other queries (e.g., see how
articlesQuery
is based onoutliersQuery
) or make conditional changes to them. - In the very end, the data is rendred. Browse the code of very simple components used in this app:
RegionSelector
renders a clickable list of country flags;CalendarChart
renders a clickable calendar;DayRegionIndicator
displays the text about "popular articles";LoadingIndicator
appears briefly when a query to Firebolt is in progress;ArticleList
displays the list of links to popular articles.
That's it! Now you know how this app is built:
Check it out! You'll surely get some insights from it.
What's next?
We encourage you to learn more about Firebolt and Cube:
- Check out Firebolt's website, beautiful docs, blog, and the whitepaper. Talk to them and make the step towards sub-second analytics experience.
- Check out Cube's website, docs, blog, and example apps. Get started with Cube Cloud for free.
And the most important thing. Join Cube and Firebolt at the webinar:
"Building lightning-fast data apps with Firebolt and Cube" webinar on September 15, 2022.
You'll see both tools in action, learn from their teams, and have a chance to discuss how Firebolt and Cube can enable sub-second analytics experiences for your company.
Check out the agenda and register today 👈
Also, don't hesitate to join Cube's community of 6000+ data engineers and application developers on Slack and give Cube a star on GitHub. Good luck!