Building terabyte-scale data apps with Firebolt and Cube

This article will show how to use the Firebolt data warehouse to generate insights from terabytes of Wikipedia page view logs. We will also use Cube to provide access control and deliver the data via Сube's REST API to end users of a web application built with React.

Here's the application we're going to build:

You can view the demo or browse the source code on GitHub.

Also, if you're already curious about Firebolt and Cube:

"Building lightning-fast data apps with Firebolt and Cube" webinar on September 15, 2022.
You'll see both tools in action, learn from their teams, and have a chance to discuss how Firebolt and Cube can enable sub-second analytics experiences for your company.
Check out the agenda and register today 👈

What is Firebolt?

Firebolt is a fully-managed data warehouse service offering sub-second and highly concurrent analytics experience.

Performance. High performance while processing TBs of data is possible because Firebolt stores data in a custom Firebolt File Format (F3). This column-based format supports data compression and indexes the values while ingesting them into the data warehouse. The sparse index built during data ingestion, together with Firebolt query optimization, allows to filter out large amounts of data fast while processing the query. Thus, it decreases the amount of data that needs to be scanned to produce the result.

Cloud. With Firebolt, we don’t need to worry about setting up the computing environment. Everything is automated, including shutdowns of an unused data warehouse to avoid unnecessary costs. If we need more processing power, we can switch the underlying database engine to a more powerful cluster using a human-friendly web UI within a few minutes. Similarly, when we don’t need massive processing power, we can deploy a cheaper cluster with two clicks of a button.

How to create a Firebolt account?

We’ll need a Firebolt account to run the code in this tutorial. If you want to follow along and don’t have an account yet, you will need to schedule a call with the Firebolt team using the contact form on their website.

What is Cube?

Cube is a headless business intelligence platform. Its data modeling layer allows us to hide the complexity of the data warehouse schema from data consumers. Additionally, we can use Cube’s access control layer to provide row-based data protection regardless of the upstream data source.

APIs. Cube can deliver data to every application with REST API, GraphQL API, and SQL API. Every data consumer — and every user — receives the same data, no matter which access method they prefer.

With Cube, we can build a single data access platform powering data analytics dashboards and front-end applications. At the same time, Cube works as an abstraction layer between the data engineering team and the data consumers. Data engineers can modify the underlying data structures any way they want while using Cube to provide a consistent interface for the downstream users.

How to create a Cube account?

In addition to a Firebolt account, we also need access to Cube. In this case, you have three options. You can sign up to create a new account in Cube Cloud, deploy your own self-hosted instance of Cube, or schedule a call with the Cube team to discuss what would work best for your project.

What are we building?

Let’s use Firebolt to analyze the massive volume of page view logs from Wikipedia. (It has around 4 TBs of data per year of logs.) Additionally, we will use Cube to encapsulate the queries and provide access control.

Architecture

We’ll try to figure out whether a sudden spike in the number of readers for a particular article correlated with the events in the real world. For example, we’ll see whether more people than usual visited the Wikipedia page about Queen Elizabeth II on June 2nd, 2022.

Outlier detection

We won’t use any sophisticated algorithms. Instead, we will calculate the average number of views per article during a single-month time window and the standard deviation. We will mark a number of views as an outlier when the number of visitors on a given day exceeds the average + 3 times the standard deviation calculated using data from the preceding 30 days.

For example, let's say we have the following 30 days of values. The value on the 31st day is an outlier because it exceeds the outlier threshold defined in the previous paragraph.

Outlier schema

This is not a perfect solution. This method makes sense when the data is normally distributed. However, as with many human activities, website views tend to follow a log-normal distribution. Finding outliers in a log-normal distributed dataset is a rather math-heavy operation.

If you are interested in getting better results in such cases, look at “The use of outlier detection methods in the log-normal distribution for the identification of atypical varietal experiments“ research paper by Andrzej Kornacki and Andrzej Bochniak from the University of Life Sciences in Lublin, Poland. Our approach may not be suitable for a very critical production deployment, but it’s good enough for a technical tutorial.

Data ingestion

We will ingest the Wikipedia logs into an S3 bucket, transform the log data to a format supported by Firebolt, and load them into the data warehouse. After that, we will calculate the time window aggregations and search for the outliers. We will store all of them in a separate fact table inside the data warehouse.

Following industry customs, we've run run all of those operations using an Airflow DAG. Being midnful of your time, we won't be describing this pretty much obvious step.

Later, we will setup Cube integration to access the calculated outliers. In Cube, we will define access control rules to limit the data a user can retrieve. Finally, we will expose the data via a REST API endpoint and build a web application displaying the number of outliers on a given day.

Data engineering inside Firebolt

After the page view logs are downloaded from the Wikipedia log server and stored in S3, we're ready to load them into Firebolt. But before that, we need to perform a one-time setup.

One-time setup: S3 integration and tables

Let’s login to Firebolt web UI and create a new database. Of course, if you already have a database, you can move to the step where we create an external table.

Databases

In the web UI, we click the New Database button and fill out the form. Remember to choose the correct AWS region. You may also need to create a database engine if it doesn’t exist yet.

New database

After creating the database, we start the Firebolt engine, wait until it boots up, and open the "SQL Workspace" tab.

In the SQL Workspace, we create a new external table backed by an S3 bucket. In the table definition, in addition to the column definition, we specify the AWS access credentials, file compression, and data format:

CREATE EXTERNAL TABLE ex_wikipedia_pageviews (
  wiki_code varchar,
  article_title varchar,
  page_id varchar,
  client varchar,
  daily_total integer,
  hourly_counts varchar
)
URL = 's3://the_name_of_your_bucket/wikilogs/'
TYPE = (CSV)
CREDENTIALS = (
  AWS_KEY_ID = 'AWS Key with access to the bucket'
  AWS_SECRET_KEY = 'Here is the secret'
)
OBJECT_PATTERN= '*.csv.gz'
COMPRESSION = GZIP;

If you need help configuring proper access permissions, refer to the “Create an IAM permissions policy” section in Firebolt’s “Using AWS roles to access S3” documentation page. You can also use an AWS role instead of user credentials. This documentation page will help you configure it as well.

Now, we create the first fact table. It will be our first table using the performant F3 file format for storage:

CREATE FACT TABLE bm_cube_fact_pageviews (
  wiki_code varchar,
  article_title varchar,
  client varchar,
  daily_total integer,
  log_date date
)
PRIMARY INDEX log_date, wiki_code, article_title;

Our logs contain data from desktop and mobile clients. We can combine them using an aggregating index. Below, we define an aggregating index over three columns. Our index isn’t perfect, though. We have only two rows for every log date, wiki code, and article title.

Firebolt recommends creating indexes with low cardinality (low number of unique grouping keys). In our case, we cannot do anything about it, but please remember that we would get better performance if we could aggregate more rows into a single group and reduce the number of unique groups.

CREATE AND GENERATE AGGREGATING INDEX daily_total_sum
ON bm_cube_fact_pageviews (
  log_date,
  wiki_code,
  article_title,
  SUM(daily_total)
);

We will generate time-window aggregations. To avoid recalculating them all the time, we will store the results in yet another table:

CREATE FACT TABLE bm_cube_pageviews_aggregates (
  wiki_code varchar,
  article_title varchar,
  average float,
  st_dev float,
  range_end_date date
)
PRIMARY INDEX range_end_date, wiki_code, article_title;

We could include the mapping between the aggregation window and the next day in every data retrieval query, but we can also create a dimension table to store the mapping:

CREATE DIMENSION TABLE bm_cube_pageview_days (
  wiki_code varchar,
  article_title varchar,
  range_end_date date,
  next_day date
)
PRIMARY INDEX wiki_code, article_title, range_end_date, next_day;

Finally, we want a table with all the outliers. We store outlier detection results in a table, so we can access them without running the filter query all the time. Here is our table:

CREATE FACT TABLE bm_cube_outliers (
  wiki_code varchar,
  article_title varchar,
  log_date date,
  daily_total integer,
  average float,
  st_dev float
)
PRIMARY INDEX log_date, wiki_code;

Our schema looks like this:

Tables

Daily data ingestion

After the setup, we can start ingesting the data. Below only the SQL queries used to load the data from S3 are shown. (Needless to say, these queries should be run with Airflow. See the DAG on GitHub.)

In the query, we read from the external table and copy the values into our pageviews table. Note that we extract the date from the source_file_name and use the file name to limit the amount of processed data. The {{ ds_nodash }} value comes from the Airflow task context.

INSERT INTO bm_cube_fact_pageviews
SELECT
  wiki_code,
  article_title,
  client,
  daily_total,
  TO_DATE(CONCAT(year, '-', month, '-', day)) AS log_date
FROM (
  SELECT
    wiki_code,
    article_title,
    client,
    daily_total,
    source_file_name,
    SUBSTR(source_file_name, 20, 4) AS year,
    SUBSTR(source_file_name, 24, 2) AS month,
    SUBSTR(source_file_name, 26, 2) AS day
  FROM bm_cube_ex_wikipedia_pageviews
  WHERE source_file_name = 'wikilogs/pageviews-{{ ds_nodash }}-user.csv.gz'
);

After the data ingestion, we aggregate the daily counts from the past 30 days for every article separately. We will store them in the bm_cube_pageviews_aggregates table. Again, we use the Airflow template variables to define the time window:

INSERT INTO bm_cube_pageviews_aggregates
SELECT
  wiki_code,
  article_title,
  AVG(daily_total) AS average,
  COALESCE(stddev_samp(daily_total), 0) AS st_dev,
  MAX(log_date) AS range_end_date
FROM bm_cube_fact_pageviews
WHERE log_date BETWEEN '{{ macros.ds_add(yesterday_ds, -30) }}' AND '{{ yesterday_ds }}'
GROUP BY wiki_code, article_title;

Finally, we update the dimension table to store the mapping between aggregated values and the daily count. In the dimension table, we want to refer to the daily count a day after the end of the time window:

INSERT INTO bm_cube_pageview_days
SELECT
  wiki_code,
  article_title,
  range_end_date,
  range_end_date + 1 AS next_day
FROM bm_cube_pageviews_aggregates
WHERE range_end_date = '{{ yesterday_ds }}';

Outlier detection in Firebolt

Finally, we can perform the last stage of a daily load and calculate which pages had more visitors than we expected.

In our final query, we combine the values from the aggregated time-range table and the daily count using the dimension table. Here we also define what an “outlier” means in our business domain by calculating the threshold of the number of views.

INSERT INTO bm_cube_outliers
WITH dd AS (
  SELECT *
  FROM bm_cube_pageview_days
  WHERE range_end_date = '{{ macros.ds_add(yesterday_ds, -1) }}'
),
v AS (
  SELECT *
  FROM bm_cube_fact_pageviews
  WHERE log_date = '{{ yesterday_ds }}'
),
a AS (
  SELECT *
  FROM bm_cube_pageviews_aggregates
  WHERE range_end_date = '{{ macros.ds_add(yesterday_ds, -1) }}'
)
SELECT
  v.wiki_code,
  v.article_title,
  v.log_date,
  v.daily_total,
  a.average,
  a.st_dev
FROM dd
JOIN v ON dd.wiki_code = v.wiki_code AND dd.article_title = v.article_title AND dd.next_day = v.log_date
JOIN a ON dd.wiki_code = a.wiki_code AND dd.article_title = a.article_title AND dd.range_end_date = a.range_end_date
WHERE v.daily_total >= a.average + 3 * a.st_dev AND v.log_date = '{{ yesterday_ds }}'

What makes Firebolt so fast?

Before we continue integrating Cube and Firebolt, let’s talk about the technology that makes Firebolt such a high-performing data warehouse.

First of all, is Firebolt fast? Let’s send a query to retrieve outliers in the English Wikipedia on 2022-06-10:

Query

Firebolt finds and sorts over half a million rows in less than a second.

Also, note the tiny amount of scanned data: only 31 MB. How is that possible? Let’s take a look at the numbers. I have almost 4 TB of data in my database at the time of writing this paragraph. Of course, the compressed size is much smaller. Note the impressive compression rate of our bm_cube_pageview_days table: 10.5 times smaller than the uncompressed data!

Compression

Compression reduces the amount of data transmitted between cluster nodes and the underlying storage. However, compression alone wouldn’t be enough if we retrieved entire files every time. Usually, data warehouses or cloud query engines use partitions to limit the number of retrieved files. Firebolt Cloud Data Warehouse Whitepaper explains how Firebolt fixes the problem and retrieves even smaller chunks of data:

Firebolt accesses much smaller ranges through the use of indexes. For example, Firebolt only fetches the data ranges it needs, not entire partitions or segments. What makes this possible is sparse indexing. In Firebolt, the unit of remote storage is a segment, and it can be any size (see the Data ingestion section below for more.) Within each segment are much smaller units called ranges. Firebolt only accesses much smaller ranges of data instead of larger segments. Storing only the ranges not only dramatically reduces data access times. It reduces the amount of data scanned, and scan times as well.

In Firebolt, we can define aggregating and join indexes to prepare pre-computed data. Its query optimizer will automatically get the pre-computed data instead of retrieving it from the storage. Of course, that’s not the only feature of the optimizer. The Firebolt whitepaper says:

The query optimizer evaluates whether query performance can be improved by reordering query operations, or by using indexes in place of operations. It reorders operations to take advantage of predicate pushdowns, and uses aggregating and join indexes to further reduce both data access and scans.

Connecting Cube to Firebolt

Let’s switch to Cube. After signing up for Cube Cloud, we open the Cube deployments dashboard and click Create Deployment:

Deployment

In the next screen, we choose the deployment platform (AWS), the region, and the name of the Cube application:

Create Deployment

We don’t have any Cube code yet, so we must start from scratch. Hence, we click Create in the second step of the setup:

Scratch

Finally, we choose Firebolt in the database connection window:

Firebolt

In the final setup step, we provide the database name, database engine endpoint URL, the API URL, and our credentials:

Firebolt credentials

After connecting to the database, we can choose to generate the schema automatically. In our case, we need only the outliers table, so that’s the only table we choose:

Schema

Let’s click Generate and wait until Cube deploys our application. When the app is ready, we open the "Playground” tab and test whether everything works as expected.

In the Playground, we configure a query and click Run. Note that we specify the Wiki code and limit the number of daily hits because we want to get more interesting results. The top page is usually Wikipedia's homepage:

Playground

It seems we have found the answer to our question about the Queen. Not entirely, though. The page Queen_of_the_United_Kingdom is just one of many redirections to the Wikipedia article about the English Monarchy. Wikipedia's readers visited some of them more often than usual.

As everything works correctly, we can switch to the “Schema” tab and define access control rules.

Access control in Cube

In the previous section, we filtered results by Wiki code. Let’s make it a permanent rule.

To secure our dataset, we’ll use the column-based access control in Cube. We want to allow users to access outliers from only some Wikipedias. We’ll need to retrieve the Wikipedia names from the user’s security context and add a filter to every query issued by the user.

Let’s open the "Schema" tab and the cube.js file. In the file, we implement the queryRewrite function.

First, we must check whether the security context contains a list of Wikipedias and reject all queries without it. Next, we modify the user’s query by adding a filter limiting the data the user can see, and we return a modified query to overwrite the user’s query:

module.exports = {
  queryRewrite: (query, { securityContext }) => {
    if (!securityContext.wikipedias) {
      throw new Error('Wikipedias in the Security Context are mandatory!!');
    }

    query.filters.push({
      member: `Outliers.wikiCode`,
      operator: 'in',
      values: securityContext.wikipedias,
    });

    return query;
  }
};

Now, we save the changes, commit them, and push the code to the repository. Click ... in the top-right corner, then Commit and push.

After a while, the application will get deployed in production, and we will no longer be able to run our previous test query.

Let’s open the previous query in the Playground and try running it anyway. In the result, we should see an error message because our security context doesn’t contain a list of Wikipedias:

Error

In the Playground, we can modify the user’s context by clicking Add Security Context and using this as JSON:

{
  "wikipedias": [ "en.wikipedia" ]
}

Now, our query returns pages from the English Wikipedia and nothing else:

Results

Exposing data to React application

Now we're all set to build a front-end app.

We will modify the data model to perform a last-mile data transformation: front-end engineers will thank us for sound dimensions exposed via the API. Then, we'll build a web application with React, the most popular JavaScript franework, and nivo charting library, a modern production-ready data visualization tool.

Updating the data model

Let's go back to Cube Cloud, click Enter Development Mode, open the Schema tab, navigate to schema/BmCubeOutliers.js, and replace it with the following:

cube(`Outliers`, {
  sql: `
    SELECT
      *,
      UPPER(SPLIT_PART(wiki_code, '.', 1)) AS region,
      CONCAT('https://', wiki_code, '.org/wiki/', article_title) AS url
    FROM dev.bm_cube_outliers
  `,

  measures: {
    count: {
      type: `count`
    },

    dailyTotal: {
      sql: `daily_total`,
      type: `sum`
    }
  },

  dimensions: {
    wikiCode: {
      sql: `wiki_code`,
      type: `string`
    },

    region: {
      sql: `region`,
      type: `string`
    },

    url: {
      sql: `url`,
      type: `string`
    },

    title: {
      sql: `article_title`,
      type: `string`
    },

    average: {
      sql: `average`,
      type: `string`
    },

    stDev: {
      sql: `st_dev`,
      type: `string`
    },

    logDate: {
      sql: `log_date`,
      type: `time`
    }
  },

  segments: {
    wikipedia: {
      sql: `wiki_code LIKE '__.wikipedia'`
    },
    wikibooks: {
      sql: `wiki_code LIKE '__.wikibooks'`
    },
    wiktionary: {
      sql: `wiki_code LIKE '__.wiktionary'`
    },
    wikimedia: {
      sql: `wiki_code LIKE '__.wikimedia'`
    },
    wikiquote: {
      sql: `wiki_code LIKE '__.wikiquote'`
    }
  },

  preAggregations: {
    regions: {
      measures: [ count ],
      dimensions: [ region ],
      segments: [ wikipedia ],
      refreshKey: {
        every: '1 day'
      }
    },

    outliers: {
      measures: [ dailyTotal ],
      dimensions: [ region ],
      segments: [ wikipedia ],
      timeDimension: logDate,
      granularity: `day`,
      refreshKey: {
        every: '1 day'
      }
    }
  }
});

A few highlights to the code above:

In the sql property, we use a slightly more complex SQL than a regular "select star" query; region and url are derived from the dataset for the needs of the front-end app.
In the segments property, we define a set of filters to simplify selecting only a part of the whole dataset, e.g., only Wikipedia articles or Wiktionary definitions; very handy to simplify the queries in the front-end app code.
In the preAggregations property, we configure caching for non-critical slowly changing queries: list of all regions and list of all days. These queries would be run by every user on every web application load, so it makes sense to use cache to prevent numerous indentical queries, traversing terabytes of data when fulfilled, from hitting the data warehouse. Unique and unsightful queries would be run directly against Firebolt.

Also, let's replace the contents of cube.js with the following:

module.exports = {
  queryRewrite: (query, { securityContext }) => {
    // if (!securityContext.wikipedias) {
    //   throw new Error('Wikipedias in the Security Context are mandatory!!');
    // }

    if (securityContext.wikipedias) {
      query.filters.push({
        member: `Outliers.region`,
        operator: 'in',
        values: securityContext.wikipedias,
      });
    }

    return query;
  }
};

The new access control logic:

Let's accept all queries, even if securityContext.wikipedias is not set.
If securityContext.wikipedias is set, enforce the filter by region.

Time to hit ... in the top-right corver, then Commit and push. Now the Cube deployment is ready, let's get down to the front-end app.

Building a React application

Let's build a single page application with React and prvide end users with a nice UI to explore Wikipedia data.

Here it is! You can check the full source code on GitHub or instantly run it with yarn dev. You'll get your own copy of this demo application.

This is a typical React app with an entry point in src/index.js. Let's review the contents of this file:

import ReactDOM from 'react-dom/client'
import { useState, useEffect } from 'react'
import cubejs from '@cubejs-client/core'
import { CubeProvider, useCubeQuery } from '@cubejs-client/react'
import RegionSelector from './components/RegionSelector'
import CalendarChart from './components/CalendarChart'
import DayRegionIndicator from './components/DayRegionIndicator'
import ArticleList from './components/ArticleList'
import LoadingIndicator from './components/LoadingIndicator'

const cubejsApi = cubejs(process.env.CUBE_TOKEN, { apiUrl: process.env.CUBE_API_URL })

ReactDOM
  .createRoot(document.getElementById('app'))
  .render(<CubeProvider cubejsApi={cubejsApi}><App /></CubeProvider>)

const defaultRegion = 'EN'
const regionCount = 20
const articleCount = 10

function App() {
  // Regions

  const [ regions, setRegions ] = useState([ defaultRegion ])
  const [ selectedRegion, setSelectedRegion ] = useState(undefined)

  function toggleSelectedRegion(region) {
    // Reset selected day
    setSelectedDay(undefined)

    // Unselect if selected, select otherwise
    setSelectedRegion(region === selectedRegion ? undefined : region)
  }

  let regionsQuery = {
    dimensions: [ 'Outliers.region' ],
    measures: [ 'Outliers.count' ],
    order: { 'Outliers.count': 'desc' },
    segments: [ 'Outliers.wikipedia' ],
    limit: regionCount
  }
  const { resultSet: regionsResultSet } = useCubeQuery(regionsQuery)

  useEffect(() => {
    if (regionsResultSet) {
      setRegions(regionsResultSet.tablePivot().map(x => x['Outliers.region']))
    }
  }, [ regionsResultSet ])

  // Outliers

  const [ outliers, setOutliers ] = useState([])

  let outliersQuery = {
    timeDimensions: [{
      dimension: 'Outliers.logDate',
      granularity: 'day'
    }],
    measures: [ 'Outliers.dailyTotal' ],
    segments: [ 'Outliers.wikipedia' ],
  }
  if (selectedRegion) {
    outliersQuery.filters = [{
      member: 'Outliers.region',
      operator: 'equals',
      values: [ selectedRegion ]
    }]
  }
  const { resultSet: outliersResultSet } = useCubeQuery(outliersQuery)

  useEffect(() => {
    if (outliersResultSet) {
      setOutliers(outliersResultSet.tablePivot()
        .map(x => ({
          day: x['Outliers.logDate.day'].split('T')[0],
          value: parseInt(x['Outliers.dailyTotal'])
        }))
        .filter(x => x.value !== 0)
      )
    }
  }, [ outliersResultSet ])

  // Articles

  const [ articles, setArticles ] = useState([])
  const [ selectedDay, setSelectedDay ] = useState(undefined)

  function toggleSelectedDay(day) {
    setSelectedDay(day)
  }

  let articlesQuery = {
    ...outliersQuery,
    timeDimensions: [{
      dimension: 'Outliers.logDate'
    }],
    dimensions: [ 'Outliers.url', 'Outliers.title', 'Outliers.region' ],
    order: { 'Outliers.dailyTotal': 'desc' },
    limit: articleCount
  }
  if (selectedDay) {
    articlesQuery.timeDimensions[0].dateRange = selectedDay
  }
  const { resultSet: articlesResultSet, isLoading } = useCubeQuery(articlesQuery)

  useEffect(() => {
    if (isLoading) {
      setArticles([])
    }

    if (articlesResultSet) {
      setArticles(articlesResultSet.tablePivot().map(x => ({
        url: x['Outliers.url'],
        title: x['Outliers.title'],
        region: x['Outliers.region'],
        value: parseInt(x['Outliers.dailyTotal'])
      })))
    }
  }, [ articlesResultSet, isLoading ])

  return <>
    <RegionSelector
      regions={regions}
      selectedRegion={selectedRegion}
      toggleRegion={toggleSelectedRegion}
    />
    <CalendarChart
      data={outliers}
      toggleDay={toggleSelectedDay}
    />
    <DayRegionIndicator
      day={selectedDay}
      region={selectedRegion}
      count={articleCount}
    />
    {isLoading && (
      <LoadingIndicator />
    )}
    <ArticleList
      articles={articles}
    />
  </>
}