Serverless Analytics Benchmark of AWS Aurora Performance

It’s well known that doing analytics on production RDBMS instances is prohibitive due to lock and performance issues. The introduction of replication by many of the modern RDBMSs made this process a little bit easier, but not ideal. The next evolution step here is Serverless RDBMSs. Splitting processing power from the storage brings our understanding of how to work with a RDBMS to a new level. This gives us an opportunity for infinite DB infrastructure scaling and allows us to have both OLTP and OLAP workload at the same time.

In this article, we perform an analytics performance benchmark of AWS Aurora MySQL using the Cube.js Serverless Analytics Framework.

TL;DR: Unoptimized 100M rows test dataset ran by Serverless MySQL Aurora can be queried in 176 seconds. A query time of less than 200ms is achievable for the same dataset using multi-stage querying approach.

Aurora MySQL Setup

Setup of Aurora MySQL is fairly simple. Just go to your RDS Service in AWS Console and select create database.

Here you should select the Aurora and MySQL compatible version that supports Serverless deployment. Then choose Serverless and enter your credentials to access your DB. That’s it! Your Aurora instance is configured and ready to launch.

Preparing your dataset

For the purpose of testing, we’ll generate a 100M row dataset out of Sakila Sample Database. In order to do that, we’ll use slightly modified versions of Sakila tables.

First we’ll introduce the customer table:

CREATE TABLE customer (
  customer_id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT,
  store_id TINYINT UNSIGNED NOT NULL,
  first_name VARCHAR(45) NOT NULL,
  last_name VARCHAR(45) NOT NULL,
  email VARCHAR(50) DEFAULT NULL,
  address_id SMALLINT UNSIGNED NOT NULL,
  active BOOLEAN NOT NULL DEFAULT TRUE,
  create_date DATETIME NOT NULL,
  last_update TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY  (customer_id),
  KEY idx_fk_store_id (store_id),
  KEY idx_fk_address_id (address_id),
  KEY idx_last_name (last_name)
)ENGINE=InnoDB DEFAULT CHARSET=utf8;

Then the payment table:

CREATE TABLE payment (
  payment_id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT,
  customer_id SMALLINT UNSIGNED NOT NULL,
  staff_id TINYINT UNSIGNED NOT NULL,
  rental_id INT DEFAULT NULL,
  amount DECIMAL(5,2) NOT NULL,
  payment_date DATETIME NOT NULL,
  last_update TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY  (payment_id),
  KEY idx_fk_customer_id (customer_id),
  CONSTRAINT fk_payment_customer FOREIGN KEY (customer_id) REFERENCES customer (customer_id) ON DELETE RESTRICT ON UPDATE CASCADE
)ENGINE=InnoDB DEFAULT CHARSET=utf8;

Quick hint: Aurora can’t be accessed using a public IP and is accessible only within VPC. To connect to it from your machine, you should use either VPN/SSH Tunnel or you can use AWS RDS query editor, which works just inside browser.

To fill the customer and payment tables we’ll use a sample of the first 600 customer rows and the first 500 payment rows from Sakila Sample Database.

We’ll then use the payment table to generate actual records for the 100M payment_big table. Its definition is very similar to payment:

CREATE TABLE payment_big (
  payment_id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
  customer_id SMALLINT UNSIGNED NOT NULL,
  staff_id TINYINT UNSIGNED NOT NULL,
  rental_id INT DEFAULT NULL,
  amount DECIMAL(5,2) NOT NULL,
  payment_date DATETIME NOT NULL,
  last_update TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY  (payment_id),
  KEY idx_fk_customer_id (customer_id),
  CONSTRAINT fk_payment_big_customer FOREIGN KEY (customer_id) REFERENCES customer (customer_id) ON DELETE RESTRICT ON UPDATE CASCADE
)ENGINE=InnoDB DEFAULT CHARSET=utf8;

As MySQL doesn’t have a generator feature, we’ll introduce some helper views for that.

Integer number generator up to 16:

CREATE OR REPLACE VIEW generator_16
AS SELECT 0 n UNION ALL SELECT 1  UNION ALL SELECT 2  UNION ALL 
   SELECT 3   UNION ALL SELECT 4  UNION ALL SELECT 5  UNION ALL
   SELECT 6   UNION ALL SELECT 7  UNION ALL SELECT 8  UNION ALL
   SELECT 9   UNION ALL SELECT 10 UNION ALL SELECT 11 UNION ALL
   SELECT 12  UNION ALL SELECT 13 UNION ALL SELECT 14 UNION ALL 
   SELECT 15;

Integer number generator up to 256:

CREATE OR REPLACE VIEW generator_256 AS SELECT ( hi.n * 16 + lo.n ) AS n FROM generator_16 lo , generator_16 hi;

To generate actual payments, we’ll use the following SQL, which generates 2,048,000 rows at once:

INSERT INTO payment_big (customer_id, staff_id, rental_id, amount, payment_date, last_update) 
SELECT customer_id, staff_id, rental_id, amount * 10 * RAND() as amount, TIMESTAMPADD(MINUTE, generator_256.n * 1000 + 1000 * RAND(), payment_date) as payment_date, last_update
FROM payment, generator_256, generator_16

We’ll call it 50 times to get a table with approximately 100M rows.

Cube.js setup

Cube.js is an open-source analytics framework that works as an intermediate between your analytics users and your database. It provides analytic SQL generation, query results caching and execution orchestration, data pre-aggregation, security, API for query results fetch, and visualization. We’ll use it mostly for caching and pre-aggregation to get a sense of how fast Aurora MySQL analytics can be and what the delay is between data ingestion and getting insight from it.

To create a Cube.js application, you’ll need to install Cube.js CLI and Serverless CLI for deployment:

$ npm install -g cubejs-cli
$ npm install -g serverless

Then, let’s create an app:

$ cubejs create aurora-benchmark -d mysql -t serverless

In serverless.yml you should define credentials to access your Aurora MySQL and Redis instance as well as provide vpc settings to access your resources. Please learn more about deployment here.

We’ll also introduce Cube.js schema definitions for our benchmark. schema/Customers.js:

cube(`Customers`, {
 sql: `select * from test.customer`,

 measures: {
   count: {
     type: `count`
   }
 },

 dimensions: {
   id: {
     sql: `customer_id`,
     type: `number`,
     primaryKey: true
   },

   email: {
     sql: `email`,
     type: `string`
   },

   name: {
     sql: `${CUBE}.first_name || ${CUBE}.last_name`,
     type: `string`
   }
 }
});

schema/Payments.js:

cube(`Payments`, {
 sql: `select * from test.payment_big`,

 joins: {
   Customers: {
     sql: `${CUBE}.customer_id = ${Customers}.customer_id`,
     relationship: `belongsTo`
   }
 },

 measures: {
   count: {
     type: `count`
   },

   totalAmount: {
     sql: `amount`,
     type: `sum`
   }
 },

 dimensions: {
   id: {
     sql: `payment_id`,
     type: `number`,
     primaryKey: true
   },

   date: {
     sql: `payment_date`,
     type: `time`
   }
 }
});

That’s everything we need to deploy it so let’s do it:

$ serverless deploy -v

If everything is set up correctly, you should see something like this in your shell:

Service Information
service: aurora-benchmark
stage: dev
region: us-east-1
stack: aurora-benchmark-dev
resources: 16
api keys:
  None
endpoints:
  GET - https://123456789a.execute-api.us-east-1.amazonaws.com/dev/
  ANY - https://123456789a.execute-api.us-east-1.amazonaws.com/dev/{proxy+}
functions:
  cubejs: aurora-benchmark-dev-cubejs
  cubejsProcess: aurora-benchmark-dev-cubejsProcess
layers:
  None

Go to your main GET endpoint link which should look like this: https://123456789a.execute-api.us-east-1.amazonaws.com/dev/

If everything is done right, you should see a code sandbox with a sample pie chart.

Let’s create our time series chart for the Payments cube. To do this let’s add moment import to index.js:

import moment from 'moment';

You should add it as a dependency as well using the blue Add Dependency button or package.json.

Then replace renderChart and query:

const renderChart = resultSet => (
  <Chart scale={{ category: { tickCount: 8 } }} height={400} data={resultSet.chartPivot()} forceFit>
    <Axis name="category" label={{ formatter: val => moment(val).format("MMM DD") }} />
    <Axis name="Payments.count" />
    <Tooltip crosshairs={{ type: 'y' }} />
    <Geom type="line" position="category*Payments.count" size={2} />
  </Chart>
);

const query = {
  measures: ["Payments.count"],
  timeDimensions: [{
    dimension: 'Payments.date',
    granularity: 'day',
    dateRange: ['2005-09-01', '2005-10-08']
  }]
};

After edits you should get an index.js similar to this:

import React from "react";
import ReactDOM from "react-dom";

import cubejs from "@cubejs-client/core";
import { QueryRenderer } from "@cubejs-client/react";
import { Chart, Axis, Tooltip, Geom, Coord, Legend } from "bizcharts";
import moment from 'moment';

const API_URL = "https://123456789a.execute-api.us-east-1.amazonaws.com/dev"; // change to your actual endpoint

const renderChart = resultSet => (
  <Chart scale={{ category: { tickCount: 8 } }} height={400} data={resultSet.chartPivot()} forceFit>
    <Axis name="category" label={{ formatter: val => moment(val).format("MMM DD") }} />
    <Axis name="Payments.count" />
    <Tooltip crosshairs={{ type: 'y' }} />
    <Geom type="line" position="category*Payments.count" size={2} />
  </Chart>
);

const query = {
  measures: ["Payments.count"],
  timeDimensions: [{
    dimension: 'Payments.date',
    granularity: 'day',
    dateRange: ['2005-09-01', '2005-10-08']
  }]
};

const cubejsApi = cubejs(
  "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpYXQiOjE1NTExODQ0NDksImV4cCI6MTU1MTI3MDg0OX0.KLkKp2pRnw9ZlrwMGkoBlpdgGy4eol7258aKVwJLPuM",
  { apiUrl: API_URL + "/cubejs-api/v1" }
);

const App = () => (
  <div style={{ textAlign: 'center', fontFamily: 'sans-serif' }}>
   <h1>Payments Count</h1>
    <QueryRenderer
      query={query}
      cubejsApi={cubejsApi}
      render={({ resultSet, error }) =>
        (resultSet && renderChart(resultSet)) ||
          (error && error.toString()) || <span>Loading...</span>
      }
    />
</div>
);

const rootElement = document.getElementById("root");
ReactDOM.render(<App />, rootElement);

If everything works well, you should see a line chart similar to this one:

Benchmark

At this point we have an unoptimized data set of 100M records in our payment_big table. We’ll use serverless logs to see how long it takes to process queries on the Aurora MySQL side as Cube.js writes these processing logs by default. To enable log tail we’ll use:

$ serverless logs -t -f cubejsProcess

If we request one month of data we’ll get a processing delay of 176 seconds:

2019-02-27T12:47:45.384Z	794618d5-5eb3-40ce-88f3-cce6d75786f1	Performing query completed:
{
    "queueSize": 2,
    "duration": 175900,
    "queryKey": [
        "SELECT\n      DATE_FORMAT(CONVERT_TZ(payments.payment_date, @@session.time_zone, '+00:00'), '%Y-%m-%dT00:00:00.000Z') `payments.date_date`, count(payments.payment_id) `payments.count`\n    FROM\n      test.payment_big AS payments\n  WHERE (payments.payment_date >= TIMESTAMP(convert_tz(?, '+00:00', @@session.time_zone)) AND payments.payment_date <= TIMESTAMP(convert_tz(?, '+00:00', @@session.time_zone))) GROUP BY 1 ORDER BY 1 ASC LIMIT 10000",
        [
            "2005-09-01T00:00:00Z",
            "2005-09-30T23:59:59Z"
        ],
        []
    ]
}

Not too bad for an unoptimized table of 100M data points. Let’s add an index for payment_date to see if it can affect this time:

CREATE INDEX payment_big_date ON payment_big (payment_date)

The same query with index will be processed in 31 seconds which is great but still not fast enough to build a great user experience. What we can do is introduce pre-aggregations for this table. To do this let’s just add:

preAggregations: {
 main: {
   type: `rollup`,
   measureReferences: [count],
   timeDimensionReference: date,
   granularity: `day`,
   partitionGranularity: `day`
 }
}

to schema/Payments.js so we can get:

cube(`Payments`, {
 sql: `select * from test.payment_big`,

 joins: {
   Customers: {
     sql: `${CUBE}.customer_id = ${Customers}.customer_id`,
     relationship: `belongsTo`
   }
 },

 measures: {
   count: {
     type: `count`
   },

   totalAmount: {
     sql: `amount`,
     type: `sum`
   }
 },

 dimensions: {
   id: {
     sql: `payment_id`,
     type: `number`,
     primaryKey: true
   },

   date: {
     sql: `payment_date`,
     type: `time`
   }
 },

 preAggregations: {
   main: {
     type: `rollup`,
     measureReferences: [count],
     timeDimensionReference: date,
     granularity: `day`,
     partitionGranularity: `day`
   }
 }
});

It’ll create a rollup table for each day of data and will refresh it incrementally. Here partitionGranularity instructs Cube.js to create rollup table per day of data. Rollup table itself will contain count measure and date dimensions with day granularity. In fact every rollup table here will contain just 1 row which contains count for specific day calculated. Let’s deploy it and see how it changes query processing times:

$ serverless deploy -v

If we request the same interval of data, Cube.js will build pre-aggregations first for each day, which takes 1.5 seconds per one day:

2019-02-27T13:33:19.267Z	11477db2-f66e-4278-9103-eefbbc513be3	Performing query completed:
{
    "queueSize": 1,
    "duration": 1578,
    "queryKey": [
        [
            "CREATE TABLE stb_pre_aggregations.payments_main20050928 AS SELECT\n      DATE_FORMAT(CONVERT_TZ(payments.payment_date, @@session.time_zone, '+00:00'), '%Y-%m-%dT00:00:00.000Z') `payments.date_date`, count(payments.payment_id) `payments.count`\n    FROM\n      test.payment_big AS payments\n  WHERE (payments.payment_date >= TIMESTAMP(convert_tz(?, '+00:00', @@session.time_zone)) AND payments.payment_date <= TIMESTAMP(convert_tz(?, '+00:00', @@session.time_zone))) GROUP BY 1",
            [
                "2005-09-28T00:00:00Z",
                "2005-09-28T23:59:59Z"
            ]
        ],
        [
            [
                {
                    "current_hour": "2019-02-27T13:00:00.000Z"
                }
            ]
        ]
    ]
}

And then it will query the union of all pre-aggregations tables, which takes less than 200ms:

2019-02-27T13:33:23.647Z	a4162f29-570d-495f-8ca4-34600869d8e7	Performing query completed:
{
    "queueSize": 1,
    "duration": 134,
    "queryKey": [
        "SELECT `payments.date_date` `payments.date_date`, sum(`payments.count`) `payments.count` FROM (SELECT * FROM stb_pre_aggregations.payments_main20050901 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050902 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050903 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050904 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050905 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050906 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050907 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050908 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050909 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050910 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050911 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050912 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050913 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050914 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050915 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050916 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050917 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050918 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050919 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050920 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050921 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050922 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050923 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050924 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050925 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050926 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050927 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050928 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050929 UNION ALL SELECT * FROM stb_pre_aggregations.payments_main20050930) as partition_union  WHERE (`payments.date_date` >= CONVERT_TZ(TIMESTAMP(convert_tz(?, '+00:00', @@session.time_zone)), @@session.time_zone, '+00:00') AND `payments.date_date` <= CONVERT_TZ(TIMESTAMP(convert_tz(?, '+00:00', @@session.time_zone)), @@session.time_zone, '+00:00')) GROUP BY 1 ORDER BY 1 ASC LIMIT 10000",
        [
            "2005-09-01T00:00:00Z",
            "2005-09-30T23:59:59Z"
        ],
       ...
}

This approach allows us to split the analytics workload into small reusable chunks to avoid DB burst with an incremental update strategy.

Conclusion

We were playing with just a test dataset and the results are very well correlated with what we see in production workloads. MySQL is capable of handling several billions of data points per table with an analytics workload pretty well. In this case, sub second query times are achievable with several seconds time to insight delay using properly organized query orchestration.

Although Serverless Aurora MySQL is capable of handling an enormous workload while stress testing, we discovered that workload routing algorithms aren’t smart enough yet to route queries to the least loaded nodes, which under certain circumstances can lead to partial database outages while analytics querying. For production environments we recommend you use provisioned Aurora MySQL with read replicas or the Parallel Query feature in order to split your OLTP and OLAP workload by design.

Aurora MySQL Setup

Preparing your dataset

Cube.js setup

Benchmark

Conclusion

Upgrade your data stack today