Essential factors to consider when building your own data warehouse
While developing an open source analytics framework Cube.js, we've worked with many data warehouses. When our customers ask us what the best data warehouse is for their growing company, we consider the answer based on their specific needs. Usually, they need nearly real-time data for a low price without the need to maintain data warehouse infrastructure. In this case, we advise them to use modern data warehouses such as Redshift, BigQuery, or Snowflake.Most of the modern data warehouse solutions are designed to work with raw data. It allows to re-transform data on the fly without a need to re-ingest your data stored in a warehouse.In this post, we would like to dig deeper into the factors to consider while choosing a data warehouse. Here they are:
Volume of data
Dedicated human resources for the support and maintenance
Scalability: horizontally vs. vertically
Pricing models
Data Volume
You need to know the estimates of the volume of data that you will be dealing with. If it’s a clear cut scenario that you are working with datasets ranging in the hundreds of TBs or in petabytes, then it is strongly advised to go for non-relational databases. The architecture of such databases that supports working with huge data sets is ingrained in their DNA.On the other hand, many relational databases have really great time-proven query optimizers. You can consider them as an option for an analytics warehouse as soon as your dataset fits into one single node.Let’s look at some mathematics relevant to dataset size:
The Postgres, MySQL, MSSQL, and many other RDBMS sweet spot is up to 1TB of data involved in analytics. If this size is exceeded, you may experience degraded performance.
Amazon Redshift, Google BigQuery, Snowflake, and Hadoop-based solutions support a dataset size up to multiple petabytes in an optimal manner.
On-premises vs Cloud
Another important aspect to evaluate is whether you have any dedicated resources for the maintenance, support, and fixes for your database, if any. This facet plays an important role in the comparison.
If you have dedicated resources for the support and maintenance, you have earned yourself a lot more options in choosing the database.
You can opt for creating your own Big Data warehousing options based on something like Hadoop or Greenplum. These systems do require significant setup, maintenance engineering resources, andskilled personnel.But in the case that you don’t have any dedicated resources for maintenance then you limit your options a bit. We advise going for a modern data warehouse solution like Redshift, BigQuery, or Snowflake. As an administrator or user, you don’t need to worry about deploying, hosting, resizing VMs, handling replication, or encryption. You can start using it by issuing SQL commands.
Scalability
When you start working with a database, you expect it to be scalable enough to support your further growth. Broadly, database scalability can be achieved in two ways, horizontally or vertically.
Horizontal scalability refers to the addition of more machines, whereas vertical scalability means the addition of resources into a single node to increase its capability.
Redshift provides easy scalable options. With a few mouse clicks, you can increase the number of nodes and configure them to meet your needs. Redshift scales very well until you hit about 100TB of data that is processed at once in a query. Computing capacity of your Redshift cluster will always rely on a count of nodes in your cluster, unlike some other data warehouse options.It’s a point where such solutions as BigQuery come into play. There’s virtually no cluster capacity as BigQuery can allocate up to 2000 slots, which is the equivalence of nodes in Redshift. Also, due to this multi-tenancy strategy, even when customers’ concurrency demands grow, BigQuery scales seamlessly with those demands and can go over this limitation of 2000 slots if required.BigQuery relies on Colossus, which is Google’s latest generation distributed file system. Colossus allows BigQuery users to scale to dozens of petabytes in storage seamlessly, without paying the penalty of attaching much more expensive computing resources.Snowflake is built on Amazon S3 cloud storage and its storage layer holds all the diverse data, tables, and query results. Because this storage layer is engineered to scale completely independent of computing resources, it ensures that maximum scalability can be achieved effortlessly for big data warehousing and analytics. In addition to this, Snowflake offers multiple virtual warehouses, at nearly any scale and concurrency, which can simultaneously operate on the same data while fully enforcing global system-wide transactional integrity and keeping it scalable.
Pricing
If you go with the self-hosted option like Hadoop, your pricing will consist mostly of VM or hardware bills. AWS provides an EMR solution which is a cost efficient option to consider when going with Hadoop stuff. Drilling down further into Redshift vs BigQuery vs Snowflake, all of them offer on-demand pricing, but each one comes with their own unique pricing model flavor. Amazon Redshift offers three pricing models:
On-Demand Pricing: no upfront commitments and cost, you simply pay an hourly rate depending upon the types and number of nodes in your cluster. Here, one important factor often ignored is that the rates do vary depending upon the region. The rates cover both computation and data storage.
Spectrum Pricing: you merely pay for the bytes scanned while querying against Amazon S3.
Reserved Instance Pricing: if you are sure that you will be running on Redshift for at least a few years, then you can save up to 75% over on-demand rates by opting for reserved instance pricing.
Google BigQuery offers scalable, flexible pricing options and charges for data storage, streaming inserts, and for querying data, but loading and exporting data are free of charge. The pricing strategy of BigQuery is quite unique as it is based on a rate per GB for storage and a bytes scanned rate for queries. Also, it offers cost control mechanisms that enable you to cap your daily costs to an amount that you choose. It offers a long-term pricing model as well. Snowflake offers on-demand pricing, which is similar to BigQuery and Redshift Spectrum. Unlike BigQuery, computational usage is billed on a per-second basis rather than bytes scanned, with a minimum of 60 seconds. Snowflake decouples data storage from computation, and hence the billing is individual for both of them. The storage rate starts at around $40/TB/month for the standard edition and remains the same for other editions. On the other hand, for compute the rate starts at $2.00 per hour for standard and goes up to $4.00 per hour for the enterprise edition.
Conclusion
General advice for choosing a data warehouse, which we usually provide to our customers, is the following:
Use index optimized RDBMS such as Postgres, MySQL, or MSSQL when data volumes are much less than 1TB of data in total and much less than 500M rows per analyzed table, and the whole database can fit into a single node.
Use modern data warehouses like Redshift, BigQuery, or Snowflake when your data volume is between 1TB and 100TB. Also consider Hadoop with Hive, Spark SQL, or Impala as a solution if you have access to this expertise and you can allocate dedicated human resources to support that.
When your data volume is over 100TB use BigQuery, Snowflake, Redshift Spectrum, or the self-hosted Hadoop equivalent solution.