Adventures in Data Warehouse-as-a-Service

For cloud analytics, a key component is a high performance analytical data warehouse that enables a highly interactive experience for customers.

The last few years have seen the emergence of several Data Warehouse-as-a-Service offerings that promise elastic scaling and high performance.  BitYota and Snowflake are startups in this area likely implemented on AWS or some other cloud infrastructure provider.  AWS itself introduced Redshift about 2 years ago and claims thousands of customers already.  Even good old Microsoft claims to have a version of SQL Server in its Azure cloud that’s optimized for data warehousing.

At Numerify, we have been using Redshift as one of the components of our technology stack.  Redshift’s columnar storage technology and MPP architecture have worked well for us in processing large volumes of data.  The ability to scale as needed and the per-TB price are attractive features as well.  It’s not an easy road though – you do need to invest in understanding the limits and quirks of Redshift to fully leverage it.

For example, if you have a large number of smaller but wide tables Redshift can be terribly inefficient with storage.  Consider a 5 node cluster, each with 2 cores – in Redshift parlance this means there are 5 * 2 = 10 slices.  Assuming a distribution that spreads data evenly across all slices, a table with just a few rows but with 100 columns could use 1GB of space!  Redshift uses a massive 1MB block size (by comparison Oracle uses configurable size from 2 to 32KB) and each column is stored in a different block on each slice.  So,  10 slices * 100 columns * 1MB block size = 1GB.  Assuming all columns are INTs (of 4 bytes) and raw encoding (no compression) – each block can fit 262,144 rows and in turn 2,621,440 rows fit across 10 slices.  So 10 rows and 2.6M rows take up the same 1GB of space because of the large block size and columnar storage.

Though this may seem like an awful waste of space, it happens only when the database tables have a certain profile.  Maybe Redshift will allow for configurable block sizes in the future?

Another example is when you have queries with a high number of joins, you may get an error saying “m_num_steps < MAX_STEPS – exceeded maximum plan steps of 100”.  The exact same query will run without a hitch in Oracle or SQL Server.  The only option is to break up the query and reduce the number of joins – we have had to do that multiple times.   According to this AWS Forum post the Redshift engineering team is aware of this limitation.

Would love to hear about your experiences with Redshift or any of the other similar services out there.  Share your stories in the comments.

[Image credit: Unsplash.]

Related Blogs

Preventing IT outages with Smart IT
Posted by Rebekka Vuojolainen | June 14, 2018
Smart IT: What We Can Learn From These 2018 IT Outages
The year's not yet half over but dozens of major companies (and possibly many more) have already experienced serious IT outages. Some, including those at TSB Bank, Woolworths, and NHS...
system of record vs system of intelligence
Posted by Amit Shah | May 30, 2018
How Does a System of Record Fit Into a System of Intelligence?
CIOs are one of the main drivers of bringing the power of analytics to power digital transformation of the business. But like the cobbler’s children who had no shoes, the...
it operations analytics
Posted by Amit Shah | May 23, 2018
How Does IT Operations Analytics Fit Into a System of Intelligence?
IT operations analytics, or ITOA, is sometimes confused with a System of Intelligence for IT since the benefits realized (like improved user experience) often sound similar. However, each can contribute...