Adventures in Data Warehouse-as-a-Service

For cloud analytics, a key component is a high performance analytical data warehouse that enables a highly interactive experience for customers.

The last few years have seen the emergence of several Data Warehouse-as-a-Service offerings that promise elastic scaling and high performance.  BitYota and Snowflake are startups in this area likely implemented on AWS or some other cloud infrastructure provider.  AWS itself introduced Redshift about 2 years ago and claims thousands of customers already.  Even good old Microsoft claims to have a version of SQL Server in its Azure cloud that’s optimized for data warehousing.

At Numerify, we have been using Redshift as one of the components of our technology stack.  Redshift’s columnar storage technology and MPP architecture have worked well for us in processing large volumes of data.  The ability to scale as needed and the per-TB price are attractive features as well.  It’s not an easy road though – you do need to invest in understanding the limits and quirks of Redshift to fully leverage it.

For example, if you have a large number of smaller but wide tables Redshift can be terribly inefficient with storage.  Consider a 5 node cluster, each with 2 cores – in Redshift parlance this means there are 5 * 2 = 10 slices.  Assuming a distribution that spreads data evenly across all slices, a table with just a few rows but with 100 columns could use 1GB of space!  Redshift uses a massive 1MB block size (by comparison Oracle uses configurable size from 2 to 32KB) and each column is stored in a different block on each slice.  So,  10 slices * 100 columns * 1MB block size = 1GB.  Assuming all columns are INTs (of 4 bytes) and raw encoding (no compression) – each block can fit 262,144 rows and in turn 2,621,440 rows fit across 10 slices.  So 10 rows and 2.6M rows take up the same 1GB of space because of the large block size and columnar storage.

Though this may seem like an awful waste of space, it happens only when the database tables have a certain profile.  Maybe Redshift will allow for configurable block sizes in the future?

Another example is when you have queries with a high number of joins, you may get an error saying “m_num_steps < MAX_STEPS – exceeded maximum plan steps of 100”.  The exact same query will run without a hitch in Oracle or SQL Server.  The only option is to break up the query and reduce the number of joins – we have had to do that multiple times.   According to this AWS Forum post the Redshift engineering team is aware of this limitation.

Would love to hear about your experiences with Redshift or any of the other similar services out there.  Share your stories in the comments.

[Image credit: Unsplash.]

Related Blogs

analytics maturity - Assessing IT Analytics Maturity with Gartner’s 4 Stages | Numerify
Posted by Ted Sapountzis | September 3, 2019
IT Decision-Making Through the Lens of Gartner’s Analytics Maturity Model
An IT business analytics solution’s capabilities should be evaluated for its ability to aid in proactive decision-making. Whether an IT team is analyzing their current solution or evaluating a new...
Numerify - Proactive-analytics-vs-reactive-reporting
Posted by Ted Sapountzis | August 26, 2019
Why Your Organizations Needs Proactive IT Analytics
Many IT departments find themselves constantly putting out fires, reacting to problems on a day-to-day basis. However, tending to issues as they arise often swallows valuable time and resources. That’s...
Numerify's Design Principles for Packaged Machine Learning Solutions
Posted by Shalini Sinha | August 19, 2019
Design Principles for Packaged Machine Learning Solutions
One of the biggest advantages IT leaders gain by working with an AI-powered IT Business Analytics solution provider, such as Numerify, is being able to tap into the predictive power...