Adventures in Data Warehouse-as-a-Service

For cloud analytics, a key component is a high performance analytical data warehouse that enables a highly interactive experience for customers.

The last few years have seen the emergence of several Data Warehouse-as-a-Service offerings that promise elastic scaling and high performance.  BitYota and Snowflake are startups in this area likely implemented on AWS or some other cloud infrastructure provider.  AWS itself introduced Redshift about 2 years ago and claims thousands of customers already.  Even good old Microsoft claims to have a version of SQL Server in its Azure cloud that’s optimized for data warehousing.

At Numerify, we have been using Redshift as one of the components of our technology stack.  Redshift’s columnar storage technology and MPP architecture have worked well for us in processing large volumes of data.  The ability to scale as needed and the per-TB price are attractive features as well.  It’s not an easy road though – you do need to invest in understanding the limits and quirks of Redshift to fully leverage it.

For example, if you have a large number of smaller but wide tables Redshift can be terribly inefficient with storage.  Consider a 5 node cluster, each with 2 cores – in Redshift parlance this means there are 5 * 2 = 10 slices.  Assuming a distribution that spreads data evenly across all slices, a table with just a few rows but with 100 columns could use 1GB of space!  Redshift uses a massive 1MB block size (by comparison Oracle uses configurable size from 2 to 32KB) and each column is stored in a different block on each slice.  So,  10 slices * 100 columns * 1MB block size = 1GB.  Assuming all columns are INTs (of 4 bytes) and raw encoding (no compression) – each block can fit 262,144 rows and in turn 2,621,440 rows fit across 10 slices.  So 10 rows and 2.6M rows take up the same 1GB of space because of the large block size and columnar storage.

Though this may seem like an awful waste of space, it happens only when the database tables have a certain profile.  Maybe Redshift will allow for configurable block sizes in the future?

Another example is when you have queries with a high number of joins, you may get an error saying “m_num_steps < MAX_STEPS – exceeded maximum plan steps of 100”.  The exact same query will run without a hitch in Oracle or SQL Server.  The only option is to break up the query and reduce the number of joins – we have had to do that multiple times.   According to this AWS Forum post the Redshift engineering team is aware of this limitation.

Would love to hear about your experiences with Redshift or any of the other similar services out there.  Share your stories in the comments.

[Image credit: Unsplash.]

Related Blogs

numerify orange shoes
Posted by Sruthi Kumar | February 22, 2017
What I Heard at Pink17: Which Comes First, Data or Analytics?
I just returned from Pink17 in Vegas, which was yet another hectic but successful event for the Numerify team. The turnout included some big exhibitors: Atlassian was there with their...
state of data security
Posted by Kristin Crosier | January 18, 2017
The State of Data Security in 2017: Is Anyone’s Information Safe?
The moment is here: the swearing in of a new POTUS is, well, nigh. America is about to usher in a new administration at a moment when digital security vulnerabilities...
data-decade
Posted by Sunil Dixit | December 21, 2016
The Data Decade: My Top 3 Data Resolutions After an Evening with a Dozen IT Leaders
Recently, I had the opportunity to attend an exclusive CIO dinner hosted by Silicon Valley Bank. The setting was to hear three industry executives — a practitioner (the CTO of...