Adventures in Data Warehouse-as-a-Service
or cloud analytics, a key component is a high performance analytical data warehouse that enables a highly interactive experience for customers.
The last few years have seen the emergence of several Data Warehouse-as-a-Service offerings that promise elastic scaling and high performance. BitYota and Snowflake are startups in this area likely implemented on AWS or some other cloud infrastructure provider. AWS itself introduced Redshift about 2 years ago and claims thousands of customers already. Even good old Microsoft claims to have a version of SQL Server in its Azure cloud that’s optimized for data warehousing.
At Numerify, we have been using Redshift as one of the components of our technology stack. Redshift’s columnar storage technology and MPP architecture have worked well for us in processing large volumes of data. The ability to scale as needed and the per-TB price are attractive features as well. It’s not an easy road though – you do need to invest in understanding the limits and quirks of Redshift to fully leverage it.
For example, if you have a large number of smaller but wide tables Redshift can be terribly inefficient with storage. Consider a 5 node cluster, each with 2 cores – in Redshift parlance this means there are 5 * 2 = 10 slices. Assuming a distribution that spreads data evenly across all slices, a table with just a few rows but with 100 columns could use 1GB of space! Redshift uses a massive 1MB block size (by comparison Oracle uses configurable size from 2 to 32KB) and each column is stored in a different block on each slice. So, 10 slices * 100 columns * 1MB block size = 1GB. Assuming all columns are INTs (of 4 bytes) and raw encoding (no compression) – each block can fit 262,144 rows and in turn 2,621,440 rows fit across 10 slices. So 10 rows and 2.6M rows take up the same 1GB of space because of the large block size and columnar storage.
Though this may seem like an awful waste of space, it happens only when the database tables have a certain profile. Maybe Redshift will allow for configurable block sizes in the future?
Another example is when you have queries with a high number of joins, you may get an error saying “m_num_steps < MAX_STEPS – exceeded maximum plan steps of 100”. The exact same query will run without a hitch in Oracle or SQL Server. The only option is to break up the query and reduce the number of joins – we have had to do that multiple times. According to this AWS Forum post the Redshift engineering team is aware of this limitation.
Would love to hear about your experiences with Redshift or any of the other similar services out there. Share your stories in the comments.