Amazon Web Services fills out its big data cloud platform
Tony Baer, Principal Analyst, Ovum IT Enterprise solutions, OvumAnnouncements of new data platforms were highlighted at Amazon Web Services’ (AWS) first ever “re: Invent” user conference this week. Among the headlines, AWS announced Amazon Redshift, a managed, petabyte-scale data warehouse service that includes technology components licensed from ParAccel. Amazon Redshift will deliver the power of massively parallel columnar databases at a commodity price for data warehousing customers.
AWS also announced AWS Data Pipeline, a utility to simplify orchestration of data flows between both AWS-based and on-premise data sources and AWS-based processing services.
These new platforms and services will help fill out and add connective tissue to the various AWS platforms and services. Now that Amazon is automating data flows, it should take the next step and add integrated SQL views to NoSQL data stores, something that other providers, including Microsoft, Cloudera, and Hortonworks, are already pursuing.
AWS’s evolving cloud data platforms
AWS’s development of data platforms for customers has been a journey that began with SimpleDB, a basic key-value NoSQL data store aimed at online applications capable of delivering more scale than MySQL. SimpleDB was enhanced over the years with features such as the addition of Memcached, which improved read latency from AWS’s S3 storage platform.
Earlier this year AWS introduced DynamoDB, which added new features, such as automated provisioning, high availability, and high performance via solid-state drives (flash storage). AWS offers other data platforms tailored for varying customer needs, including Amazon Relational Database Service (RDS) that offers managed instances of MySQL, Oracle 11g and Microsoft SQL Server, Amazon Elastic MapReduce (EMR) that offers Hadoop processing, ElastiCache, a scalable in-memory cache, and Amazon Simple Queuing Service (SQS). This is in addition to options for clients to deploy their database and applications of choice on the EC2 infrastructure-as-a-service (IaaS) offering.
AWS’s announcements this week filled several key gaps, including automating the orchestration of data flows between the various AWS and on-premise platforms (AWS Data Pipeline), and a new SQL-based data warehousing service designed for scale (Amazon Redshift).
Making a play for enterprise data warehousing
Amazon Redshift is a managed, scalable data warehouse service that licenses technology components from ParAccel, an Advanced SQL analytics platform provider in which Amazon previously made an equity investment. (Amazon, however, maintains that there is no connection between these developments).
As one of one of the last of the first-generation Advanced SQL analytics platform providers to remain independent, ParAccel re-engineered SQL platform to overcome scale and performance limitations for multi-terabyte data warehouses. It adapted the open source Postgres SQL platform for a columnar table store architecture. Columnar is well suited to data warehousing, where there is a need to aggregate and gain insights from data in field, rather than individual records. Columnar data stores also lend themselves well to compression, multiplying effective capacity and accelerating performance.
AWS uses a subset of ParAccel’s technology, focusing on delivering a low-cost, scalable, high-performance platform, and has not licensed any of the prepackaged analytics functions that ParAccel offers on its full product. Consistent with AWS practice, Redshift is intended to price Advanced SQL analytics as a commodity. It offers three linear pricing tiers, with the most economical being three-year reserved pricing at $999 per TByte per year.
Charging a fraction of the price of well-established data warehousing platforms such as Oracle or Teradata, this is an aggressive play from AWS. However, Ovum believes the more immediate target is Microsoft, whose emerging Azure service is being beefed up for Big Data with the addition of the HD Insight services for Hadoop integration, and Fast Data with the eventual availability of in-memory processing extensions of SQL Server. Both are seeking to commercialize a new opportunity by delivering high-volume, high-performance Big Data analytics to a largely untapped market.
Starting to unify the data platform
The AWS catalog includes a variety of data platforms and services, and a variety of ways for processing the data. While the broad selection offers flexibility of deployment, it also adds complications when it comes to integrating or processing the data. This is where some of the new AWS platform integrations come in.
Redshift will have purpose-built integrations with Amazon DynamoDB and Amazon EMR. Amazon is also introducing its new Data Pipeline that automates the orchestration of data workflows between various AWS platforms, or between a client’s on-premise platforms and AWS targets.
For instance, if you have data in Amazon Relational Database RDS that you would like streamed into Amazon Redshift for SQL processing, AWS Data Pipeline provides a simplified tool for specifying data sources, targets, rules and conditions, and processing steps. The same would apply for moving data between custom applications or NoSQL data stores on Amazon EC2. AWS Data Pipeline provides the glue that the AWS collection of data platforms and services has lacked when coordinating data residing both in AWS and on-premise.
What about SQL/Hadoop convergence?
As AWS rolls out the connective tissue between its data platforms and services, the logical next step will be to provide tighter SQL/NoSQL integration. There is significant activity among Hadoop and Advanced SQL providers to make Hadoop more accessible, either via SQL or through common Microsoft Excel functions. Examples include:
- Microsoft bundling the Hortonworks Data Platform into its Azure cloud and providing direct Excel access using familiar tools such as PowerPivot.
- Cloudera’s Impala framework, aimed at reducing the need for an additional SQL data warehouse by implementing a massively parallel SQL framework that leverages Hadoop’s Hive metadata layer.
- Hortonworks acting as prime sponsor for HCatalog, an incubating Apache Hadoop project for enhancing the Hive metadata framework to make Hadoop appear more like an appendage of an SQL data warehouse. Teradata Aster, MapR, and ParAccel have already announced support for this framework.
Through Amazon EC2, AWS customers can mount any of the platforms to take advantage of these features in the cloud. However, with AWS’s effort to build its own platform-as-a-service with expansion or enhancement to its own data platforms, the next logical step will be for AWS to not only automate data workflows, but also provide similar levels of SQL/Hadoop data integration.