In-memory stepping up to big and fast data analytic crunch

Madan Sheina, Lead Analyst, Software – Information Management

Interest in in-memory computing platforms is rising. Until now, the primary driver has been the “need for speed,” helping companies to process relatively modest data sets more quickly.

Both traditional and newer in-memory vendors are looking to scale-up these capabilities with innovative platforms that promise to meet the increasing demand for more accelerated, operationally driven analytics against Big Data residing in Hadoop environments.

Their efforts promise to overcome the limitations of traditional analytic architectures and, if successful, to bolster Hadoop’s credentials as a mainstream option for operational enterprise computing needs.

Two magnitudes of in-memory play into Hadoop

Two “magnitudes” of in-memory stand out – processing speed and data scale. In-memory has more or less proved itself as a viable platform for processing data more quickly – at speeds that traditional technologies simply cannot handle. For companies that require realtime analytics, in-memory computing is certainly becoming a valid economic option – in particular for Hadoop, which was originally conceived to work in batch operation mode, but evolving from analyzing historical to live (transaction) Big Data against terabytes and even petabytes of data.

We see two reasons, in particular, why Hadoop is an attractive target: linear scalability (which opens up to Big Data analytics) and economics (running as open source on commodity hardware platforms). These capabilities, however, are still in embryonic stages of development.

Add into the mix cheaper DRAM (dynamic random access memory) and an open source framework that is cheaper than a traditional, proprietary CEP (complex event processing) platform and you have a perfect set of technical drivers to use this technology as a platform for enabling (near) realtime and scalable analytics on Hadoop clusters.

Providing such scale and acceleration is significant for processing Big and Fast Data; in particular it raises the potential for Hadoop to be used as a platform for realtime analytic processing on streaming data, minus the latency introduced by disk-based access.

HANA and other platforms enabling realtime analytics for Hadoop

IT giants continue to develop their in-memory offerings. SAP’s HANA is undoubtedly the highest profile in-memory platform today. Much of that is due to SAP, which lends considerable credibility. IBM, meanwhile, is promoting its own in-memory accelerator for the DB2 database. However, these solutions can be expensive and tend to work alongside Hadoop – i.e. accessing and querying data from it, and processing the data externally in their own proprietary in-memory databases.

Enter smaller players like ScaleOut and GridGain as alternative in-memory data grids that can be used as accelerators integrated within the Hadoop environment. Both combine in-memory database technology with grid infrastructure, leveraging memory to process data more quickly. Significantly, both also offer custom read and write data-processing capabilities, meaning that MapReduce code can be executed over in-memory data and provide speedy updates to the HDFS store.

Technically, these data grids are designed not as databases per se, but implemented as Java object grids. Why? Because by not being databases they avoid I/O and the indexing overhead of databases. That’s reflected in solutions from established and new vendors offering Big Data grids – i.e. Oracle Coherence, IBM WebSphere xTreme Scale, etc. Vendors like ScaleOut and GridGain are also entering the market with solutions based on the same design premise, but differentiated by open source and specific engineering to run in conjunction with Hadoop, which cannot (not yet at least) be classified as a “database” per se.

ScaleOut and GridGain are well suited for a variety of Fast Data processing applications

ScaleOut provides an in-memory data grid called hServer that lets organizations run Hadoop MapReduce on live data without having to install or manage layers of Hadoop. hServer, in effect, acts as a self-contained Hadoop MapReduce engine that negates the need to install and use the standard Hadoop open source distribution. And by leveraging Hadoop’s stock batch scheduler alongside its parallel processing it is able (according to the vendor’s claims) to push MapReduce jobs in seconds, rather than minutes.

GridGain is another in-memory computing solution provider that is targeting Big Data. According to the vendor, its namesake platform bypasses the inherent latency and complexity of having to replicate and transfer source data to HDFS for analysis. Instead it (and like ScaleOut) taps data directly from servers, but is differentiated by offering separate in-memory products matched to specific analytic use cases and processing loads. Its intent is to provide customized configurations of its technology that appeal to specific needs – typically for transactional processing.

Regardless of the approaches, the net result is dramatically accelerated data processing which opens up Hadoop as a potential platform for a new set of more operationally focused application workload processing, including potentially streaming data for realtime applications – for example, around hedge fund portfolio management, recommendation engines, fraud detection, consumer credit scoring, and purchase history analysis.

In-memory acceleration is no longer a rich man’s game

In-memory prices continue to tumble, but it is still pricy compared to disk – though nowhere near the prohibitive cost that excluded it as an alternative platform from data warehousing for analytic processing. This is one of the main reasons why the notion of data tiering has become increasingly popular among enterprise database and data management software vendors like IBM, Oracle, and SAP.

But we equally believe that alternative approaches to traditional analytic data-processing technologies, like ScaleOut and GridGain, that eliminate the processing overheads for Big Data, will be a boon for a wide range of applications and industries that rely on realtime data processing that taps into Hadoop’s data infrastructure. Economics is playing a key role. Memory prices are falling to the extent that even single or multiple terabytes of DRAM are within the reach and budget of organizations.

Ovum believes it will be the smaller companies in this space – like ScaleOut and GridGain – that will lead the way, with cost-effective solutions that also accommodate existing data infrastructure.

Further reading

SAP HANA 1.0, SPS6, IT014-002840 (December 2013)

2014 Trends to Watch: Big Data, IT014-002814 (October 2013)

Related Stories

Leave a comment

Alternatively

This will only be used to quickly provide signup information and will not allow us to post to your account or appear on your timeline.