By Mark Davis, Distinguised Big Data Engineer, Dell Software Group, Santa Clara, California
Big data technologies are increasingly considered an alternative to the data warehouse. Surveys of large corporations and organisations bear out the strong desire to incorporate big data management approaches as part of their competitive strategy.
But what is the value that these companies see? Faster decision making, more complete information, and greater agility in the face of competitive challenges. Traditional data warehousing involved complex steps to curate and schematise data combined with expensive storage and access technologies. Complete plans worked through archiving, governance, visualization, master data management, OLAP cubes, and a range of different user expectations and project stakeholders. Trying to manage these projects through to success also required coping with rapidly changing technology options. The end result was often failure.
With the big data stack, some of these issues are pushed back or simplified. For example, the issue of schematizing and merging data sources need not be considered up front in many cases, but can be done on a more on-demand basis. The concept of schema-on-read is based on a widely seen usage pattern for data that emerged from agile web startups. Log files from web servers needed to be merged with relational stores to provide predictive value about user “journeys” through the website. The log files could be left at rest in cheap storage on commodity servers beefed up with software replication capabilities. Only when parts of the logs needed to be merged or certain timeframes of access analyzed, did the data get touched.
Distributing data processing on commodity hardware led to the obvious next step of moving parts of the data into memory or processing it as it streams through the system. This most recent evolution of the big data stack shares characteristics with high performance computing techniques that have increasingly ganged together processors across interconnect fabrics rather than used custom processors tied to large collections of RAM. The BDAS (Berkeley Data Analytics Stack) exemplifies this new world of analytical processing. BDAS is a combination of in-memory, distributed database technologies like Spark, streaming systems like Spark Streaming, a graph database that layers on top of Spark called GraphX, and machine learning components called MLBase. Together these tools sit on top of Hadoop that provides a resilient, replicated storage layer combined with resource management.
What can we expect in the future? Data warehousing purists have watched these developments with a combination of interest and some degree of skepticism. The latter is because the problems and solutions that they have perfected through the years are not fully baked in the big data community. It seemed a bit like amateur hour.
But that is changing rapidly. Security and governance, for instance, have been weak parts of the big data story, but there are now a range of security approaches that range from Kerberos protocols permeating the stack to integrated ReST APIs with authentication at the edges of the clustered resources. Governance is likewise improving with projects growing out of the interplay between open source contributors and enterprises that want to explore the tooling. We will continue to see a rich evolution of the big data world until it looks more and more like traditional data warehousing, but perhaps with a lower cost of entry and increased accessibility for developers and business decision makers.
About the author:
Mark Davis founded one of the first big data analytics startups, Kitenga, that was acquired by Dell Software Group in 2012, where he now serves as a Distinguished Engineer. Mark led Big Data efforts as part of the IEEE Cloud Computing Initiative and is on the executive committee of the Intercloud Testbed Executive Committee, as well as contributing to the IEEE Big Data Initiative.