IBM’s stack on big data management and governance
The big data phenomenon presents a challenge for data management, a complex IT discipline that many organizations still struggle to deal with effectively today, even with their “small” data sets.
As data grows, so too does data uncertainty – and with it lowered quality and trust. IBM is in the process of extending its InfoSphere family of data integration and governance tooling, with a sharp focus on targeting the Hadoop framework.
However, technology is only part of the equation. While the end goal for managing and governing big data should be the same as that for traditional enterprise data – i.e. to provide the business with reliable data – the exact methods vary and continue to gel. IBM (and its customers) is still only a short way along the path of creating best practices for reconciling big data technology implementations with stewardship.
Taming big data – the next big challenge for data management
Many of the ills of corporate IT do not stem from application design or process, but rather mis-management of the data that feeds them. Data management has long been seen, but perhaps conveniently ignored, as the root cause of many IT failures.
The reality is that many enterprises find themselves conducting “reactive” data governance initiatives – trying to fix the problem after the fact, which is often too late.
Enterprise data warehouses and BI & analytic systems promised to provide a rich, trusted central source of decision-making information that organisations could use to drive their businesses forward. But without adequate data governance policies in place the decisions they made were misinformed. Lacking were controls to ensure data quality, consistent definitions (MDM), and even simple, basic metadata management.
The integration of different sets of such data without such controls has simply compounded the problem. Lack of data management technologies and practices is not the problem; rather, the question is whether existing tools and approaches to data stewardship, such as quality assurance, access control, auditing, and lifecycle management, will apply to the scale and variety of data in Hadoop.
IBM is extending its InfoSphere platform for Hadoop
It’s easy to slap a “big” label in front of existing information management technologies; this is a mistake many data management and analytic IT providers keep on making.
Companies need to think beyond the bandwagon. Solutions need to provide automated technology that will deal with big data efficiently and at scale, otherwise enterprises will find themselves overwhelmed by data accumulation and intolerable amounts of manual effort to govern the data properly.
Recently, IBM has rolled out several new products and initiatives across its information management stack that speak directly to big data management and governance, specifically through Hadoop integration:
- InfoSphere Information Server – IBM has provided deep integration between its premier data integration (ETL) and data quality platform and Hadoop environments that are increasingly being positioned as the new platforms for big data management. That integration has been adapted so that transformation can occur on the most appropriate environment: Hadoop or Information Server.
- IBM is adapting its InfoSphere Guardium technology to help provide better protection of Hadoop data management and access – specifically for near-realtime activity monitoring. It has even rolled out an early version of software billed as “Hadoop Activity Monitoring” (HAM). Such technology is critical to addressing Hadoop’s gap in security (which only supports Kerberos authentication), which fails to provide the complete picture of what actually happens to the data, and who is using it. IBM’s HAM will prove to be a critical capability when Hadoop is used for storing sensitive data such as customer or financial records.
- IBM’s InfoSphere Optim software is also being adapted to support data masking for Hadoop environments. This is being made available on-demand via an open API. Optim’s auditing and archiving capabilities are also being adapted to supporting “warm” (occasionally used) archived data in Hadoop. These capabilities support an emerging use case for Hadoop: as a low-cost data warehouse that allows data that would otherwise be migrated to tape to still be available for analytics.
- IBM’s master data management (MDM) has been integrated with IBM BigInsights (IBM’s Hadoop platform) to provide master data records and definitions into Hadoop. BigInsights can also be used for combing through publicly available data to identify unique records for populating into MDM.
Ovum believes that MDM could, and should, be an ideal starting (and selling) point for big data governance IT projects.
Why? Because MDM can work as a logical middle ground for companies that want a single version of the truth of a customer’s data via a variety of internal systems and external sources such as Facebook and Twitter, and Hadoop-based MDM is something IBM will have to address sooner rather than later.
Will big data management force a change in traditional data management thinking?
Hadoop is the next logical target for data management and integration platforms, as it is becoming a popular addition to the analytics environment. IBM’s moves make perfect sense; for instance, Guardium provides capability that is conspicuously absent from Hadoop,
The bigger question is whether the same techniques for managing, cleansing, security, and tracking utilisation will make sense in the Hadoop environment, as it is capable of storing a much broader variety of data.
Ovum’s research on data quality practices for Hadoop reveals a mixed bag: practices and approaches to cleansing data will vary based on the type of data, its value to the organization, and how it is consumed. Clearly, machine data will not dictate the same degree of attention as customer data, while customer data treated as part of a broad social media focus group may in turn not dictate the same rigour of attention as internally-maintained records.
In some cases, the volume and variety of big data will merit a more probabilistic approach that assesses data quality at an aggregate, rather than per record, level.
The onus is on IBM and other tools providers to support different approaches that may require different tools. IBM’s offerings are a good start in managing the diverse data and use cases that will be found in Hadoop implementations. But Ovum expects that this will become an area in which IBM and its rivals Oracle and Informatica, not to mention an emerging ecosystem of startup tools providers, will provide innovation.