IBM introduces flexibility to Big Data governance
IBM’s latest Big Data enhancements for its InfoSphere integration and governance product portfolio promote an emerging approach to reconcile Big Data with master data management (MDM). IBM’s official terminology is “building confidence” in Big Data. The innovation is a probabilistic approach based on the idea that when ensuring the validity and sanctity of Big Data, “perfect is the enemy of good.”
This is very much in line with our recommendations on managing Big Data sets where the actual lineage of the data may not be known with the same degree of certainty as internally generated data.
Highlights of IBM’s announcements include new “2-click” provisioning of data, information governance dashboards covering policy compliance status, integration with IBM’s existing data lifecycle management tools, enhanced data-masking capabilities, and a new “Big Match” MDM integration capability allowing ingested data to inherit policies associated with existing master data based on the probability of match.
These announcements are the first steps toward providing more flexible approaches to overall Big Data governance.
Once the floodgates open, knowing your data becomes a messy process
Data security has evolved ever since the advent of multi-tiered architectures that separated data from the application tiers. Although not universally practiced, data governance and stewardship technologies have steadily evolved over the past 20 years for the relational world.
They have received renewed scrutiny with various regulatory and compliance initiatives that have materialized in sectors such as financial services and healthcare, and they are being stressed further with the emergence of Big Data. Ovum’s coverage of Big Data is premised on the notion that it must become a first-class citizen of the enterprise, with data security and guardianship being firmly part of that.
The challenge is that Big Data opens up the spigot to a much greater variety of data streams and sources, many of which originate outside the organization. Yet, the difficulty of vouching for data doesn’t eliminate the need for data governance and careful stewardship. It just demands a new approach, because the data is flowing in from beyond the walled garden of the enterprise.
IBM is one of several vendors responding to this challenge. Its latest release of information management solutions for Hadoop and Big Data takes the first steps toward reconciling Big Data with existing enterprise master data.
The crux of the announcements
IBM’s announcements were grouped into three areas: integration, “visual context,” and privacy/security. The integration announcements included a new self-service “2-click” capability for data provisioning, where end users can select transformations that are pre-built by IT. This extends to the Hadoop capabilities that are already in practice in the data warehousing world.
What’s more interesting is the next step, a “Big Match” capability that was labeled a “statement of direction.” Big Match would assign probability ratings to reconciling unknown Big Data with the known world of the enterprise’s master data. In turn, Big Data Catalog and Agile MDM would employ similar approaches to grouping and assigning metadata to newly ingested data from the wild.
This announcement represents an important step forward because it is not practical to govern data that originates from multiple (and often external) sources in the same way – and with the same degree of certainty – as internally generated data from enterprise transaction systems, for example.
The other announcements focus on “visual context” and “agile governance,” both of which relate to the new information governance dashboards for tracking key policy compliance indicators for specific data sets or systems. A related “statement of direction” – the development of a Big Data Catalog that categorizes metadata on Big Data sources – capitalizes on the IBM DataExplorer capability announced last year for crawling data sets to discover metadata.
Finally, there is “Agile Governance,” using well-worn, generic terminology to describe a capability for overseeing data masking (we think that IBM could have come up with more original branding here). IBM plans to add “Agile MDM,” which would provide a means for rapidly assigning master data to raw ingested data.
Perfect is the enemy of good
Not surprisingly, the most significant parts of the announcements concern areas that are still works in progress. Together, they paint a picture of a more flexible approach by IBM to govern ingested Big Data. By incorporating probabilistic logic to quantify the level of confidence that one should have in the data, IBM’s approach acknowledges the reality that the lineage of externally sourced data cannot be judged with the same level of certitude as internally generated data.
There is another reason why such an approach makes sense. With Big Data analytics, the goal is often focused on getting the big, rather than the exact, picture. In some cases, such as classic Internet use cases of optimizing web search or ad placement, 100% certainty is unnecessary; other cases, such as identifying customer risk categories on which to make underwriting or credit decisions, require firmer data.
Nonetheless, even for sensitive questions that are ultimately subject to policy or regulatory scrutiny, it may be useful to construct that initial, coarse, big picture before refining data and analytics criteria, and conducting a formalized, auditable decision-making process.
Therefore, IBM’s approach, which provides a mechanism for quantifying the level of confidence in matching or categorizing data, is a good first step, and is consistent with an approach we outlined in our report, Hadoop and Data Quality: From Discovery to Precision.
With Big Data, data quality should be managed based on multiple factors such as the condition of the data, the criticality of the analysis, and the degree to which actions taken are subject to internal policy and/or regulatory compliance. Analytic conclusions and the decisions made from them should be couched on the level of confidence as to the data’s validity and currency.
IBM’s moves could be early incremental steps toward applying such an approach to reconciling and categorizing data. That in turn could become a building block towards a more formalized, adaptive governance strategy for Big Data.