How big data improves - and complicates - predictive analytics
By Andy Flint, FICO
Analytics depends on data — the more, the merrier. If we’re trying to model, say, the behaviour of customers responding to marketing offers or clicking through a website, we can build a far stronger model with 10,000 samples than with 100.
You would think, then, that the rise of Big Data and its seemingly inexhaustible supply of data would be every analyst’s dream. But Big Data poses its own challenges for modelling. Much of Big Data isn’t what we have historically thought of as “data” at all. In fact, 80% of Big Data is raw, unstructured information, such as text, and doesn’t neatly fit into the columns and rows that feed most modelling programs.
Here’s how data scientists seeking to harness Big Data for predictive modelling have addressed the challenges presented by a mass of messy data.
Turning words into numbers
Modelling programs, such as the programs FICO uses to build credit scoring models, fraud detection models and marketing models, rely on numbers. When we’re building models from textual sources — such as emails, Word documents, case notes — we have to intelligently transform that text into numbers or something that behaves like numbers.
Using text mining algorithms, we can scour large text data sources to find associations between words and outcomes. It’s perhaps easiest to look at a case of supervised modelling, where the outcome we are modelling for is known — for example, credit fraud or customer profitability. We use that outcome to direct the search for repeating terms (words or phrases) that have real signal strength — that is, they are often present in records associated with one side of our possible outcomes.
We then build up a term document matrix, which lists all the unique terms in the corpus of text we are examining, across all the cases (or documents) in the analysis. Among this often enormous list of terms, which terms are used most frequently? More pointedly, which terms appear most reliably in connection to one outcome (e.g., purchases of product X) versus the other (no such purchases)?
From there, we reduce the problem by sorting the words or phrases from weakest to strongest, based on their signal strength, so that we have a tractable number of features to select from in our formal model. Finally, the presence and frequency of these words and phrases can be represented as just another column of numbers in our traditional predictive model.
We call that a “semantic scorecard” – it uses traditional scorecard methodology (turning traditional numeric and categorical data into characteristics and attributes, assigning weights to those attributes, and totalling the attributes to produce a score), augmented with features drawn from the raw, unstructured information. The challenging part is crunching the data to identify which words and phrases have the greatest signal strength.
And as with any modelling challenge, it’s always critical to bring some knowledge and intuition to the problem. In text mining, one way to impart that domain knowledge is to introduce a synonym list, which aligns different words that carry the same meaning and can improve the accuracy of the analytics. For instance, while any credit analysts would know that “1 cycle” and “30 days” indicate the same level of payment delinquency, the lowly computer does not. At least not until we teach it so. The synonym list is one way to interact with and iteratively improve the analysis and derive concepts and meaning from raw text.
The unsupervised problem
Not every data analysis project starts with a clearly defined objective variable — the thing we’re solving for, like presence of fraud or customer profitability. In fact, many problems today start with a different question: What value can we get out of this mass of data?
We’re still looking for patterns, but they’re not necessarily directional patterns. This is where the machine-learning art of clustering comes in. Say you are a retailer looking for ways to categorize your customers based on 1 million purchases as well as demographic data. What’s the best way to group these customers so that you can find the best offers or promotions?
In this kind of case, we don’t know for sure what the “right” answer will look like in a large N-dimensional space. Through clustering technology, we can let the machine find the naturally occurring relationships. For example, a group of customers who reliably buy high-tech products as soon as they hit the shelves might distinctly emerge, and we’d likely regard them as our “early adopters”. As a group, their purchase patterns stand clearly apart from customers who focus their buying power on groceries and other living essentials. Clustering algorithms identify these patterns and suggest these distinct groups. The analyst can then examine those groupings to see if they make business sense, and reduce or expand the number of groupings to better fit the data, and better fit our business needs.
Is causality dead?
As machine learning takes over more of the pattern analysis in Big Data, some experts have concluded that we should stop worrying about causality (what’s causing the relationships), and just focus on the data relationships, the correlation.
In their book Big Data: A Revolution That Will Transform How We Live, Work, and Think, Kenneth Cukier and Viktor Mayer-Schonberger, cite examples such as the use of words in online searches to identify where a disease may be breaking out. They argue that causation is too often misguided, and that much of the value in data can be mined just by focusing on correlation. No doubt, faster detection of outbreaks demonstrates that correlation alone can be profoundly valuable.
The best aspect of the debate on causation versus correlation is that it compels us to think critically about what we seek to learn from the data, and how those findings might drive our actions. A spike of web searches about a shared medical symptoms coming from one location might be all the signal we need to initiate positive actions to treat and contain an outbreak.
But of course, the curious data scientist will want to go further, and analyse more data to reveal transmission rates, mechanisms of infection, and ultimately the root cause, because that could even prevent the next outbreak.
Andy Flint is a senior director for analytic product management at FICO, a leading global analytics software company.