Collecting large amounts of data can help you answer complex questions or better serve your customers. Sheer volume of data alone, however, does not guarantee good results. Problems related to imprecision or inconsistency are often compounded with large data sets, and predictive power can suffer if growth in desired results outpaces growth in data.
When dealing with quantitative data, for instance, increasing the number of samples or the richness of your data source will tend to increase both the precision and the accuracy of your estimates. However, taking advantage of increases in dimensionality, or increasing the dimensionality of your predictions, requires exponentially more data.
Let’s say you don’t know anything about how weather works, and you want to predict whether tomorrow will be warm or cool. You have the full readout of a weather station today. Of course, if you did know how weather worked, you could already take advantage of that weather station to make a pretty good prediction. But you’re 100% naïve, and so you want to use the power of data and machine learning to make your prediction. So you pull up historical data for your weather station on this date for the past five years, and you look at the temperature on the following date for each of those years. On the strength of these data, you could now build a simple classifier or regressor to predict whether tomorrow will be warm (i.e. above some threshold, or even a numerical value).
Continuing on with this exercise, you dig up some more historical data, going back 25 years. You’ve quintupled your data; your warm/cool predictor is going to be a better one as a result. But what happens if you then try to quintuple the dimensionality of your prediction? Instead of asking just whether it will be warm tomorrow, you want to know whether it will be warm, whether it will be windy, whether it will be humid, whether it will be cloudy, and whether the wind will be from the east or the west. Considering all possible combinations gives you 32 possible weather conditions. Yet, you only have 25 readings from your weather station—that’s less that one reading per possible weather condition! If each of the five conditions is completely independent from the others, this won’t be a problem. You can consider each separately. In reality, of course, they are not independent, and some conditions are guaranteed never to have shown up in your 25 years’ worth of training data. Your model may be completely unprepared for a hurricane or a sirocco or a blocking high.
This “curse of dimensionality” illustrates one of the pitfalls of a thoughtless dependence on Big Data. Theoretically, it may well be possible to predict anything, given sufficient data. However, in practice, we must pay careful attention to the nature of the questions being asked relative to the collection of data supporting them. Failure to do so can lead to disappointment or unanticipated disaster.
In cases of product recommendations and personalized email marketing (cases central to the work we do at Infinite Analytics) this might lead to a lot of time wasted trying to make complex predictions from sparse or poor data. Infinite Analytics overcomes these obstacles by collecting all of your rich customer history, using the latest technology to enrich your product catalog, and asking the right questions to transform your Big Data experience from something akin to throwing darts at a distant map to something more like a trip through a wondrous locale with a helpful guide.
To learn how Infinite Analytics can guide your data to their most productive uses, email firstname.lastname@example.org