We’ve spoken a lot on this blog in the past about the many ways that we here at Infinite Analytics work to build our recommendations and other data analysis services, from the interesting ways we use data from the social graph to some fun insights generated by our process. But as difficult as all this quantitative work is in a general sense, the problems we tackle are made that much richer by the scale of the datasets we use.
Our work routinely involves pools of data measured in hundreds of gigabytes or terabytes, and that require nontrivial computing resources to effectively work with. For a recent project we worked on in collaboration with a large media company, for instance, we wrangled with data on the scale of hundreds of gigabytes. It’s also worth noting that, unlike many other large-scale data projects — where the size of the dataset is driven by data that are intrinsically large, such as image or video data — our datasets tend to deal in very small individual data: often nothing more than a “who-what-when” triple. A terabyte of images might only be one family’s photo library; a terabyte of such small data can describe trillions or quadrillions of actions. So in addition to dealing with nominally large quantities of data, we solve problems involving unpronounceably large numbers of subsidiary calculations.
Fortunately, we have a suite of internal tools that allows us to apply our ideas to data at this scale with only minimal additional effort. We start with some excellent open source tools — the Hadoop Distributed File System for reliably storing such large volumes of data and the amazing work from the Apache Spark community — and add on top of it our own set of tools and workflows.
It’s not just about what we do: it’s also the scale at which we’re able to do it. We’d love to apply our technologies to your problems. Get in touch at email@example.com for more information.