CMS Seminar
"Big Data" traditionally refers to some combination of high volume of data, high velocity of change, and/or wide variety and complexity of the underlying data. Solving such problems has evolved into using paradigms like MapReduce on large clusters of compute nodes. More recently, a growing number of "Deep Data" problems have arisen where it is the relationships between objects, and not necessarily the collections of objects, that are important, and for which the traditional implementation techniques are unsatisfactory. This talk addresses a study of a class of such "challenge problems" first formulated by David Bayliss of LexisNexis, and what are their execution characteristics on both current and future architectures. The goal is to discover, to at least a first order approximation, what are the tall poles preventing a speedup of their solution. A variety or architectures are considered, ranging from standard server blades in large scale configurations, to emerging variations that leverage simpler and more energy efficient chip sets, through systems built on 3D chip stacks, and on to new architectures that were designed from the ground up to "follow the links." Such architectures are considered for two variants of such problems: a traditional partitioned data approach where data is "pre-boiled" to provide fast response, and one that uses very large graphs in very large shared memories. The results are not necessarily intuitive; the bottlenecks in such problems are not where current systems have the bulk of their capabilities or costs, nor where obvious near term upgrades will have major effects. Instead, it appears that only highly scalable memory-intensive architectures offer the potential for truly major gains in application performance.