First International Workshop on Hot Topics in Cloud Data Processing (HotCDP'12)

HotCDP12 Hot Topics on Cloud Data Processing (Bern, Switzerland)

Program

Session 1 (09:00 - 10.30)

Keynote: Things that go bump in the night (and how to sleep through them)
Speaker: Michael Christian, Infrastructure Resilience, Yahoo!

There is a widespread belief in our industry that our data-centers are supremely reliable, capable of providing us true five-9s service, and that by hosting our platforms in these expensive massively redundant locations, we will be safe from all ills. As a result, business continuity planning is often approached from a classic DR "Airplane Into Building" perspective, where insufficient thought, energy, planning, and practice is put into a solution never expected to be used.
The truth is that data-centers DO fail, sometimes for the oddest of reasons. This is not a knock against the infrastructure designers of our industry, who have created some of the most efficient, redundant, and innovative buildings in history; it is merely a statement of fact. As redundancy is added, complexity increases, adding more links to a system at the mercy of the weakest. Something as simple as a border router failure can effectively knock an entire building of compute offline, regardless of how many generators it has.
By starting under the assumption that a particular data-center WILL fail at some point, it becomes significantly easier to build platforms that can be quickly and transparently shifted from location to location. Where designing for the unthinkable leads to poorly thought out solutions, designing for the every-day leads to useable and indeed useful tooling. Highly available internet platforms are not nearly as technically difficult as they are culturally difficult.
I'll interleave a history of massive outages during the last decade with proven recovery solutions and strategies, with the hope of swaying our collective industry from a disaster insurance approach to a truly always-on philosophy. This talk is loosely based on Chapter 17 of O'Reilly's Web Operations, by the same author/speaker.
Ant Rowstron (Microsoft Research, Cambridge), Dushyanth Narayanan (Microsoft Research, Cambridge), Austin Donnelly (Microsoft Research, Cambridge), Greg O'Shea (Microsoft Research, Cambridge) and Andrew Douglas (Microsoft Research, Cambridge). Nobody ever got fired for using Hadoop

Coffee break (10:30 - 11.00)

Session 2 (11:00 - 12.30)

Keynote: Programming and Debugging Large-Scale Data Processing Workflows
Speaker: Christopher Olston, Google

This talk gives an overview of my former team's work on large-scale data processing at Yahoo! Research. The talk begins by introducing two data processing systems we helped develop: PIG, a dataflow programming environment and Hadoop-based runtime, and NOVA, a workflow manager for Pig/Hadoop. The bulk of the talk focuses on debugging, and looks at what can be done before, during and after execution of a data processing operation:
* Pig's automatic EXAMPLE DATA GENERATOR is used before running a Pig job to get a feel for what it will do, enabling certain kinds of mistakes to be caught early and cheaply. The algorithm behind the example generator performs a combination of sampling and synthesis to balance several key factors---realism, conciseness and completeness---of the example data it produces.
* INSPECTOR GADGET is a framework for creating custom tools that monitor Pig job execution. We implemented a dozen user-requested tools, ranging from data integrity checks to crash cause investigation to performance profiling, each in just a few hundred lines of code.
* IBIS is a system that collects metadata about what happened during data processing, for post-hoc analysis. The metadata is collected from multiple sub-systems (e.g. Nova, Pig, Hadoop) that deal with data and processing elements at different granularities (e.g. tables vs. records; relational operators vs. reduce task attempts) and offer disparate ways of querying it. IBIS integrates this metadata and presents a uniform and powerful query interface to users.

Bio: Christopher Olston is a staff research scientist at Google, working on structured data. He previously worked at Yahoo! (principal research scientist) and Carnegie Mellon (assistant professor). He holds computer science degrees from Stanford (2003 Ph.D., M.S.; funded by NSF and Stanford fellowships) and UC Berkeley (B.S. with highest honors).
Olston just started at Google in November 2011, so he hasn't done anything there yet. At Yahoo, Olston co-created Apache Pig, which is used for large-scale data processing by LinkedIn, Netflix, Salesforce, Twitter, Yahoo and others, and is offered by Amazon as a cloud service. Olston gave the 2011 Symposium on Cloud Computing keynote, and won the 2009 SIGMOD best paper award. During his flirtation with academia, Olston taught undergrad and grad courses at Berkeley, Carnegie Mellon and Stanford, and signed several Ph.D. dissertations.
Malte Schwarzkopf (University of Cambridge) and Steven Hand (University of Cambridge). Bringing the cloud down to earth

Lunch break (12:30 - 14.00)

Session 3 (14:00 - 15.30)

Nathan Backman (Brown Univesity), Ugur Cetintemel (Brown Univesity) and Rodrigo Fonseca (Brown University). Managing Parallelism for Stream Processing in the Cloud

Masoud Saeida Ardekani (UPMC - LIP6), Marek Zawirski (INRIA & UMPC - LIP6), Pierre Sutra (INRIA & UPMC - LIP6) and Marc Shapiro (INRIA & UPMC - LIP6). The Space Complexity of Transactional Interactive Reads

Panel: Topic TBA