Sharing Knowledge without Sharing Data:
Software Platforms for Deploying Secure Multi-Party Computation at Scale
A common misconception that dominates our society is the belief that data sharing is a prerequisite to knowledge sharing. This is often translated into a false choice between data utility and data privacy, resulting in handicapping our ability to fully leverage data or else sacrificing privacy and confidentiality. In this talk, I will argue for an alternative tack that allows knowledge extraction from one or more data sets, which remain otherwise private, using a cryptographic technique called secure multi-party computation (MPC). I will begin by introducing MPC and by sharing my experience in deploying this technology for a study of the gender pay gap in Boston based on actual, confidential payroll data from hundreds of employers that cover tens of thousands of employees. Next, I will provide an overview of our work on integrating optimized MPC implementations of common programming constructs into open-source software libraries and big-data workflows in support of peer-to-peer and outsourced computation deployments in various settings and under different assumptions. Time permitting, I will show performance results from a number of at-scale deployments, including the computation of the Herfindahl-Hirschman Index (HHI) of market concentration used in antitrust regulation on over 1.3 billion transactions from three mutually distrusting companies, the analysis of cybersecurity risks associated with threat propagation between multiple private networks, and privacy-preserving navigation and ride sharing services.
Mosaics in Big Data
Stratosphere, Apache Flink, and Beyond
The global database research community has greatly impacted the functionality and performance of data storage and processing systems along the dimensions that define “big data”, i.e., volume, velocity, variety, and veracity. Locally, over the past five years, we have also been working on varying fronts. Among our contributions are: (1) establishing a vision for a database-inspired big data analytics system, which unifies the best of database and distributed systems technologies, and augments it with concepts drawn from compilers (e.g., iterations) and data stream processing, as well as (2) forming a community of researchers and institutions to create the Stratosphere platform to realize our vision. One major result from these activities was Apache Flink, an open-source big data analytics platform and its thriving global community of developers and production users. Although much progress has been made, when looking at the overall big data stack, a major challenge for database research community still remains. That is, how to maintain the ease-of-use despite the increasing heterogeneity and complexity of data analytics, involving specialized engines for various aspects of an end-to-end data analytics pipeline, including, among others, graph-based, linear algebra-based, and relational-based algorithms, and the underlying, increasingly heterogeneous hardware and computing infrastructure. At TU Berlin, DFKI, and the Berlin Big Data Center (BBDC), we aim to advance research in this field via the Mosaics project. Our goal is to remedy some of the heterogeneity challenges that hamper developer productivity and limit the use of data science technologies to just the privileged few, who are coveted experts.
Harnessing Multicores for Big Data Processing
Online analytics is on the rise in the big data world. As DRAM prices steadily drop, it becomes possible to hold growing amounts of data in main memory for advanced analysis. Yet processing large data feeds and answering queries about them in real-time remains a challenge. The current trend in server architecture, which favors parallelism over sequential speed, motivates parallelizing real-time analytics applications on multicore servers. Since data analysis is typically not ``embarrassingly parallel’’, naïve shared-nothing parallelization is often inefficient. On the other hand, in today’s NUMA architectures with per-core caches, tight coupling among parallel threads creates high synchronization overhead and slows down memory access. To be effective, solutions need to find a sweet spot where sufficient information is shared to allow the application to operate effectively and yet without excessive synchronization cost. I will give two examples of big data analytics algorithms that achieve such a sweet spot: data sketches and top-k document retrieval. Time permitting, I will also discuss the design of concurrent data structures for such applications.
Data sketches are approximate succinct summaries of long streams. They are widely used for processing massive amounts of data and answering statistical queries about it in real-time. Existing libraries producing sketches are very fast, but do not allow parallelism for creating sketches using multiple threads or querying them while they are being built. I will present a generic approach to parallelizing data sketches efficiently, while bounding the error that such parallelism introduces. Utilizing relaxed semantics and the notion of strong linearizability we have proven our algorithm's correctness and analyzed the error it induces in two specific sketches. We have shown that our implementation achieves high scalability on multicore hardware while keeping the error small.
Big data processing applications often rely on a top-k retrieval building block, which selects (or approximates) the k highest scoring data items based on an aggregation of features. For example, in a web search query with n search terms, a document's score is the sum of its scores for the n terms. Top-k retrieval is often used to sift through massive amounts of data and identify a smaller subset of it for further (more detailed) analysis. Because it filters out the bulk of the data, despite its relative simplicity, the top-k retrieval stage constitutes the main performance bottleneck in many systems. I will present Sparta -- a practical parallel algorithm that exploits multi-core hardware for fast (approximate) top-k retrieval. Thanks to lightweight coordination and judicious context sharing among threads, Sparta scales both in the number of features and in the searched index size. I will describe a web search case study using the 50M-document ClueWeb09B dataset, where Sparta processes 12-term queries twice as fast as the state-of-the-art. On a tenfold bigger index, Sparta’s performance advantage increases.
Based in part on joint works with Dmitry Basin, Edward Bortnikov, David Carmel, Eshcar Hillel, Arik Rinberg, Hadar Serviansky, Gali Sheffi, and Alexander Spiegelman.