Data Cleaning in the Big Data era @ Barrault (Amphi Rubis)

Speaker : Paolo Papotti
Arizona State University
Date: 18/04/2016
Time: 3:00 pm - 4:00 pm

Abstract

Abstract: In the “big data” era, data is often dirty in nature because of several reasons, such as typos, missing values, and duplicates. The intrinsic problem with dirty data is that it can lead to poor results in analytic tasks. Therefore, data cleaning is an unavoidable task in data preparation to have reliable data for final applications, such as querying and mining. Unfortunately, data cleaning is hard in practice and it requires a great amount of manual work. Several systems have been proposed to increase automation and scalability in the process. They rely on a formal, declarative approach based on first order logic: users provide high-level specifications of their tasks, and the systems compute optimal solutions without human intervention on the generated code. However, traditional ‘top-down’ cleaning approaches quickly become unpractical when dealing with the complexity and variety found in big data.In this talk, we first describe recent results in tackling data cleaning with a declarative approach. We then discuss how this experience has pushed several groups to propose new systems that recognize the central role of the users in cleaning big data.
Biography: Paolo Papotti is an Assistant Professor of Computer Science in the School of Computing, Informatics, and Decision Systems Engineering (CIDSE) at Arizona State University. He got his Ph.D. in Computer Science at Universita’ degli Studi Roma Tre (2007, Italy) and before joining ASU he had been a senior scientist at Qatar Computing Research Institute. His research is focused on systems that assist users in complex, necessary tasks and that scale to large datasets with efficient algorithms and distributed platforms. His work has been recognized with two “Best of the Conference” citations (SIGMOD 2009, VLDB 2015) and with a best demo award at SIGMOD 2015. He is group leader for SIGMOD 2016 and associate editor for the ACM Journal of Data and Information Quality (JDIQ).