Doan Keynote

2018 Symposium Keynote Speaker

Wednesday October 3 (morning plenary)

in BSOE Simularium

AnHai Doan is a Vilas Distinguished Achievement Professor of Computer Science at the University of Wisconsin-Madison. His interests cover databases, AI, and Web, with a current focus on data integration, data science, big data, and machine learning.
 
AnHai received the ACM Doctoral Dissertation Award in 2003, a CAREER Award in 2004, and a Sloan Fellowship in 2007. He co-authored “Principles of Data Integration”, a textbook published by Morgan-Kaufmann in 2012. AnHai was on the Advisory Board of Transformic, a Deep Web startup acquired by Google in 2005, and was Chief Scientist of Kosmix, a social media startup acquired by Walmart in 2011. From 2011 to 2014 he was Chief Scientist of WalmartLabs, a newly formed R&D lab at Walmart, devoted to analyzing and integrating data for e-commerce. 
 
From 2015 until now he has been devoting most of his effort toward developing a research/teaching/service agenda for data science at UW-Madison.
 

(See Anhai’s webpage)

 

Developing Open Source Software in Academia for Data Integration: Experience and Lessons Learned

Abstract:
Data integration (DI), also known as data preparation, wrangling, munging, and curation, is a fundamental challenge in data science. In this talk, I argue that the DI community must devote far more effort to building systems, in order to truly advance the field. I describe a system building agenda that we have been working on in the past three years at Wisconsin. I will focus on entity matching (EM), a major challenge in DI. I describe how we develop cutting-edge EM solutions, and implement them as software packages in the Python ecosystem of data science tools, and as micro- and macro cloud services that data science teams can easily deploy. A key theme underlying many of our solutions is the use of machine learning and user interaction techniques, and a focus  on scaling up these techniques to work over very large data as well as over structured and text data. I discuss the deployment of our EM systems at a Fortune-500 company and in many industrial and academic projects. Finally, I discuss our experience and lessons learned in developing open source software in academia, applying such software to real-world problems, and commercializing such software.