I am with the Data Management, Exploration and Mining (DMX) group, Microsoft Research. I received my Ph.D. in the Database Group at the University of Wisconsin-Madison, under the supervision of Prof. Jeffrey Naughton. I have broad interest in database system, data mining, and machine learning. I am currently working on query optimization, query processing, database system performance tuning, big data systems, distributed systems, data stream processing, and machine learning theory and systems. In the past, I have worked on various topics including graph data management, personal data management, knowledgebase construction, social network analysis, data privacy, entity matching in data integration, database as a service in the cloud, and so on.
Selected Projects
- Autonomous Index Tuning for Database/Big Data Systems: [VLDB'22], [SIGMOD'22], [SIGMOD'22], [VLDB'21], [VLDB'20], [SIGMOD'19], [VLDB'18].
- Query Optimization for Data Stream Processing Systems: [ICDE'22], [VLDB'21], [CIDR'19].
- Cost Modeling and Query Optimization for Database Systems: [SIGMOD'16], [VLDB'14], [VLDB'13], [ICDE'13].
- Ease.ML: A Lifecycle Management System for MLDev and MLOps: [CIDR'21], [VLDB'21], [VLDB'20], [KDD'20], [VLDB'19], [SysML'19], [VLDB'18], [VLDB'18].
- Efficient and Scalable Machine Learning Systems: [SIGMOD'22], [KDD'21], [VLDB'21], [SIGMOD'21], [CIDR'21], [ICDE'20], [ICDE'19], [VLDB'18], [VLDB'17].
- Probase: A Probabilistic Taxonomy for Text Understanding: [SIGMOD'12], [TKDE'17], [ICDE'17].
Professional Services
I am a program committee member of the following conferences: SIGMOD (2020, 2018, 2017), VLDB (2023, 2020), ICDE (2023), SIGIR (2023, 2022, 2021, 2020), KDD (2023, 2022, 2021), WSDM (2023, 2021), WWW (2016), CIKM (2022, 2021, 2020, 2018, 2017).
-
2023
- Automatic Feasibility Study via Data Quality Analysis for ML: A Case-Study on Label Noise.
Cedric Renggli, Luka Rimanic, Luka Kolar, Wentao Wu, and Ce Zhang.
In Proceedings of the IEEE 39th International Conference on Data Engineering (ICDE 2023), 2023. [PDF][arXiv]
- Data Debugging with Shapley Importance over End-to-End Machine Learning Pipelines.
Bojan Karlas, David Dao, Matteo Interlandi, Bo Li, Sebastian Schelter, Wentao Wu, and Ce Zhang.
In arXiv Preprint, 2023. [arXiv]
2022
- Budget-aware Index Tuning with Reinforcement Learning.
Wentao Wu, Chi Wang, Tarique Siddiqui, Junxiong Wang, Vivek Narasayya, Surajit Chaudhuri, and Philip A Bernstein.
In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2022): 1528-1541, 2022. [PDF][FULL]
- ISUM: Efficiently Compressing Large and Complex Workloads for Scalable Index Tuning.
Tarique Siddiqui, Saehan Jo, Wentao Wu, Chi Wang, Vivek Narasayya, and Surajit Chaudhuri.
In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2022): 660-673, 2022. [PDF][FULL]
- In-Database Machine Learning with CorgiPile: Stochastic Gradient Descent without Full Data Shuffle.
Lijie Xu, Shuang Qiu, Binhang Yuan, Jiawei Jiang, Cedric Renggli, Shaoduo Gan, Kaan Kara, Guoliang Li, Ji Liu, Wentao Wu, Jieping Ye, and Ce Zhang.
In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2022): 1286-1300, 2022. [PDF]
- DISTILL: Low-Overhead Data-Driven Techniques for Filtering and Costing Indexes for Scalable Index Tuning.
Tarique Siddiqui, Wentao Wu, Vivek Narasayya, and Surajit Chaudhuri.
In Proceedings of the VLDB Endowment, Vol. 15, No. 10 (VLDB 2022): 2019-2031, 2022. [PDF]
- Factor Windows: Cost-based Query Rewriting for Optimizing Correlated Window Aggregates.
Wentao Wu, Philip A. Bernstein, Alex Raizman, and Christina Pavlopoulou.
In Proceedings of the IEEE 38th International Conference on Data Engineering (ICDE 2022): 2723-2735, 2022. [PDF][FULL][arXiv]
- Data Science Through the Looking Glass: Analysis of Millions of GitHub Notebooks and ML.NET Pipelines.
Fotis Psallidas, Yiwen Zhu, Bojan Karlas, Jordan Henkel, Matteo Interlandi, Subru Krishnan, Brian Kroth, Venkatesh Emani, Wentao Wu, Ce Zhang, Markus Weimer, Avrilia Floratou, Carlo Curino, and Konstantinos Karanasos.
In SIGMOD Record, Vol. 51, No. 2: 30-37, 2022. [PDF][arXiv]
-
2021
- OpenBox: A Generalized Black-box Optimization Service.
Yang Li, Yu Shen, Wentao Zhang, Yuanwei Chen, Huaijun Jiang, Mingchao Liu, Jiawei Jiang, Jinyang Gao, Wentao Wu, Zhi Yang, Ce Zhang, and Bin Cui.
In Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2021): 3209-3219, 2021. [PDF] [arXiv]
- Hyperspace: The Indexing Subsystem of Azure Synapse.
Rahul Potharaju, Terry Kim, Eunjin Song, Wentao Wu, Lev Novik, Apoorve Dave, Andrew Fogarty, Pouria Pirzadeh, Vidip Acharya, Gurleen Dhody, Jiying Li, Sinduja Ramanujam, Nicolas Bruno, Cesar Galindo-Legaria, Vivek Narasayya, Surajit Chaudhuri, Anil K. Nori, Tomas Talius, and Raghu Ramakrishnan.
In Proceedings of the VLDB Endowment, Vol. 14, No. 12 (VLDB 2021): 3043-3055, 2021. [PDF]
- VolcanoML: Speeding up End-to-End AutoML via Scalable Search Space Decomposition.
Yang Li, Yu Shen, Wentao Zhang, Jiawei Jiang, Yaliang Li, Bolin Ding, Jingren Zhou, Zhi Yang, Wentao Wu, Ce Zhang, and Bin Cui.
In Proceedings of the VLDB Endowment, Vol. 14, No. 11 (VLDB 2021): 2167-2176, 2021. [PDF] [arXiv]
- Optimization of Threshold Functions over Streams.
Walter Cai, Philip A. Bernstein, Wentao Wu, and Badrish Chandramouli.
In Proceedings of the VLDB Endowment, Vol. 14, No. 6 (VLDB 2021): 878-889, 2021. [PDF]
- Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions.
Bojan Karlas, Peng Li, Renzhi Wu, Nezihe Merve Gurel, Xu Chu, Wentao Wu, and Ce Zhang.
In Proceedings of the VLDB Endowment, Vol. 14, No. 3 (VLDB 2021): 255-267, 2021. [PDF] [arXiv]
- The Case for ML-Enhanced High-Dimensional Indexes.
Rong Kang, Wentao Wu, Chen Wang, Ce Zhang, and Jianmin Wang.
In Proceedings of the 3rd International Workshop on Applied AI for Database Systems and Applications (AIDB@VLDB 2021), 2021. [PDF]
- Towards Demystifying Serverless Machine Learning Training.
Jiawei Jiang, Shaoduo Gan, Yue Liu, Fanlin Wang, Gustavo Alonso, Ana Klimovic, Ankit Singla, Wentao Wu, and Ce Zhang.
In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2021): 857-871, 2021. [PDF] [arXiv]
- Towards Understanding End-to-End Learning in the Context of Data: Machine Learning Dancing over Semirings & Codd's Table.
Wentao Wu and Ce Zhang.
In Proceedings of the Fifth Workshop on Data Management for End-To-End Machine Learning (DEEM@SIGMOD 2021): 1-4, 2021. [PDF]
- Magpie: Python at Speed and Scale using Cloud Backends.
Alekh Jindal, Venkatesh Emani, Maureen Daum, Olga Poppe, Brandon Haynes, Anna Pavlenko, Ayushi Gupta, Karthik Ramachandra, Carlo Curino, Andreas Mueller, Wentao Wu, and Hiren Patel.
In Conference on Innovative Data Systems Research (CIDR 2021), 2021. [PDF]
- Ease.ML: A Lifecycle Management System for MLDev and MLOps.
Leonel Aguilar, David Dao, Shaoduo Gan, Nezihe Merve Gurel, Nora Hollenstein, Jiawei Jiang, Bojan Karlas, Thomas Lemmin, Tian Li, Yang Li, Susie Rao, Johannes Rausch, Cedric Renggli, Luka Rimanic, Maurice Weber, Shuai Zhang, Zhikuan Zhao, Kevin Schawinski, Wentao Wu, and Ce Zhang.
In Conference on Innovative Data Systems Research (CIDR 2021), 2021. [PDF]
- A Data Quality-Driven View of MLOps.
Cedric Renggli, Luka Rimanic, Nezihe Merve Gurel, Bojan Karlas, Wentao Wu, and Ce Zhang.
In IEEE Data Engineering Bulletin, Vol. 44, No. 1: 11-23, 2021. [PDF] [arXiv]
- Model Averaging in Distributed Machine Learning: A Case Study with Apache Spark.
Yunyan Guo, Zhipeng Zhang, Jiawei Jiang, Wentao Wu, Ce Zhang, Bin Cui, and Jianzhong Li.
In VLDB Journal, Vol. 30, No. 4: 693-712, 2021. [PDF]
-
2020
- Helios: Hyperscale Indexing for the Cloud & Edge.
Rahul Potharaju, Terry Kim, Wentao Wu, Vidip Acharya, Steve Suh, Andrew Fogarty, Apoorve Dave, Sinduja Ramanujam, Tomas Talius, Lev Novik, and Raghu Ramakrishnan.
In Proceedings of the VLDB Endowment, Vol. 13, No. 12 (VLDB 2020): 3231-3244, 2020. [PDF]
- Ease.ml/snoopy in Action: Towards Automatic Feasibility Analysis for Machine Learning Application Development.
Cedric Renggli, Luka Rimanic, Luka Kolar, Wentao Wu, and Ce Zhang.
In Proceedings of the VLDB Endowment, Vol. 13, No. 12 (VLDB 2020): 2837-2840, 2020. [PDF]
- Building Continuous Integration Services for Machine Learning.
Bojan Karlas, Matteo Interlandi, Cedric Renggli, Wentao Wu, Ce Zhang, Deepak Mukunthu Iyappan Babu, Jordan Edwards, Chris Lauren, Andy Xu, and Markus Weimer.
In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2020): 2407-2415, 2020 (oral presentation, 44/756). [PDF]
- ColumnSGD: A Column-oriented Framework for Distributed Stochastic Gradient Descent.
Zhipeng Zhang, Wentao Wu, Jiawei Jiang, Lele Yu, Bin Cui, and Ce Zhang.
In Proceedings of the IEEE 36th International Conference on Data Engineering (ICDE 2020): 1513-1524, 2020. [PDF]
- A Note On Operator-Level Query Execution Cost Modeling.
Wentao Wu
In arXiv Preprint, 2020. [arXiv]
-
2019
- AI Meets AI: Leveraging Query Executions to Improve Index Recommendations.
Bailu Ding, Sudipto Das, Ryan Marcus, Wentao Wu, Surajit Chaudhuri, and Vivek Narasayya.
In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2019): 1241-1258, 2019. [PDF]
- Serverless Event-Stream Processing over Virtual Actors.
Philip A. Bernstein, Todd Porter, Rahul Potharaju, Alejandro Z. Tomsici, Shivaram Venkataramani, and Wentao Wu.
In Conference on Innovative Data Systems Research (CIDR 2019), 2019. [PDF]
- Ease.ml/ci and Ease.ml/meter in Action: Towards Data Management for Statistical Generalization.
Cedric Renggli, Frances Ann Hubis, Bojan Karlas, Kevin Schawinski, Wentao Wu, and Ce Zhang.
In Proceedings of the VLDB Endowment, Vol. 12, No.12 (VLDB 2019): 1962-1965, 2019. [PDF]
- Continuous Integration of Machine Learning Models with ease.ml/ci: Towards a Rigorous Yet Practical Treatment.
Cedric Renggli, Bojan Karlas, Bolin Ding, Feng Liu, Kevin Schawinski, Wentao Wu, and Ce Zhang.
In Proceedings of the 2nd SysML Conference (SysML 2019), 2019. [PDF] [arXiv]
- MLlib*: Fast Training of GLMs using Spark MLlib.
Zhipeng Zhang, Jiawei Jiang, Wentao Wu, Ce Zhang, Lele Yu, and Bin Cui.
In Proceedings of the IEEE 35th International Conference on Data Engineering (ICDE 2019): 1778-1789, 2019. [PDF]
- Quantitative Overfitting Management for Human-in-the-loop ML Application Development with ease.ml/meter.
Frances Ann Hubis, Wentao Wu, and Ce Zhang.
In arXiv Preprint, 2019. [arXiv]
-
2018
- Plan Stitch: Harnessing the Best of Many Plans.
Bailu Ding, Sudipto Das, Wentao Wu, Surajit Chaudhuri, and Vivek Narasayya.
In Proceedings of the VLDB Endowment, Vol. 11, No. 10 (VLDB 2018): 1123-1136, 2018. [PDF]
- Ease.ml: Towards Multi-tenant Resource Sharing for Machine Learning Workloads.
Tian Li, Jie Zhong, Ji Liu, Wentao Wu, and Ce Zhang.
In Proceedings of the VLDB Endowment, Vol. 11, No. 5 (VLDB 2018): 607-620, 2018. [PDF] [arXiv]
- MLBench: Benchmarking Machine Learning Services Against Human Experts.
Yu Liu, Hantian Zhang, Luyuan Zeng, Wentao Wu, and Ce Zhang.
In Proceedings of the VLDB Endowment, Vol. 11, No. 10 (VLDB 2018): 1220-1232, 2018. [PDF] [arXiv] [Datasets]
- Ease.ml in Action: Towards Multi-tenant Declarative Learning Services.
Bojan Karlas, Ji Liu, Wentao Wu, and Ce Zhang.
In Proceedings of the VLDB Endowment, Vol. 11, No. 12 (VLDB 2018): 2054-2057, 2018. [PDF]
-
2017
- Semantic Bootstrapping: A Theoretical Perspective.
Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q. Zhu.
In Proceedings of the 33rd International Conference on Data Engineering (ICDE 2017): 7-8, 2017 (TKDE poster). [PDF]
- Semantic Bootstrapping: A Theoretical Perspective.
Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q. Zhu.
In IEEE Transactions on Knowledge and Data Engineering, Vol. 29, No. 2: 446-457, 2017. [PDF]
- Towards Interactive Debugging of Rule-based Entity Matching.
Fatemah Panahi, Wentao Wu, AnHai Doan, and Jeffrey F. Naughton.
In Proceedings of the 20th International Conference on Extending Database Technology (EDBT 2017): 354-365, 2017. [PDF]
- MLog: Towards Declarative In-Database Machine Learning.
Xupeng Li, Bin Cui, Yiru Chen, Wentao Wu, and Ce Zhang.
In Proceedings of the VLDB Endowment, Vol. 10, No. 12 (VLDB 2017): 1933-1936, 2017. [PDF]
- How Good Are Machine Learning Clouds for Binary Classification with Good Features?
Hantian Zhang, Luyuan Zeng, Wentao Wu, and Ce Zhang.
In Proceedings of the 2017 Symposium on Cloud Computing (SoCC 2017): 649, 2017 (extended abstract). [PDF]
- An Overreaction to the Broken Machine Learning Abstraction: The ease.ml Vision.
Ce Zhang, Wentao Wu, and Tian Li.
In Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics (HILDA@SIGMOD 2017): 3:1-3:6, 2017. [PDF]
-
2011 - 2016
- Sampling-Based Query Re-Optimization.
Wentao Wu, Jeffrey F. Naughton, and Harneet Singh.
In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2016): 1721-1736, 2016. [PDF] [FULL] [arXiv] [Slides]
- On Debugging Non-Answers in Keyword Search Systems.
Akanksha Baid, Wentao Wu, Chong Sun, AnHai Doan, and Jeffrey F. Naughton.
In Proceedings of the 18th International Conference on Extending Database Technology (EDBT 2015): 37-48, 2015. [PDF]
- Revisiting Differentially Private Regression: Lessons From Learning Theory and their Consequences.
Xi Wu, Matthew Fredrikson, Wentao Wu, Somesh Jha, and Jeffrey F. Naughton.
In arXiv Preprint, 2015. [arXiv]
- Uncertainty Aware Query Execution Time Prediction.
Wentao Wu, Xi Wu, Hakan Hacigümüs, and Jeffrey F. Naughton.
In Proceedings of the VLDB Endowment, Vol. 7, No. 14 (VLDB 2014): 1857-1868, 2014. [PDF] [FULL] [arXiv] [Slides]
- Towards Predicting Query Execution Time for Concurrent and Dynamic Database Workloads.
Wentao Wu, Yun Chi, Hakan Hacigümüs, and Jeffrey F. Naughton.
In Proceedings of the VLDB Endowment, Vol. 6, No. 10 (VLDB 2013): 925-936, 2013. [PDF] [FULL] [Slides]
- Predicting Query Execution Time: Are Optimizer Cost Models Really Unusable?
Wentao Wu, Yun Chi, Shenghuo Zhu, Junichi Tatemura, Hakan Hacigümüs, and Jeffrey F. Naughton.
In Proceedings of the 29th International Conference on Data Engineering (ICDE 2013): 1081-1092, 2013. [PDF] [FULL] [Slides]
- Probase: A Probabilistic Taxonomy for Text Understanding.
Wentao Wu, Hongsong Li, Haixun Wang and Kenny Q. Zhu.
In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2012): 481-492, 2012. [PDF] [FULL] [Slides]
- Context-aware Search for Personal Information Management Systems.
Jidong Chen, Wentao Wu, Hang Guo and Wei Wang.
In Proceedings of the 12th SIAM International Conference on Data Mining (SDM 2012): 708-719, 2012. [PDF]
- iMecho: A Context-Aware Desktop Search System.
Jidong Chen, Hang Guo, Wentao Wu and Wei Wang.
In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2011): 1269-1270, 2011. [PDF]
-
2010 and Prior
- K-Symmetry Model for Identity Anonymization in Social Networks.
Wentao Wu, Yanghua Xiao, Wei Wang, Zhenying He and Zhihui Wang.
In Proceedings of the 13th International Conference on Extending Database Technology (EDBT 2010) : 111-122, 2010. [PDF] [FULL]
- iMecho: An Associative Memory Based Desktop Search System.
Jidong Chen, Hang Guo, Wentao Wu and Wei Wang.
In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM 2009): 731-740, 2009. [PDF]
- Personalization As A Service: The Architecture and A Case Study.
Hang Guo, Jidong Chen, Wentao Wu and Wei Wang.
In Proceedings of the 1st International CIKM Workshop on Cloud Data Management (CloudDb 2009): 1-8, 2009. [PDF]
- Search Your memory! - An Associative Memory Based Desktop Search System.
Jidong Chen, Hang Guo, Wentao Wu and Chunxin Xie.
In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2009): 1099-1102, 2009. [PDF]
- Efficiently Indexing Shortest Paths by Exploiting Symmetry in Graphs.
Yanghua Xiao, Wentao Wu, Jian Pei, Wei Wang and Zhenying He.
In Proceedings of the 12th International Conference on Extending Database Technology (EDBT 2009): 493-504, 2009. [PDF]
- Efficient Algorithms for Node Disjoint Subgraph Homeomorphism Determination.
Yanghua Xiao, Wentao Wu, Wei Wang and Zhenying He.
In Proceedings of 13th International Conference on Database Systems for Advanced Applications (DASFAA 2008): 452-460, 2008. [FULL] [arXiv]
- Structure-based Graph Distance Measures of High Degree of Precision.
Yanghua Xiao, Hua Dong, Wentao Wu, Momiao Xiong, Wei Wang and Baile Shi.
In Pattern Recognition, Vol. 41, No. 12: 3547 - 3561, 2008. [PDF]
- Symmetry-based Structure Entropy of Complex Networks.
Yanghua Xiao, Wentao Wu, Hui Wang, Momiao Xiong and Wei Wang.
In Physica A, Vol. 387, No. 11: 2611-2619, 2008. [PDF]
Unpublished and Miscellaneous
- A Brief Overview of Query Optimization.
Wentao Wu, 2018.
- Suppression Strikes Back: On the Interaction of Thresholding and Differential Privacy.
Xi Wu, Wentao Wu, Chen Zeng, and Jeffrey F. Naughton, 2015.
- Sampling-Based Cardinality Estimation Algorithms: A Survey and An Empirical Evaluation.
Wentao Wu, 2012.
- Probase: a Universal Knowledge Base for Semantic Search.
Zhongyuan Wang, Jiuming Huang, Hongsong Li, Bin Liu, Bin Shao, Haixun Wang, Jingjing Wang, Yue Wang, Wentao Wu, Jing Xiao, and Kenny Q. Zhu, 2010.
Theses