Bojan Karlaš _{/ Postdoc at Harvard}

I am a postdoctoral research fellow at Harvard University. I develop machine learning pipelines for processing biomedical data and extracting clinically meaningful insights. I work with Eugene Semenov at the Cutaneous Biology Research Center of Massachusetts General Hospital, David Liu at the Department of Medical Oncology of Dana-Farber Cancer Institue and Kun-Hsing Yu at the Department of Biomedical Informatics of Harvard Medical School.

Previously, I did my Ph.D. at the Systems Group of ETH Zurich working with Ce Zhang on the intersection between data management systems and machine learing. I was designing and building systems for managing the machine learning development lifecycle with a specific focus on data debugging.

Publications

2024

Data Debugging with Shapley Importance over Machine Learning Pipelines
B Karlaš, D Dao, M Interlandi, S Schelter, W Wu, C Zhang
The Twelfth International Conference on Learning Representations

Navigating Data Errors in Machine Learning Pipelines: Identify, Debug, and Learn
B Karlaš, B Salimi, S Schelter

2023

RAB: Provable Robustness Against Backdoor Attacks
M Weber, X Xu, B Karlaš, C Zhang, B Li
[IEEE] IEEE Symposium on Security and Privacy (SP)

Abstract

Recent studies have shown that deep neural networks (DNNs) are highly vulnerable to adversarial attacks, including evasion and backdoor (poisoning) attacks. On the defense side, there have been intensive interests in both empirical and provable robustness against evasion attacks; however, provable robustness against backdoor attacks remains largely unexplored. In this paper, we focus on certifying robustness against backdoor attacks. To this end, we first provide a unified framework for robustness certification and show that it leads to a tight robustness condition for backdoor attacks. We then propose the first robust training process, RAB, to smooth the trained model and certify its robustness against backdoor attacks. Moreover, we evaluate the certified robustness of a family of “smoothed” models which are trained in a differentially private fashion, and show that they achieve better certified robustness bounds. In addition, we theoretically show that it is possible to train the robust smoothed models efficiently for simple models such as K-nearest neighbor classifiers, and we propose an exact smooth-training algorithm which eliminates the need to sample from a noise distribution. Empirically, we conduct comprehensive experiments for different machine learning (ML) models such as DNNs, differentially private DNNs, and K-NN models on MNIST, CIFAR-10 and ImageNet datasets (focusing on binary classifiers), and provide the first benchmark for certified robustness against backdoor attacks. In addition, we evaluate K-NN models on a spambase tabular dataset to demonstrate the advantages of the proposed exact algorithm. Both the theoretical analysis and the comprehensive benchmark on diverse ML models and datasets shed lights on further robust learning strategies against training time attacks or other general adversarial attacks.

Dataperf: Benchmarks for data-centric ai development
M Mazumder, C Banbury, X Yao, B Karlaš, WG Rojas, S Diamos, G Diamos, L He, A Parrish, HR Kirk, J Quaye, C Rastogi, D Kiela, D Jurado, D Kanter, R Mosquera, W Cukierski, J Ciro, L Aroyo, B Acun, L Chen, M Raje, M Bartolo, ES Eyuboglu, A Ghorbani, E Goodman, A Howard, O Inel, T Kane, CR Kirkpatrick, D Sculley, T Kuo, JW Mueller, T Thrush, J Vanschoren, M Warren, A Williams, S Yeung, N Ardalani, P Paritosh, C Zhang, JY Zou, C Wu, C Coleman, A Ng, P Mattson, VJ Reddi
Advances in Neural Information Processing Systems

DMLR: Data-centric Machine Learning Research--Past, Present and Future
L Oala, M Maskey, L Bat-Leah, A Parrish, NM Gürel, T Kuo, Y Liu, R Dror, D Brajovic, X Yao, M Bartolo, WAG Rojas, R Hileman, R Aliment, MW Mahoney, M Risdal, M Lease, W Samek, D Dutta, CG Northcutt, C Coleman, B Hancock, B Koch, GA Tadesse, B Karlaš, A Alaa, AB Dieng, N Noy, VJ Reddi, J Zou, P Paritosh, MVD Schaar, K Bollacker, L Aroyo, C Zhang, J Vanschoren, I Guyon, P Mattson
[arXiv] arXiv preprint arXiv:2311.13028

Proactively screening machine learning pipelines with arguseyes
S Schelter, S Grafberger, S Guha, B Karlas, C Zhang

Towards Declarative Systems for Data-Centric Machine Learning
S Grafberger, B Karlaš, P Groth, S Schelter
Data-centric Machine Learning Research (DMLR) Workshop at ICML 2023

2022

Data Science Through the Looking Glass: Analysis of Millions of GitHub Notebooks and ML .NET Pipelines
F Psallidas, Y Zhu, B Karlaš, J Henkel, M Interlandi, S Krishnan, B Kroth, V Emani, W Wu, C Zhang, M Weimer, A Floratou, C Curino, K Karanasos
[SIGMOD] SIGMOD Record

Abstract

The recent success of machine learning (ML) has led to an explosive growth of systems and applications built by an ever-growing community of system builders and data science (DS) practitioners. This quickly shifting panorama, however, is challenging for system builders and practitioners alike to follow. In this paper, we set out to capture this panorama through a wide-angle lens, performing the largest analysis of DS projects to date, focusing on questions that can advance our understanding of the field and determine investments. Specifically, we download and analyze (a) over 8M notebooks publicly available on GITHUB and (b) over 2M enterprise ML pipelines developed within Microsoft. Our analysis includes coarse-grained statistical characterizations, fine-grained analysis of libraries and pipelines, and comparative studies across datasets and time. We report a large number of measurements for our readers to interpret and draw actionable conclusions on (a) what system builders should focus on to better serve practitioners and (b) what technologies should practitioners rely on.

dcbench: a benchmark for data-centric AI systems
S Eyuboglu, B Karlaš, C Ré, C Zhang, J Zou
[DEEM Workshop] Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning

Improving certified robustness via statistical learning with logical reasoning
Z Yang, Z Zhao, B Wang, J Zhang, L Li, H Pei, B Karlaš, J Liu, H Guo, C Zhang, B Li
Advances in Neural Information Processing Systems

Screening Native ML Pipelines with "ArgusEyes"
S Schelter, S Grafberger, S Guha, O Sprangers, B Karlaš, C Zhang
[CIDR Abstract] Conference on Innovative Data Systems Research

Abstract

Software systems that learn from data are being deployed in increasing numbers in industrial and institutional scenarios. Developing these machine learning (ML) applications imposes additional challenges beyond those of traditional software systems. The behavior of such applications very much depends on their input data, and they are based on systems and libraries from a relatively young data science ecosystem, which is rapidly evolving all the time. Experience shows that it is difficult to ensure that such ML applications are implemented correctly, and as a consequence, data scientists building these applications require fundamental system support.

2021

A data quality-driven view of mlops
C Renggli, L Rimanic, NM Gürel, B Karlaš, W Wu, C Zhang
[IEEE] IEEE Data Engineering Bulletin

Ease.ML: A Lifecycle Management System for MLDev and MLOps
LA Melgar, D Dao, S Gan, NM Gürel, N Hollenstein, J Jiang, B Karlaš, T Lemmin, T Li, Y Li, S Rao, J Rausch, C Renggli, L Rimanic, M Weber, S Zhang, Z Zhao, K Schawinski, W Wu, C Zhang
[CIDR] Conference on Innovative Data Systems Research

Abstract

We present Ease.ML, a lifecycle management system for machine learning (ML). Unlike many existing works, which focus on improving individual steps during the lifecycle of ML application development, Ease.ML focuses on managing and automating the entire lifecycle itself. We present user scenarios that have motivated the development of Ease.ML, the eight-step Ease.ML process that covers the lifecycle of ML application development; the foundation of Ease.ML in terms of a probabilistic database model and its connection to information theory; and our lessons learned, which hopefully can inspire future research.

Online Active Model Selection for Pre-trained Classifiers
MR Karimi, NM Gürel, B Karlaš, J Rausch, C Zhang, A Krause
[AISTATS] International Conference on Artificial Intelligence and Statistics

2020

Nearest neighbor classifiers over incomplete information: From certain answers to certain predictions
B Karlaš, P Li, R Wu, NM Gürel, X Chu, W Wu, C Zhang
[VLDB] Proceedings of the VLDB Endowment

Abstract

Machine learning (ML) applications have been thriving recently, largely attributed to the increasing availability of data. However, inconsistency and incomplete information are ubiquitous in real-world datasets, and their impact on ML applications remains elusive. In this paper, we present a formal study of this impact by extending the notion of Certain Answers for Codd tables, which has been explored by the database research community for decades, into the field of machine learning. Specifically, we focus on classification problems and propose the notion of “Certain Predictions” (CP) — a test data example can be certainly predicted (CP’ed) if all possible classifiers trained on top of all possible worlds induced by the incompleteness of data would yield the same prediction. We study two fundamental CP queries: (Q1) checking query that determines whether a data example can be CP’ed; and (Q2) counting query that computes the number of classifiers that support a particular prediction (i.e., label). Given that general solutions to CP queries are, not surprisingly, hard without assumption over the type of classifier, we further present a case study in the context of nearest neighbor (NN) classifiers, where efficient solutions to CP queries can be developed — we show that it is possible to answer both queries in linear or polynomial time over exponentially many possible worlds. We demonstrate one example use case of CP in the important application of “data cleaning for machine learning (DC for ML).” We show that our proposed CPClean approach built based on CP can often significantly outperform existing techniques, particularly on datasets with systematic missing values. For example, on 5 datasets with systematic missingness, CPClean (with early termination) closes 100% gap on average by cleaning 36% of dirty data on average, while the best automatic cleaning approach BoostClean can only close 14% gap on average.

Building continuous integration services for machine learning
B Karlaš, M Interlandi, C Renggli, W Wu, C Zhang, DMI Babu, J Edwards, C Lauren, A Xu, M Weimer
[SIGKDD] Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

2019

Is advance knowledge of flow sizes a plausible assumption?
V Ðukić, SA Jyothi, B Karlaš, M Owaida, C Zhang, A Singla
[NSDI] 16th {USENIX} Symposium on Networked Systems Design and Implementation

Continuous Integration of Machine Learning Models: A Rigorous Yet Practical Treatment
C Renggli, B Karlaš, B Ding, F Liu, K Schawinski, W Wu, C Zhang
[MLSYS] Proceedings of Machine Learning and Systems

Abstract

Continuous integration is an indispensable step of modern software engineering practices to systematically manage the life cycles of system development. Developing a machine learning model is no difference — it is an engineering process with a life cycle, including design, implementation, tuning, testing, and deployment. However, most, if not all, existing continuous integration engines do not support machine learning as first-class citizens. In this paper, we present ease.ml/ci, to our best knowledge, the first continuous integration system for machine learning. The challenge of building ease.ml/ci is to provide rigorous guarantees, e.g., single accuracy point error tolerance with 0.999 reliability, with a practical amount of labeling effort, e.g., 2K labels per test. We design a domain specific language that allows users to specify integration conditions with reliability constraints, and develop simple novel optimizations that can lower the number of labels required by up to two orders of magnitude for test conditions popularly used in real production systems.

ease.ml/ci and ease.ml/meter in action: towards data management for statistical generalization
C Renggli, FA Hubis, B Karlaš, K Schawinski, W Wu, C Zhang
[VLDB Demo] Proceedings of the VLDB Endowment

Automl from service provider’s perspective: Multi-device, multi-tenant model selection with gp-ei
C Yu, B Karlaš, J Zhong, C Zhang, J Liu
[AISTATS] 22nd International Conference on Artificial Intelligence and Statistics

2018

ease.ml in action: towards multi-tenant declarative learning services
B Karlaš, J Liu, W Wu, C Zhang
[VLDB Demo] Proceedings of the VLDB Endowment

2016

The curious case of the PDF converter that likes Mozart: Dissecting and mitigating the privacy risk of personal cloud apps
H Harkous, R Rahman, B Karlaš, K Aberer
Proceedings on Privacy Enhancing Technologies

Community

Presentations

Data Debugging with Shapley importance over machine learning pipelines
Lightning talk outlining the ease.ml/datascope project and one of the key algorithms.
Venue: MIA seminar @ Broad Institue (2024)
Talk video

Data Science through the Looking Glass and what we found there
Presenting an extensive study of data science pipelines conducted at Microsoft.
Venue: Dutch Seminar on Data Systems Design (2022)
Talk video

Understanding Data Quality in the Area of Data-Centric AI (co-presenter)
Covering several methods we developed for tackling ML data quality issues.
Venue: DCAI Workshop (2021), Yu Lab @ Harvard Medical School (2021)
Talk video

DataScope: Scaling up Data Shapley over Machine Learning Pipelines
Virtual presentation introducing the ease.ml/datascope project.
Venue: Microsoft Joint Research Center Workshop 2021
Talk video

Introduction to Machine Learning
Six-part video series outlining key concepts in the field of machine learning.
Venue: YouTube channel of Modulos (2021)
Video playlist

Reviewing

VLDB International Conference on Very Large Data Bases: 2023, 2025
NeurIPS Neural Information Processing Systems: 2023, 2024
ICML International Conference on Machine Learning: 2024
ICLR International Conference on Learning Representations 2024, 2025
CIKM Conference on Information and Knowledge Management: 2021, 2022
DEEM Workshop @ SIGMOD Workshop on Data Management for End-to-End Machine Learning: 2023, 2024
DBML Workshop @ ICDE International Workshop on Databases and Machine Learning: 2023, 2024
DataPerf Workshop @ ICML Benchmarking Data for Data-Centric AI: 2022
WiML Workshop @ NeurIPS Women in Machine Learning Workshop: 2019
VLDB Journal The International Journal on Very Large Data Bases: 2023
ACM JDIQ Journal of Data and Information Quality: 2023
ACM TODS ACM Transactions on Database Systems: 2024
DMLR Journal on Data-centric Machine Learning Research: 2024
Nature SREP Nature Scientific Reports: 2023
AACR CR AACR Cancer Research: 2024 (shadow reviewer)
AACR CCR AACR Clinical Cancer Research: 2024 (shadow reviewer)

Organization

Table Representation Learning Workshop @ NeurIPS (co-organizer)
Workshop featuring invited speakers, a panel and papers aimed at advancing relational tables as a first-class data modality for deep learning research.
Venue: NeurIPS (2022, 2023)
Website

Data Centric AI Workshop @ Stanford and ETH (co-organizer)
Two-day workshop featuring panels and keynotes by leading figures from academia and industry centered around the emerging field of data-centric AI.
Venue: Hybrid event supported by Stanford HAI and ETH AI Center (2021)
Website Benchmark

Employment

2024 - Present

Broad Institute
Postdoctoral Scholar / Boston, USA

2024 - Present

Dana-Farber Cancer Institute
Research Fellow (CWR) / Department of Medical Oncology / Boston, USA
Developing methods for detecting genomic biomarkers from whole-slide images.

2024 - Present

Mass General Brigham
Non-Clinical Research Fellow / Cutaneous Biology Research Center / Boston, USA
Building a digital pathology pipeline for predicting clinical outcomes and associated biomarkers using whole-slide images of melanoma patients.

2022 - Present

Harvard University
Research Fellow / Harvard Medical School / Boston, USA
Developing methods for improving the quality of biomedical ML pipelines using data debugging techniques.

2019

Microsoft
Research Intern / Gray Systems Lab / Redmond, USA
Building a testing tool for ML models. Performing an extensive analysis of usage data from ML.NET feature engineering pipelines.

2018

Oracle
Research Intern / Oracle Labs / San Francisco Bay Area, USA
Developing an automated ensemble construction method for the Oracle Auto-ML system.

2016 - 2017

Logitech
Research and Development Intern / Lausanne, Switzerland
Applying machine learning and signal processing techniques to detect filler words in speech audio recordings.

2015 - 2016

EPFL Research Scholar Program
Research Scholar / LCBB Lab / Lausanne, Switzerland
Applying graph theory and development of algorithms for assigning absolute orientations to genetic markers.

2014 - 2015

EPFL Research Scholar Program
Research Scholar / LSIR Lab / Lausanne, Switzerland
Design, develop, and validate an improved App Permissions Dialog for Google Drive.

2012 - 2014

Microsoft
Software Design Engineer / Microsoft Development Center Serbia / Belgrade, Serbia
Worked in the SQL Server Parallel Data Warehouse Team (PDW) on development of a Microsoft big data solution. Participated in all phases of the software development cycle, collaborated with various teams in the US, worked with a large code base and wrote maintainable production quality code.

Education

2018 - 2022

Eidgenössische Technische Hochschule (ETH)
PhD in Computer Science / DS3 Lab / Zürich, Switzerland
Thesis: Data Systems for Managing and Debugging Machine Learning Workflows

2014 - 2017

École polytechnique fédérale de Lausanne (EPFL)
Master in Computer Science / Lausanne, Switzerland
Thesis: Machine Learning Models for Network Flow Size Prediction

2008 - 2014

School of Electrical Engineering, Belgrade University (ETF)
Bachelor in Software Engineering / Belgrade, Serbia
Thesis: A Fast JSON Memory Object Model Implementation in C#

Personal

I speak English fluently, as well as intermediate German and French, and some beginner-level Spanish. My native language is Serbian. I enjoy running, hiking, skiing, scuba diving, books and videogames.