Bojan Karlaš
/ Postdoc at Harvard
I am a postdoctoral research fellow at Harvard University. I develop machine learning pipelines for processing biomedical data and extracting clinically meaningful insights. I work with Eugene Semenov at the Cutaneous Biology Research Center of Massachusetts General Hospital, David Liu at the Department of Medical Oncology of Dana-Farber Cancer Institue and Kun-Hsing Yu at the Department of Biomedical Informatics of Harvard Medical School.
Previously, I did my Ph.D. at the Systems Group of ETH Zurich working with Ce Zhang on the intersection between data management systems and machine learing. I was designing and building systems for managing the machine learning development lifecycle with a specific focus on data debugging.
Publications
2024
B Karlaš,
The Twelfth International Conference on Learning Representations
Abstract
When a machine learning (ML) model exhibits poor quality (e.g., poor accuracy or fairness), the problem can often be traced back to errors in the training data. Being able to discover the data examples that are the most likely culprits is a fundamental concern that has received a lot of attention recently. One prominent way to measure “data importance” with respect to model quality is the Shapley value. Unfortunately, existing methods only focus on the ML model in isolation, without considering the broader ML pipeline for data preparation and feature extraction, which appears in the majority of real-world ML code. This presents a major limitation to applying existing methods in practical settings. In this paper, we propose Datascope, a method for efficiently computing Shapley-based data importance over ML pipelines. We introduce several approximations that lead to dramatic improvements in terms of computational speed. Finally, our experimental evaluation demonstrates that our methods are capable of data error discovery that is as effective as existing Monte Carlo baselines, and in some cases even outperform them. We release our code as an open-source data debugging library available at https://github.com/easeml/datascope.
2023
M Weber,
[IEEE] IEEE Symposium on Security and Privacy (SP)
Abstract
Recent studies have shown that deep neural networks (DNNs) are highly vulnerable to adversarial attacks, including evasion and backdoor (poisoning) attacks. On the defense side, there have been intensive interests in both empirical and provable robustness against evasion attacks; however, provable robustness against backdoor attacks remains largely unexplored. In this paper, we focus on certifying robustness against backdoor attacks. To this end, we first provide a unified framework for robustness certification and show that it leads to a tight robustness condition for backdoor attacks. We then propose the first robust training process, RAB, to smooth the trained model and certify its robustness against backdoor attacks. Moreover, we evaluate the certified robustness of a family of “smoothed” models which are trained in a differentially private fashion, and show that they achieve better certified robustness bounds. In addition, we theoretically show that it is possible to train the robust smoothed models efficiently for simple models such as K-nearest neighbor classifiers, and we propose an exact smooth-training algorithm which eliminates the need to sample from a noise distribution. Empirically, we conduct comprehensive experiments for different machine learning (ML) models such as DNNs, differentially private DNNs, and K-NN models on MNIST, CIFAR-10 and ImageNet datasets (focusing on binary classifiers), and provide the first benchmark for certified robustness against backdoor attacks. In addition, we evaluate K-NN models on a spambase tabular dataset to demonstrate the advantages of the proposed exact algorithm. Both the theoretical analysis and the comprehensive benchmark on diverse ML models and datasets shed lights on further robust learning strategies against training time attacks or other general adversarial attacks.
M Mazumder,
Advances in Neural Information Processing Systems
Abstract
Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and fragility in real-world applications, and research is hindered by saturation across existing dataset benchmarks. In response, we present DataPerf, a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We aim to foster innovation in data-centric AI through competition, comparability, and reproducibility. We enable the ML community to iterate on datasets, instead of just architectures, and we provide an open, online platform with multiple rounds of challenges to support this iterative development. The first iteration of DataPerf contains five benchmarks covering a wide spectrum of data-centric techniques, tasks, and modalities in vision, speech, acquisition, debugging, and diffusion prompting, and we support hosting new contributed benchmarks from the community. The benchmarks, online evaluation platform, and baseline implementations are open source, and the MLCommons Association will maintain DataPerf to ensure long-term benefits to academia and industry.
S Schelter,
Abstract
Software systems that learn from data with machine learning (ML) are ubiquitous. ML pipelines in these applications often suffer from a variety of data-related issues, such as data leakage, label errors or fairness violations, which require reasoning about complex dependencies between their inputs and outputs. These issues are usually only detected in hindsight after deployment, after they caused harm in production. We demonstrate ArgusEyes, a system which enables data scientists to proactively screen their ML pipelines for data-related issues as part of continuous integration. ArgusEyes instruments, executes and screens ML pipelines for declaratively specified pipeline issues, and analyzes data artifacts and their provenance to catch potential problems early before deployment to production. We demonstrate our system for three scenarios: detecting mislabeled images in a computer vision pipeline, spotting data …
L Oala,
[arXiv] arXiv preprint arXiv:2311.13028
Abstract
Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods towards positive scientific, societal and business impact.
S Grafberger,
Data-centric Machine Learning Research (DMLR) Workshop at ICML 2023
Abstract
We argue for a declarative approach to simplify the application of data-centric ML in real-world scenarios, and present our prototypical system MLWHATIF, which takes a first step in this direction.
2022
F Psallidas,
[SIGMOD] SIGMOD Record
Abstract
The recent success of machine learning (ML) has led to an explosive growth of systems and applications built by an ever-growing community of system builders and data science (DS) practitioners. This quickly shifting panorama, however, is challenging for system builders and practitioners alike to follow. In this paper, we set out to capture this panorama through a wide-angle lens, performing the largest analysis of DS projects to date, focusing on questions that can advance our understanding of the field and determine investments. Specifically, we download and analyze (a) over 8M notebooks publicly available on GITHUB and (b) over 2M enterprise ML pipelines developed within Microsoft. Our analysis includes coarse-grained statistical characterizations, fine-grained analysis of libraries and pipelines, and comparative studies across datasets and time. We report a large number of measurements for our readers to interpret and draw actionable conclusions on (a) what system builders should focus on to better serve practitioners and (b) what technologies should practitioners rely on.
S Eyuboglu,
[DEEM Workshop] Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning
Abstract
The development workflow for today’s AI applications has grown far beyond the standard model training task. This workflow typically consists of various data and model management tasks. It includes a “data cycle” aimed at producing high-quality training data, and a “model cycle” aimed at managing trained models on their way to production. This broadened workflow has opened a space for already emerging tools and systems for AI development. However, as a research community, we are still missing standardized ways to evaluate these tools and systems. In a humble effort to get this wheel turning, we developed dcbench, a benchmark for evaluating systems for data-centric AI development. In this report, we present the main ideas behind dcbench, some benchmark tasks that we included in the initial release, and a short summary of its implementation.
Z Yang,
Advances in Neural Information Processing Systems
Abstract
Intensive algorithmic efforts have been made to enable the rapid improvements of certificated robustness for complex ML models recently. However, current robustness certification methods are only able to certify under a limited perturbation radius. Given that existing pure data-driven statistical approaches have reached a bottleneck, in this paper, we propose to integrate statistical ML models with knowledge (expressed as logical rules) as a reasoning component using Markov logic networks (MLN), so as to further improve the overall certified robustness. This opens new research questions about certifying the robustness of such a paradigm, especially the reasoning component (eg, MLN). As the first step towards understanding these questions, we first prove that the computational complexity of certifying the robustness of MLN is# P-hard. Guided by this hardness result, we then derive the first certified robustness bound for MLN by carefully analyzing different model regimes. Finally, we conduct extensive experiments on five datasets including both high-dimensional images and natural language texts, and we show that the certified robustness with knowledge-based logical reasoning indeed significantly outperforms that of the state-of-the-arts.
S Schelter,
[CIDR Abstract] Conference on Innovative Data Systems Research
Abstract
Software systems that learn from data are being deployed in increasing numbers in industrial and institutional scenarios. Developing these machine learning (ML) applications imposes additional challenges beyond those of traditional software systems. The behavior of such applications very much depends on their input data, and they are based on systems and libraries from a relatively young data science ecosystem, which is rapidly evolving all the time. Experience shows that it is difficult to ensure that such ML applications are implemented correctly, and as a consequence, data scientists building these applications require fundamental system support.
2021
C Renggli,
[IEEE] IEEE Data Engineering Bulletin
Abstract
Developing machine learning models can be seen as a process similar to the one established for traditional software development. A key difference between the two lies in the strong dependency between the quality of a machine learning model and the quality of the data used to train or perform evaluations. In this work, we demonstrate how different aspects of data quality propagate through various stages of machine learning development. By performing a joint analysis of the impact of well-known data quality dimensions and the downstream machine learning process, we show that different components of a typical MLOps pipeline can be efficiently designed, providing both a technical and theoretical perspective.
LA Melgar,
[CIDR] Conference on Innovative Data Systems Research
Abstract
We present Ease.ML, a lifecycle management system for machine learning (ML). Unlike many existing works, which focus on improving individual steps during the lifecycle of ML application development, Ease.ML focuses on managing and automating the entire lifecycle itself. We present user scenarios that have motivated the development of Ease.ML, the eight-step Ease.ML process that covers the lifecycle of ML application development; the foundation of Ease.ML in terms of a probabilistic database model and its connection to information theory; and our lessons learned, which hopefully can inspire future research.
MR Karimi,
[AISTATS] International Conference on Artificial Intelligence and Statistics
Abstract
Given pre-trained classifiers and a stream of unlabeled data examples, how can we actively decide when to query a label so that we can distinguish the best model from the rest while making a small number of queries? Answering this question has a profound impact on a range of practical scenarios. In this work, we design an online selective sampling approach that actively selects informative examples to label and outputs the best model with high probability at any round. Our algorithm can also be used for online prediction tasks for both adversarial and stochastic streams. We establish several theoretical guarantees for our algorithm and extensively demonstrate its effectiveness in our experimental studies.
2020
B Karlaš,
[VLDB] Proceedings of the VLDB Endowment
Abstract
Machine learning (ML) applications have been thriving recently, largely attributed to the increasing availability of data. However, inconsistency and incomplete information are ubiquitous in real-world datasets, and their impact on ML applications remains elusive. In this paper, we present a formal study of this impact by extending the notion of Certain Answers for Codd tables, which has been explored by the database research community for decades, into the field of machine learning. Specifically, we focus on classification problems and propose the notion of “Certain Predictions” (CP) — a test data example can be certainly predicted (CP’ed) if all possible classifiers trained on top of all possible worlds induced by the incompleteness of data would yield the same prediction. We study two fundamental CP queries: (Q1) checking query that determines whether a data example can be CP’ed; and (Q2) counting query that computes the number of classifiers that support a particular prediction (i.e., label). Given that general solutions to CP queries are, not surprisingly, hard without assumption over the type of classifier, we further present a case study in the context of nearest neighbor (NN) classifiers, where efficient solutions to CP queries can be developed — we show that it is possible to answer both queries in linear or polynomial time over exponentially many possible worlds. We demonstrate one example use case of CP in the important application of “data cleaning for machine learning (DC for ML).” We show that our proposed CPClean approach built based on CP can often significantly outperform existing techniques, particularly on datasets with systematic missing values. For example, on 5 datasets with systematic missingness, CPClean (with early termination) closes 100% gap on average by cleaning 36% of dirty data on average, while the best automatic cleaning approach BoostClean can only close 14% gap on average.
B Karlaš,
[SIGKDD] Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
Abstract
Continuous integration (CI) has been a de facto standard for building industrial-strength software. Yet, there is little attention towards applying CI to the development of machine learning (ML) applications until the very recent effort on the theoretical side. In this paper, we take a step forward to bring the theory into practice.We develop the first CI system for ML, to the best of our knowledge, that integrates seamlessly with existing ML development tools. We present its design and implementation details.
2019
V Ðukić,
[NSDI] 16th {USENIX} Symposium on Networked Systems Design and Implementation
Abstract
Recent research has proposed several packet, flow, and coflow scheduling methods that could substantially improve data center network performance. Most of this work assumes advance knowledge of flow sizes. However, the lack of a clear path to obtaining such knowledge has also prompted some work on non-clairvoyant scheduling, albeit with more limited performance benefits.
C Renggli,
[MLSYS] Proceedings of Machine Learning and Systems
Abstract
Continuous integration is an indispensable step of modern software engineering practices to systematically manage the life cycles of system development. Developing a machine learning model is no difference — it is an engineering process with a life cycle, including design, implementation, tuning, testing, and deployment. However, most, if not all, existing continuous integration engines do not support machine learning as first-class citizens. In this paper, we present ease.ml/ci, to our best knowledge, the first continuous integration system for machine learning. The challenge of building ease.ml/ci is to provide rigorous guarantees, e.g., single accuracy point error tolerance with 0.999 reliability, with a practical amount of labeling effort, e.g., 2K labels per test. We design a domain specific language that allows users to specify integration conditions with reliability constraints, and develop simple novel optimizations that can lower the number of labels required by up to two orders of magnitude for test conditions popularly used in real production systems.
C Renggli,
[VLDB Demo] Proceedings of the VLDB Endowment
Abstract
Developing machine learning (ML) applications is similar to developing traditional software — it is often an iterative process in which developers navigate within a rich space of requirements, design decisions, implementations, empirical quality, and performance. In traditional software development, software engineering is the field of study which provides principled guidelines for this iterative process. However, as of today, the counterpart of “software engineering for ML” is largely missing — developers of ML applications are left with powerful tools (e.g., TensorFlow and PyTorch) but little guidance regarding the development lifecycle itself.In this paper, we view the management of ML development life-cycles from a data management perspective. We demonstrate two closely related systems, ease.ml/ci and ease.ml/meter, that provide some “principled guidelines” for ML application development: ci is a continuous …
C Yu,
[AISTATS] 22nd International Conference on Artificial Intelligence and Statistics
Abstract
AutoML has become a popular service that is provided by most leading cloud service providers today. In this paper, we focus on the AutoML problem from the\emph {service provider’s perspective}, motivated by the following practical consideration: When an AutoML service needs to serve {\em multiple users} with {\em multiple devices} at the same time, how can we allocate these devices to users in an efficient way? We focus on GP-EI, one of the most popular algorithms for automatic model selection and hyperparameter tuning, used by systems such as Google Vizer. The technical contribution of this paper is the first multi-device, multi-tenant algorithm for GP-EI that is aware of\emph {multiple} computation devices and multiple users sharing the same set of computation devices. Theoretically, given users and devices, we obtain a regret bound of $ O ((\text {\bf {MIU}}(T, K)+ M)\frac {N^ 2}{M}) $, where $\text {\bf {MIU}}(T, K) $ refers to the maximal incremental uncertainty up to time for the covariance matrix . Empirically, we evaluate our algorithm on two applications of automatic model selection, and show that our algorithm significantly outperforms the strategy of serving users independently. Moreover, when multiple computation devices are available, we achieve near-linear speedup when the number of users is much larger than the number of devices.
2018
B Karlaš,
[VLDB Demo] Proceedings of the VLDB Endowment
Abstract
We demonstrate ease.ml, a multi-tenant machine learning service we host at ETH Zurich for various research groups. Unlike existing machine learning services, ease.ml presents a novel architecture that supports multi-tenant, cost-aware model selection that optimizes for minimizing total regrets of all users. Moreover, it provides a novel user interface that enables declarative machine learning at a higher level: Users only need to specify the input/output schemata of their learning tasks and ease.ml can handle the rest. In this demonstration, we present the design principles of ease.ml, highlight the implementation of its key components, and showcase how ease.ml can help ease machine learning tasks that often perplex even experienced users.
2016
H Harkous,
Proceedings on Privacy Enhancing Technologies
Abstract
Third party apps that work on top of personal cloud services such as Google Drive and Dropbox, require access to the user’s data in order to provide some functionality. Through detailed analysis of a hundred popular Google Drive apps from Google’s Chrome store, we discover that the existing permission model is quite often misused: around two thirds of analyzed apps are over-privileged, i.e., they access more data than is needed for them to function. In this work, we analyze three different permission models that aim to discourage users from installing over-privileged apps. In experiments with 210 real users, we discover that the most successful permission model is our novel ensemble method that we call Far-reaching Insights. Far-reaching Insights inform the users about the data-driven insights that apps can make about them (e.g., their topics of interest, collaboration and activity patterns etc.) Thus, they seek to bridge the gap between what third parties can actually know about users and users perception of their privacy leakage. The efficacy of Far-reaching Insights in bridging this gap is demonstrated by our results, as Far-reaching Insights prove to be, on average, twice as effective as the current model in discouraging users from installing over-privileged apps. In an effort for promoting general privacy awareness, we deploy a publicly available privacy oriented app store that uses Far-reaching Insights. Based on the knowledge extracted from data of the store’s users (over 115 gigabytes of Google Drive data from 1440 users with 662 installed apps), we also delineate the ecosystem for third-party cloud apps from the standpoint of developers and …
Community
Presentations
Data Debugging with Shapley importance over machine learning pipelines
Lightning talk outlining the ease.ml/datascope project and one of the key algorithms.
Venue: MIA seminar @ Broad Institue (2024)
Talk video
Data Science through the Looking Glass and what we found there
Presenting an extensive study of data science pipelines conducted at Microsoft.
Venue: Dutch Seminar on Data Systems Design (2022)
Talk video
Understanding Data Quality in the Area of Data-Centric AI (co-presenter)
Covering several methods we developed for tackling ML data quality issues.
Venue: DCAI Workshop (2021), Yu Lab @ Harvard Medical School (2021)
Talk video
DataScope: Scaling up Data Shapley over Machine Learning Pipelines
Virtual presentation introducing the ease.ml/datascope project.
Venue: Microsoft Joint Research Center Workshop 2021
Talk video
Introduction to Machine Learning
Six-part video series outlining key concepts in the field of machine learning.
Venue: YouTube channel of Modulos (2021)
Video playlist
Reviewing
VLDB International Conference on Very Large Data Bases:
2023, 2025
NeurIPS Neural Information Processing Systems:
2023, 2024
ICML International Conference on Machine Learning:
2024
ICLR International Conference on Learning Representations
2024, 2025
CIKM Conference on Information and Knowledge Management:
2021, 2022
DEEM Workshop @ SIGMOD Workshop on Data Management for End-to-End Machine Learning:
2023, 2024
DBML Workshop @ ICDE International Workshop on Databases and Machine Learning:
2023, 2024
DataPerf Workshop @ ICML Benchmarking Data for Data-Centric AI:
2022
WiML Workshop @ NeurIPS Women in Machine Learning Workshop:
2019
VLDB Journal The International Journal on Very Large Data Bases: 2023
ACM JDIQ Journal of Data and Information Quality: 2023
ACM TODS ACM Transactions on Database Systems: 2024
DMLR Journal on Data-centric Machine Learning Research: 2024
Nature SREP Nature Scientific Reports: 2023
AACR CR AACR Cancer Research: 2024 (shadow reviewer)
AACR CCR AACR Clinical Cancer Research: 2024 (shadow reviewer)
Organization
Table Representation Learning Workshop @ NeurIPS (co-organizer)
Workshop featuring invited speakers, a panel and papers aimed at advancing relational tables
as a first-class data modality for deep learning research.
Venue: NeurIPS (2022, 2023)
Website
Data Centric AI Workshop @ Stanford and ETH (co-organizer)
Two-day workshop featuring panels and keynotes by leading figures from academia and industry
centered around the emerging field of data-centric AI.
Venue: Hybrid event supported by Stanford HAI and ETH AI Center (2021)
Website Benchmark
Employment
2024 - Present
Postdoctoral Scholar / Boston, USA
2024 - Present
Research Fellow (CWR) / Department of Medical Oncology / Boston, USA
Developing methods for detecting genomic biomarkers from whole-slide images.
2024 - Present
Non-Clinical Research Fellow / Cutaneous Biology Research Center / Boston, USA
Building a digital pathology pipeline for predicting clinical outcomes and associated biomarkers using whole-slide images of melanoma patients.
2022 - Present
Research Fellow / Harvard Medical School / Boston, USA
Developing methods for improving the quality of biomedical ML pipelines using data debugging techniques.
2019
Research Intern / Gray Systems Lab / Redmond, USA
Building a testing tool for ML models. Performing an extensive analysis of usage data from ML.NET feature engineering pipelines.
2018
Research Intern / Oracle Labs / San Francisco Bay Area, USA
Developing an automated ensemble construction method for the Oracle Auto-ML system.
2016 - 2017
Research and Development Intern / Lausanne, Switzerland
Applying machine learning and signal processing techniques to detect filler words in speech audio recordings.
2015 - 2016
Research Scholar / LCBB Lab / Lausanne, Switzerland
Applying graph theory and development of algorithms for assigning absolute orientations to genetic markers.
2014 - 2015
Research Scholar / LSIR Lab / Lausanne, Switzerland
Design, develop, and validate an improved App Permissions Dialog for Google Drive.
2012 - 2014
Software Design Engineer / Microsoft Development Center Serbia / Belgrade, Serbia
Worked in the SQL Server Parallel Data Warehouse Team (PDW) on development of a Microsoft big data solution. Participated in all phases of the software development cycle, collaborated with various teams in the US, worked with a large code base and wrote maintainable production quality code.
Education
2018 - 2022
PhD in Computer Science / DS3 Lab / Zürich, Switzerland
Thesis: Data Systems for Managing and Debugging Machine Learning Workflows
2014 - 2017
Master in Computer Science / Lausanne, Switzerland
Thesis: Machine Learning Models for Network Flow Size Prediction
2008 - 2014
Bachelor in Software Engineering / Belgrade, Serbia
Thesis: A Fast JSON Memory Object Model Implementation in C#
Personal
I speak English fluently, as well as intermediate German and French, and some beginner-level Spanish. My native language is Serbian. I enjoy running, hiking, skiing, scuba diving, books and videogames.