I am an Assistant Professor in the Department of Computer Science at ETH Zurich. I am a member of the ETH Systems Group, where I lead the Efficient Architectures and Systems Lab (EASL).
I work on computer systems for large-scale applications such as cloud computing services, data analytics, and machine learning. The goal of my research is to improve the performance and resource efficiency of cloud computing while making it easier for users to deploy and manage their applications. My research interests span operating systems, computer architecture, and their intersection with machine learning.
Before joining ETH, I spent a year as a Research Scientist at Google Brain. I completed my Ph.D. in Electrical Engineering at Stanford University, advised by Professor Christos Kozyrakis. My dissertation was on the design and implementation of fast, elastic storage for cloud computing. My Ph.D. was generously supported by the Microsoft Research PhD Fellowship and Stanford Graduate Fellowship. I earned my M.S. in Electrical Engineering at Stanford University in 2015. I graduated from the Engineering Science program at the University of Toronto in 2013, where I earned my Bachelor of Applied Science and Engineering.
If you are interested in joining the EASL research group, please email me (aklimovic@ethz.ch) with your CV. See below for research focus areas.
Cloud computing is undergoing a fundamental shift, stimulated by an exponential growth in data, users, and an increasing demand for cloud services that can automatically allocate and scale computing resources for jobs. An emerging wave of cloud computing, called serverless computing, enables users to focus on writing code for their applications while cloud providers manage resources based on application demands. On serverless computing platforms, users can simultaneously launch thousands of tiny, short-lived tasks and pay only for the resources their tasks actually consume per ~10ms time interval, as opposed to paying for pre-allocated virtual machines that have fixed ratios of compute, memory, and storage.
Research topics: What should an operating system for serverless computing look like? Scheduling millions of short-lived tasks to satisfy performance requirements and achieve high resource utilization poses interesting challenges. Serverless computing encourages a high degree of resource sharing across tenants, which poses performance and security isolation concerns. In addition, it is not yet clear what is the right abstraction for users to specify application performance requirements.
Machine learning (ML) jobs are an increasingly important class of applications in the cloud. Across domains such as image understanding and text translation, scaling machine learning models to a large number of parameters has been shown to dramatically improve accuracy when sufficiently large datasets are used. While significant work has focused on optimizing hardware and software for ML computations, data management is a common bottleneck. As organizations collect massive amounts of data, storing and ingesting data at this scale poses several challenges.
Research topics: How should we design distributed storage systems for machine learning to optimize end-to-end model training and inference? How can we avoid moving large amounts of data across the network — should we instead move computation closer to the data (near-storage computing)? How can multiple tenants safely share datasets and models with good performance guarantees?
Many of today’s computer systems use heuristics and hints to make decisions (e.g., to decide which resources to allocate for a task or which data to keep in a cache). As software applications and hardware platforms become more and more heterogeneous, designing heuristics is increasingly difficult. Yet due to growing heterogeneity, automating resource and data management is increasingly important. One promising approach is to learn resource management strategies by training machine learning models using system data collected while profiling or running applications.
Research topics: How can we leverage machine learning models to make systems-level decisions when such decisions often need to be made at microsecond timescales? How should we design APIs to make replacing or supplementing heuristics with machine learning model inference practical in computer systems?
PhD students:
Postdocs:
Masters students:
Bachelor students:
Visitors:
[SOSP] Dirigent: Lightweight Serverless Orchestration
Lazar Cvetković, François Costa, Mihajlo Djokic, Michal Friedman, Ana Klimovic.
Proceedings of the ACM Symposium on Operating Systems Principles, 2024.
[ICML] DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving
Foteini Strati, Sara McAllister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic.
Proceedings of the International Conference on Machine Learning (ICML), 2024.
[ATC] Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement
Dan Graur*, Oto Mraz*, Sepehr Pourghannad, Muyu Li, Chandramohan A. Thekkath, Ana Klimovic.
Proceedings of the USENIX Annual Technical Conference (ATC), 2024.
[CIDR] Off-the-shelf Data Analytics on Serverless
Michael Wawrzoniak, Gianluca Moro, Rodrigo Bruno, Ana Klimovic, Gustavo Alonso.
Proceedings of the Conference on Innovative Data Systems Research (CIDR), 2024.
[EuroSys] Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications
Foteini Strati, Xianzhe Ma, Ana Klimovic.
Proceedings of the European Conference on Computer Systems (EuroSys), 2024.
[SoCC] Function as a Function
Tom Kuchler, Michael Giardino, Timothy Roscoe, Ana Klimovic.
Proceedings of the ACM Symposium on Cloud Computing (SoCC), 2023.
[SoCC] tf.data service: A Case for Disaggregating ML Input Data Processing
Andrew Audibert, Yang Chen, Dan Graur, Ana Klimovic, Jiri Simsa, Chandramohan A. Thekkath.
Proceedings of the ACM Symposium on Cloud Computing (SoCC), 2023.
[WORDS] Understanding the Neglected Cost of Serverless Cluster Management
Lazar Cvetković, Rodrigo Fonseca, Ana Klimovic.
Proceedings of the Workshop on Resource Disaggregation and Serverless (WORDS) at SOSP, 2023.
[WORDS] Enabling In-Vitro Serverless Systems Research
Dmitrii Ustiugov, Dohyun Park, Lazar Cvetković, Mihajlo Djokić, Hongyu Hè, Boris Grot, Ana Klimovic.
Proceedings of the Workshop on Resource Disaggregation and Serverless (WORDS) at SOSP, 2023.
[VLDB] NVM: Is it Not Very Meaningful for Databases?
Dimitrios Koutsoukos, Raghav Bhartia, Michal Friedman, Ana Klimovic, Gustavo Alonso.
Proceedings of the International Conference on Very Large Databases (VLDB), 2023.
[VLDB] Analyzing Vectorized Hash Tables Across CPU Architectures
Maximilian Böther, Lawrence Benson, Ana Klimovic, Tilmann Rabl.
Proceedings of the International Conference on Very Large Databases (VLDB), 2023.
[VLDB] SHiFT: An Efficient, Flexible Search Engine for Transfer Learning
Cedric Renggli, Xiaozhe Yao, Luka Kolar, Luka Rimanic, Ana Klimovic, Ce Zhang.
Proceedings of the International Conference on Very Large Databases (VLDB), 2023.
[SDA] Rethinking Serverless Computing: from the Programming Model to the Platform Design
Gustavo Alonso, Ana Klimovic, Tom Kuchler, Michael Wawrzoniak.
Proceedings of VLDB Workshop on Serverless Data Analytics, 2023.
[SDA] Ephemeral Per-query Engines for Serverless Analytics
Michael Wawrzoniak, Rodrigo Bruno, Ana Klimovic, Gustavo Alonso.
Proceedings of VLDB Workshop on Serverless Data Analytics, 2023.
[EuroMLSys] Towards a Platform and Benchmark Suite for Model Training on Dynamic Datasets
Maximilian Böther, Foteini Strati, Viktor Gsteiger, Ana Klimovic.
Proceedings of the Workshop on Machine Learning and Systems (EuroMLSys), 2023.
[ATC] Cachew: Machine Learning Input Data Processing as a Service
Dan Graur, Damien Aymon, Dan Kluser, Tanguy Albrici, Chandramohan A. Thekkath, Ana Klimovic.
Proceedings of the USENIX Annual Technical Conference (ATC), 2022.
[MLSys] Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines
Michael Kuchnik, Ana Klimovic, Jiri Simsa, George Amvrosiadis, Virgina Smith.
Proceedings of the Conference on Machine Learning and Systems (MLSys), 2022.
[DistributedML] Exploring Learning Rate Scaling Rules for Distributed ML Training on Transient Resources
Joel André*, Foteini Strati*, Ana Klimovic.
International Workshop on Distributed Machine Learning (DistributedML), 2022.
[VLDB] tf.data: A Machine Learning Data Processing Framework
Derek G. Murray, Jiri Simsa, Ana Klimovic, Ihor Indyk.
Proceedings of the International Conference on Very Large Databases (VLDB), August 2021. Presentation Video.
[VLDB] Modularis: Modular Relational Analytics over Heterogeneous Distributed Platforms
Dimitris Koutsoukos, Ingo Müller, Renato Marroquín, Ana Klimovic, Gustavo Alonso.
Proceedings of the International Conference on Very Large Databases (VLDB), 2021.
[ATC] SONIC: Application-aware Data Passing for Chained Serverless Applications
Ashraf Mahgoub, Karthick Shankar, Subrata Mitra, Ana Klimovic, Somali Chaterji, Saurabh Bagchi.
Proceedings of the USENIX Annual Technical Conference (ATC), July 2021.
[SIGMOD] Towards Demystifying Serverless Machine Learning Training
Jiawei Jiang*, Shaoduo Gan*, Yue Liu, Gustavo Alonso, Ana Klimovic, Ankit Singla, Wentao Wu, Ce Zhang.
Proceedings of ACM SIGMOD/PODS International Conference on Management of Data (SIGMOD), June 2021.
[TOS] RAIL: Predictable, Low Tail Latency for NVMe Flash
Heiner Litz, Javier Gonzalez, Ana Klimovic, Christos Kozyrakis.
ACM Transactions on Storage (TOS), Volume 1, Issue 1, January 2021.
[ATC] OPTIMUSCLOUD: Heterogeneous Configuration Optimization for Distributed Databases in the Cloud
Ashraf Mahgoub, Alexander Michaelson Medoff, Rakesh Kumar, Subrata Mitra, Ana Klimovic, Somali Chaterji, Saurabh Bagchi.
Proceedings of the USENIX Annual Technical Conference (ATC), July 2020.
[SPMA] Serverless Clusters: The Missing Piece for Interactive Batch Applications?
Ingo Müller, Rodrigo Bruno, Ana Klimovic, John Wilkes, Eric Sedlar, Gustavo Alonso.
Workshop on Systems for Post-Moore Architectures (SPMA), April 2020.
[ATC] Unification of Temporary Storage in the NodeKernel Architecture
Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle, Ana Klimovic, Adrian Schuepbach, Bernard Metzler.
Proceedings of the USENIX Annual Technical Conference (ATC), Renton, WA, July 2019.
[Thesis] Fast, Elastic Storage for the Cloud
Ana Klimovic. Doctoral Dissertation, Stanford University, June 2019.
I presented my thesis work at several seminars, including the WICARCH seminar (presentation video).
[OSDI] Pocket: Elastic Ephemeral Storage for Serverless Analytics
Ana Klimovic, Yawen Wang, Christos Kozyrakis, Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle
Proceedings of USENIX Operating Systems Design and Implementation (OSDI), Carlsbad, CA, October 2018.
[ATC] Understanding Ephemeral Storage for Serverless Analytics
Ana Klimovic, Yawen Wang, Christos Kozyrakis, Patrick Stuedi, Jonas Pfefferle, Animesh Trivedi
Proceedings of the USENIX Annual Technical Conference (ATC), Boston, MA, July 2018.
[ATC] Selecta: Heterogeneous Cloud Storage Configuration for Data Analytics
Ana Klimovic, Heiner Litz, Christos Kozyrakis
Proceedings of the USENIX Annual Technical Conference (ATC), Boston, MA, July 2018.
[MLSys] Learning Heterogeneous Cloud Storage Configuration for Data Analytics
Ana Klimovic, Heiner Litz, Christos Kozyrakis
Non-archival proceedings of the inaugural Systems and Machine Learning conference (MLSys), Stanford, CA, February 2018.
[HotStorage] Understanding Rack-Scale Disaggregated Storage
Sergey Legtchenko, Hugh Williams, Kaveh Razavi, Austin Donnelly, Richard Black, Andrew Douglas, Nathanael Cheriere, Daniel Fryer, Kai Mast, Angela Demke Brown, Ana Klimovic, Andy Slowey, Antony Rowstron
Proceedings of the USENIX Hot Topics in Storage and File Systems (HotStorage), Santa Clara, CA, July 2017.
[ASPLOS] ReFlex: Remote Flash ≈ Local Flash
Ana Klimovic*, Heiner Litz*, Christos Kozyrakis
Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Xi'an, China, April 2017. Memorable Paper Award (awarded at NVMW'18).
[TOCS] The IX Operating System: Combining Low Latency, High Throughput, and Efficiency in a Protected Dataplane
Adam Belay, George Prekas, Mia Primorac, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, Edouard Bugnion
ACM Transactions on Computer Systems, Volume 34, Issue 4, January 2017.
[EuroSys] Flash Storage Disaggregation
Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, Sanjeev Kumar
Proceedings of the 11th European Conference on Computer Systems (EuroSys), London, UK, April 2016.
[OSDI] IX: A Protected Dataplane Operating System for High Throughput and Low Latency
Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, Edouard Bugnion
Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Broomfield, CO, October 2014.
Best Paper Award.
[FPT] Bitwidth-optimized Hardware Accelerators with Software Fallback
Ana Klimovic and Jason H. Anderson
IEEE International Conference on Field-Programmable Technology (FPT), pp. 136-143, Kyoto, Japan, December 2013.
See the EASL GitHub for our research project code, including:
Outside of research, I enjoy: