I am an Assistant Professor in the Department of Computer Science at ETH Zurich. I am a member of the ETH Systems Group, where I lead the Efficient Architectures and Systems Lab (EASL).

I work on computer systems for large-scale applications such as cloud computing services, data analytics, and machine learning. The goal of my research is to improve the performance and resource efficiency of cloud computing while making it easier for users to deploy and manage their applications. My research interests span operating systems, computer architecture, and their intersection with machine learning.

Before joining ETH, I spent a year as a Research Scientist at Google Brain. I completed my Ph.D. in Electrical Engineering at Stanford University, advised by Professor Christos Kozyrakis. My dissertation was on the design and implementation of fast, elastic storage for cloud computing. My Ph.D. was generously supported by the Microsoft Research PhD Fellowship and Stanford Graduate Fellowship. I earned my M.S. in Electrical Engineering at Stanford University in 2015. I graduated from the Engineering Science program at the University of Toronto in 2013, where I earned my Bachelor of Applied Science and Engineering.

I am currently looking for a motivated postdoctoral researcher. If you are interested in joining the EASL research group, please email me at aklimovic@ethz.ch with your CV. See below for research focus areas.


Research Focus Areas

Cloud computing is undergoing a fundamental shift, stimulated by an exponential growth in data, users, and an increasing demand for cloud services that can automatically allocate and scale computing resources for jobs. An emerging wave of cloud computing, called serverless computing, enables users to focus on writing code for their applications while cloud providers manage resources based on application demands. On serverless computing platforms, users can simultaneously launch thousands of tiny, short-lived tasks and pay only for the resources their tasks actually consume per ~10ms time interval, as opposed to paying for pre-allocated virtual machines that have fixed ratios of compute, memory, and storage.

Research topics: What should an operating system for serverless computing look like? Scheduling millions of short-lived tasks to satisfy performance requirements and achieve high resource utilization poses interesting challenges. Serverless computing encourages a high degree of resource sharing across tenants, which poses performance and security isolation concerns. In addition, it is not yet clear what is the right abstraction for users to specify application performance requirements.

Machine learning (ML) jobs are an increasingly important class of applications in the cloud. Across domains such as image understanding and text translation, scaling machine learning models to a large number of parameters has been shown to dramatically improve accuracy when sufficiently large datasets are used. While significant work has focused on optimizing hardware and software for ML computations, data management is a common bottleneck. As organizations collect massive amounts of data, storing and ingesting data at this scale poses several challenges.

Research topics: How should we design distributed storage systems for machine learning to optimize end-to-end model training and inference? How can we avoid moving large amounts of data across the network — should we instead move computation closer to the data (near-storage computing)? How can multiple tenants safely share datasets and models with good performance guarantees?

Many of today’s computer systems use heuristics and hints to make decisions (e.g., to decide which resources to allocate for a task or which data to keep in a cache). As software applications and hardware platforms become more and more heterogeneous, designing heuristics is increasingly difficult. Yet due to growing heterogeneity, automating resource and data management is increasingly important. One promising approach is to learn resource management strategies by training machine learning models using system data collected while profiling or running applications.

Research topics: How can we leverage machine learning models to make systems-level decisions when such decisions often need to be made at microsecond timescales? How should we design APIs to make replacing or supplementing heuristics with machine learning model inference practical in computer systems?

Publications


[VLDB] tf.data: A Machine Learning Data Processing Framework
Derek G. Murray, Jiri Simsa, Ana Klimovic, Ihor Indyk.
Proceedings of the International Conference on Very Large Databases (VLDB), 2021.

[SIGMOD] Towards Demystifying Serverless Machine Learning Training
Jiawei Jiang*, Shaoduo Gan*, Yue Liu, Gustavo Alonso, Ana Klimovic, Ankit Singla, Wentao Wu, Ce Zhang.
Proceedings of ACM SIGMOD/PODS International Conference on Management of Data (SIGMOD), 2021.

[ATC] OPTIMUSCLOUD: Heterogeneous Configuration Optimization for Distributed Databases in the Cloud
Ashraf Mahgoub, Alexander Michaelson Medoff, Rakesh Kumar, Subrata Mitra, Ana Klimovic, Somali Chaterji, Saurabh Bagchi.
Proceedings of the USENIX Annual Technical Conference (ATC), July 2020.

[SPMA] Serverless Clusters: The Missing Piece for Interactive Batch Applications?
Ingo Müller, Rodrigo Bruno, Ana Klimovic, John Wilkes, Eric Sedlar, Gustavo Alonso.
Workshop on Systems for Post-Moore Architectures (SPMA), April 2020.

[ATC] Unification of Temporary Storage in the NodeKernel Architecture
Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle, Ana Klimovic, Adrian Schuepbach, Bernard Metzler.
Proceedings of the USENIX Annual Technical Conference (ATC), Renton, WA, July 2019.

[Thesis] Fast, Elastic Storage for the Cloud
Ana Klimovic. Doctoral Dissertation, Stanford University, June 2019.

[OSDI] Pocket: Elastic Ephemeral Storage for Serverless Analytics
Ana Klimovic, Yawen Wang, Christos Kozyrakis, Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle
Proceedings of USENIX Operating Systems Design and Implementation (OSDI), Carlsbad, CA, October 2018.

[ATC] Understanding Ephemeral Storage for Serverless Analytics
Ana Klimovic, Yawen Wang, Christos Kozyrakis, Patrick Stuedi, Jonas Pfefferle, Animesh Trivedi
Proceedings of the USENIX Annual Technical Conference (ATC), Boston, MA, July 2018.

[ATC] Selecta: Heterogeneous Cloud Storage Configuration for Data Analytics
Ana Klimovic, Heiner Litz, Christos Kozyrakis
Proceedings of the USENIX Annual Technical Conference (ATC), Boston, MA, July 2018.

[MLSys] Learning Heterogeneous Cloud Storage Configuration for Data Analytics
Ana Klimovic, Heiner Litz, Christos Kozyrakis
Non-archival proceedings of the inaugural Systems and Machine Learning conference (MLSys), Stanford, CA, February 2018.

[HotStorage] Understanding Rack-Scale Disaggregated Storage
Sergey Legtchenko, Hugh Williams, Kaveh Razavi, Austin Donnelly, Richard Black, Andrew Douglas, Nathanael Cheriere, Daniel Fryer, Kai Mast, Angela Demke Brown, Ana Klimovic, Andy Slowey, Antony Rowstron
Proceedings of the USENIX Hot Topics in Storage and File Systems (HotStorage), Santa Clara, CA, July 2017.

[ASPLOS] ReFlex: Remote Flash ≈ Local Flash
Ana Klimovic, Heiner Litz, Christos Kozyrakis
Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Xi'an, China, April 2017. Memorable Paper Award (awarded at NVMW'18).

[TOCS] The IX Operating System: Combining Low Latency, High Throughput, and Efficiency in a Protected Dataplane
Adam Belay, George Prekas, Mia Primorac, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, Edouard Bugnion
ACM Transactions on Computer Systems, Volume 34, Issue 4, January 2017.

[EuroSys] Flash Storage Disaggregation
Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, Sanjeev Kumar
Proceedings of the 11th European Conference on Computer Systems (EuroSys), London, UK, April 2016.

[OSDI] IX: A Protected Dataplane Operating System for High Throughput and Low Latency
Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, Edouard Bugnion
Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Broomfield, CO, October 2014.
Best Paper Award.

[FPT] Bitwidth-optimized Hardware Accelerators with Software Fallback
Ana Klimovic and Jason H. Anderson
IEEE International Conference on Field-Programmable Technology (FPT), pp. 136-143, Kyoto, Japan, December 2013.


Open Source Software


  • ReFlex: a software system that enables fast, predictable access to remote Flash storage
  • Pocket: a distributed, elasitc data store for ephemeral data, designed for serverless computing applications

Teaching



About Me


Outside of research, I enjoy:

  • Sports: tennis, volleyball, skiing, swimming, squash, ...
  • Travel: exploring new places and cultures
  • Music: piano, guitar, singing, violin
  • Art: painting, sketching