Publications

Systems-Aware Machine Learning
Programming the Cloud: Performance, Predictability, Security
Software-Defined Networking and Network Architecture
Education in Reproducible Research
Reliable Networked Systems
Energy-Efficient Networks
Network Monitoring and Application Identification
Extended abstracts
Book chapters
Copyright notice

Systems-Aware Machine Learning

LLM Optimization Without Data Sharing: A Split Learning Paradigm
O. Basubeit, A. Alenazi, B. Radovič, A. Milibari, M. Canini, Z. Khayyat.
In Proceedings of MetaCom’25, Aug 2025.

abstract
Resource-constrained organizations with vast datasets often face significant challenges in training and fine-tuning large language models (LLMs) due to privacy and compliance requirements. These organizations cannot share their sensitive data but still require external computational resources for model creation. In this paper, we investigate the effectiveness of split learning (SL) techniques for fine-tuning LLMs while maintaining data privacy. We compare three strategies: a centralized approach, where all fine-tuning occurs on a central server, and two split learning strategies (plain and U-shaped), where fine-tuning is distributed between a client and a cloud server, ensuring the client retains control over sensitive data. The U-shaped strategy enhances the plain split learning approach by ensuring labels are not shared with the server, offering stronger privacy protection. Our experiments empirically evaluate these methods, comparing model quality and training efficiency. The results show that all three strategies achieve comparable model performance, evaluated using the loss metric. Although the U-shaped SL incurs higher network transmission costs, it provides enhanced privacy guarantees. This presents a trade-off between privacy and communication efficiency, making U-shaped SL a strong candidate for privacy-preserving LLM fine-tuning in resource-constrained environments.
bibtex
```
@inproceedings{basubeit2025llmoptsl,
  author = {Omar Basubeit and Abdulaziz Alenazi and Boris Radovič and Ameera Milibari and Marco Canini and Zuhair Khayyat},
  title = {{LLM Optimization Without Data Sharing: A Split Learning Paradigm}},
  booktitle = {Proceedings of MetaCom'25},
  year = {2025},
  month = {Aug} 
}
```
Idle No More: Boosting Distributed Pipeline Training via FluidPipe
M. Aljahdali, M. Canini.
In Proceedings of HotDiML’25, Jul 2025.

abstract
Large foundation models, such as language models with billions of parameters, have become pivotal in various fields, including natural language processing, healthcare, and AI-driven applications. These models require many GPUs to train or fine-tune, which presents a challenge when obtaining GPUs within the same cloud datacenter or region is not viable. Moreover, organizations with limited computational resources may acquire cloud-based resources but cannot fully offload to the cloud due to privacy, legal or data sovereignty concerns.
In this paper, we propose FluidPipe, a distributed parallel training algorithm that is tailored for geo-distributed resources. Our algorithm takes inspiration from Federated Group Knowledge Transfer (FedGKT), a federated learning method that splits model training between clients and the server. Our approach greatly reduces the runtime over pipeline parallelism at the cost of slightly changing the learning objective. While effective, our approach still suffers from idle times due to synchronization and data communication delays, especially in geo-distributed scenarios where latency and bandwidth issues lead to substantial idle times. To address this we incorporate idle training, that is, carefully orchestrated additional training steps during idle times enhancing resource utilization.
bibtex
```
@inproceedings{aljahdali2025fluidpipe,
  author = {Mohammed Aljahdali and Marco Canini},
  title = {{Idle No More: Boosting Distributed Pipeline Training via FluidPipe}},
  booktitle = {Proceedings of HotDiML'25},
  year = {2025},
  month = {Jul} 
}
```
Analysis and Tuning of Knowledge Distillation for Efficient Collaborative Learning (Technical Report)
N. Alballa, A. M. Abdelmoniem, M. Canini.
In Proceedings of 2025 IEEE 45th International Conference on Distributed Computing Systems Workshops (ICDCSW), Jul 2025.

abstract
Knowledge Distillation (KD) has become a crucial technique for efficient collaborative learning in distributed systems, especially under data and system heterogeneity. Despite considerable attention, few works ystematically examine tuning strategies for KD hyperparameters or provide comprehensive empirical comparisons across different data distributions. In this paper, we conduct an in-depth study of multiple KD methods—Vanilla KD, Deep Mutual Learning (DML), Data Partitioning KD (DP-KD), and our Tuned KD approach—using diverse data partitions and transfer sets. We show that (1) hyperparameter tuning dramatically boosts KD performance, especially when teacher–student models have disparate accuracies, (2) different transfer set sizes and labeling conditions substantially affect outcomes. Our findings serve as practical guidelines for effectively applying KD to collaborative learning tasks.
bibtex
```
@inproceedings{alballa2025ekd,
title = {{Analysis and Tuning of Knowledge Distillation for Efficient Collaborative Learning}},
author = {Alballa, Norah and Ahmed M. Abdelmoniem and Canini, Marco},
booktitke = {The 2025 IEEE 45th International Conference on Distributed Computing Systems Workshops (ICDCSW)},
year = {2025},
}
```
Decentralized Personalized Federated Learning
S. Kharrat, M. Canini, S. Horvath.
In Proceedings of AISTATS’25, May 2025.

abstract
This work addresses the challenges of data heterogeneity and communication constraints in decentralized federated learning (FL). We introduce decentralized personalized FL (DPFL), a bi-level optimization framework that enhances personalized FL by leveraging combinatorial relationships among clients, enabling fine-grained and targeted collaborations. By employing a constrained greedy algorithm, DPFL constructs a collaboration graph that guides clients in choosing suitable collaborators, enabling personalized model training tailored to local data while respecting a fixed and predefined communication and resource budget. Our theoretical analysis demonstrates that the proposed objective for constructing the collaboration graph yields superior or equivalent performance compared to any alternative collaboration structures, including pure local training. Extensive experiments across diverse datasets show that DPFL consistently outperforms existing methods, effectively handling non-IID data, reducing communication overhead, and improving resource efficiency in real-world decentralized FL scenarios. The code can be accessed at: https://github.com/salmakh1/DPFL.
bibtex
```
@inproceedings{kharrat2025dpfl,
title = {{Decentralized Personalized Federated Learning}},
author = {Kharrat, Salma and Canini, Marco and Horv{\'a}th, Samuel},
booktitle={The 28th International Conference on Artificial Intelligence and Statistics (AISTATS)},
year={2025},
url={https://openreview.net/forum?id=cTAI6fpB9q}
}
```
Query-based Knowledge Transfer for Heterogeneous Learning Environments
N. Alballa, W. Zhang, Z. Liu, A. M. Abdelmoniem, M. Elhoseiny, M. Canini.
In Proceedings of ICLR’25, Apr 2025.

abstract
Decentralized collaborative learning under data heterogeneity and privacy constraints has rapidly advanced. However, existing solutions like federated learning, ensembles, and transfer learning, often fail to adequately serve the unique needs of clients, especially when local data representation is limited. To address this issue, we propose a novel framework called Query-based Knowledge Transfer (QKT) that enables tailored knowledge acquisition to fulfill specific client needs without direct data exchange. It employs a data-free masking strategy to facilitate the communication-efficient query-focused knowledge transformation while refining task-specific parameters to mitigate knowledge interference and forgetting. Our experiments, conducted on both standard and clinical benchmarks, show that QKT significantly outperforms existing collaborative learning methods by an average of 20.91% points in single-class query settings and an average of 14.32% points in multi-class query scenarios. Further analysis and ablation studies reveal that QKT effectively balances the learning of new and existing knowledge, showing strong potential for its application in decentralized learning.
bibtex
```
@inproceedings{alballa2025qkt,
title={Query-based Knowledge Transfer for Heterogeneous Learning Environments},
author={Norah Alballa and Wenxuan Zhang and Ziquan Liu and Ahmed M. Abdelmoniem and Mohamed Elhoseiny and Marco Canini},
booktitle={The Thirteenth International Conference on Learning Representations (ICLR)},
year={2025},
url={https://openreview.net/forum?id=XKv29sMyjF}
}
```
Towards a Unified Framework for Split Learning
B. Radovič, M. Canini, S. Horváth, V. Pejović, P. Vepakomma.
In Proceedings of EuroMLSys’25, Mar 2025.

abstract
Split Learning (SL) is a principled approach for training models on data distributed across multiple devices without sharing training data. While SL emerged as an alternative to federated learning to reduce the compute burden on devices, it also enables a more fair redistribution of work between edge devices and the server. Despite its potential, there is no unified framework for implementing and deploying SL algorithms, leaving several research questions underexplored. To address this gap, we introduce SplitBud, a versatile framework to implement virtually any SL algorithm. By supporting various variants of SL, SplitBud facilitates research and development in the field. In this paper, we demonstrate its flexibility by implementing and evaluating multiple SL algorithms, and we discuss future directions for the field.
bibtex
```
@inproceedings{radovic2025splitbud,
  author = {Boris Radovič and Marco Canini and Samuel Horváth and Veljko Pejović and Praneeth Vepakomma},
  title = {{Towards a Unified Framework for Split Learning}},
  booktitle = {Proceedings of EuroMLSys'25},
  year = {2025},
  month = {Mar} 
}
```
OptimusNIC: Offloading Optimizer State to SmartNICs for Efficient Large-Scale AI Training
A. Rebai, M. Canini.
In Proceedings of EuroMLSys’25, Mar 2025.

abstract
LLM training is a demanding workload that requires careful coordination among hardware components, ensuring high GPU utilization, rapid data transfer, and minimal memory overhead. This synergy leads to an efficient AI training system. Scaling LLMs encounters memory constraint challenges, particularly the optimizer state, whose size in bytes scales with a factor of 12× the number of model parameters. In this paper, we explore the impact of offloading the optimizer state and parameter update operation to a SmartNIC. OptimusNIC reduces communication overhead between GPUs, minimizes GPU computation (by handling the optimizer step externally), and significantly decreases memory requirements. These improvements allow GPUs to accommodate additional model layers and efficiently train and fine-tune models with minimal resources. In addition to examining the impact of OptimusNIC, we evaluate DeepSpeed’s ZeRO-Infinity framework, identifying key limitations associated with using the CPU as the offloading target and we demonstrate why OptimusNIC offers a more efficient alternative. Furthermore, we analyze the usability of one of the target platforms for OptimusNIC: the NVIDIA BlueField-2 DPU.
bibtex
```
@inproceedings{rebai2025optimusnic,
  author = {Achref Rebai and Marco Canini},
  title = {{OptimusNIC: Offloading Optimizer State to SmartNICs for Efficient Large-Scale AI Training}},
  booktitle = {Proceedings of EuroMLSys'25},
  year = {2025},
  month = {Mar} 
}
```
Global-QSGD: Allreduce-Compatible Quantization for Distributed Learning with Theoretical Guarantees (Technical Report)
J. Xin, M. Canini, P. Richtarik, S. Horvath.
In Proceedings of EuroMLSys’25, Mar 2025.

abstract
Distributed training enables large-scale deep learning, but suffers from high communication overhead, especially as models and datasets grow. Gradient compression, particularly quantization, is a promising approach to mitigate this bottleneck. However, existing quantization schemes are often incompatible with Allreduce, the dominant communication primitive in distributed deep learning, and many prior solutions rely on heuristics without theoretical guarantees. We introduce Global-QSGD, an Allreduce-compatible gradient quantization method that leverages global norm scaling to reduce communication overhead while preserving accuracy. Global-QSGD is backed by rigorous theoretical analysis, extending standard unbiased compressor frameworks to establish formal convergence guarantees. Additionally, we develop a performance model to evaluate its impact across different hardware configurations. Extensive experiments on NVLink, PCIe, and large-scale cloud environments show that Global-QSGD accelerates distributed training by up to 3.51× over baseline quantization methods, making it a practical and efficient solution for large-scale deep learning workloads.
bibtex
```
@inproceedings{xin2025gqsgd,
  author = {Xin, Jihao and Canini, Marco and Richt{\'a}rik, Peter and Horv{\'a}th, Samuel},
  title = {{Global-QSGD: Practical Floatless Quantization for Distributed Learning with Theoretical Guarantees}},
  booktitle = {Proceedings of EuroMLSys'25},
  year = {2025},
  month = {Mar} 
}
```
Rethinking cloud abstractions for tenant-provider cooperative optimization of AI workloads
M. Canini, R. Bianchini, Í. Goiri, D. Kostić, P. Pietzuch.
arXiv preprint arXiv:2501.09562

abstract
AI workloads, often hosted in multi-tenant cloud environments, require vast computational resources but suffer inefficiencies due to limited tenant-provider coordination. Tenants lack infrastructure insights, while providers lack workload details to optimize tasks like partitioning, scheduling, and fault tolerance. We propose the HarmonAIze project to redefine cloud abstractions, enabling cooperative optimization for improved performance, efficiency, resiliency, and sustainability. This paper outlines key opportunities, challenges, and a research agenda to realize this vision
bibtex
```
@misc{canini2025aiworkloads,
    title={Rethinking cloud abstractions for tenant-provider cooperative optimization of AI workloads}, 
    author={Marco Canini and Ricardo Bianchini and Íñigo Goiri and Dejan Kostić and Peter Pietzuch},
    year={2025},
    eprint={2501.09562},
    archivePrefix={arXiv},
    primaryClass={cs.DC},
    url={https://arxiv.org/abs/2501.09562}, 
}
```
Protecting Confidentiality, Privacy and Integrity in Collaborative Learning
D. Chen, A. Dethise, I. E. Akkus, I. Rimac, K. Satzke, A. Koskela, M. Canini, W. Wang, R. Chen.
arXiv preprint arXiv:2412.08534

abstract
A collaboration between dataset owners and model owners is needed to facilitate effective machine learning (ML) training. During this collaboration, however, dataset owners and model owners want to protect the confidentiality of their respective assets (i.e., datasets, models and training code), with the dataset owners also caring about the privacy of individual users whose data is in their datasets. Existing solutions either provide limited confidentiality for models and training code, or suffer from privacy issues due to collusion.
We present Citadel++, a scalable collaborative ML training system designed to simultaneously protect the confidentiality of datasets, models and training code, as well as the privacy of individual users. Citadel++ enhances differential privacy techniques to safeguard the privacy of individual user data while maintaining model utility. By employing Virtual Machine-level Trusted Execution Environments (TEEs) and improved integrity protection techniques through various OS-level mechanisms, Citadel++ effectively preserves the confidentiality of datasets, models and training code, and enforces our privacy mechanisms even when the models and training code have been maliciously designed. Our experiments show that Citadel++ provides privacy, model utility and performance while adhering to confidentiality and privacy requirements of dataset owners and model owners, outperforming the state-of-the-art privacy-preserving training systems by up to 543x on CPU and 113x on GPU TEEs.
bibtex
```
@misc{chen2024citadelpp,
    title={Protecting Confidentiality, Privacy and Integrity in Collaborative Learning},
    author={Dong Chen and Alice Dethise and Istemi Ekin Akkus and Ivica Rimac and Klaus Satzke and Antti Koskela and Marco Canini and Wei Wang and Ruichuan Chen},
    year={2024},
    eprint={2412.08534},
    archivePrefix={arXiv},
    primaryClass={cs.DC},
    url={https://arxiv.org/abs/2412.08534},
}
```
Review and Comparative Evaluation of Resource-Adaptive Collaborative Training for Heterogeneous Edge Devices
B. Radovič, M. Canini, V. Pejović.
In ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Dec 2024.

abstract
Growing concerns about centralized mining of personal data threatens to stifle further proliferation of machine learning (ML) applications. Consequently, a recent trend in ML training advocates for a paradigm shift – moving the computation of ML models from a centralized server to a federation of edge devices owned by the users whose data is to be mined. Though such decentralization aims to alleviate concerns related to raw data sharing, it introduces a set of challenges due to the hardware heterogeneity among the devices possessing the data. The heterogeneity may, in the most extreme cases, impede the participation of low-end devices in the training or even prevent the deployment of the ML model to such devices.
Recent research in distributed collaborative machine learning (DCML) promises to address the issue of ML model training over heterogeneous devices. However, the actual extent to which the issue is solved remains unclear, especially as an independent investigation of the proposed methods’ performance in realistic settings is missing. In this paper, we present a detailed survey and an evaluation of algorithms that aim to enable collaborative model training across diverse devices. We explore approaches that harness three major strategies for DCML, namely Knowledge Distillation, Split Learning, and Partial Training, and we conduct a thorough experimental evaluation of these approaches on a real-world testbed of 14 heterogeneous devices. Our analysis compares algorithms based on the resulting model accuracy, memory consumption, CPU utilization, network activity, and other relevant metrics, and provides guidelines for practitioners as well as pointers for future research in DCML.
bibtex
```
@article{radovič2024dcml,
title = {{Review and Comparative Evaluation of Resource-Adaptive Collaborative Training for Heterogeneous Edge Devices}},
author = {Boris Radovič and Marco Canini and Veljko Pejović},
journal = {ACM Trans. Model. Perform. Eval. Comput. Syst.},
year = {2024},
doi = {10.1145/3708983},
url = {https://doi.org/10.1145/3708983}
}
```
Where is the Testbed for my Federated Learning Research?
J. Božič, A. R. Faustino, B. Radovič, M. Canini, V. Pejović.
In Proceedings of ACM/IEEE Symposium on Edge Computing (SEC’24), Dec 2024.

abstract
Progressing beyond centralized AI is of paramount importance, yet, distributed AI solutions, in particular various federated learning (FL) algorithms, are often not comprehensively assessed, which prevents the research community from identifying the most promising approaches and practitioners from being convinced that a certain solution is deployment-ready. The largest hurdle towards FL algorithm evaluation is the difficulty of conducting real-world experiments over a variety of FL client devices and different platforms, with different datasets and data distribution, all while assessing various dimensions of algorithm performance, such as inference accuracy, energy consumption, and time to convergence, to name a few. In this paper, we present CoLExT, a real-world testbed for FL research. CoLExT is designed to streamline experimentation with custom FL algorithms in a rich testbed configuration space, with a large number of heterogeneous edge devices, ranging from single-board computers to smartphones, and provides real-time collection and visualization of a variety of metrics through automatic instrumentation. According to our evaluation, porting FL algorithms to CoLExT requires minimal involvement from the developer, and the instrumentation introduces minimal resource usage overhead. Furthermore, through an initial investigation involving popular FL algorithms running on CoLExT, we reveal previously unknown trade-offs, inefficiencies, and programming bugs.
bibtex
```
@INPROCEEDINGS{božič2024colext,
title = {{Where is the Testbed for my Federated Learning Research?}},
author = {Janez Božič and Amândio R. Faustino and Boris Radovič and Marco Canini and Veljko Pejović},
booktitle = {ACM/IEEE Symposium on Edge Computing (SEC)},
year = {2024},
}
```
Continuous Exact Explanations of Neural Networks
A. Dethise, M. Canini.
In Proceedings of ICDM’24, Dec 2024.

abstract
Accurate explanations of how a trained neural network (NN) behaves are desirable for a wide range of labor-intensive activities, including troubleshooting, validation, and understanding performance issues or identifying biases.
We address the problem of explaining feedforward piecewise NNs (such as CNNs with ReLU) by breaking them down into their linear components. Automatic encoding of the NN structure into linear program constraints is used to extract continuous exact explanations, which help interpret model behavior over continuous regions, provide model descriptions that convey the importance of input features and answer model queries that analyze the feature contributions under user-defined constraints.
Our examples show that we can extract explanations for NNs with a moderate number of layers without relying on approximations. We demonstrate that the high-level explanations can help understand the outputs of NNs and the comparative importance of features by observing the linear models. We also show how explaining continuous inputs prevents certain attacks that have been proposed against existing explainers.
bibtex
```
@INPROCEEDINGS{dethise2024explainer,
title = {{Continuous Exact Explanations of Neural Networks}},
author = {Dethise, Alice and Canini, Marco},
booktitle = {ICDM},
year = {2024}
}
```
ACING: Actor-Critic for Instruction Learning in Black-Box Large Language Models
S. Kharrat, F. Fourati, M. Canini.
arXiv preprint arXiv:2411.12736

abstract
The effectiveness of Large Language Models (LLMs) in solving tasks vastly depends on the quality of the instructions, which often require fine-tuning through extensive human effort. This highlights the need for automated instruction optimization; however, this optimization is particularly challenging when dealing with black-box LLMs, where model parameters and gradients remain inaccessible. We propose ACING, a task-specific prompt optimization approach framed as a stateless continuous-action Reinforcement Learning (RL) problem, known as the continuum bandit setting. ACING leverages an actor-critic-based method to optimize prompts, learning from non-differentiable reward signals. We validate ACING by optimizing prompts for ChatGPT on 30 instruction-based tasks. ACING consistently outperforms baseline methods, achieving a median score improvement of 10 percentage points. Furthermore, ACING not only recovers but also surpasses human-crafted expert instructions, achieving up to a 39 percentage point improvement against human benchmarks.
bibtex
```
@misc{kharrat2024acingactorcriticinstructionlearning,
    title={ACING: Actor-Critic for Instruction Learning in Black-Box Large Language Models},
    author={Salma Kharrat and Fares Fourati and Marco Canini},
    year={2024},
    eprint={2411.12736},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2411.12736},
}
```
Immediate Communication for Distributed AI Tasks
J. Xin, S. Bae, K. Park, M. Canini, C. Hwang.
The 2nd Workshop on Hot Topics in System Infrastructure (HotInfra), Nov 2024.

abstract
Large AI models have necessitated efficient communication strategies across multi-GPU and multi-node infrastructures due to their increasing complexity. Current methods focusing on inter-operator overlaps fail when dependencies exist, leading to underutilized hardware. DistFuse addresses this by enabling fine-grained overlapping of computation and communication, triggering communication as soon as data is ready, and reducing latency. Initial experiments show up to 44.3% reduction in communication latency of Llama3-70B inference on a single node, demonstrating its potential to accelerate diverse AI workloads.
bibtex
```
@INPROCEEDINGS{xin2024distfuse,
title = {{Immediate Communication for Distributed AI Tasks}},
author = {Jihao Xin and Seongjong Bae and KyoungSoo Park and Marco Canini and Changho Hwang},
booktitle = {Workshop on Hot Topics in System Infrastructure (HotInfra)},
year = {2024},
}
```
FilFL: Client Filtering for Optimized Client Participation in Federated Learning (Technical Report)
F. Fourati, S. Kharrat, V. Aggarwal, M.-S. Alouini, M. Canini.
In Proceedings of ECAI’24, Oct 2024.

abstract
Federated learning, an emerging machine learning paradigm, enables clients to collaboratively train a model without exchanging local data. Clients participating in the training process significantly impact the convergence rate, learning efficiency, and model generalization. We propose a novel approach, client filtering, to improve model generalization and optimize client participation and training. The proposed method periodically filters available clients to identify a subset that maximizes a combinatorial objective function with an efficient greedy filtering algorithm. Thus, the clients are assessed as a combination rather than individually. We theoretically analyze the convergence of federated learning with client filtering in heterogeneous settings and evaluate its performance across diverse vision and language tasks, including realistic scenarios with time-varying client availability. Our empirical results demonstrate several benefits of our approach, including improved learning efficiency, faster convergence, and up to 10% higher test accuracy than training without client filtering.
bibtex
```
@INPROCEEDINGS{fourati2024filfl,
title = {{FilFL: Client Filtering for Optimized Client Participation in Federated Learning}},
author = {Fourati, Fares and Kharrat, Salma and Aggarwal, Vaneet and Alouini, Mohamed-Slim and Canini, Marco},
booktitle = {ECAI},
year = {2024}
}
```
Towards a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation
B. Liu, M. A. Ojewale, Y. Ding, M. Canini.
In Proceedings of APSys’24, Sep 2024.

abstract
We propose NeuronaBox, a flexible, user-friendly, and high-fidelity approach to emulate DNN training workloads. We argue that to accurately observe performance, it is possible to execute the training workload on a subset of real nodes and emulate the networked execution environment along with the collective communication operations. Initial results from a proof-of-concept implementation show that NeuronaBox replicates the behavior of actual systems with high accuracy, with an error margin of less than 1% between the emulated measurements and the real system.
bibtex
```
@INPROCEEDINGS{liu2024neuronabox,
title = {{Towards a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation}},
author = {Banruo Liu and Mubarak Adetunji Ojewale and Yuhan Ding and Marco Canini},
booktitle = {APSys},
year = {2024}
}
```
OmNICCL: Zero-cost Sparse AllReduce with Direct Cache Access and SmartNICs
T. Gu, J. Fei, M. Canini.
In Proceedings of NAIC’24, Aug 2024.

abstract
AllReduce is a collective communication pattern commonly used in Distributed Deep Learning (DDL) and High Performance Comput- ing (HPC). Sparse AllReduce, which compresses the data transmit- ted, achieves significant acceleration on specific workloads. How- ever, compression introduces a non-negligible performance over- head. Therefore, we propose the OmNICreduce algorithm, an effi- cient inter-node sparse AllReduce method, as well as its implemen- tation, OmNICCL. It utilizes Direct Cache Access (DCA) to achieve zero-overhead lossless compression and employs SmartNICs for aggregation on the data plane. We demonstrate that our method can provide up to a 7.24× speedup over conventional dense AllRe- duce methods under a 100Gbps RoCEv2 network and 1.76-17.37× performance improvement over state-of-the-art implementations when performing sparse AllReduce.
bibtex
```
@INPROCEEDINGS{gu2024omnicreduce,
title = {{OmNICCL: Zero-cost Sparse AllReduce with Direct Cache Access and SmartNICs}},
author = {Tongzhou Gu and Jiawei Fei and Marco Canini},
booktitle = {NAIC},
year = {2024}
}
```
SqueezeNIC: Low-Latency In-NIC Compression for Distributed Deep Learning
A. Rebai, M. A. Ojewale, A. Ullah, M. Canini, S. A. Fahmy.
In Proceedings of NAIC’24, Aug 2024.

abstract
To alleviate the communication bottleneck of distributed deep learning training, several data compression algorithms have been proposed. However, these algorithms introduce computational overhead and resource allocation concerns on CPUs and GPUs. In this paper, we introduce SqueezeNIC, an FPGA-based Network Interface Card (NIC) that offloads communication compression from CPUs/GPUs, bridging a high bandwidth intra-node network with a high bandwidth inter-node network. It enables better overlap of gradient communication and computation to further reduce training time per iteration in distributed training. Our evaluations shows that SqueezeNIC achieves line rate compression and can speed up training by up to a factor of 1.21×, compared to baseline approaches.
bibtex
```
@INPROCEEDINGS{rebai2024squeezenic,
title = {{SqueezeNIC: Low-Latency In-NIC Compression for Distributed Deep Learning}},
author = {Achref Rebai and Mubarak Adetunji Ojewale and Anees Ullah and Marco Canini and Suhaib A. Fahmy},
booktitle = {NAIC},
year = {2024}
}
```
Train your cake and eat it too! Repurposing collaborative training to tailor LLMs to private data without sharing
B. Radovič, M. Aljahdali, M. Canini, V. Pejović, Z. Khayyat.
Workshop on Efficient Systems for Foundation Models II @ ICML2024, Jul 2024.

abstract
In the emerging field of large language models (LLMs), a significant challenge arises when organizations with vast datasets lack the computational resources to independently train and fine-tune models. This issue stems from privacy, compliance, and resource constraints: organizations cannot share their sensitive data but still need external computational assistance for model training. In this paper, we implement, enhance, and empirically compare several methods, including Split Learning (SL) and select Federated Learning (FL) methods, which enable data-rich yet compute-poor clients to offload LLM training without sharing raw data. Our study evaluates these methods across multiple dimensions, including model quality and training time.
bibtex
```
@inproceedings{radovic2024collabLLM,
title = {{Train your cake and eat it too! Repurposing collaborative training to tailor LLMs to private data without sharing}},
author = {Boris Radovič, Mohammed Aljahdali, Marco Canini, Veljko Pejović, Zuhair Khayyat},
booktitle = {Workshop on Efficient Systems for Foundation Models II @ ICML2024},
year = {2024},
url = {https://openreview.net/forum?id=FGupKd365r}
}
```
Towards LLM-Assisted System Testing for Microservices
M. Almutawa, Q. Ghabrah, M. Canini.
In Proceedings of AI-DCS’24, Jul 2024.

abstract
As modern applications are being designed in a distributed, Microservices Architecture (MSA), it becomes increasingly difficult to debug and test those systems. Typically, it is the role of software testing engineers or Quality Assurance (QA) engineers to write software tests to ensure the reliability of applications, but such a task can be labor-intensive and time-consuming. In this paper, we explore the potential of Large Language Models (LLMs) in assisting software engineers in generating test cases for software systems, with a particular focus on performing end-to-end (black-box) system testing on web-based MSA applications. We present our experience building Kashef, a software testing tool that utilizes the advanced capabilities of current LLMs in code generation and reasoning, and builds on top of the concept of communicative agents.
bibtex
```
@INPROCEEDINGS{Almutawa.AI-DCS24,
  author = {Almutawa, Mustafa and Ghabrah, Qusai Canini, Marco},
  title = {{Towards LLM-Assisted System Testing for Microservices}},
  booktitle = {Proceedings of AI-DCS'24},
  year = {2024},
  month = {Jul}
}
```
Flashback: Understanding and Mitigating Forgetting in Federated Learning
M. Aljahdali, A. M. Abdelmoniem, M. Canini, S. Horvath.
arXiv preprint arXiv:2402.05558

abstract
In Federated Learning (FL), forgetting, or the loss of knowledge across rounds, hampers algorithm convergence, particularly in the presence of severe data heterogeneity among clients. This study explores the nuances of this issue, emphasizing the critical role of forgetting in FL’s inefficient learning within heterogeneous data contexts. Knowledge loss occurs in both client-local updates and server-side aggregation steps; addressing one without the other fails to mitigate forgetting. We introduce a metric to measure forgetting granularly, ensuring distinct recognition amid new knowledge acquisition. Leveraging these insights, we propose Flashback, an FL algorithm with a dynamic distillation approach that is used to regularize the local models, and effectively aggregate their knowledge. Across different benchmarks, Flashback outperforms other methods, mitigates forgetting, and achieves faster round-to-target-accuracy, by converging in 6 to 16 rounds.
bibtex
```
@MISC{flashback,
title = {{Flashback: Understanding and Mitigating Forgetting in Federated Learning}},
author = {Aljahdali, Mohammed and Abdelmoniem, Ahmed M. and Canini, Marco and Horv{\'a}th, Samuel},
publisher = {arXiv},
doi = {10.48550/ARXIV.2402.05558},
url = {https://arxiv.org/abs/2402.05558},
year = {2024},
}
```
Kimad: Adaptive Gradient Compression with Bandwidth Awareness
J. Xin, I. Ilin, S. Zhang, M. Canini, P. Richtarik.
In Proceedings of DistributedML’23, Dec 2023.

abstract
In distributed training, communication often emerges as a bottleneck. In response, we introduce Kimad, a solution that offers adaptive gradient compression. By consistently monitoring bandwidth, Kimad refines compression ratios to match specific neural network layer requirements. Our exhaustive tests and proofs confirm Kimad’s outstanding performance, establishing it as a benchmark in adaptive compression for distributed deep learning.
bibtex
```
@INPROCEEDINGS{Xin.DistributedML23,
  author = {Xin, Jihao and Ilin, Ivan and Zhang, Shunkang and Canini, Marco and Richt{\'a}rik, Peter},
  title = {{Kimad: Adaptive Gradient Compression with Bandwidth Awareness}},
  booktitle = {Proceedings of DistributedML'23},
  year = {2023},
  month = {Dec}
}
```
On Detecting Biased Predictions with Post-hoc Explanation Methods
M. Ruggeri, A. Dethise, M. Canini.
In Proceedings of SAFE’23, Dec 2023.

abstract
We develop a methodology for the analysis of machine learning (ML) models to detect and understand biased decisions and apply it to two specific scenarios. In particular, we show how analyzing model predictions across the dataset, comparing models trained on different subsets of the original data, and applying model-agnostic post-hoc explanation tools can help identify bias in a model in general as well as in specific instances. Further, we consider several definitions of bias and fairness, and show how each provides a different interpretation of the model decisions.
Our results show that the analysis of models through the lens of statistical analysis and post-hoc explanations helps to detect and understand bias. We also observe that post-hoc explanations often fail to detect individual biased instances, and caution against using this category of tools to guarantee model fairness. Finally, we provide insights on how this analysis can help understand the origin and shape of bias.
bibtex
```
@INPROCEEDINGS{Ruggeri.SAFE23,
  author = {Ruggeri, Matteo and Dethise, Alice and Canini, Marco},
  title = {{On Detecting Biased Predictions with Post-hoc Explanation Methods}},
  booktitle = {Proceedings of SAFE'23},
  year = {2023},
  month = {Dec}
}
```
A First Look at the Impact of Distillation Hyper-Parameters in Federated Knowledge Distillation
N. Alballa, M. Canini.
In Proceedings of EuroMLSys’23, May 2023.

abstract
Knowledge distillation has been known as a useful way for model compression. It has been recently adopted in the distributed training domain, such as federated learning, as a way to transfer knowledge between already pre-trained models. Knowledge distillation in distributed settings promises advantages, including significantly reducing the communication overhead and allowing heterogeneous model architectures. However, distillation is still not well studied and understood in such settings, which hinders the possible gains. We bridge this gap by performing an experimental analysis of the distillation process in the distributed training setting, mainly with non-IID data. We highlight some elements that require special considerations when transferring knowledge between already pre-trained models: the transfer set, the temperature, the weight, and the positioning. Appropriately tuning these hyper-parameters can remarkably boost learning outcomes. In our experiments, around two-thirds of the participants require settings other than commonly used default settings in literature, and appropriate tuning can reach more than five times improvement on average.
bibtex
```
@INPROCEEDINGS{Alballa.EuroMLSys23,
  author = {Norah Alballa and Marco Canini},
  title = {{A First Look at the Impact of Distillation Hyper-Parameters in Federated Knowledge Distillation}},
  booktitle = {Proceedings of EuroMLSys'23},
  year = {2023},
  month = {May} 
}
```
REFL: Resource-Efficient Federated Learning
A. M. Abdelmoniem, A. N. Sahu, M. Canini, S. A. Fahmy.
In Proceedings of EuroSys’23, May 2023.

abstract
Federated Learning (FL) enables distributed training by learners using local data, thereby enhancing privacy and reducing communication. However, it presents numerous challenges relating to the heterogeneity of the data distribution, device capabilities, and participant availability as deployments scale, which can impact both model convergence and bias. Existing FL schemes use random participant selection to improve the fairness of the selection process; however, this can result in inefficient use of resources and lower quality training. In this work, we systematically address the question of resource efficiency in FL, showing the benefits of intelligent participant selection, and incorporation of updates from straggling participants. We demonstrate how these factors enable resource efficiency while also improving trained model quality.
bibtex
```
@INPROCEEDINGS{Abdelmoniem.EuroSys23,
  author = {Abdelmoniem, Ahmed M. and Sahu, Atal Narayan and Canini, Marco and Fahmy, Suhaib A.},
  title = {{REFL: Resource-Efficient Federated Learning}},
  booktitle = {Proceedings of EuroSys'23},
  year = {2023},
  month = {May} 
}
```
In-Network Aggregation with Transport Layer Transparency for Distributed Training
S. Liu, Q. Wang, J. Zhang, W. Wu, Q. Lin, Y. Liu, M. Xu, M. Canini, R. Cheung, J. He.
In Proceedings of ASPLOS’23, Mar 2023.

abstract
Recent In-Network Aggregation (INA) solutions offload the all-reduce operation onto network switches to accelerate and scale distributed training (DT). On end hosts, these solutions build custom network stacks to replace the transport layer. The INA-oriented network stack cannot take advantage of the state-of-the-art performant transport layer implementation, and also causes complexity in system development and operation.
We design a transport-transparent INA primitive named NetReduce for modern multi-rack data centers. NetReduce runs beneath the transport layer. The switch performs aggregation operations but preserves data transmission connections. The host uses RoCE as its transport layer to deliver gradient messages and receive aggregation results. NetReduce achieves performance gains from both INA and RoCE: linear scalability, traffic reduction, and bandwidth freeing-up from INA — high throughput, low latency, and low CPU overhead from RoCE. For jobs spanning several multi-GPU machines, we also devise parallel all-reduce based on NetReduce to make use of intra-machine and inter-machine bandwidth efficiently. We prototype NetReduce on an FPGA board attached to an Ethernet switch. We compare NetReduce with existing programmable switch-based solutions and justify the FPGA-based design choice. We evaluate NetReduce’s performance by training typical Deep Neural Network models on single-GPU and multi-GPU testbeds. NetReduce inter-operates with the existing Ethernet transport layer, is training-framework friendly, accelerates network-intensive DT jobs effectively (e.g., 70% for AlexNet), reduces CPU overheads (e.g., only one core for transmission), and is cost-effective (e.g., only 2.40% more capital expense and 0.68% more power consumption making 12.3-57.9% more performance acceleration).
bibtex
```
@INPROCEEDINGS{Liu.ASPLOS23,
  author = {Liu, Shuo and Wang, Qiaoling and Zhang, Junyi and Wu, Wenfei and Lin, Qinliang and Liu, Yao and Xu, Meng and Canini, Marco and Cheung, Ray Chak Chung and He, Jianfei},
  title = {{In-Network Aggregation with Transport Layer Transparency for Distributed Training}},
  booktitle = {Proceedings of ASPLOS'23},
  year = {2023},
  month = {Mar} 
}
```
A Comprehensive Empirical Study of Heterogeneity in Federated Learning
A. M. Abdelmoniem, C.-Y. Ho, P. Papageorgiou, M. Canini.
IEEE Internet of Things Journal, 2023.

abstract
Federated learning (FL) is becoming a popular paradigm for collaborative learning over distributed, private datasets owned by non-trusting entities. FL has seen successful deployment in production environments, and it has been adopted in services such as virtual keyboards, auto-completion, item recommendation, and several IoT applications. However, FL comes with the challenge of performing training over largely heterogeneous datasets, devices, and networks that are out of the control of the centralized FL server. Motivated by this inherent challenge, we aim to empirically characterize the impact of device and behavioral heterogeneity on the trained model. We conduct an extensive empirical study spanning nearly 1.5K unique configurations on five popular FL benchmarks. Our analysis shows that these sources of heterogeneity have a major impact on both model quality and fairness, causing up to 4.6× and 2.2× degradation in the quality and fairness, respectively, thus shedding light on the importance of considering heterogeneity in FL system design.
bibtex
```
@ARTICLE{Abdelmoniem.IoTJ23,
author = {Abdelmoniem, Ahmed M. and Ho, Chen-Yu and Papageorgiou, Pantelis and Canini, Marco},
title = {{A Comprehensive Empirical Study of Heterogeneity in Federated Learning}},
journal = {IEEE Internet of Things Journal},
year = {2023},
doi = {10.1109/JIOT.2023.3250275}
}
```
Natural Compression for Distributed Deep Learning
S. Horvath, C.-Y. Ho, L. Horvath, A. N. Sahu, M. Canini, P. Richtarik
In Proceedings of MSML’22, Aug 2022.

abstract
Modern deep learning models are often trained in parallel over a collection of distributed machines to reduce training time. In such settings, communication of model updates among machines becomes a significant performance bottleneck and various lossy update compression techniques have been proposed to alleviate this problem. In this work, we introduce a new, simple yet theoretically and practically effective compression technique: compression (C_nat). Our technique is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a “natural” way by ignoring the mantissa. We show that compared to no compression, C_nat increases the second moment of the compressed vector by not more than the tiny factor ⁹⁄₈, which means that the effect of C_nat on the convergence speed of popular training algorithms, such as distributed SGD, is negligible. However, the communications savings enabled by C_nat are substantial, leading to 3-4× improvement in overall theoretical running time. For applications requiring more aggressive compression, we generalize C_nat to natural dithering, which we prove is exponentially better than the common random dithering technique. Our compression operators can be used on their own or in combination with existing operators for a more aggressive combined effect, and offer new state-of-the-art both in theory and practice.
bibtex
```
@INPROCEEDINGS{Horvath.MSML22,
  author = {Horv{\'a}th, Samuel and Ho, Chen-Yu and Horv{\'a}th, \v{L}udov{\'i}t and Sahu, Atal Narayan and Canini, Marco and Richt{\'a}rik, Peter},
  title = {{Natural Compression for Distributed Deep Learning}},
  booktitle = {Proceedings of MSML'22},
  year = {2022},
  month = {Aug}
}
```
Empirical Analysis of Federated Learning in Heterogeneous Environments
A. M. Abdelmoniem, C.-Y. Ho, P. Papageorgiou, M. Canini.
In Proceedings of EuroMLSys’22, Apr 2022.

abstract
Federated learning (FL) is becoming a popular paradigm for collaborative learning over distributed, private datasets owned by non-trusting entities. FL has seen successful deployment in production environments, and it has been adopted in services such as virtual keyboards, auto-completion, item recommendation, and several IoT applications. However, FL comes with the challenge of performing training over largely heterogeneous datasets, devices, and networks that are out of the control of the centralized FL server. Motivated by this inherent setting, we make a first step towards characterizing the impact of device and behavioral heterogeneity on the trained model. We conduct an extensive empirical study spanning close to 1.5K unique configurations on five popular FL benchmarks. Our analysis shows that these sources of heterogeneity have a major impact on both model performance and fairness, thus shedding light on the importance of considering heterogeneity in FL system design.
bibtex
```
@INPROCEEDINGS{Abdelmoniem.EuroMLSys22,
  author = {Ahmed M. Abdelmoniem and Chen-Yu Ho and Pantelis Papageorgiou and Marco Canini},
  title = {{Empirical Analysis of Federated Learning in Heterogeneous Environments}},
  booktitle = {Proceedings of EuroMLSys'22},
  year = {2022},
  month = {Apr} 
}
```
Unlocking the Power of Inline Floating-Point Operations on Programmable Switches
Y. Yuan, O. Alama, J. Fei, J. Nelson, D. R. K. Ports, A. Sapio, M. Canini, N. S. Kim.
In Proceedings of NSDI’22, Apr 2022.

abstract
The advent of switches with programmable dataplanes has enabled the rapid development of new network functionality, as well as providing a platform for acceleration of a broad range of application-level functionality. However, existing switch hardware was not designed with application acceleration in mind, and thus applications requiring operations or datatypes not used in traditional network protocols must resort to expensive workarounds. Applications involving floating point data, including distributed training for machine learning and distributed query processing, are key examples.
In this paper, we propose FPISA, a floating point representation designed to work efficiently in programmable switches. We first implement FPISA on an Intel Tofino switch, but find that it has limitations that impact throughput and accuracy. We then propose hardware changes to address these limitations based on the open-source Banzai switch architecture, and synthesize them in a 15-nm standard-cell library to demonstrate their feasibility. Finally, we use FPISA to implement accelerators for training for machine learning as an example application, and evaluate its performance on a switch implementing our changes using emulation. We find that FPISA allows distributed training to use 1-3 fewer CPU cores and provide up to 85.9% better throughput than SwitchML in a CPU-constrained environment.
bibtex
```
@INPROCEEDINGS{Yuan.NSDI22,
  title= {{Unlocking the Power of Inline Floating-Point Operations on Programmable Switches}},
  author={Yuan, Yifan and Alama, Omar and Fei, Jiawei and Nelson, Jacob and Ports, Dan R. K. and Sapio, Amedeo and Canini, Marco and Kim, Nam Sung},
  booktitle = {Proceedings of NSDI'22},
  year = {2022},
  month = {Apr}
}
```
Direct Nonlinear Acceleration
A. Dutta, E. H. Bergou, Y. Xiao, M. Canini, P. Richtarik
EURO Journal on Computational Optimization, 10, 2022.

abstract
Optimization acceleration techniques such as momentum play a key role in state-of-the-art machine learning algorithms. Recently, generic vector sequence extrapolation techniques, such as regularized nonlinear acceleration (RNA) of Scieur et al. [22], were proposed and shown to accelerate fixed point iterations. In contrast to RNA which computes extrapolation coefficients by (approximately) setting the gradient of the objective function to zero at the extrapolated point, we propose a more direct approach, which we call direct nonlinear acceleration (DNA). In DNA, we aim to minimize (an approximation of) the function value at the extrapolated point instead. We adopt a regularized approach with regularizers designed to prevent the model from entering a region in which the functional approximation is less precise. While the computational cost of DNA is comparable to that of RNA, our direct approach significantly outperforms RNA on both synthetic and real-world datasets. While the focus of this paper is on convex problems, we obtain very encouraging results in accelerating the training of neural networks.
bibtex
```
@ARTICLE{Dutta2022dna,
  author = {Dutta, Aritra and Bergou, El Houcine and Xiao, Yunming and Canini, Marco and Richt{\'a}rik, Peter},
  title = {{Direct Nonlinear Acceleration}},
  journal = {EURO Journal on Computational Optimization},
  volume = {10},
  year = {2022},
  doi = {10.1016/j.ejco.2022.100047}
}
```
Rethinking gradient sparsification as total error minimization
A. N. Sahu, A. Dutta, A. M. Abdelmoniem, T. Banerjee, M. Canini, P. Kalnis.
In Proceedings of NeurIPS’21, Dec 2021.

abstract
Gradient compression is a widely-established remedy to tackle the communication bottleneck in distributed training of large deep neural networks (DNNs). Under the error-feedback framework, Top-k sparsification, sometimes with k as little as 0.1% of the gradient size, enables training to the same model quality as the uncompressed case for a similar iteration count. From the optimization perspective, we find that Top-k is the communication-optimal sparsifier given a per-iteration k element budget. We argue that to further the benefits of gradient sparsification, especially for DNNs, a different perspective is necessary — one that moves from per-iteration optimality to consider optimality for the entire training.
We identify that the total error — the sum of the compression errors for all iterations — encapsulates sparsification throughout training. Then, we propose a communication complexity model that minimizes the total error under a communication budget for the entire training. We find that the hard-threshold sparsifier, a variant of the Top-k sparsifier with k determined by a constant hard-threshold, is the optimal sparsifier for this model. Motivated by this, we provide convex and non-convex convergence analyses for the hard-threshold sparsifier with error-feedback. We show that hard-threshold has the same asymptotic convergence and linear speedup property as SGD in both the case, and unlike with Top-k sparsifier, has no impact due to data-heterogeneity. Our diverse experiments on various DNNs and a logistic regression model demonstrate that the hard-threshold sparsifier is more communication-efficient than Top-k.
bibtex
```
@INPROCEEDINGS{Sahu.NeurIPS21,
  title= {{Rethinking gradient sparsification as total error minimization}},
  author={Atal Narayan Sahu and Aritra Dutta and Ahmed M. Abdelmoniem and Trambak Banerjee and Marco Canini and Panos Kalnis},
  year={2021},
  booktitle = {Proceedings of NeurIPS'21},
}
```
Efficient Sparse Collective Communication and its application to Accelerate Distributed Deep Learning (Technical Report)
J. Fei, C.-Y. Ho, A. N. Sahu, M. Canini, A. Sapio.
In Proceedings of SIGCOMM’21, Aug 2021.

abstract
Efficient collective communication is crucial to parallel-computing applications such as distributed training of large-scale recommendation systems and natural language processing models. Existing collective communication libraries focus on optimizing operations for dense inputs, resulting in transmissions of many zeros when inputs are sparse. This counters current trends that see increasing data sparsity in large models.We propose OmniReduce, an efficient streaming aggregation system that exploits sparsity to maximize effective bandwidth use by sending only non-zero data blocks. We demonstrate that this idea is beneficial and accelerates distributed training by up to 8.2×. Even at 100 Gbps, OmniReduce delivers 1.4–2.9× better performance for network-bottlenecked DNNs.
bibtex
```
@INPROCEEDINGS{omnireduce,
  author = {Fei, Jiawei and Ho, Chen-Yu and Sahu, Atal Narayan and Canini, Marco and Sapio, Amedeo},
  title = {{Efficient Sparse Collective Communication and its application to Accelerate Distributed Deep Learning}},
  booktitle = {Proceedings of SIGCOMM'21},
  year = {2021},
  month = {Aug} 
}
```
Towards Mitigating Device Heterogeneity in Federated Learning via Adaptive Model Quantization
A. M. Abdelmoniem, M. Canini.
In Proceedings of EuroMLSys’21, Apr 2021.

abstract
Federated learning (FL) is increasingly becoming the norm for training models over distributed and private datasets. Major service providers rely on FL to improve services such as text auto-completion, virtual keyboards, and item recommendations. Nonetheless, training models with FL in practice requires significant amount of time (days or even weeks) because FL tasks execute in highly heterogeneous environments where devices only have widespread yet limited computing capabilities and network connectivity conditions.
In this paper, we focus on mitigating the extent of device heterogeneity, which is a main contributing factor to training time in FL. We propose AQFL, a simple and practical approach leveraging adaptive model quantization to homogenize the computing resources of the clients. We evaluate AQFL on five common FL benchmarks. The results show that, in heterogeneous settings, AQFL obtains nearly the same quality and fairness of the model trained in homogeneous settings.
bibtex
```
@INPROCEEDINGS{Abdelmoniem.EuroMLSys21,
  author = {Abdelmoniem, Ahmed M. and Canini, Marco},
  title = {{Towards Mitigating Device Heterogeneity in Federated Learning via Adaptive Model Quantization}},
  booktitle = {Proceedings of EuroMLSys'21},
  year = {2021},
  month = {Apr} 
}
```
GRACE: A Compressed Communication Framework for Distributed Machine Learning (Technical Report)
H. Xu, C.-Y. Ho, A. M. Abdelmoniem, A. Dutta, E. H. Bergou, K. Karatsenidis, M. Canini, P. Kalnis.
In Proceedings of ICDCS’21, Jul 2021.

abstract
Powerful computer clusters are used nowadays to train complex deep neural networks (DNN) on large datasets. Distributed training increasingly becomes communication bound. For this reason, many lossy compression techniques have been proposed to reduce the volume of transferred data. Unfortunately, it is difficult to argue about the behavior of compression methods, because existing work relies on inconsistent evaluation testbeds and largely ignores the performance impact of practical system configurations. In this paper, we present a comprehensive survey of the most influential compressed communication methods for DNN training, together with an intuitive classification (i.e., quantization, sparsification, hybrid and low-rank). Next, we propose GRACE, a unified framework and API that allows for consistent and easy implementation of compressed communication on popular machine learning toolkits. We instantiate GRACE on TensorFlow and PyTorch, and implement 16 such methods. Finally, we present a thorough quantitative evaluation with a variety of DNNs (convolutional and recurrent), datasets and system configurations. We show that the DNN architecture affects the relative performance among methods. Interestingly, depending on the underlying communication library and computational cost of compression / decompression, we demonstrate that some methods may be impractical. GRACE and the entire benchmarking suite are available as open-source.
bibtex
```
@INPROCEEDINGS{grace,
  author = {Xu, Hang and Ho, Chen-Yu and Abdelmoniem, Ahmed M. and Dutta, Aritra and Bergou, El Houcine and Karatsenidis, Konstantinos and Canini, Marco and Kalnis, Panos},
  title = {{GRACE: A Compressed Communication Framework for Distributed Machine Learning}},
  booktitle = {Proceedings of ICDCS'21},
  year = {2021},
  month = {Jul}
}
```
AutoLRS: Automatic Learning-Rate Schedule by Bayesian Optimization on the Fly
Y. Jin, T. Zhou, L. Zhao, Y. Zhu, C. Guo, M. Canini, A. Krishnamurthy.
In Proceedings of ICLR’21, May 2021.

abstract
The learning rate (LR) schedule is one of the most important hyper-parameters needing careful tuning in training DNNs. However, it is also one of the least automated parts of machine learning systems and usually costs significant manual effort and computing. Though there are pre-defined LR schedules and optimizers with adaptive LR, they introduce new hyperparameters that need to be tuned separately for different tasks/datasets. In this paper, we consider the question: Can we automatically tune the LR over the course of training without human involvement? We propose an efficient method, AutoLRS, which automatically optimizes the LR for each training stage by modeling training dynamics. AutoLRS aims to find an LR applied to every τ steps that minimizes the resulted validation loss. We solve this black-box optimization on the fly by Bayesian optimization (BO). However, collecting training instances for BO requires a system to evaluate each LR queried by BO’s acquisition function for τ steps, which is prohibitively expensive in practice. Instead, we apply each candidate LR for only τ′ ≪ τ steps and train an exponential model to predict the validation loss after τ steps. This mutual-training process between BO and the loss-prediction model allows us to limit the training steps invested in the BO search. We demonstrate the advantages and the generality of AutoLRS through extensive experiments of training DNNs for tasks from diverse domains using different optimizers. The LR schedules auto-generated by AutoLRS lead to a speedup of 1.22×, 1.43×, and 1.5× when training ResNet-50, Transformer, and BERT, respectively, compared to the LR schedules in their original papers, and an average speedup of 1.31× over state-of-the-art heavily-tuned LR schedules.
bibtex
```
@INPROCEEDINGS{AutoLRS,
  author = {Yuchen Jin and Tianyi Zhou and Liangyu Zhao and Yibo Zhu and Chuanxiong Guo and Marco Canini and Arvind Krishnamurthy},
  title = {{AutoLRS: Automatic Learning-Rate Schedule by Bayesian Optimization on the Fly}},
  booktitle = {Proceedings of ICLR'21},
  year = {2021},
  month = {May} 
}
```
Analyzing Learning-Based Networked Systems with Formal Verification
A. Dethise, M. Canini, N. Narodytska.
In Proceedings of INFOCOM’21, May 2021.

abstract
As more applications of (deep) neural networks emerge in the computer networking domain, the correctness and predictability of a neural agent’s behavior for corner case inputs are becoming crucial. Enabling the formal analysis of agents with nontrivial properties, we bridge between specifying intended high-level behavior and expressing low-level statements directly encoded into an efficient verification framework. Our results support that within minutes, one can establish the resilience of a neural network to adversarial attacks on its inputs, as well as formally prove properties that were previously relying on educated guesses. Finally, we also show how formal verification can help create an accurate visual representation of an agent behavior to perform visual inspection and improve its trustworthiness.
bibtex
```
@INPROCEEDINGS{Dethise.INFOCOM21,
  author = {Dethise, Arnaud and Canini, Marco and Narodytska, Nina},
  title = {{Analyzing Learning-Based Networked Systems with Formal Verification}},
  booktitle = {Proceedings of INFOCOM'21},
  year = {2021},
  month = {May} 
}
```
DC2: Delay-aware Compression Control for Distributed Machine Learning
A. M. Abdelmoniem, M. Canini.
In Proceedings of INFOCOM’21, May 2021.

abstract
Distributed training performs data-parallel training of DNN models which is a necessity for increasingly complex models and large datasets. Recent works are identifying major communication bottlenecks in distributed training. These works seek possible opportunities to speed-up the training in systems supporting distributed ML workloads. As communication reduction, compression techniques are proposed to speed up this communication phase. However, compression comes at the cost of reduced model accuracy, especially when compression is applied arbitrarily. Instead, we advocate a more controlled use of compression and propose DC2, a delay-aware compression control mechanism. DC2 couples compression control and network delays in applying compression adaptively. DC2 not only compensates for network variations but can also strike a better trade-off between training speed and accuracy. DC2 is implemented as a drop-in module to the communication library used by the ML toolkit and can operate in a variety of network settings. We empirically evaluate DC2 in network environments exhibiting low and high delay variations. Our evaluation of different popular CNN models and datasets shows that DC2 improves training speed-ups of up to 41× and 5.3× over baselines with no-compression and uniform compression, respectively.
bibtex
```
@INPROCEEDINGS{Abdelmoniem.INFOCOM21,
  author = {Abdelmoniem, Ahmed M. and Canini, Marco},
  title = {{DC2: Delay-aware Compression Control for Distributed Machine Learning}},
  booktitle = {Proceedings of INFOCOM'21},
  year = {2021},
  month = {May} 
}
```
Scaling Distributed Machine Learning with In-Network Aggregation (Technical Report)
A. Sapio, M. Canini, C.-Y. Ho, J. Nelson, P. Kalnis, C. Kim, A. Krishnamurthy, M. Moshref, D. R. K. Ports, P. Richtarik.
In Proceedings of NSDI’21, Apr 2021.

abstract
Training machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the training process. Our approach, SwitchML, reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network. We co-design the switch processing with the end-host protocols and ML frameworks to provide an efficient solution that speeds up training by up to 5.5× for a number of real-world benchmark models.
bibtex
```
@INPROCEEDINGS{SwitchML,
  author = {Sapio, Amedeo and Canini, Marco and Ho, Chen-Yu and Nelson, Jacob and Kalnis, Panos and Kim, Changhoon and Krishnamurthy, Arvind and Moshref, Masoud and Ports, Dan R. K. and Richt{\'a}rik, Peter},
  title = {{Scaling Distributed Machine Learning with In-Network Aggregation}},
  booktitle = {Proceedings of NSDI'21},
  year = {2021},
  month = {Apr} 
}
```
An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems
A. M. Abdelmoniem, A. Elzanaty, M.-S. Alouini, M. Canini.
In Proceedings of MLSys’21, Apr 2021.

abstract
The recent many-fold increase in the size of deep neural networks makes efficient distributed training challenging. Many proposals exploit the compressibility of the gradients and propose lossy compression techniques to speed up the communication stage of distributed training. Nevertheless, compression comes at the cost of reduced model quality and extra computation overhead. In this work, we design an efficient compressor with minimal overhead. Noting the sparsity of the gradients, we propose to model the gradients as random variables distributed according to some sparsity-inducing distributions (SIDs). We empirically validate our assumption by studying the statistical characteristics of the evolution of gradient vectors over the training process. We then propose Sparsity-Inducing Distribution-based Compression (SIDCo), a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC) while being faster by imposing lower compression overhead. Our extensive evaluation of popular machine learning benchmarks involving both recurrent neural network (RNN) and convolution neural network (CNN) models shows that SIDCo speeds up training by up to ~41.7X, 7.6X, and 1.9X compared to the no-compression baseline, Topk, and DGC compressors, respectively.
bibtex
```
@INPROCEEDINGS{SIDCo,
  author = {Ahmed M. Abdelmoniem and Ahmed Elzanaty and Mohamed-Slim Alouini and Marco Canini},
  title = {{An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems}},
  booktitle = {Proceedings of MLSys'21},
  year = {2021},
  month = {Apr}
}
```
Huffman Coding Based Encoding Techniques for Fast Distributed Deep Learning
R. R. Gajjala, S. Banchhor, A. M. Abdelmoniem, A. Dutta, M. Canini, P. Kalnis.
In Proceedings of DistributedML’20, Dec 2020.

abstract
Distributed stochastic algorithms, equipped with gradient compression techniques, such as codebook quantization, are becoming increasingly popular and considered state-of-the-art in training large deep neural network (DNN) models. However, communicating the quantized gradients in a network requires efficient encoding techniques. For this, practitioners generally use Elias encoding-based techniques without considering their computational overhead or data-volume. In this paper, based on Huffman coding, we propose several lossless encoding techniques that exploit different characteristics of the quantized gradients during distributed DNN training. Then, we show their effectiveness on 5 different DNN models across three different data-sets, and compare them with classic state-of-the-art Elias-based encoding techniques. Our results show that the proposed Huffman-based encoders (i.e., RLH, SH, and SHS) can reduce the encoded data-volume by up to 5.1×, 4.32×, and 3.8×, respectively, compared to the Elias-based encoders.
bibtex
```
@INPROCEEDINGS{Gajjala.DistributedML20,
  author = {Gajjala, Rishikesh R. and Banchhor, Shashwat and Abdelmoniem, Ahmed M. and Dutta, Aritra and Canini, Marco and Kalnis, Panos},
  title = {{Huffman Coding Based Encoding Techniques for Fast Distributed Deep Learning}},
  booktitle = {Proceedings of DistributedML'20},
  year = {2020},
  month = {Dec} 
}
```
On the Discrepancy between the Theoretical Analysis and Practical Implementations of Compressed Communication for Distributed Deep Learning (Technical Report)
A. Dutta, E. H. Bergou, A. M. Abdelmoniem, C.-Y. Ho, A. N. Sahu, M. Canini, P. Kalnis
In Proceedings of AAAI’20, Feb 2020.

abstract
Compressed communication, in the form of sparsification or quantization of stochastic gradients, is employed to reduce communication costs in distributed data-parallel training of deep neural networks. However, there exists a discrepancy between theory and practice: while theoretical analysis of most existing compression methods assumes compression is applied to the gradients of the entire model, many practical implementations operate individually on the gradients of each layer of the model.
In this paper, we prove that layer-wise compression is, in theory, better, because the convergence rate is upper bounded by that of entire-model compression for a wide range of biased and unbiased compression methods. However, despite the theoretical bound, our experimental study of six well-known methods shows that convergence, in practice, may or may not be better, depending on the actual trained model and compression ratio. Our findings suggest that it would be advantageous for deep learning frameworks to include support for both layer-wise and entire-model compression.
bibtex
```
@INPROCEEDINGS{Dutta.AAAI20,
  author = {Dutta, Aritra and Bergou, El Houcine and Abdelmoniem, Ahmed M. and Ho, Chen-Yu and Sahu, Atal Narayan and Canini, Marco and Kalnis, Panos},
  title = {{On the Discrepancy between the Theoretical Analysis and Practical Implementations of Compressed Communication for Distributed Deep Learning}},
  booktitle = {Proceedings of AAAI'20},
  year = {2020},
  month = {Feb},
}
```
Cracking Open the Black Box: What Observations Can Tell Us About Reinforcement Learning Agents
A. Dethise, M. Canini, S. Kandula
In Proceedings of NetAI’19, Aug 2019.

abstract
Machine learning (ML) solutions to challenging networking problems, while promising, are hard to interpret; the uncertainty about how they would behave in untested scenarios has hindered adoption. Using a case study of an ML-based video rate adaptation model, we show that carefully applying interpretability tools and systematically exploring the model inputs can identify unwanted or anomalous behaviors of the model; hinting at a potential path towards increasing trust in ML-based solutions.
bibtex
```
@INPROCEEDINGS{Dethise.COTBB19,
  author = {Arnaud Dethise and Marco Canini and Srikanth Kandula},
  title = {{Cracking Open the Black Box: What Observations Can Tell Us About Reinforcement Learning Agents}},
  booktitle = {Proceedings of NetAI'19},
  year = {2019},
  month = {Aug} 
}
```
In-Network Computation is a Dumb Idea Whose Time Has Come
A. Sapio, I. Abdelaziz, A. Aldilaijan, M. Canini, P. Kalnis.
In Proceedings of HotNets’17, Nov 2017.

abstract
Programmable data plane hardware creates new opportunities for infusing intelligence into the network. This raises a fundamental question: what kinds of computation should be delegated to the network?
In this paper, we discuss the opportunities and challenges for co-designing data center distributed systems with their network layer. We believe that the time has finally come for offloading part of their computation to execute in-network. However, in-network computation tasks must be judiciously crafted to match the limitations of the network machine architecture of programmable devices. With the help of our experiments on machine learning and graph analytics workloads, we identify that aggregation functions raise opportunities to exploit the limited computation power of networking hardware to lessen network congestion and improve the overall application performance. Moreover, as a proof-of-concept, we propose DAIET, a system that performs in-network data aggregation. Experimental results with an initial prototype show a large data reduction ratio (86.9%-89.3%) and a similar decrease in the workers’ computation time.
bibtex
```
@INPROCEEDINGS{Sapio.DAIET17,
  author = {Amedeo Sapio and Ibrahim Abdelaziz and Abdulla Aldilaijan and Marco Canini and Panos Kalnis},
  title = {{In-Network Computation is a Dumb Idea Whose Time Has Come}},
  booktitle = {Proceedings of HotNets'17},
  year = {2017},
  month = {Nov} 
}
```

Programming the Cloud: Performance, Predictability, Security

Assassyn: A Unified Abstraction for Architectural Simulation and Implementation
J. Weng, B. Han, D. Gao, R. Gao, W. Zhang, A. Zhong, C. Xu, J. Xin, Y. Luo, L. W. Wills, M. Canini.
In Proceedings of ISCA’25, Jun 2025.

abstract
The continuous growth of on-chip transistors driven by technology scaling urges architecture developers to design and implement novel architectures to effectively utilize the excessive on-chip resources. Due to the challenges of programming in register-transfer level (RTL) languages, performance modeling based on simulation is typically developed alongside hardware implementation, allowing the exploration of high-level design decisions before dealing with the error-prone, low-level RTL details. However, this approach also introduces new challenges in coordinating across multiple teams to align implementation details separate codebases.
In this paper, we address this issue by presenting Assassyn, a unified, high-level, and general-purpose programming framework for architectural simulation and implementation. By taking advantage of the concept of asynchronous event handling, a widely existing behavior in both hardware design and implementation and software engineering, a general-purpose, and high-level programming abstraction is proposed to mitigate the difficulties of RTL programming. Moreover, the unified programming interface naturally enables an accurate and faithful alignment between the simulation-based performance modeling and RTL implementation.
Our evaluation demonstrates that Assassyn’s high-level programming interface is sufficiently expressive to implement a wide range of architectures, from architectural components, and application-specific accelerators, to designs as complicated as out-of-order CPUs. All the generated simulators perfectly align with the generated RTL behavior, while achieving 2.2-8.1× simulation speedup, and requiring 70% lines of code. The generated RTL achieves comparable perf/area compared to handcrafted RTL, and 6× perf/area compared to high-level synthesis generated RTL code by introducing by mean 1.26× lines of code overhead.
bibtex
```
@inproceedings{weng2025assassyn,
title = {{Assassyn: A Unified Abstraction for Architectural Simulation and Implementation}},
author = {Jian Weng, Boyang Han, Derui Gao, Ruijie Gao, Wanning Zhang, An Zhong, Ceyu Xu, Jihao Xin, Yangzhixin Luo, Lisa Wu Wills, Marco Canini},
booktitle={The 2025 International Symposium on Computer Architecture (ISCA)},
year={2025},
}
```
OffRAC: Offloading Through Remote Accelerator Calls
Z. Yang, K. B. Iyer, Y. Chen, R. Shu, Z. István, M. Canini, S. A. Fahmy.
arXiv preprint arxiv:2504.04404

abstract
Modern applications increasingly demand ultra-low latency for data processing, often facilitated by host-controlled accelerators like GPUs and FPGAs. However, significant delays result from host involvement in accessing accelerators. To address this limitation, we introduce a novel paradigm we call Offloading through Remote Accelerator Calls (OffRAC), which elevates accelerators to first-class compute resources. OffRAC enables direct calls to FPGA-based accelerators without host involvement. Utilizing the stateless function abstraction of serverless computing, with applications decomposed into simpler stateless functions, offloading promotes efficient acceleration and distribution of computational loads across the network. To realize this proposal, we present a prototype design and implementation of an OffRAC platform for FPGAs that assembles diverse requests from multiple clients into complete accelerator calls with multi-tenancy performance isolation. This design minimizes the implementation complexity for accelerator users while ensuring isolation and programmability. Results show that the OffRAC approach reduces the latency of network calls to accelerators down to approximately 10.5 us, as well as sustaining high application throughput up to 85Gbps, demonstrating scalability and efficiency, making it compelling for the next generation of low-latency applications.
bibtex
```
@misc{yang2025offrac,
    title={OffRAC: Offloading Through Remote Accelerator Calls}, 
    author={Ziyi Yang and Krishnan B. Iyer and Yixi Chen and Ran Shu and Zsolt István and Marco Canini and Suhaib A. Fahmy},
    year={2025},
    eprint={2504.04404},
    archivePrefix={arXiv},
    primaryClass={cs.NI},
    url={https://arxiv.org/abs/2504.04404}, 
}
```
SAGE: Software-based Attestation for GPU Execution
A. Ivanov, B. Rothenberger, A. Dethise, M. Canini, T. Hoefler, A. Perrig.
In Proceedings of USENIX ATC’23, Jul 2023.

abstract
With the application of machine learning to security-critical and sensitive domains, there is a growing need for integrity and privacy in computation using accelerators, such as GPUs. Unfortunately, the support for trusted execution on GPUs is currently very limited – trusted execution on accelerators is particularly challenging since the attestation mechanism should not reduce performance.
Although hardware support for trusted execution on GPUs is emerging, we study purely software-based approaches for trusted GPU execution. A software-only approach offers distinct advantages: (1) complement hardware-based approaches, enhancing security especially when vulnerabilities in the hardware implementation degrade security, (2) operate on GPUs without hardware support for trusted execution, and (3) achieve security without reliance on secrets embedded in the hardware, which can be extracted as history has shown.
In this work, we present SAGE, a software-based attestation mechanism for GPU execution. SAGE enables secure code execution on NVIDIA GPUs of the Ampere architecture (A100), providing properties of code integrity and secrecy, computation integrity, as well as data integrity and secrecy – all in the presence of malicious code running on the GPU and CPU. Our evaluation demonstrates that SAGE is already practical today for executing code in a trustworthy way on GPUs without specific hardware support.
bibtex
```
@INPROCEEDINGS{Ivanov.ATC23,
  author = {Andrei Ivanov and Benjamin Rothenberger and Arnaud Dethise and Marco Canini and Torsten Hoefler and Adrian Perrig},
  title = {{SAGE: Software-based Attestation for GPU Execution}},
  booktitle = {Proceedings of USENIX ATC'23},
  year = {2023},
  month = {Jul}
}
```
With Great Freedom Comes Great Opportunity: Rethinking Resource Allocation for Serverless Functions
M. Bilal, M. Canini, R. Fonseca, R. Rodrigues.
In Proceedings of EuroSys’23, May 2023.

abstract
Current serverless offerings give users limited flexibility for configuring the resources allocated to their function invocations. This simplifies the interface for users to deploy serverless computations but creates deployments that are resource inefficient. In this paper, we take a principled approach to the problem of resource allocation for serverless functions, analyzing the effects of automating this choice in a way that leads to the best combination of performance and cost. In particular, we systematically explore the opportunities that come with decoupling memory and CPU resource allocations and also enabling the use of different VM types, and we find a rich trade-off space between performance and cost. The provider can use this in a number of ways, e.g., exposing all these parameters to the user; eliding preferences for performance and cost from users and simply offer the same performance with lower cost; or exposing a small number of choices for users to trade performance for cost.
Our results show that, by decoupling memory and CPU allocation, there is the potential to have up to 40% lower execution cost than the preset coupled configurations that are the norm in current serverless offerings. Similarly, making the correct choice of VM instance type can provide up to 50% better execution time. Furthermore, we demonstrate that providers have the flexibility to choose different instance types for the same functions to maximize resource utilization while providing performance within 10-20% of the best resource configuration for each respective function.
bibtex
```
@INPROCEEDINGS{Bilal.EuroSys23,
  author={Muhammad Bilal and Marco Canini and Rodrigo Fonseca and Rodrigo Rodrigues},
  title= {{With Great Freedom Comes Great Opportunity: Rethinking Resource Allocation for Serverless Functions}},
  booktitle = {Proceedings of EuroSys'23},
  year={2023},
  month = {May}
}
```
RDMA is Turing complete, we just did not know it yet!
W. Reda, M. Canini, D. Kostic, S. Peter.
In Proceedings of NSDI’22, Apr 2022.

abstract
It is becoming increasingly popular for distributed systems to exploit offload to reduce load on the CPU. Remote Direct Memory Access (RDMA) offload, in particular, has become popular. However, RDMA still requires CPU intervention for complex offloads that go beyond simple remote memory access. As such, the offload potential is limited and RDMA-based systems usually have to work around such limitations.
We present RedN, a principled, practical approach to implementing complex RDMA offloads, without requiring any hardware modifications. Using self-modifying RDMA chains, we lift the existing RDMA verbs interface to a Turing complete set of programming abstractions. We explore what is possible in terms of offload complexity and performance with a commodity RDMA NIC. We show how to integrate these RDMA chains into applications, such as the Memcached key-value store, allowing us to offload complex tasks such as key lookups. RedN can reduce the latency of key-value get operations by up to 2.6× compared to state-of-the-art KV designs that use one-sided RDMA primitives (e.g., FaRM-KV), as well as traditional RPC-over-RDMA approaches. Moreover, compared to these baselines, RedN provides performance isolation and, in the presence of contention, can reduce latency by up to 35× while providing applications with failure resiliency to OS and process crashes.
bibtex
```
@INPROCEEDINGS{Reda.NSDI22,
  title= {{RDMA is Turing complete, we just did not know it yet!}},
  author={Waleed Reda and Marco Canini and Dejan Kosti\'c and Simon Peter},
  booktitle = {Proceedings of NSDI'22},
  year = {2022},
  month = {Apr}
}
```
LineFS: Efficient SmartNIC Offload of a Distributed File System with Pipeline Parallelism
J. Kim, I. Jang, W. Reda, J. Im, M. Canini, D. Kostic, Y. Kwon, S. Peter, E. Witchel.
In Proceedings of SOSP’21, Oct 2021.

abstract
In multi-tenant systems, the CPU overhead of distributed file systems (DFSes) is increasingly a burden to application performance. CPU and memory interference cause degraded and unstable application and storage performance, in particular for operation latency. Recent client-local DFSes for persistent memory (PM) accelerate this trend. DFS offload to SmartNICs is a promising solution to these problems, but it is challenging to fit the complex demands of a DFS onto simple SmartNIC processors located across PCIe.
We present LineFS, a SmartNIC-offloaded, high-performance DFS with support for client-local PM. To fully leverage the SmartNIC architecture, we decompose DFS operations into execution stages that can be offloaded to a parallel data-path execution pipeline on the SmartNIC. LineFS offloads CPU-intensive DFS tasks, like replication, compression, data publication, index and consistency management to a SmartNIC.
We implement LineFS on the Mellanox BlueField SmartNIC and compare it to Assise, a state-of-the-art PM DFS. LineFS improves latency in LevelDB up to 80% and throughput in Filebench up to 79%, while providing extended DFS availability during host system failures.
bibtex
```
@INPROCEEDINGS{Kim.SOSP21,
  author = {Jongyul Kim and Insu Jang and Waleed Reda and Jaeseong Im and Marco Canini and Dejan Kosti\'c and Youngjin Kwon and Simon Peter and Emmett Witchel},
  title = {{LineFS: Efficient SmartNIC Offload of a Distributed File System with Pipeline Parallelism}},
  booktitle = {Proceedings of the 28th ACM Symposium on Operating Systems Principles (SOSP'21)},
  year = {2021},
  month = {Oct}
}
```
Assise: Performance and Availability via Client-local NVM in a Distributed File System
T. E. Anderson, M. Canini, J. Kim, D. Kostic, Y. Kwon, S. Peter, W. Reda, H. N. Schuh, E. Witchel.
In Proceedings of OSDI’20, Nov 2020.

abstract
The adoption of low latency persistent memory modules (PMMs) upends the long-established model of remote storage for distributed file systems. Instead, by colocating computation with PMM storage, we can provide applications with much higher IO performance, sub-second application failover, and strong consistency. To demonstrate this, we built the Assise distributed file system, based on a persistent, replicated coherence protocol that manages client-local PMM as a linearizable and crash-recoverable cache between applications and slower (and possibly remote) storage. Assise maximizes locality for all file IO by carrying out IO on process-local, socket-local, and client-local PMM whenever possible. Assise minimizes coherence overhead by maintaining consistency at IO operation granularity, rather than at fixed block sizes.
We compare Assise to Ceph/BlueStore, NFS, and Octopus on a cluster with Intel Optane DC PMMs and SSDs for common cloud applications and benchmarks, such as LevelDB, Postfix, and FileBench. We find that Assise improves write latency up to 22×, throughput up to 56×, fail-over time up to 103×, and scales up to 6× better than its counterparts, while providing stronger consistency semantics.
bibtex
```
@INPROCEEDINGS{Reda.OSDI20,
  author = {Anderson, Thomas E. and Canini, Marco and Kim, Jongyul and Kosti\'c, Dejan and Kwon, Youngjin and Peter, Simon and Reda, Waleed and Schuh, Henry N. and Witchel, Emmett},
  title = {{Assise: Performance and Availability via Client-local NVM in a Distributed File System}},
  booktitle = {Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation 2020 (OSDI'20)},
  year = {2020},
  month = {Nov}
}
```
Finding the Right Cloud Configuration for Analytics Clusters
M. Bilal, M. Canini, R. Rodrigues.
In Proceedings of SoCC’20, Oct 2020.

abstract
Finding good cloud configurations for deploying a single distributed system is already a challenging task, and it becomes substantially harder when a data analytics cluster is formed by multiple distributed systems since the search space becomes exponentially larger. In particular, recent proposals for single system deployments rely on benchmarking runs that become prohibitively expensive as we shift to joint optimization of multiple systems, as users have to wait until the end of a long optimization run to start the production run of their job.
We propose Vanir, an optimization framework designed to operate in an ecosystem of multiple distributed systems forming an analytics cluster. To deal with this large search space, Vanir takes the approach of quickly finding a good enough configuration and then attempts to further optimize the configuration during production runs. This is achieved by combining a series of techniques in a novel way, namely a metrics-based optimizer for the benchmarking runs, and a Mondrian forest-based performance model and transfer learning during production runs. Our results show that Vanir can find deployments that perform comparably to the ones found by state-of-the-art single-system cloud configuration optimizers while spending 2× fewer benchmarking runs. This leads to an overall search cost that is 1.3-24× lower compared to the state-of-the-art. Additionally, when transfer learning can be used, Vanir can minimize the benchmarking runs even further, and use online optimization to achieve a performance comparable to the deployments found by today’s single-system frameworks.
bibtex
```
@INPROCEEDINGS{Bilal.SoCC20,
  author = {Bilal, Muhammad and Canini, Marco and Rodrigues, Rodrigo},
  title = {{Finding the Right Cloud Configuration for Analytics Clusters}},
  booktitle = {Proceedings of the 11th ACM Symposium on Cloud Computing 2020 (SoCC'20)},
  year = {2020},
  month = {Oct} 
}
```
Do the Best Cloud Configurations Grow on Trees? An Experimental Evaluation of Black Box Algorithms for Optimizing Cloud Workloads
M. Bilal, M. Serafini, M. Canini, R. Rodrigues.
In Proc. PVLDB Endow. 13(11), Aug 2020.

abstract
Cloud configuration optimization is the procedure to determine the number and the type of instances to use when deploying an application in cloud environments, given a cost or performance objective. In the absence of a performance model for the distributed application, black-box optimization can be used to perform automatic cloud configuration. Numerous black-box optimization algorithms have been developed; however, their comparative evaluation has so far been limited to the hyper-parameter optimization setting, which differs significantly from the cloud configuration problem. In this paper, we evaluate 8 commonly used black-box optimization algorithms to determine their applicability for the cloud configuration problem. Our evaluation, using 23 different workloads, shows that in several cases Bayesian optimization with Gradient boosted regression trees performs better than methods chosen by prior work.
bibtex
```
@ARTICLE{Bilal.VLDB20,
  author = {Bilal, Muhammad and Serafini, Marco and Canini, Marco and Rodrigues, Rodrigo},
  title = {{Do the Best Cloud Configurations Grow on Trees? An Experimental Evaluation of Black Box Algorithms for Optimizing Cloud Workloads}},
  journal={Proc. VLDB Endow.},
  year = {2020},
  volume = {13},
  number = {11},
  doi = {10.14778/3407790.3407845}
}
```
Fast and Accurate Load Balancing for Geo-Distributed Storage Systems
K. L. Bogdanov, W. Reda, G. Q. Maguire Jr., D. Kostic, M. Canini.
In Proceedings of SoCC’18, Oct 2018.

abstract
The increasing density of globally distributed datacenters reduces the network latency between neighboring datacenters and allows replicated services deployed across neighboring locations to share workload when necessary, without violating strict Service Level Objectives (SLOs).
We present Kurma, a practical implementation of a fast and accurate load balancer for geo-distributed storage systems. At runtime, Kurma integrates network latency and service time distributions to accurately estimate the rate of SLO violations for requests redirected across geo-distributed datacenters. Using these estimates, Kurma solves a decentralized rate-based performance model enabling fast load balancing (in the order of seconds) while taming global SLO violations. We integrate Kurma with Cassandra, a popular storage system. Using real-world traces along with a geo-distributed deployment across Amazon EC2, we demonstrate Kurma’s ability to effectively share load among datacenters while reducing SLO violations by up to a factor of 3 in high load settings or reducing the cost of running the service by up to 17%.
bibtex
```
@INPROCEEDINGS{Bogdanov.SoCC18,
  author = {Bogdanov, Kirill L. and Reda, Waleed and Maguire Jr., Gerald Q. and Kosti\'c, Dejan and Canini, Marco},
  title = {{Fast and Accurate Load Balancing for Geo-Distributed Storage Systems}},
  booktitle = {Proceedings of the 9th ACM Symposium on Cloud Computing 2018 (SoCC'18)},
  year = {2018},
  month = {Oct} 
}
```
Mitigating Network Side Channel Leakage for Stream Processing Systems in Trusted Execution Environments
M. Bilal, H. Alsibyani, M. Canini.
In Proceedings of DEBS’18, Jun 2018.

abstract
A crucial concern regarding cloud computing is the confidentiality of sensitive data being processed in the cloud. Trusted Execution Environments (TEEs), such as Intel Software Guard eXtensions (SGX), allow applications to run securely on an untrusted platform. However, using TEEs alone for stream processing is not enough to ensure privacy as network communication patterns may leak information about the data.
This paper introduces two techniques – anycast and multicast – for mitigating leakage at inter-stage communications in streaming applications according to a user-selected mitigation level. These techniques aim to achieve network data obliviousness, i.e., communication patterns should not depend on the data. We implement these techniques in an SGX-based stream processing system. We evaluate the latency and throughput overheads, and the data obliviousness using three benchmark applications. The results show that anycast scales better with input load and mitigation level, and provides better data obliviousness than multicast.
bibtex
```
@INPROCEEDINGS{Bilal.DEBS18,
  author = {Bilal, Muhammad and Alsibyani, Hassan and Canini, Marco},
  title = {{Towards Automatic Parameter Tuning of Stream Processing Systems}},
  booktitle = {Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems 2018 (DEBS'18)},
  year = {2018},
  month = {Jun} 
}
```
Towards Automatic Parameter Tuning of Stream Processing Systems
M. Bilal, M. Canini.
In Proceedings of SoCC’17, Sep 2017.

abstract
Optimizing the performance of big-data streaming applications has become a daunting and time-consuming task: parameters may be tuned from a space of hundreds or even thousands of possible configurations. In this paper, we present a framework for automating parameter tuning for stream-processing systems. Our framework supports standard black-box optimization algorithms as well as a novel gray-box optimization algorithm. We demonstrate the multiple benefits of automated parameter tuning in optimizing three benchmark applications in Apache Storm. Our results show that a hill-climbing algorithm that uses a new heuristic sampling approach based on Latin Hypercube provides the best results. Our gray-box algorithm provides comparable results while being two to five times faster.
bibtex
```
@INPROCEEDINGS{Bilal.SoCC17,
  author = {Bilal, Muhammad and Canini, Marco},
  title = {{Towards Automatic Parameter Tuning of Stream Processing Systems}},
  booktitle = {Proceedings of the 8th ACM Symposium on Cloud Computing 2017 (SoCC'17)},
  year = {2017},
  month = {Sep} 
}
```
Distributed Resource Management Across Process Boundaries
L. Suresh, P. Bodik, I. Menache, M. Canini, F. Ciucu.
In Proceedings of SoCC’17, Sep 2017.

abstract
Multi-tenant distributed systems composed of small services, such as Service-oriented Architectures (SOAs) and Micro-services, raise new challenges in attaining high performance and efficient resource utilization. In these systems, a request execution spans tens to thousands of processes, and the execution paths and resource demands on different services are generally not known when a request first enters the system. In this paper, we highlight the fundamental challenges of regulating load and scheduling in SOAs while meeting end-to-end performance objectives on metrics of concern to both tenants and operators. We design Wisp, a framework for building SOAs that transparently adapts rate limiters and request schedulers system-wide according to operator policies to satisfy end-to-end goals while responding to changing system conditions. In evaluations against production as well as synthetic workloads, Wisp successfully enforces a range of end-to-end performance objectives, such as reducing average latencies, meeting deadlines, providing fairness and isolation, and avoiding system overload.
bibtex
```
@INPROCEEDINGS{Suresh.SoCC17,
  author = {Suresh, Lalith and Bodik, Peter and Menache, Ishai and Canini, Marco and Ciucu, Florin},
  title = {{Distributed Resource Management Across Process Boundaries}},
  booktitle = {Proceedings of the 8th ACM Symposium on Cloud Computing 2017 (SoCC'17)},
  year = {2017},
  month = {Sep} 
}
```
Rein: Taming Tail Latency in Key-Value Stores via Multiget Scheduling
W. Reda, M. Canini, L. Suresh, D. Kostic, S. Braithwaite.
In Proceedings of EuroSys’17, Apr 2017.

abstract
We tackle the problem of reducing tail latencies in distributed key-value stores, such as the popular Cassandra database. We focus on workloads of multiget requests, which batch together access to several data elements and parallelize read operations across the data store machines. We first analyze a production trace of a real system and quantify the skew due to multiget sizes, key popularity, and other factors. We then proceed to identify opportunities for reduction of tail latencies by recognizing the composition of aggregate requests and by carefully scheduling bottleneck operations that can otherwise create excessive queues. We design and implement a system called Rein, which reduces latency via inter-multiget scheduling using low overhead techniques. We extensively evaluate Rein via experiments in Amazon Web Services (AWS) and simulations. Our scheduling algorithms reduce the median, 95th, and 99th percentile latencies by factors of 1.5, 1.5, and 1.9, respectively.
bibtex
```
@INPROCEEDINGS{Reda.EuroSys17,
  author = {Reda, Waleed and Canini, Marco and Suresh, Lalith and Kosti\'c, Dejan and Braithwaite, Sean},
  title = {{Rein: Taming Tail Latency in Key-Value Stores via Multiget Scheduling}},
  booktitle = {Proceedings of the 7th ACM european conference on Computer Systems (EuroSys'17)},
  year = {2017},
  month = {Apr} 
}
```
C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection
L. Suresh, M. Canini, S. Schmid, A. Feldmann.
In Proceedings of NSDI’15, May 2015.

abstract
Achieving predictable performance is critical for many distributed applications, yet difficult to achieve due to many factors that skew the tail of the latency distribution even in well-provisioned systems. In this paper, we present the fundamental challenges involved in designing a replica selection scheme that is robust in the face of performance fluctuations across servers. We illustrate these challenges through performance evaluations of the Cassandra distributed database on Amazon EC2. We then present the design and implementation of an adaptive replica selection mechanism, C3, that is robust to performance variability in the environment. We demonstrate C3’s effectiveness in reducing the latency tail and improving throughput through extensive evaluations on Amazon EC2 and through simulations. Our results show that C3 significantly improves the latencies along the mean, median, and tail (up to 3 times improvement at the 99.9^th percentile) and provides higher system throughput.
bibtex
```
@INPROCEEDINGS{Suresh.NSDI15,
  author = {Suresh, Lalith and Canini, Marco and Schmid, Stefan and Feldmann, Anja},
  title = {{C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection}},
  booktitle = {Proceedings of 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI'15)},
  year = {2015},
  month = {May} 
}
```
Application Aware Placement and Scheduling for Multi-tenant Clouds
L. Suresh, M. Canini.
HPI Technical Report for Spring 2013 Future SOC Lab.

abstract
In an Infrastructure-as-a-Service (IaaS) environment, it is paramount to perform intelligent allocation of shared resources. Placement is the problem of choosing which virtual machine (VM) should run on which physical machine (PM), whereas Scheduling is the problem of sharing resources between multiple co-located VMs. An efficient placement and scheduling is one, that in addition to satisfying all constraints, increases the overall utilization of physical resources such as CPU, storage, or network. Determining an efficient placement and scheduling is a very challenging problem, especially in face of conflicting goals and partially available information about workloads.
In order to reason about placement, we first tackle the problem of performance interference that may affect co-located VMs—when there is more demand by multiple VMs for a resource than is available at a given instant of time. We thus characterize the performance of Hadoop in a shared and virtualized setting.
bibtex
```
@TECHREPORT{AbsintheHPITR,
author = {Suresh, Lalith and Canini, Marco},
title = {{Application Aware Placement and Scheduling for Multi-tenant Clouds}},
institution = {Technical Report for Spring 2013 Future SOC Lab},
year = {2013}
}
```

Software-Defined Networking and Network Architecture

Scaling SCIERA: A Journey Through the Deployment of a Next-Generation Network
F. Wirz, M. Gartner, J. van Bommel, E. Ehsani Moghadam, G. H. Cimaszewski, A. He, Y. Zhang, H. Birge-Lee, F. Kottmann, C. Krähenbühl, J. Kwon, K. Mavromati, L. Wang, D. Bertolo, M. Canini, B. Cho, R. A. Ferreira, S. P. Green, D. Hausheer, J. Hur, X. Jia, H. Lee, P. Mittal, O. Oaiya, C. Park, A. Perrig, J. Sobieski, Y. Sun, C. Wang, K. Wierenga.
In Proceedings of SIGCOMM’25, Sep 2025.

abstract
The SCION Next-Generation Network (NGN) architecture has expanded steadily since 2017, with today 20+ ISPs offering SCION connectivity. In production, IP-to-SCION-to-IP translation by SCION-IP-Gateways (SIGs) is used, such that applications are unaware of the NGN communication. To accelerate innovation and deployments, our aim is to increase the number of native SCION use cases, where the application is fully SCION-aware and optimizes communication across all path choices offered by the network. We set out to achieve two core objectives: (1) facilitating simple native connectivity for applications, and (2) enhancing the scalability of SCION deployment at academic sites.
With these goals in mind, we built the SCION Education, Research, and Academic (SCIERA) network infrastructure. This paper presents key lessons learned from the SCIERA deployment, which we anticipate will offer actionable insights to researchers, network operators, and system builders seeking to overcome practical challenges also for other NGN deployments. We report on establishing native SCION connectivity at research and education institutions that can reach 250,000 people across five continents, without relying on BGP. Our evaluation demonstrates that our core objectives were reached. Today, the SCIERA deployment offers tangible real-world benefits to users by providing rich global connectivity through a multitude of inter-domain paths.
bibtex
```
@INPROCEEDINGS{sciera,
  author = {Wirz, François and Gartner, Marten and van Bommel, Jelte and Ehsani Moghadam, Elham and Cimaszewski, Grace H. and He, Anxiao and Zhang, Yizhe and Birge-Lee, Henry and Kottmann, Felix and Krähenbühl, Cyrill and Kwon, Jonghoon and Mavromati, Kyveli and Wang, Liang and Bertolo, Daniel and Canini, Marco and Cho, Buseung and Ferreira, Ronaldo A. and Green, Simon Peter and Hausheer, David and Hur, Junbeom and Jia, Xiaohua and Lee, Heejo and Mittal, Prateek and Oaiya, Omo and Park, Chanjin and Perrig, Adrian and Sobieski, Jerry and Sun, Yixin and Wang, Cong and Wierenga, Klaas},
  title = {{Scaling SCIERA: A Journey Through the Deployment of a Next-Generation Network}},
  booktitle = {Proceedings of SIGCOMM'25},
  year = {2025},
  month = {Sep}
}
```
Renaissance: A self-stabilizing distributed SDN control plane using in-band communications
M. Canini, I. Salem, L. Schiff, E. M. Schiller, S. Schmid.
Journal of Computer and System Sciences, 127, Feb 2022.

abstract
By introducing programmability, automated verification, and innovative debugging tools, Software-Defined Networks (SDNs) are poised to meet the increasingly stringent dependability requirements of today’s communication networks. However, the design of fault-tolerant SDNs remains an open challenge. This paper considers the design of dependable SDNs through the lenses of self-stabilization—a very strong notion of fault-tolerance. In particular, we develop algorithms for an in-band and distributed control plane for SDNs, called Renaissance, which tolerate a wide range of failures. Our self-stabilizing algorithms ensure that after the occurrence of arbitrary failures, (i) every non-faulty SDN controller can reach any switch (or another controller) within a bounded communication delay (in the presence of a bounded number of failures) and (ii) every switch is managed by a controller. We evaluate Renaissance through a rigorous worst-case analysis as well as a prototype implementation (based on OVS and Floodlight, and Mininet).
bibtex
```
@ARTICLE{Canini.Renaissance,
  author = {Marco Canini and Iosif Salem and Liron Schiff and Elad M. Schiller and Stefan Schmid},
  title = {{Renaissance: A self-stabilizing distributed SDN control plane using in-band communications}},
  volume = {127},
  year = {2022},
  doi = {10.1016/j.jcss.2022.02.001},
}
```
Tardis: A Fault-Tolerant Design for Network Control Planes
Z. Zhou, T. A. Benson, M. Canini, B. Chandrasekaran.
In Proceedings of SOSR’21, Oct 2021.

abstract
Guaranteeing high availability of networks virtually hinges on the ability to handle and recover from bugs and failures. Yet, despite the advances in verification, testing, and debugging, production networks remain susceptible to large-scale failures — often due to deterministic bugs.
This paper explores the use of input transformations as a viable method for recovering from such deterministic bugs. In particular, we introduce an online system, Tardis, for overcoming deterministic faults by using a blend of program analysis and runtime program data to systematically determine the fault-triggering input events and using domain-specific models to automatically generate transformations of the fault-triggering inputs that are both safe and semantically equivalent. We evaluated Tardis on several production network control plane applications (CPAs), including six SDN CPAs and several popular BGP CPAs using 71 realistic bugs. We observe that Tardis improves recovery time by 7.44%, introduces a 25% CPU and 0.5% memory overhead, and recovers from 77.26% of the injected realistic and representative bugs, more than twice that of existing solutions.
bibtex
```
@INPROCEEDINGS{Zhou.Tardis,
  author = {Zhenyu Zhou and Theophilus A. Benson and Marco Canini and Balakrishnan Chandrasekaran},
  title = {{Tardis: A Fault-Tolerant Design for Network Control Planes}},
  booktitle = {Proceedings of SOSR'21},
  year = {2021},
  month = {Oct},
}
```
P4xos: Consensus as a Network Service
H. T. Dang, P. Bressana, H. Wang, K. S. Lee, N. Zilberman, H. Weatherspoon, M. Canini, F. Pedone, R. Soule.
IEEE/ACM Transactions on Networking, 28(4), Aug 2020.

abstract
In this paper, we explore how a programmable forwarding plane offered by a new breed of network switches might naturally accelerate consensus protocols, specifically focusing on Paxos. The performance of consensus protocols has long been a concern. By implementing Paxos in the forwarding plane, we are able to significantly increase throughput and reduce latency. Our P4-based implementation running on an ASIC in isolation can process over 2.5 billion consensus messages per second, a four orders of magnitude improvement in throughput over a widely-used software implementation. This effectively removes consensus as a bottleneck for distributed applications in data centers. Beyond sheer performance, our approach offers several other important benefits: it readily lends itself to formal verification; it does not rely on any additional network hardware; and as a full Paxos implementation, it makes only very weak assumptions about the network.
bibtex
```
@ARTICLE{Dang.P4xos,
  author = {Dang, Huynh Tu and Bressana, Pietro and Wang, Han, and Lee, Ki Suh and Zilberman, Noa and Weatherspoon, Hakim and Canini, Marco and Pedone, Fernando and Soul{\'{e}}, Robert},
  title = {{P4xos: Consensus as a Network Service}},
  journal = {IEEE/ACM Transactions on Networking},
  year = {2020},
  month = {Aug},
  volume = {28},
  number = {4},
  doi = {10.1109/TNET.2020.2992106}
}
```
A Survey on the Current Internet Interconnection Practices
P. Marcos, M. Chiesa, C. Dietzel, M. Canini, M. Barcellos.
SIGCOMM Comput. Commun. Rev., 50(1), Mar 2020.

abstract
The Internet topology has significantly changed in the past years. Today, it is richly connected and flattened. Such a change has been driven mostly by the fast growth of peering infrastructures and the expansion of Content Delivery Networks as alternatives to reduce interconnection costs and improve traffic delivery performance. While the topology evolution is perceptible, it is unclear whether or not the interconnection process has evolved or if it continues to be an ad-hoc and lengthy process. To shed light on the current practices of the Internet interconnection ecosystem and how these could impact the Internet, we surveyed more than 100 network operators and peering coordinators. We divide our results into two parts: (i) the current interconnection practices, including the steps of the process and the reasons to establish new interconnection agreements or to renegotiate existing ones, and the parameters discussed by network operators. In part (ii), we report the existing limitations and how the interconnection ecosystem can evolve in the future. We show that despite the changes in the topology, interconnecting continues to be a cumbersome process that usually takes days, weeks, or even months to complete, which is in stark contrast with the desire of most operators in reducing the interconnection setup time. We also identify that even being primary candidates to evolve the interconnection process, emerging on-demand connectivity companies are only fulfilling part of the existing gap between the current interconnection practices and the network operators’ desires.
bibtex
```
@ARTICLE{Marcos.CCR,
author = {Marcos, Pedro and Chiesa, Marco and Dietzel, Christoph and Canini, Marco and Barcellos, Marinho},
title = {{A Survey on the Current Internet Interconnection Practices}},
journal = {SIGCOMM Comput. Commun. Rev.},
year = {2020},
month = {Mar},
volume = {50},
number = {1},
doi = {10.1145/3390251.3390254}
}
```
Towards Consistent SDNs: A Case for Network State Fuzzing
A. Shukla, S. J. Saidi, S. Schmid, M. Canini, T. Zinner, A. Feldmann.
IEEE Transactions on Network and Service Management, 17(2), Jun 2020.

abstract
The conventional wisdom is that a software-defined network (SDN) operates under the premise that the logically centralized control plane has an accurate representation of the actual data plane state. Unfortunately, bugs, misconfigurations, faults or attacks can introduce inconsistencies that undermine correct operation. Previous work in this area, however, lacks a holistic methodology to tackle this problem and thus, addresses only certain parts of the problem. Yet, the consistency of the overall system is only as good as its least consistent part.
Motivated by an analogy of network consistency checking with program testing, we propose to add active probe-based network state fuzzing to our consistency check repertoire. Hereby, our system, PAZZ, combines production traffic with active probes to periodically test if the actual forwarding path and decision elements (on the data plane) correspond to the expected ones (on the control plane). Our insight is that active traffic covers the inconsistency cases beyond the ones identified by passive traffic. PAZZ prototype was built and evaluated on topologies of varying scale and complexity. Our results show that PAZZ requires minimal network resources to detect persistent data plane faults through fuzzing and localize them quickly while outperforming baseline approaches.
bibtex
```
@ARTICLE{Shukla.TNSM20,
author = {Shukla, Apoorv and Saidi, S. Jawad and Schmid, Stefan and Canini, Marco and Zinner, Thomas and Feldmann, Anja},
title = {{Towards Consistent SDNs: A Case for Network State Fuzzing}},
journal = {IEEE Transactions on Network and Service Management},
year = {2020},
month = {Jun},
volume = {17},
number = {2},
doi = {10.1109/TNSM.2019.2955790}
}
```
Measurements as First-class Artifacts
P. Laffranchini, L. Rodrigues, M. Canini, B. Krishnamurthy.
In Proceedings of INFOCOM’19, Apr 2019.

abstract
The emergence of programmable switches has sparked a significant amount of work on new techniques to perform more powerful measurement tasks, for instance, to obtain fine-grained traffic and performance statistics. Previous work has focused on the efficiency of these measurements alone and has neglected flexibility, resulting in solutions that are hard to reuse or repurpose and that often overlap in functionality or goals.
In this paper, we propose the use of a set of reusable primitive building blocks that can be composed to express measurement tasks in a concise and simple way. We describe the rationale for the design of our primitives, that we have named MAFIA (Measurements As FIrst-class Artifacts), and using several examples we illustrate how they can be combined to realize a comprehensive range of network measurement tasks. Writing MAFIA code does not require expert knowledge of low-level switch architecture details. Using a prototype implementation of MAFIA, we demonstrate the applicability of our approach and show that the use of our primitives results in compiled code that is comparable in size and resource usage with manually written specialized P4 code, and can be run in current hardware.
bibtex
```
@INPROCEEDINGS{Laffranchini.INFOCOM19,
  author={Paolo Laffranchini and Luis Rodrigues and Marco Canini and Balachander Krishnamurthy},
  title = {{Measurements as First-class Artifacts}},
  booktitle = {Proceedings of INFOCOM'19},
  year = {2019},
  month = {Apr}
}
```
Dynam-IX: a Dynamic Interconnection eXchange
P. Marcos, M. Chiesa, L. Muller, P. Kathiravelu, C. Dietzel, M. Canini, M. Barcellos.
In Proceedings of CoNEXT’18, Dec 2018.

abstract
Autonomous Systems (ASes) can reach hundreds of networks via Internet eXchange Points (IXPs), allowing improvements in traffic delivery performance and competitiveness. Despite the benefits, any pair of ASes needs first to agree on exchanging traffic. By surveying 100+ network operators, we discovered that most interconnection agreements are established through ad-hoc and lengthy processes heavily influenced by personal relationships and brand image. As such, ASes prefer long-term agreements at the expense of a potential mismatch between actual delivery performance and current traffic dynamics. ASes also miss interconnection opportunities due to trust reasons. To improve wide-area traffic delivery performance, we propose Dynam-IX, a framework that allows operators to build trust cooperatively and implement traffic engineering policies to exploit the rich interconnection opportunities at IXPs quickly. Dynam-IX offers a protocol to automate the interconnection process, an intent abstraction to express interconnection policies, a legal framework to digitally handle contracts, and a distributed tamper-proof ledger to create trust among ASes. We build and evaluate a Dynam-IX prototype and show that an AS can establish tens of agreements per minute with negligible overhead for ASes and IXPs.
bibtex
```
@INPROCEEDINGS{Marcos.CoNEXT18,
  author={Pedro Marcos and Marco Chiesa and Lucas Muller and Pradeeban Kathiravelu and Christoph Dietzel and Marco Canini and Marinho Barcellos},
  title = {{Dynam-IX: a Dynamic Interconnection eXchange}},
  booktitle = {Proceedings of CoNEXT'18},
  year = {2018},
  month = {Dec}
}
```
Sonata: Query-Driven Streaming Network Telemetry
A. Gupta, R. Harrison, M. Canini, N. Feamster, J. Rexford, W. Willinger.
In Proceedings of SIGCOMM’18, Aug 2018.

abstract
Managing and securing networks requires collecting and analyzing network traffic data in real time. Existing telemetry systems do not allow operators to express the range of queries needed to perform management or scale to large traffic volumes and rates. We present Sonata, an expressive and scalable telemetry system that coordinates joint collection and analysis of network traffic. Sonata provides a declarative interface to express queries for a wide range of common telemetry tasks; to enable real-time execution, Sonata partitions each query across the stream processor and the data plane, running as much of the query as it can on the network switch, at line rate. To optimize the use of limited switch memory, Sonata dynamically refines each query to ensure that available resources focus only on traffic that satisfies the query. Our evaluation shows that Sonata can support a wide range of telemetry tasks while reducing the workload for the stream processor by as much as seven orders of magnitude compared to existing telemetry systems.
bibtex
```
@INPROCEEDINGS{Gupta.SIGCOMM18,
  author={Arpit Gupta and Rob Harrison and Marco Canini and Nick Feamster and Jennifer Rexford and Walter Willinger},
  title = {{Sonata: Query-Driven Streaming Network Telemetry}},
  booktitle = {Proceedings of SIGCOMM'18},
  year = {2018},
  month = {Aug}
}
```
Prelude: Ensuring Inter-Domain Loop-Freedom in SDN-Enabled Networks
A. Dethise, M. Chiesa, M. Canini.
In Proceedings of APNet’18, Aug 2018.

abstract
Software-Defined eXchanges (SDXes) promise to improve the inter-domain routing ecosystem through SDN deployment. Yet, the naïve deployment of SDN on the Internet raises concerns about the correctness of the inter-domain data-plane. By allowing operators to deflect traffic from default BGP routes, SDN policies can create permanent forwarding loops that are not visible to the control-plane.
We propose Prelude, a system for detecting SDN-induced forwarding loops between SDXes with high accuracy without leaking private routing information of network operators. To achieve this, we leverage Secure Multi-Party Computation (SMPC) techniques to build a novel and general privacy-preserving primitive that detects whether any subset of SDN rules might affect the same portion of traffic without learning anything about those rules. We then leverage this primitive as the main building block of a distributed system tailored to detect forwarding loops among any set of SDXes. We leverage the particular nature of SDXes to further improve the efficiency of our SMPC solution.
The number of valid SDN rules rejected by our solution is 100x lower than previous privacy-preserving solutions, and provides better privacy guarantees. Furthermore, our solution naturally provides network operators with some insights on the cost of the deflected paths.
bibtex
```
@INPROCEEDINGS{Dethise.APNet18,
  author={Arnaud Dethise and Marco Chiesa and Marco Canini},
  title = {{Prelude: Ensuring Inter-Domain Loop-Freedom in SDN-Enabled Networks}},
  booktitle = {Proceedings of APNet'18},
  year = {2018},
  month = {Aug}
}
```
Picking a Partner: A Fair Blockchain Based Scoring Protocol for Autonomous Systems
Y. Alowayed, M. Canini, P. Marcos, M. Chiesa, M. Barcellos.
In Proceedings of ANRW’18, Jul 2018.

abstract
We tackle the problem of enabling Autonomous Systems to evaluate network providers on the basis of their adherence to Service Level Agreements (SLAs) regarding interconnection agreements. In current Internet practices, choices of interconnection partners are driven by factors such as word of mouth, personal relationships, brand recognition and market intelligence, and not by proofs of previous performance. Given that Internet eXchange Points provide increasingly more peering choices, rudimentary schemes for picking interconnection partners are not adequate anymore.
Although the current interconnection ecosystem is shrouded in confidentiality, our key observation is that recently-emerged blockchain technology and advances in cryptography enable a privacy-preserving decentralized solution based on actual performance measurements. We propose the concept of SLA score to evaluate network providers and introduce a privacy-preserving protocol that allows networks to compute and verify SLA scores.
bibtex
```
@INPROCEEDINGS{Alowayed.ANRW18,
  author={Yousef Alowayed and Marco Canini and Pedro Marcos and Marco Chiesa and Marinho Barcellos},
  title = {{Picking a Partner: A Fair Blockchain Based Scoring Protocol for Autonomous Systems}},
  booktitle = {Proceedings of ANRW'18},
  year = {2018},
  month = {Jul}
}
```
Renaissance: Self-Stabilizing Distributed SDN Control Plane
M. Canini, I. Salem, L. Schiff, E. M. Schiller, S. Schmid.
In Proceedings of ICDCS’18, Jul 2018.

abstract
By introducing programmability, automated verification, and innovative debugging tools, Software-Defined Networks (SDNs) are poised to meet the increasingly stringent dependability requirements of today’s communication networks. However, the design of fault-tolerant SDNs remains an open challenge. This paper considers the design of dependable SDNs through the lenses of self-stabilization—a very strong notion of fault-tolerance. In particular, we develop algorithms for an in-band and distributed control plane for SDNs, called Renaissance, which tolerate a wide range of (concurrent) controller, link, and communication failures. Our self-stabilizing algorithms ensure that after the occurrence of an arbitrary combination of failures, (i) every non-faulty SDN controller can eventually reach any switch in the network within a bounded communication delay (in the presence of a bounded number of concurrent failures) and (ii) every switch is managed by at least one non-faulty controller. We evaluate Renaissance through a rigorous worst-case analysis as well as a prototype implementation (based on OVS and Floodlight), and we report on our experiments using Mininet.
bibtex
```
@INPROCEEDINGS{Canini.ICDCS18,
  author={Canini, Marco and Salem, Iosif and Schiff, Liron and Schiller, Elad M. and Schmid, Stefan},
  title = {{Renaissance: Self-Stabilizing Distributed SDN Control Plane}},
  booktitle = {Proceedings of ICDCS'18},
  year = {2018},
  month = {Jul}
}
```
Methodology, Measurement and Analysis of Flow Table Update Characteristics in Hardware OpenFlow Switches
M. Kuzniar, P. Peresini, D. Kostic, M. Canini.
Computer Networks, 136, May 2018.

abstract
Software-Defined Networking (SDN) and OpenFlow are actively being standardized and deployed. These deployments rely on switches that come from various vendors and differ in terms of performance and available features. Understanding these differences and performance characteristics is essential for ensuring successful and safe deployments.
We propose a systematic methodology for SDN switch performance analysis and devise a series of experiments based on this methodology. The methodology relies on sending a stream of rule updates, while relying on both observing the control plane view as reported by the switch and probing the data plane state to determine switch characteristics by comparing these views. We measure, report and explain the performance characteristics of flow table updates in six hardware OpenFlow switches. Our results describing rule update rates can help SDN designers make their controllers efficient. Further, we also highlight differences between the OpenFlow specification and its implementations, that if ignored, pose a serious threat to network security and correctness.
bibtex
```
@ARTICLE{Kuzniar.COMNET18,
author = {Ku\'zniar, Maciej and Pere\v{s}\'ini, Peter and Kosti\'c, Dejan and Canini, Marco},
title = {{Methodology, Measurement and Analysis of Flow Table Update Characteristics in Hardware OpenFlow Switches}},
journal = {Computer Networks},
volume = {136},
year = {2018},
month = {May},
doi = {10.1016/j.comnet.2018.02.014}
}
```

Moving Bits with a Fleet of Shared Virtual Routers
P. Kathiravelu, M. Chiesa, P. de B Marcos, M. Canini, L. Veiga.
In Proceedings of IFIP Networking’18, May 2018.

abstract

bibtex

@INPROCEEDINGS{Kathiravelu.Networking18,
  author={Pradeeban Kathiravelu and Marco Chiesa and Pedro de B Marcos and Marco Canini and Luis Veiga}, 
  title = {{Moving Bits with a Fleet of Shared Virtual Routers}},
  booktitle = {Proceedings of IFIP Networking'18},
  year = {2018},
  month = {May}
}

SIXPACK: Securing Internet eXchange Points Against Curious onlooKers
M. Chiesa, D. Demmler, M. Canini, M. Schapira, T. Schneider.
In Proceedings of ACM CoNEXT’17, Dec 2017.

abstract
Internet eXchange Points (IXPs) play an ever-growing role in Internet inter-connection. To facilitate the exchange of routes amongst their members, IXPs provide Route Server (RS) services to dispatch the routes according to each member’s peering policies. Nowadays, to make use of RSes, these policies must be disclosed to the IXP. This poses fundamental questions regarding the privacy guarantees of route-computation on confidential business information. Indeed, as evidenced by interaction with IXP administrators and a survey of network operators, this state of affairs raises privacy concerns among network administrators and even deters some networks from subscribing to RS services. We design SIXPACK, an RS service that leverages Secure Multi-Party Computation (SMPC) to keep peering policies confidential, while extending, the functionalities of today’s RSes. As SMPC is notoriously heavy in terms of communication and computation, our design and implementation of SIXPACK aims at moving computation outside of the SMPC without compromising the privacy guarantees.
We assess the effectiveness and scalability of our system by evaluating a prototype implementation using traces of data from one of the largest IXPs in the world. Our evaluation results indicate that SIXPACK can scale to support privacy-preserving route-computation, even at IXPs with many hundreds of member networks.
bibtex
```
@INPROCEEDINGS{Chiesa.CoNEXT17,
  author={Marco Chiesa and Daniel Demmler and Marco Canini and Michael Schapira and Thomas Schneider}, 
  title = {{SIXPACK: Securing Internet eXchange Points Against Curious onlooKers}},
  booktitle = {Proceedings of ACM CoNEXT'17},
  year = {2017},
  month = {Dec}
}
```
ENDEAVOUR: A Scalable SDN Architecture for Real-World IXPs
G. Antichi, I. Castro, M. Chiesa, E. L. Fernandes, R. Lapeyrade, D. Kopp, J. H. Han, M. Bruyere, C. Dietzel, M. Gusat, A. W. Moore, P. Owezarski, S. Uhlig, M. Canini.
IEEE Journal on Selected Areas in Communications, 35(11), Nov 2017.

abstract
Innovation in interdomain routing has remained stagnant for over a decade. Recently, IXPs have emerged as economically-advantageous interconnection points for reducing path latencies and exchanging ever increasing traffic volumes among, possibly, hundreds of networks. Given their far-reaching implications on interdomain routing, IXPs are the ideal place to foster network innovation and extend the benefits of SDN to the interdomain level.
In this paper, we present, evaluate, and demonstrate ENDEAVOUR, an SDN platform for IXPs. ENDEAVOUR can be deployed on a multi-hop IXP fabric, supports a large number of use cases, and is highly-scalable while avoiding broadcast storms. Our evaluation with real data from one of the largest IXPs, demonstrates the benefits and scalability of our solution: ENDEAVOUR requires around 70% fewer rules than alternative SDN solutions thanks to our rule partitioning mechanism. In addition, by providing an open source solution, we invite everyone from the community to experiment (and improve) our implementation as well as adapt it to new use cases.
bibtex
```
@ARTICLE{Antichi.JSAC,
  author = {Gianni Antichi and Ignacio Castro and Marco Chiesa and Eder L. Fernandes and Remy Lapeyrade and Daniel Kopp and Jong Hun Han and Marc Bruyere and Christoph Dietzel and Mitchell Gusat and Andrew W. Moore and Philippe Owezarski and Steve Uhlig and Marco Canini}, 
  title = {{ENDEAVOUR: A Scalable SDN Architecture for Real-World IXPs}},
  journal = {IEEE Journal on Selected Areas in Communications},
  volume = {35}, 
  number = {11}, 
  year = {2017}, 
  month = {Nov},
  doi = {10.1109/JSAC.2017.2760398}
}
```
Decentralized Consistent Updates in SDN (Technical Report)
T. D. Nguyen, M. Chiesa, M. Canini.
In Proceedings of SOSR’17, Apr 2017.

abstract
We present ez-Segway, a decentralized mechanism to consistently and quickly update the network state while preventing forwarding anomalies (loops and black-holes) and avoiding link congestion. In our design, the centralized SDN controller only pre-computes information needed by the switches during the update execution. This information is distributed to the switches, which use partial knowledge and direct message passing to efficiently realize the update. This separation of concerns has the key benefit of improving update performance as the communication and computation bottlenecks at the controller are removed. Our evaluations via network emulations and large-scale simulations demonstrate the efficiency of ez-Segway, which compared to a centralized approach, improves network update times by up to 45% and 57% at the median and the 99th percentile, respectively. A deployment of a system prototype in a real OpenFlow switch and an implementation in P4 demonstrate the feasibility and low overhead of implementing simple network update functionality within switches.
bibtex
```
@INPROCEEDINGS{Nguyen.SOSR17,
  author = {Thanh Dang Nguyen and Marco Chiesa and Marco Canini},
  title = {{Decentralized Consistent Updates in SDN}},
  booktitle = {Proceedings of SOSR'17},
  year = {2017},
  month = {Apr} 
}
```
Correct by Construction Networks using Stepwise Refinement
L. Ryzhyk, N. Bjorner, M. Canini, J.-B. Jeannin, C. Schlesinger, D. B. Terry, G. Varghese.
In Proceedings of NSDI’17, Mar 2017.

abstract
Building software-defined network controllers is an exercise in software development and, as such, likely to introduce bugs. We present Cocoon, a framework for SDN development that facilitates both the design and verification of complex networks using stepwise refinement to move from a high-level specification to the final network implementation.
A Cocoon user specifies intermediate design levels in a hierarchical design process that delineates the modularity in complicated network forwarding and makes verification extremely efficient. For example, an enterprise network, equipped with VLANs, ACLs, and Level 2 and Level 3 Routing, can be decomposed cleanly into abstractions for each mechanism, and the resulting stepwise verification is over 200x faster than verifying the final implementation. Cocoon further separates static network design from its dynamically changing configuration. The former is verified at design time, while the latter is checked at run time using statically defined invariants. We present six different SDN use cases including B4 and F10. Our performance evaluation demonstrates that Cocoon is not only faster than existing verification tools but can also find many bugs statically before the network design has been fully specified.
bibtex
```
@INPROCEEDINGS{Ryzhyk.NSDI17,
  author = {Leonid Ryzhyk and Nikolaj Bj\"orner and Marco Canini and Jean-Baptiste Jeannin and Cole Schlesinger and Douglas B. Terry and George Varghese},
  title = {{Correct by Construction Networks using Stepwise Refinement}},
  booktitle = {Proceedings of 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI'17)},
  year = {2017},
  month = {Mar} 
}
```
Network Monitoring as a Streaming Analytics Problem
A. Gupta, R. Birkner, M. Canini, N. Feamster, C. Mac-Stoker, W. Willinger.
In Proceedings of HotNets’16, Nov 2016.

abstract
Programmable switches make it easier to perform flexible network monitoring queries at line rate, and scalable stream processors make it possible to fuse data streams to answer more sophisticated queries about the network in real-time. Unfortunately, processing such network monitoring queries at high traffic rates requires both the switches and the stream processors to filter the traffic iteratively and adaptively so as to extract only that traffic that is of interest to the query at hand. Others have network monitoring in the context of streaming; yet, previous work has not closed the loop in a way that allows network operators to perform streaming analytics for network monitoring applications at scale. To achieve this objective, Sonata allows operators to express a network monitoring query by considering each packet as a tuple and efficiently partitioning each query between the switches and the stream processor through iterative refinement. Sonata extracts only the traffic that pertains to each query, ensuring that the stream processor can scale traffic rates of several terabits per second. We show with a simple example query involving DNS reflection attacks and traffic traces from one of the world’s largest IXPs that Sonata can capture 95% of all traffic pertaining to the query, while reducing the overall data rate by a factor of about 400 and the number of required counters by four orders of magnitude.
bibtex
```
@INPROCEEDINGS{Gupta.HotNets16,
  author = {Arpit Gupta and R\"udiger Birkner and Marco Canini and Nick Feamster and Chris Mac-Stoker and Walter Willinger},
  title = {{Network Monitoring as a Streaming Analytics Problem}},
  booktitle = {Proceedings of the 15th ACM Workshop on Hot Topics in Networks (HotNets'16)},
  year = {2016},
  month = {Nov} 
}
```
Inter-domain Networking Innovation on Steroids: Empowering IXPs with SDN Capabilities
M. Chiesa, C. Dietzel, G. Antichi, M. Bruyere, I. Castro, M. Gusat, T. King, A. W. Moore, T. D. Nguyen, P. Owezarski, S. Uhlig, M. Canini.
IEEE Communications Magazine, 54(10), Oct 2016.

abstract
While innovation in inter-domain routing has remained stagnant for over a decade, Internet Exchange Points (IXPs) are consolidating their role as economically advantageous interconnection points for reducing path latencies and exchanging ever increasing amounts of traffic. As such, IXPs appear as a natural place to foster network innovation and assess the benefits of Software-Defined Networking (SDN), a recent technological trend that has already boosted innovation within data-center networks.
In this paper, we give a comprehensive overview of use cases for SDN at IXPs, which leverage the superior vantage point of an IXP to introduce advanced features like load-balancing and DDoS mitigation. We discuss the benefits of SDN solutions by analyzing real-world data from one of the largest IXPs. We also leverage insights into IXP operations to not only shape benefits for members but also for operators.
bibtex
```
@ARTICLE{Chiesa.Communications,
  author = {Marco Chiesa and Christoph Dietzel and Gianni Antichi and Marc Bruyere and Ignacio Castro and Mitch Gusat and Thomas King and Andrew W. Moore and Thanh Dang Nguyen and Philippe Owezarski and Steve Uhlig and Marco Canini}, 
  title = {{Inter-domain Networking Innovation on Steroids: Empowering IXPs with SDN Capabilities}},
  journal = {IEEE Communications Magazine}, 
  volume = {54}, 
  number = {10}, 
  year = {2016}, 
  month = {Oct},
  doi = {10.1109/MCOM.2016.7588277}
}
```
Towards Decentralized Fast Consistent Updates
T. D. Nguyen, M. Chiesa, M. Canini.
In Proceedings of ANRW’16, Jul 2016.

abstract
Updating data plane state to adapt to dynamic conditions is a fundamental network control operation. Software-Defined Networking (SDN) offers abstractions for updating network state while preserving consistency properties. However, realizing these abstractions in a purely centralized fashion is inefficient, due to the inherent delays between switches and the SDN controller, we argue for delegating the responsibility of coordinated updates to the switches. To make our case, we propose ez-Segway, a mechanism that enables decentralized network updates while preventing forwarding anomalies and avoiding link congestion. In our architecture, the controller is only responsible for computing the intended network configuration. This information is distributed to the switches, which use partial knowledge and direct message passing to efficiently schedule and implement the update. This separation of concerns has the key benefit of improving update performance as the communication and computation bottlenecks at the controller are removed. Our extensive simulations show update speedups up to a factor of 2.
bibtex
```
@INPROCEEDINGS{Nguyen.ANRW16,
  author = {Thanh Dang Nguyen and Marco Chiesa and Marco Canini},
  title = {{Towards Decentralized Fast Consistent Updates}},
  booktitle = {Proceedings of the Applied Networking Research Workshop 2016 (ANRW'16)},
  year = {2016},
  month = {Jul} 
}
```
Network Hardware-Accelerated Consensus
H. T. Dang, P. Bressana, H. Wang, K. S. Lee, H. Weatherspoon, M. Canini, F. Pedone, R. Soule.
arXiv preprint arXiv:1605.05619

abstract
Consensus protocols are the foundation for building many fault-tolerant distributed systems and services. This paper posits that there are significant performance benefits to be gained by offering consensus as a network service (CAANS). CAANS leverages recent advances in commodity networking hardware design and programmability to implement consensus protocol logic in network devices. CAANS provides a complete Paxos protocol, is a drop-in replacement for software-based implementations of Paxos, makes no restrictions on network topologies, and is implemented in a higher-level, data-plane programming language, allowing for portability across a range of target devices. At the same time, CAANS significantly increases throughput and reduces latency for consensus operations. Consensus logic executing in hardware can transmit consensus messages at line speed, with latency only slightly higher than simply forwarding packets.
bibtex
```
@ARTICLE{2016arXiv:1605.05619,
  author = {Dang, Huynh Tu and Bressana, Pietro and Wang, Han, and Lee, Ki Suh and Weatherspoon, Hakim and Canini, Marco and Pedone, Fernando and Soul{\'{e}}, Robert},
  title = {{Network Hardware-Accelerated Consensus}},
  journal = {arXiv preprint arXiv:1605.05619},
  year = {2016},
  moth = {May}
}
```
Ground Control to Major Faults: Towards a Fault Tolerant and Adaptive SDN Control Network
L. Schiff, S. Schmid, M. Canini.
In Proceedings of DISN’16, Jun 2016.

abstract
To provide high availability and fault-tolerance, SDN control planes should be distributed. However, distributed control planes are challenging to design and bootstrap, especially if to be done in-band, without dedicated control network, and without relying on legacy protocols. This paper promotes a distributed systems approach to build and maintain connectivity between a distributed control plane and the data plane. In particular, we make the case for a self-stabilizing distributed control plane, where from any initial configuration, controllers self-organize, and quickly establish a communication channel among themselves. Given the resulting managed control plane, arbitrary network services can be implemented on top.
This paper presents a model for the design of such self-stabilizing control planes, and identifies fundamental challenges. Subsequently, we present techniques which can be used to solve these challenges, and implement a plug & play distributed control plane which supports automatic topology discovery and management, as well as flexible controller membership: controllers can be added and removed dynamically. Interestingly, we argue that our approach can readily be implemented in today’s OpenFlow protocol. Moreover, our approach comes with interesting security features.
bibtex
```
@INPROCEEDINGS{Schiff.DISN16,
  author = {Liron Schiff and Stefan Schmid and Marco Canini},
  title = {{Ground Control to Major Faults: Towards a Fault Tolerant and Adaptive SDN Control Network}},
  booktitle = {Proceedings of the 2nd Workshop on Dependability Issues on SDN and NFV (DISN'16)},
  year = {2016},
  month = {Jun} 
}
```
Scheduling Multi-flow Network Updates in Software-Defined NFV Systems
Y. Liu, Y. Li, M. Canini, Y. Wang, J. Yuan.
In Proceedings of SWFAN’16, Apr 2016.

abstract
Combining Network Functions Virtualization (NFV) with Software-Defined Networking (SDN) is an emerging solution to provide fine-grained control over scalable and elastic packet processing functions. Due to changes in network policy, traffic characteristics, or physical topology in Software-Defined NFV (SDNFV) systems, the controller needs to carry out network updates frequently, i.e., change the data plane configuration from one state to another. In order to adapt to a newly desired network state quickly, the network update process is expected to be completed in the shortest time possible. However, the update scheduling schemes need to address resource constraints including flow table sizes, CPU capacities of Virtualized Network Functions (VNFs) and link bandwidths, which are closely coupled. Thus, the problem is difficult to solve, especially when multiple flows are involved in the network update. In this work we investigate the multi-flow update problem in SDNFV systems, and formulate it as a mixed integer programming problem, which is NP-complete. We propose an approximation algorithm via linear relaxation. By extensive simulations, we demonstrate that our algorithm approaches the optimal solution, while requiring 10x-100x less computing time.
bibtex
```
@INPROCEEDINGS{Liu.SWFAN16,
  author = {Yujie Liu and Yong Li and Marco Canini and Yue Wang and Jian Yuan},
  title = {{Scheduling Multi-flow Network Updates in Software-Defined NFV Systems}},
  booktitle = {Proceedings of the 1st International INFOCOM Workshop on Software-Driven Flexible and Agile Networking (SWFAN'16)},
  year = {2016},
  month = {Apr} 
}
```
An Industrial-Scale Software Defined Internet Exchange Point
A. Gupta, R. MacDavid, R. Birkner, M. Canini, N. Feamster, J. Rexford, L. Vanbever.
In Proceedings of NSDI’16, Mar 2016.

abstract
Software-Defined Internet Exchange Points (SDXes) promise to significantly increase the flexibility and function of interdomain traffic delivery on the Internet. Unfortunately, current SDX designs cannot yet achieve the scale required for large Internet exchange points (IXPs), which can host hundreds of participants exchanging traffic for hundreds of thousands of prefixes. Existing platforms are indeed too slow and inefficient to operate at this scale, typically requiring minutes to compile policies and millions of forwarding rules in the data plane. We motivate, design, and implement iSDX, the first SDX architecture that can operate at the scale of the largest IXPs. We show that iSDX reduces both policy compilation time and forwarding table size by two orders of magnitude compared to current state-of-the-art SDX controllers. Our evaluation against a trace from one of the largest IXPs in the world found that iSDX can compile a realistic set of policies for 500 IXP participants in less than three seconds. Our public release of iSDX, complete with tutorials and documentation, is already spurring early adoption in operational networks.
bibtex
```
@INPROCEEDINGS{Gupta.NSDI16,
  author = {Arpit Gupta and Robert MacDavid and Rudiger Birkner and Marco Canini and Nick Feamster and Jennifer Rexford and Laurent Vanbever},
  title = {{An Industrial-Scale Software Defined Internet Exchange Point}},
  booktitle = {Proceedings of 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI'16)},
  year = {2016},
  month = {Mar} 
}
```
Paxos Made Switch-y (Technical Report)
H. T. Dang, M. Canini, F. Pedone, R. Soule.
SIGCOMM Comput. Commun. Rev., 46(2), Apr 2016.

abstract
The Paxos protocol is the foundation for building many fault-tolerant distributed systems and services. This paper posits that there are significant performance benefits to be gained by implementing Paxos logic in network devices. Until recently, the notion of a switch-based implementation of Paxos would be a daydream. However, new flexible hardware is on the horizon that will provide customizable packet processing pipelines needed to implement Paxos. While this new hardware is still not readily available, several vendors and consortia have made the programming languages that target these devices public. This paper describes an implementation of Paxos in one of those languages, P4. Implementing Paxos provides a critical use case for P4, and will help drive the requirements for data plane languages in general. In the long term, we imagine that consensus could someday be offered as a network service, just as point-to-point communication is provided today.
bibtex
```
@ARTICLE{Dang.CCR16,
  author = {Dang, Huynh Tu and Canini, Marco and Pedone, Fernando and Soul{\'{e}}, Robert},
  title = {{Paxos Made Switch-y}},
  journal = {SIGCOMM Comput. Commun. Rev.},
  year = {2016},
  month = {Apr},
  volume = {46},
  number = {2},
  doi = {10.1145/2935634.2935638}
}
```
NetPaxos: Consensus at Network Speed
H. T. Dang, D. Sciascia, M. Canini, F. Pedone, R. Soule.
In Proceedings of SOSR’15, Jun 2015.

abstract
This paper explores the possibility of implementing the widely deployed Paxos consensus protocol in network devices. We present two different approaches: (i) a detailed design description for implementing the full Paxos logic in SDN switches, which identifies a sufficient set of required OpenFlow extensions; and (ii) an alternative, optimistic protocol which can be implemented without changes to the OpenFlow API, but relies on assumptions about how the network orders messages. Although neither of these protocols can be fully implemented without changes to the underlying switch firmware, we argue that such changes are feasible in existing hardware. Moreover, we present an evaluation that suggests that moving Paxos logic into the network would yield significant performance benefits for distributed applications.
bibtex
```
@INPROCEEDINGS{Dang.SOSR15,
  author = {Dang, Huynh Tu and Sciascia, Daniele and Canini, Marco and Pedone, Fernando and Soul\'e, Robert},
  title = {{NetPaxos: Consensus at Network Speed}},
  booktitle = {Proceedings of SOSR'15)},
  year = {2015},
  month = {Jun}
}
```
A Distributed and Robust SDN Control Plane for Transactional Network Updates (Technical Report)
M. Canini, P. Kuznetsov, D. Levin, S. Schmid.
In Proceedings of INFOCOM’15, Apr 2015.

abstract
Software-defined networking (SDN) is a novel paradigm that outsources the control of programmable network switches to a set of software controllers. The most fundamental task of these controllers is the correct implementation of the network policy, i.e., the intended network behavior. In essence, such a policy specifies the rules by which packets must be forwarded across the network. This paper studies a distributed SDN control plane that enables concurrent and robust policy implementation. We introduce a formal model describing the interaction between the data plane and a distributed control plane (consisting of a collection of fault-prone controllers). Then we formulate the problem of consistent composition of concurrent network policy updates (termed the CPC Problem). To anticipate scenarios in which some conflicting policy updates must be rejected, we enable the composition via a natural transactional interface with all-or-nothing semantics.
We show that the ability of an f-resilient distributed control plane to process concurrent policy updates depends on the tag complexity, i.e., the number of policy labels (a.k.a. tags) available to the controllers, and describe a CPC protocol with optimal tag complexity f+2.
bibtex
```
@INPROCEEDINGS{Canini.INFOCOM15,
  author = {Canini, Marco and Kuznetsov, Petr and Levin, Dan and Schmid, Stefan},
  title = {{A Distributed and Robust SDN Control Plane for Transactional Network Updates}},
  booktitle = {Proceedings of INFOCOM'15)},
  year = {2015},
  month = {Apr} 
}
```
Software-Defined Networks: Incremental Deployment with Panopticon
M. Canini, A. Feldmann, D. Levin, F. Schaffert, S. Schmid.
Computer, 47(11), Nov 2014.

abstract
Practically speaking, most enterprises migrating to a software-defined network (SDN) must do so incrementally. Panopticon offers an approach for designing and operating an interim hybrid network that combines both traditional and SDN switches by exposing a logical SDN abstraction.
bibtex
```
@ARTICLE{Canini.PanopticonComputer, 
  author = {Canini, Marco and Feldmann, Anja and Levin, Dan and Schaffert, Fabian and Schmid, Stefan},
  journal = {Computer},
  title = {{Software-Defined Networks: Incremental Deployment with Panopticon}},
  year = {2014},
  month = {Nov},
  volume = {47},
  number = {11},
  doi = {10.1109/MC.2014.330}, 
}
```
ESPRES: Transparent SDN Update Scheduling
P. Peresini, M. Kuzniar, M. Canini, D. Kostic.
In Proceedings of HotSDN’14, Aug 2014.

abstract
Network forwarding state undergoes frequent changes, in batches of forwarding rule modifications at multiple switches. Installing or modifying a large number of rules is time consuming given the performance limits of current programmable switches, which are also due to economical factors in addition to technological ones.
In this paper, we observe that a large network-state update typically consists of a set of sub-updates that are independent of one another w.r.t. the traffic they affect, and hence sub-updates can be installed in parallel, in any order. Leveraging this observation, we treat update installation as a scheduling problem and design ESPRES, a runtime mechanism that rate-limits and reorders updates to fully utilize processing capacities of switches without overloading them. Our early results show that compared to using no scheduler, our schemes yield 2.17-3.88 times quicker sub-update completion time for 20th percentile of sub-updates and 1.27-1.57 times quicker for 50th percentile.
bibtex
```
@INPROCEEDINGS{Peresini.HOTSDN14,
  author = {Pere\v{s}\'ini, Peter and Ku\'zniar, Maciej and Canini, Marco and Kosti\'c, Dejan},
  title = {{ESPRES: Transparent SDN Update Scheduling}},
  booktitle = {Proceedings of HotSDN'14},
  year = {2014},
  month = {Aug} 
}
```
Panopticon: Reaping the Benefits of Incremental SDN Deployment in Enterprise Networks (Technical Report)
D. Levin, M. Canini, S. Schmid, F. Schaffert, A. Feldmann.
In Proceedings of USENIX Annual Technical Conference (ATC’14), Jun 2014.

abstract
The operational challenges posed in enterprise networks present an appealing opportunity for automated orchestration by way of Software-Defined Networking (SDN). The primary challenge to SDN adoption in the enterprise is the deployment problem: How to deploy and operate a network consisting of both legacy and SDN switches, while benefiting from simplified management and enhanced flexibility of SDN.
This paper presents the design and implementation of Panopticon, an architecture for operating networks that combine legacy and SDN switches. Panopticon exposes an abstraction of a logical SDN in a partially upgraded legacy network, where SDN benefits can extend over the entire network. We demonstrate the feasibility and evaluate the efficiency of our approach through both testbed experiments with hardware switches and through simulation on real enterprise campus network topologies entailing over 1500 devices. Our results suggest that when as few as 10% of distribution switches support SDN, most of an enterprise network can be operated as a single SDN while meeting key resource constraints.
bibtex
```
@INPROCEEDINGS{Levin.ATC14,
author = {Levin, Dan and Canini, Marco and Schmid, Stefan and Schaffert, Fabian and Feldmann, Anja},
title = {{Panopticon: Reaping the Benefits of Incremental SDN Deployment in Enterprise Networks}},
booktitle = {Proceedings of the 2014 USENIX Annual Technical Conference (ATC'14)},
year = {2014},
month = {Jun} 
}
```
ESPRES: Easy Scheduling and Prioritization for SDN
P. Peresini, M. Kuzniar, M. Canini, D. Kostic.
In Proceedings of Open Networking Summit (ONS’14), Mar 2014.

abstract
Due to traffic engineering, topology changes, and VM migrations, today’s networks undergo a variety of large updates that concurrently affect many switches. Big updates are time consuming because the available switches have low rule installation rates. In this paper, we observe that a large network update may consist of a set of sub-updates that are independent and can be installed in parallel in any order. We treat update installation as a scheduling problem and design ESPRES, a runtime mechanism that rate-limits and reorders updates to fully utilize switches without overloading them. Our early results show that compared to using no scheduler, our simple scheduler yields 4 times quicker sub-update completion time for 20th percentile of sub-updates and 40% quicker for 50th percentile.
bibtex
```
@INPROCEEDINGS{Peresini.ONS14,
  author = {Pere\v{s}\'ini, Peter and Ku\'zniar, Maciej and Canini, Marco and Kosti\'c, Dejan},
  title = {{ESPRES: Easy Scheduling and Prioritization for SDN}},
  booktitle = {Proceedings of Open Networking Summit (ONS'14)},
  year = {2014},
  month = {Mar} 
}
```
STN: A Robust and Distributed SDN Control Plane
M. Canini, D. De Cicco, P. Kuznetsov, D. Levin, S. Schmid, S. Vissicchio.
In Proceedings of Open Networking Summit (ONS’14), Mar 2014.

abstract
This paper presents the STN, a distributed SDN control plane based on Software Transactional Memory principles that supports concurrent policy updates while ensuring consistent policy composition and high availability. In particular, STN gives network programmers an abstraction that guarantees (1) linearizability: any execution can be transformed into an equivalent sequential history and (2) wait-freeness: the control plane tolerates up to f node failures, for some parameter f. We describe an STN design using a number of tags which is asymptotically minimal to ensure the above properties. Also, we report our progresses in building an STN prototype based on the Pyretic modular programming language.
bibtex
```
@INPROCEEDINGS{Canini.ONS14,
  author = {Canini, Marco and De Cicco, Daniele and Kuznetsov, Petr and Levin, Dan and Schmid, Stefan and Vissicchio, Stefano},
  title = {{STN: A Robust and Distributed SDN Control Plane}},
  booktitle = {Proceedings of Open Networking Summit (ONS'14)},
  year = {2014},
  month = {Mar} 
}
```
Brief Announcement: Towards Distributed and Reliable Software Defined Networking
M. Canini, P. Kuznetsov, D. Levin, S. Schmid.
In Proceedings of International Symposium on Distributed Computing (DISC’13), Oct 2013.

abstract
Software-defined networking (SDN) is a novel paradigm that out-sources the control of packet-forwarding switches to a set of software controllers. The most fundamental task of these controllers is the correct implementation of the network policy, i.e., the intended network behavior. In essence, such a policy specifies the rules by which packets must be forwarded across the network. This paper initiates the study of the SDN control plane as a distributed system.
bibtex
```
@INPROCEEDINGS{Canini.DISC13,
  author = {Canini, Marco and Kuznetsov, Petr and Levin, Dan and Schmid, Stefan},
  title = {{Brief Announcement: Towards Distributed and Reliable Software Defined Networking}},
  booktitle = {Proceedings of International Symposium on Distributed Computing (DISC'13)},
  year = {2013},
  month = {Oct} 
}
```
The Case for Reliable Software Transactional Networking
M. Canini, P. Kuznetsov, D. Levin, S. Schmid.
arXiv preprint arXiv:1305.7429

abstract
Software-defined networking (SDN) is a novel paradigm that out-sources the control of packet-forwarding switches to a set of software controllers. The most fundamental task of these controllers is the correct implementation of the network policy, i.e., the intended network behavior. In essence, such a policy specifies the rules by which packets must be forwarded across the network. This paper initiates the study of the SDN control plane as a distributed system.
We introduce a formal model describing the interaction between the data plane and a distributed control plane (consisting of a collection of fault-prone controllers). Then we formulate the problem of consistent composition of concurrent network policy updates. The composition is enabled via a transactional interface with all-or-nothing semantics. The system behaves as though committed updates are installed atomically and every data packet traverses the network instantaneously, respecting a sequential composition of previously installed committed updates. Updates that cannot be composed are aborted and do not affect the data plane.
We show that in the asynchronous environment, it is impossible to achieve consistent policy composition that tolerates a single controller crash. We then discuss stronger variants of the model that allow for solving the problem and study algorithmic complexities of such solutions.
bibtex
```
@ARTICLE{2013arXiv1305.7429C,
  author = {Canini, Marco and Kuznetsov, Petr and Levin, Dan and Schmid, Stefan},
  title = {{The Case for Reliable Software Transactional Networking}},
  journal = {arXiv preprint arXiv:1305.7429},
  year = {2013},
  month = {May}
}
```
Software Transactional Networking: Concurrent and Consistent Policy Composition
M. Canini, P. Kuznetsov, D. Levin, S. Schmid.
In Proceedings of HotSDN’13, Aug 2013.

abstract
It seems natural to imagine that SDN policy specification and control is distributed, and this paper focuses on the resulting concurrency issues. Indeed, conflicts among concurrent policy updates may result in serious inconsistencies on the data plane, even when each update is installed with per-packet consistent update semantics. This paper introduces the problem of consistent composition of concurrent policy updates. Intuitively, consistent concurrent policy composition must appear as though there is no concurrency neither between any policy updates, nor between a policy update and in-flight packets on the data plane.
We propose an elegant policy composition abstraction based on a transactional interface with all-or-nothing semantics: a policy update is either committed, in which case the policy is guaranteed to compose consistently over the entire network and the update is installed in its entirety, or aborted, in which case, no packet is affected by it. Consequently, the control application logic is relieved from the cumbersome and potentially error-prone synchronization and locking tasks, and control applications are kept light-weight. In this paper, we also sketch a simple implementation of the transactional synchronization: our approach is based on fine-grained locking on network components and avoids complex state machine replication.
bibtex
```
@INPROCEEDINGS{Canini.HOTSDN13,
author = {Canini, Marco and Kuznetsov, Petr and Levin, Dan and Schmid, Stefan},
title = {{Software Transactional Networking: Concurrent and Consistent Policy Composition}},
booktitle = {Proceedings of HotSDN'13},
year = {2013},
month = {Aug}
}
```
FatTire: Declarative Fault Tolerance for Software-Defined Networks
M. Reitblatt, M. Canini, A. Guha, N. Foster.
In Proceedings of HotSDN’13, Aug 2013.

abstract
This paper presents FatTire, a new language for writing fault-tolerant network programs. The central feature of this language is a new programming construct based on regular expressions that allows developers to specify the set of paths that packets may take through the network as well as the degree of fault tolerance required. This construct is implemented by a compiler that targets the in-network fast-failover mechanisms provided in recent versions of the OpenFlow standard, and facilitates simple reasoning about network programs even in the presence of failures. We describe the design of FatTire, present algorithms for compiling FatTire programs to OpenFlow switch configurations, describe our prototype FatTire implementation, and demonstrate its use on simple examples.
bibtex
```
@INPROCEEDINGS{Reitblatt.HOTSDN13,
author = {Reitblatt, Mark and Canini, Marco and Guha, Arjun and Foster, Nate},
title = {{FatTire: Declarative Fault Tolerance for Software-Defined Networks}},
booktitle = {Proceedings of HotSDN'13},
year = {2013},
month = {Aug}
}
```
OF.CPP: Consistent Packet Processing for OpenFlow
P. Peresini, M. Kuzniar, N. Vasic, M. Canini, D. Kostic.
In Proceedings of HotSDN’13, Aug 2013.

abstract
This paper demonstrates a new class of bugs that is likely to occur in enterprise OpenFlow deployments. In particular, step-by-step, reactive establishment of paths can cause network-wide inconsistencies or performance- and space-related inefficiencies. The cause for this behavior is inconsistent packet processing: as the packets travel through the network they do not encounter consistent state at the OpenFlow controller. To mitigate this problem, we propose to use transactional semantics at the controller to achieve consistent packet processing. We detail the challenges in achieving this goal (including the inability to directly apply database techniques), as well as a potentially promising approach. In particular, we envision the use of multi-commit transactions that could provide the necessary serialization and isolation properties without excessively reducing network performance.
bibtex
```
@INPROCEEDINGS{Peresini.HOTSDN13,
author = {Pere\v{s}\'ini, Peter and Ku\'zniar, Maciej and Vasi\'c, Nedeljko and Canini, Marco and Kosti\'c, Dejan},
title = {{OF.CPP: Consistent Packet Processing for OpenFlow}},
booktitle = {Proceedings of HotSDN'13},
year = {2013},
month = {Aug}
}
```
Toward Transitional SDN Deployment in Enterprise Networks
D. Levin, M. Canini, S. Schmid, A. Feldmann.
Open Networking Summit, Research Track (ONS’13), Apr 2013.

abstract
The operational challenges posed in enterprise networks, present an appealing opportunity for the software-defined orchestration of the network (SDN). However, the primary challenge to realizing solutions built on SDN in the enterprise is the deployment problem. Unlike in the datacenter, network upgrades in the enterprise start with the existing deployment and are particularly budget and resource-constrained.
In this work, we investigate the prospect for partial Software Defined Network deployment. We present Panopticon, an architecture and methodology for planning and operating networks that combine legacy and upgraded SDN switches. Panopticon exposes an abstraction of a fully-deployed SDN in a partially upgraded legacy network, where the SDN benefits extend potentially over the entire network.
We evaluate the feasibility of our approach through simulation on real enterprise campus network topologies entailing over 1500 switches and routers. Our results suggest that with only a handful of upgraded switches, it becomes possible to operate most of an enterprise network as a single SDN while meeting key resource constraints.
bibtex
```
@INPROCEEDINGS{Levin.ONS13,
author = {Levin, Dan and Canini, Marco and Schmid, Stefan and Feldmann, Anja},
title = {{Toward Transitional SDN Deployment in Enterprise Networks}},
booktitle = {Open Networking Summit (ONS'13)},
year = {2013},
month = {Apr}
}
```

Education in Reproducible Research

Learning Reproducibility with a Yearly Networking Contest
M. Canini, J. Crowcroft.
In Proceedings of Reproducibility’17, Aug 2017.

abstract
Better reproducibility of networking research results is currently a major goal that the academic community is striving towards. This position paper makes the case that improving the extent and pervasiveness of reproducible research can be greatly fostered by organizing a yearly international contest.
bibtex
```
@INPROCEEDINGS{Canini.Reproducibility17,
  author = {Canini, Marco and Crowcroft, Jon},
  title = {{Learning Reproducibility with a Yearly Networking Contest}},
  booktitle = {Proceedings of the Reproducibility Workshop},
  year = {2017},
  month = {Aug} 
}
```

Reliable Networked Systems

TENSOR: Lightweight BGP Non-Stop Routing
C. Miao, Y. Xiao, M. Canini, R. Dai, S. Zheng, J. Wang, J. Bu, A. Kuzmanovic, Y. Wang.
In Proceedings of SIGCOMM’23, Sep 2023.

abstract
As the solitary inter-domain protocol, BGP plays an important role in today’s Internet. Its failures threaten network stability and will usually result in large-scale packet losses. Thus, the non-stop routing (NSR) capability that protects inter-domain connectivity from being disrupted by various failures, is critical to any Autonomous System (AS) operator. Replicating the BGP and underlying TCP connection status is key to realizing NSR. But existing NSR solutions, which heavily rely on OS kernel modifications, have become impractical due to providers’ adoption of virtualized network gateways for better scalability and manageability.
In this paper, we tackle this problem by proposing TENSOR, which incorporates a novel kernel-modification-free replication design and lightweight architecture. More concretely, the kernel-modification-free replication design mitigates the reliance on OS kernel modification and hence allows the virtualization of the network gateway. Meanwhile, lightweight virtualization provides strong performance guarantees and improves system reliability. Moreover, TENSOR provides a solution to the split-brain problem that affects NSR solutions. Through extensive experiments, we show that TENSOR realizes NSR while bearing little overhead compared to open-source BGP implementations. Further, our two-year operational experience on a fleet of 400 servers controlling over 31,000 BGP peering connections demonstrates that TENSOR reduces the development, deployment, and maintenance costs significantly – at least by factors of 20, 5, and 10, respectively, while retaining the same SLA with the NSR-enabled routers.
bibtex
```
@INPROCEEDINGS{tensor,
  author = {Miao, Congcong and Xiao, Yunming and Canini, Marco and Dai, Ruiqiang and Zheng, Shengli and Wang, Jilong and Bu, Jiwu and Kuzmanovic, Aleksandar and Wang, Yachen},
  title = {{TENSOR: Lightweight BGP Non-Stop Routing}},
  booktitle = {Proceedings of SIGCOMM'23},
  year = {2023},
  month = {Sep} 
}
```
Systematically Testing OpenFlow Controller Applications
P. Peresini, M. Kuzniar, M. Canini, D. Venzano, D. Kostic, J. Rexford.
Computer Networks, 92(2), Dec 2015.

abstract
The emergence of OpenFlow-capable switches enables exciting new network functionality, at the risk of programming errors that make communication less reliable. The centralized programming model, where a single controller program manages the network, seems to reduce the likelihood of bugs. However, the system is inherently distributed and asynchronous, with events happening at different switches and end hosts, and inevitable delays affecting communication with the controller. In this paper, we present efficient, systematic techniques for testing unmodified controller programs. Our NICE tool applies model checking to explore the state space of the entire system—the controller, the switches, and the hosts. Scalability is the main challenge, given the diversity of data packets, the large system state, and the many possible event orderings. To address this, we propose a novel way to augment model checking with symbolic execution of event handlers (to identify representative packets that exercise code paths on the controller). We also present a simplified OpenFlow switch model (to reduce the state space), and effective strategies for generating event interleavings likely to uncover bugs. Our prototype tests Python applications on the popular NOX platform. In testing three real applications—a MAC-learning switch, in-network server load balancing, and energy-efficient traffic engineering—we uncover 13 bugs.
bibtex
```
@ARTICLE{Peresini.15.NICE,
author = {Pere\v{s}\'ini, Peter and Ku\'zniar, Maciej and Canini, Marco and Venzano, Daniele and Kosti\'c, Dejan and Rexford, Jennifer},
title = {{Systematically Testing OpenFlow Controller Applications}},
journal = {Computer Networks},
volume = {92},
number = {2},
year = {2015},
month = {Dec},
doi = {10.1016/j.comnet.2015.03.019}
}
```
A SOFT Way for OpenFlow Switch Interoperability Testing
M. Kuzniar, P. Peresini, M. Canini, D. Venzano, D. Kostic.
In Proceedings of ACM CoNEXT’12, Dec 2012.

abstract
The increasing adoption of Software Defined Networking, and OpenFlow in particular, brings great hope for increasing extensibility and lowering costs of deploying new network functionality. A key component in these networks is the OpenFlow agent, a piece of software that a switch runs to enable remote programmatic access to its forwarding tables. While testing high-level network functionality, the correct behavior and interoperability of any OpenFlow agent are taken for granted. However, existing tools for testing agents are not exhaustive nor systematic, and only check that the agent’s basic functionality works. In addition, the rapidly changing and sometimes vague OpenFlow specifications can result in multiple implementations that behave differently.
This paper presents SOFT, an approach for testing the interoperability of OpenFlow switches. Our key insight is in automatically identifying the testing inputs that cause different OpenFlow agent implementations to behave inconsistently. To this end, we first symbolically execute each agent under test in isolation to derive which set of inputs causes which behavior. We then crosscheck all distinct behaviors across different agent implementations and evaluate whether a common input subset causes inconsistent behaviors. Our evaluation shows that our tool identified several inconsistencies between the publicly available Reference OpenFlow switch and Open vSwitch implementations.
bibtex
```
@INPROCEEDINGS{Kuzniar.CONEXT12,
author = {Ku\'zniar, Maciej and Pere\v{s}\'ini, Peter and Canini, Marco and Venzano, Daniele and Kosti\'c, Dejan},
title = {{A SOFT Way for OpenFlow Switch Interoperability Testing}},
booktitle = {Proceedings of ACM CoNEXT'12},
year = {2012},
month = {Dec}
}
```
OFTEN Testing OpenFlow Networks
M. Kuzniar, M. Canini, D. Kostic.
In Proceedings of the 1st European Workshop on Software Defined Networks (EWSDN), Oct 2012.

abstract
Software-defined networking and OpenFlow in particular enable independent development of network devices and software that controls them. Such separation of concerns eases the introduction of new network functionality; however, it leads to distributed responsibility for bugs. Despite the common interface, separate development entails the need to test an integrated network before deployment. In this work-in-progress paper, we identify the challenges of creating an environment that simplifies and systematically conducts such tests. We discuss optimizations required for efficient and reliable OpenFlow switch black-box testing and present a possible approach to address other challenges. In our preliminary prototype, we combine systematic state-space exploration techniques with real switches execution to explore an integrated network behavior. Our initial results show that such methods help detect previously unrevealed inconsistencies in the network.
bibtex
```
@INPROCEEDINGS{Kuzniar.EWSDN12,
author = {Ku\'zniar, Maciej and Canini, Marco and Kosti\'c, Dejan},
title = {{OFTEN Testing OpenFlow Networks}},
booktitle = {Proceedings of the 1st European Workshop on Software Defined Networks (EWSDN'12)},
year = {2012},
month = {Oct}
}
```
A NICE Way to Test OpenFlow Applications (Technical Report)
M. Canini, D. Venzano, P. Peresini, D. Kostic, J. Rexford.
In Proceedings of NSDI’12, Apr 2012.

abstract
The emergence of OpenFlow-capable switches enables exciting new network functionality, at the risk of programming errors that make communication less reliable. The centralized programming model, where a single controller program manages the network, seems to reduce the likelihood of bugs. However, the system is inherently distributed and asynchronous, with events happening at different switches and end hosts, and inevitable delays affecting communication with the controller. In this paper, we present efficient, systematic techniques for testing unmodified controller programs. Our NICE tool applies model checking to explore the state space of the entire system—the controller, the switches, and the hosts. Scalability is the main challenge, given the diversity of data packets, the large system state, and the many possible event orderings. To address this, we propose a novel way to augment model checking with symbolic execution of event handlers (to identify representative packets that exercise code paths on the controller). We also present a simplified OpenFlow switch model (to reduce the state space), and effective strategies for generating event interleavings likely to uncover bugs. Our prototype tests Python applications on the popular NOX platform. In testing three real applications—a MAC-learning switch, in-network server load balancing, and energy-efficient traffic engineering—we uncover eleven bugs.
bibtex
```
@INPROCEEDINGS{Canini.NSDI12,
author = {Canini, Marco and Venzano, Daniele and Pere\v{s}\'ini, Peter and Kosti\'c, Dejan and Rexford, Jennifer},
title = {{A NICE Way to Test OpenFlow Applications}},
booktitle = {Proceedings of 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI'12)},
year = {2012},
month = {Apr}
}
```
Automating the Testing of OpenFlow Applications
M. Canini, D. Kostic, J. Rexford, D. Venzano.
In Proceedings of the 1st International Workshop on Rigorous Protocol Engineering (WRiPE 2011), Oct 2011.

abstract
Software-defined networking, and the emergence of OpenFlow-capable switches, enables a wide range of new network functionality. However, enhanced programmability inevitably leads to more software faults (or bugs). We believe that tools for testing OpenFlow programs are critical to the success of the new technology. However, the way OpenFlow applications interact with the data plane raises several challenges. First, the space of possible inputs (e.g., packet headers and inter-packet timings) is huge. Second, the centralized controller has a indirect view of the traffic and experiences unavoidable delays in installing rules in the switches. Third, external factors like user behavior (e.g., mobility) and higher-layer protocols (e.g., the TCP state machine) affect the correctness of OpenFlow programs.
In this work-in-progress paper, we extend techniques for symbolic execution to generate inputs that systematically explore the space of system executions. First, we analyze controller applications to identify equivalence classes of packets that exercise different parts of the code. Second, we propose several network models with increasing precision, ranging from simple traffic models to live testing on the target network. Initial experiences with our prototype, which symbolically executes OpenFlow applications written in Python, suggest that our techniques can help programmers identify bugs in their OpenFlow programs.
bibtex
```
@INPROCEEDINGS{Canini.WRIPE,
author = {Canini, Marco and Kosti\'c, Dejan and Rexford, Jennifer and Venzano, Daniele},
title = {{Automating the Testing of OpenFlow Applications}},
booktitle = {Proceedings of the 1st International Workshop on Rigorous Protocol Engineering (WRiPE'11)},
year = {2011},
month = {Oct}
}
```
Finding Almost-Invariants in Distributed Systems
M. Yabandeh, A. Anand, M. Canini, D. Kostic.
In Proceedings of the 30th IEEE Symposium on Reliable Distributed Systems (SRDS’11), Oct 2011.

abstract
It is notoriously hard to develop dependable distributed systems. This is partly due to the difficulties in foreseeing various corner cases and failure scenarios while implementing a system that will be deployed over an asynchronous network. In contrast, reasoning about the desired distributed system behavior and the corresponding invariants is easier than reasoning about the code itself. Further, the invariants can be used for testing, theorem proving, and runtime enforcement.
In this paper, we propose an approach to observe the system behavior and automatically infer invariants which reveal implementation bugs. Using our tool, Avenger, we automatically generate a large number of potentially relevant properties, check them within the time and spatial domains using traces of system executions, and filter out all but a few properties before reporting them to the developer. Our key insight in filtering is that a good candidate for an invariant is the one that holds in all but a few cases, i.e., an “almost-invariant”. Our experimental results with the XORP BGP implementation demonstrate Avenger’s ability to identify the almost-invariants that lead the developer to programming errors.
bibtex
```
@INPROCEEDINGS{Yabandeh.SRDS.Avenger,
author = {Yabandeh, Maysam and Abhishek, Anand and Canini, Marco and Kosti\'c, Dejan},
title = {{Finding Almost-Invariants in Distributed Systems}},
booktitle = {Proceedings of the 30th IEEE Symposium on Reliable Distributed Systems (SRDS'11)},
year = {2011},
month = {Oct}
}
```
Toward Online Testing of Federated and Heterogeneous Distributed Systems (Technical Report)
M. Canini, V. Jovanovic, D. Venzano, B. Spasojevic, O. Crameri, D. Kostic.
In Proceedings of USENIX Annual Technical Conference (ATC’11), Jun 2011.

abstract
Making distributed systems reliable is notoriously difficult. It is even more difficult to achieve high reliability for federated and heterogeneous systems, i.e., those that are operated by multiple administrative entities and have numerous inter-operable implementations. A prime example of such a system is the Internet’s inter-domain routing, today based on BGP.
We argue that system reliability should be improved by proactively identifying potential faults using an online testing functionality. We propose DiCE, an approach that continuously and automatically explores the system behavior, to check whether the system deviates from its desired behavior. DiCE orchestrates the exploration of relevant system behaviors by subjecting system nodes to many possible inputs that exercise node actions. DiCE starts exploring from current, live system state, and operates in isolation from the deployed system. We describe our experience in integrating DiCE with an open-source BGP router. We evaluate the prototype’s ability to quickly detect origin misconfiguration, a recurring operator mistake that causes Internet-wide outages. We also quantify DiCE’s overhead and find it to have marginal impact on system performance.
bibtex
```
@INPROCEEDINGS{Canini.ATC.TowardOnlineTesting,
author = {Canini, Marco and Jovanovi\'c, Vojin and Venzano, Daniele and Spasojevi\'c, Boris and Crameri, Olivier and Kosti\'c, Dejan},
title = {{Toward Online Testing of Federated and Heterogeneous Distributed Systems}},
booktitle = {Proceedings of the 2011 USENIX Annual Technical Conference (ATC'11)},
year = {2011},
month = {Jun}
}
```
Fault Prediction in Distributed Systems Gone Wild
M. Canini, D. Novakovic, V. Jovanovic, D. Kostic.
In Proceedings of the 4th ACM SIGOPS/SIGACT Workshop on Large Scale Distributed Systems and Middleware (LADIS’10), Jul 2010.

abstract
We consider the problem of predicting faults in deployed, large-scale distributed systems that are heterogeneous and federated. Motivated by the importance of ensuring reliability of the services these systems provide, we argue that the key step in making these systems reliable is the need to automatically predict faults. For example, doing so is vital for avoiding Internet-wide outages that occur due to programming errors or misconfigurations.
bibtex
```
@INPROCEEDINGS{Canini.LADIS.FaultPred,
author = {Canini, Marco and Novakovi\'c, Dejan and Jovanovi\'c, Vojin and Kosti\'c, Dejan},
title = {{Fault Prediction in Distributed Systems Gone Wild}},
booktitle = {Proceedings of the 4th ACM SIGOPS/SIGACT Workshop on Large Scale Distributed Systems and Middleware (LADIS'10)},
year = {2010},
month = {Jul}
}
```

Energy-Efficient Networks

Identifying and Using Energy-Critical Paths
N. Vasic, D. Novakovic, S. Shekhar, P. Bhurat, M. Canini, D. Kostic.
In Proceedings of ACM CoNEXT’11, Dec 2011.

abstract
The power consumption of the Internet and datacenter networks is already significant, and threatens to shortly hit the power delivery limits while the hardware is trying to sustain ever-increasing traffic requirements. Existing energy-reduction approaches in this domain advocate recomputing network configuration with each substantial change in demand. Unfortunately, computing the minimum network subset is computationally hard and does not scale. Thus, the network is forced to operate with diminished performance during the recomputation periods. In this paper, we propose REsPoNse, a framework which overcomes the optimality-scalability trade-off. The insight in REsPoNse is to identify a few energy-critical paths off-line, install them into network elements, and use a simple online element to redirect the traffic in a way that enables large parts of the network to enter a low-power state. We evaluate REsPoNse with real network data and demonstrate that it achieves the same energy savings as the existing approaches, with marginal impact on network scalability and application performance.
bibtex
```
@INPROCEEDINGS{Vasic.CONEXT11,
author = {Vasi\'c, Nedeljko and Novakovi\'c, Dejan and Shekhar, Satyam and Bhurat, Prateek and Canini, Marco and Kosti\'c, Dejan},
title = {{Identifying and Using Energy-Critical Paths}},
booktitle = {Proceedings of ACM CoNEXT'11},
year = {2011},
month = {Dec}
}
```
Insomnia in the Access or How to Curb Access Network Related Energy Consumption
E. Goma, M. Canini, A. Lopez Toledo, N. Laoutaris, D. Kostic, P. Rodriguez, R. Stanojevic, P. Yague Valentin.
In Proceedings of ACM SIGCOMM’11, Aug 2011.

abstract
Access networks include modems, home gateways, and DSL Access Multiplexers (DSLAMs), and are responsible for 70-80% of total network-based energy consumption. In this paper, we take an in-depth look at the problem of greening access networks, identify root problems, and propose practical solutions for their user- and ISP-parts. On the user side, the combination of continuous light traffic and lack of alternative paths condemns gateways to being powered most of the time despite having Sleep-on-Idle (SoI) capabilities. To address this, we introduce Broadband Hitch-Hiking (BH²), that takes advantage of the overlap of wireless networks to aggregate user traffic in as few gateways as possible. In current urban settings BH² can power off 65-90% of gateways. Powering off gateways permits the remaining ones to synchronize at higher speeds due to reduced crosstalk from having fewer active lines. Our tests reveal speedup up to 25%. On the ISP side, we propose introducing simple inexpensive switches at the distribution frame for batching active lines to a subset of cards letting the remaining ones sleep. Overall, our results show an 80% energy savings margin in access networks. The combination of BH² and switching gets close to this margin, saving 66% on average.
bibtex
```
@INPROCEEDINGS{Goma.SIGCOMM.BH2,
author = {Eduard Goma and Marco Canini and Alberto Lopez Toledo and Nikolaos Laoutaris and Dejan Kosti\'c and Pablo Rodriguez and Rade Stanojevi\'c and Pablo Yag\"ue Valent\'in},
title = {{Insomnia in the Access or How to Curb Access Network Related Energy Consumption}},
booktitle = {Proceedings of ACM SIGCOMM'11},
year = {2011},
month = {Aug}
}
```

Network Monitoring and Application Identification

Evaluation and Design of Cache Replacement Policies under Flooding Attacks
M. Zadnik, M. Canini.
In Proceedings of the 2nd International Workshop on TRaffic Analysis and Classification (TRAC’11), Jul 2011.

abstract
A flow cache is a fundamental building block for flow-based traffic processing. Its efficiency is critical for the overall performance of a number of networked devices and systems. However, if not properly managed, the flow cache can be easily filled up and rendered ineffective by traffic patterns such as flooding attacks and scanning activities which, unfortunately, commonly occur in the Internet.In this paper, we show that popular cache replacement policies such as LRU cause the flow caches to evict the so called heavy-hitter flows during flooding attacks. To address this shortcoming, we build upon our recent work [1] and construct a replacement policy that is more resilient to floods and yet performs similarly to other policies under common network traffic conditions.
bibtex
```
@INPROCEEDINGS{Zadnik.TRAC.CacheEvo,
author = {Zadnik, Martin and Canini, Marco},
title = {{Evaluation and Design of Cache Replacement Policies under Flooding Attacks}},
booktitle = {Proceedings of the 2nd International Workshop on TRaffic Analysis and Classification (TRAC'11)},
year = {2011},
month = {Jul}
}
```
Evolution of Cache Replacement Policies to Track Heavy-hitter Flows
M. Zadnik, M. Canini.
In Proceedings of Passive Active Measurement conference (PAM’11), Mar 2011.

abstract
Several important network applications cannot easily scale to higher data rates without requiring focusing just on the large traffic flows. Recent works have discussed algorithmic solutions that trade-off accuracy to gain efficiency for filtering and tracking the so-called “heavy-hitters”. However, a major limit is that flows must initially go through a filtering process, making it impossible to track state associated with the first few packets of the flow.
In this paper, we propose a different paradigm in tracking the large flows which overcomes this limit. We view the problem as that of managing a small flow cache with a finely tuned replacement policy that strives to avoid evicting the heavy-hitters. Our scheme starts from recorded traffic traces and uses Genetic Algorithms to evolve a replacement policy tailored for supporting seamless, stateful traffic-processing. We evaluate our scheme in terms of missed heavy-hitters: it performs close to the optimal, oracle-based policy, and when compared to other standard policies, it consistently outperforms them, even by a factor of two in most cases.
bibtex
```
@INPROCEEDINGS{Zadnik.PAM.CacheEvo,
author = {Zadnik, Martin and Canini, Marco},
title = {{Evolution of Cache Replacement Policies to Track Heavy-hitter Flows}},
booktitle = {Proceedings of Passive Active Measurement conference (PAM'11)},
year = {2011},
month = {Mar}
}
```
Experience with High-Speed Automated Application-Identification for Network-Management
M. Canini, W. Li, M. Zadnik, A. W. Moore.
In Proceedings of the 5th ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS’09), Oct 2009.

abstract
AtoZ, an automatic traffic organizer, provides control of how network-resources are used by applications. It does this by combining the high-speed packet processing of the NetFPGA with an efficient method for application-behavior labeling. AtoZ can control network resources by prohibiting certain applications and controlling the resources available to others. We discuss deployment experience and use real traffic to illustrate how such an architecture enables several distinct features: high accuracy, high throughput, minimal delay, and efficient packet labeling - all in a low cost, robust configuration that works alongside the enterprise access-router.
bibtex
```
@INPROCEEDINGS{Canini.ANCS.AtoZ,
author = {Marco Canini and Wei Li and Martin Zadnik and Andrew W. Moore},
title = {{Experience with High-Speed Automated Application-Identification for Network-Management}},
booktitle = {Proceedings of the 5th ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS'09)},
year = {2009},
month = {Oct}
}
```
Tracking Elephant Flows in Internet Backbone Traffic with an FPGA-based Cache
M. Zadnik, M. Canini, A. W. Moore, D. J. Miller, W. Li.
In Proceedings of the 19th International Conference on Field Programmable Logic and Applications (FPL’09), Aug 2009.

abstract
This paper presents an FPGA-friendly approach to tracking elephant flows in network traffic. Our approach, Single Step Segmented Least Recently Used (S³-LRU) policy, is a network traffic-friendly replacement policy for maintaining flow states in a Naïve Hash Table (NHT). We demonstrate that our S³-LRU approach preserves elephant flows: conservatively promoting potential elephants and evicting low rate flows in LRU manner. Our approach keeps flow-state of any elephant since start-of-day and provides a significant improvement over filtering approaches proposed in previous work. Our FPGA-based implementation of the S³-LRU in combination with an NHT suites well the parallel access to block memories while capitalising on the retuning of parameters through dynamic-reprogramming.
bibtex
```
@INPROCEEDINGS{Zadnik.FPL09.Elephant,
author = {Martin Zadnik and Marco Canini and Andrew W. Moore and David J. Miller and Wei Li},
title = {{Tracking Elephant Flows in Internet Backbone Traffic with an FPGA-based Cache}},
booktitle = {Proceedings of the 19th International Conference on Field Programmable Logic and Applications (FPL'09)},
year = {2009},
month = {Aug}
}
```
GTVS: Boosting the Collection of Application Traffic Ground Truth
M. Canini, W. Li, A. W. Moore, R. Bolla.
In Proceedings of the First International Workshop on Traffic Monitoring and Analysis (TMA’09), May 2009.

abstract
Interesting research in the areas of traffic classification, network monitoring, and application-oriented analysis can not proceed without real traffic traces, labeled with actual application information. However, hand-labeled traces are an extremely valuable but scarce resource in the traffic monitoring and analysis community, as a result of both privacy concerns and technical difficulties. Hardly any possibility exists for payloaded data to be released, while the impossibility of obtaining certain ground-truth application information from non-payloaded data has severely constrained the value of anonymized public traces.
The usual way to obtain the ground truth is fragile, inefficient and not directly comparable from one’s work to another. This paper proposes a methodology and details the design of a technical framework that significantly boosts the efficiency in compiling the application traffic ground truth. Further, a case study on a 30 minute real data trace is presented. In contrast with past work, this is an easy hands-on tool suite dedicated to save user’s time and labor and is freely available to the public.
bibtex
```
@INPROCEEDINGS{Canini.TMA09.GTVS,
author = {Marco Canini and Wei Li and Andrew W. Moore and Raffaele Bolla},
title = {{GTVS: Boosting the Collection of Application Traffic Ground Truth}},
booktitle = {Proceedings of the First International Workshop on Traffic Monitoring and Analysis (TMA'09)},
year = {2009},
month = {May}
}
```
Efficient Application Identification and the Temporal Stability of Classification Schema
W. Li, M. Canini, A. W. Moore, R. Bolla.
Computer Networks, 53(6), Apr 2009.

abstract
Motivated by the importance of accurate identification for a range of applications, this paper compares and contrasts the effective and efficient classification of network-based applications using behavioral observations of network-traffic and those using deep-packet inspection.
Importantly, throughout our work we are able to make comparison with data possessing an accurate, independently determined ground-truth that describes the actual applications causing the network-traffic observed.
In a unique study in both the spatial-domain: comparing across different network-locations and in the temporal-domain: comparing across a number of years of data, we illustrate the decay in classification accuracy across a range of application-classification mechanisms. Further, we document the accuracy of spatial classification without training data possessing spatial diversity.
Finally, we illustrate the classification of UDP traffic. We use the same classification approach for both stateful flows (TCP) and stateless flows based upon UDP. Importantly, we demonstrate high levels of accuracy: greater than 92% for the worst circumstance regardless of the application.
bibtex
```
@ARTICLE{Li.09.EAI,
author = {Wei Li and Marco Canini and Andrew W. Moore and Raffaele Bolla},
title = {{Efficient application identification and the temporal and spatial stability of classification schema}},
journal = {Computer Networks},
volume = {53},
number = {6},
year = {2009},
month = {Apr},
doi = {10.1016/j.comnet.2008.11.016}
}
```
Per Flow Packet Sampling for High-Speed Network Monitoring
M. Canini, D. Fay, D. J. Miller, A. W. Moore, R. Bolla.
In Proceedings of the First International Conference on Communication Systems and Networks (COMSNETS’09), Jan 2009.

abstract
We present a per-flow packet sampling method that enables the real-time classification of high-speed network traffic. Our method, based upon the partial sampling of each flow (i.e., performing sampling at only early stages in each flow’s lifetime), provides a sufficient reduction in total traffic (e.g., a factor of five in packets, a factor of ten in bytes) as to allow practical implementations at one Gigabit/s, and, using limited hardware assistance, ten Gigabit/s.
bibtex
```
@INPROCEEDINGS{Canini.COMSNETS09.PFPS,
author = {Marco Canini and Damien Fay and David J. Miller and Andrew W. Moore and Raffaele Bolla},
title = {{Per Flow Packet Sampling for High-Speed Network Monitoring}},
booktitle = {Proceedings of the First International Conference on Communication Systems and Networks (COMSNETS'09)},
year = {2009},
month = {Jan}
}
```
On the Double-Faced Nature of P2P Traffic
R. Bolla, M. Canini, R. Rapuzzi, M. Sciuto.
In Proceedings of the Sixteenth Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP’08), Feb 2008.

abstract
Over the last few years, peer-to-peer (P2P) file sharing applications have evolved to become a major traffic source in the Internet. The ability to quantify their impact on the network, as a consequence of both signaling and download traffic, is fundamental to a number of network operations, including traffic engineering, capacity planning, quality of service, forecasting for long-term provisioning, etc.
We present here a measurement study on the characteristics of the traffic associated with different P2P applications. Our aim is to offer useful insight into the nature of P2P traffic, which we consider a step toward building P2P traffic aggregates generators in simulative environments. We show that P2P traffic can be divided into two distinguished behavioral profiles, which, independently of the application protocol, present significant differences in the average and standard deviation of four measurements: arrival times, durations, volumes and average packet sizes of P2P conversations. These profiles well represent the typical behavior of signaling and download traffic. Based on our findings, we argue that, if such distinction is not taken into account, the statistical measurements needed to model P2P traffic aggregates would result biased, and potentially bring to misleading results.
bibtex
```
@INPROCEEDINGS{Bolla.PDP08.P2P,
author = {Raffaele Bolla and Marco Canini and Riccardo Rapuzzi and Michele Sciuto},
title = {{On the Double-Faced Nature of P2P Traffic}},
booktitle = {Proceedings of the Sixteenth Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP'08)},
year = {2008},
pages = {524--530},
month = {Feb}
}
```
Characterizing the network behavior of P2P traffic
R. Bolla, M. Canini, R. Rapuzzi, M. Sciuto.
In Proceedings of the Fourth International Telecommunication Networking Workshop on QoS in Multiservice IP Networks (IT-NEWS’08), Feb 2008.

abstract
Nowadays the majority of Internet traffic is generated by peer-to-peer (P2P) file sharing applications. As the popularity of these applications has been increasing dramatically over the past few years, it becomes increasingly important to analyze their behavior and to understand their effects on the network. The ability to quantify their impact on the network is fundamental to a number of network operations, including traffic engineering, capacity planning, quality of service, forecasting for long-term provisioning, etc.
We present here a measurement study on the characteristics of the traffic associated with two different P2P applications. Our aim is to provide useful insight into the nature of P2P traffic from the point of view of the network. To achieve this, we introduce a novel measurement, Content Transfer Index (CTI), to distinguish two classes of behavior associated with P2P traffic: the download and the signaling traffic profile. Next we apply the CTI to our data sets and show that it effectively offers a general characterization of P2P traffic. Finally, we present a number of statistical measurements that are significantly unbiased due to having considered the distinction between the two classes. To the best of our knowledge, this is the first study to follow this approach.
We believe such a study will help researchers better understand the impact of P2P applications on the network and how to improve their performance.
bibtex
```
@INPROCEEDINGS{Bolla.ITNEWS08.P2P,
author = {Raffaele Bolla and Marco Canini and Riccardo Rapuzzi and Michele Sciuto},
title = {{Characterizing the network behavior of P2P traffic}},
booktitle = {Proceedings of the Fourth International Telecommunication Networking Workshop on QoS in Multiservice IP Networks (IT-NEWS'08)},
year = {2008},
pages = {14--19},
month = {Feb}
}
```
A High Performance IP Traffic Generation Tool Based on the Intel IXP2400 Network Processor
R. Bolla, R. Bruschi, M. Canini, M. Repetto.
In Proceedings of the 2005 Tyrrhenian International Workshop on Digital Communications (TIWDC’05), Sorrento, Italy, Jun 2005, and in F. Davoli, S. Palazzo, S. Zappatore, Eds. Distributed Cooperative Laboratories: Networking, Instrumentation, and Measurements. Springer, Norwell, MA, pages 127-142, 2006.

abstract
Traffic generation is essential in the processes of testing and developing new network elements, such as equipment, protocols and applications, regarding both the production and research area. Traditionally, two approaches have been followed for this purpose: the first is based on software applications that can be executed on inexpensive Personal Computers, while the second relies on dedicated hardware. Obviously, performance in the latter case (in terms of sustainable rates, precision in delays and jitters) outclasses that in the former, but also the costs usually grow of some order of magnitude. In this paper we describe a software IP traffic generator based on a hardware architecture specialized for packet processing (known as Network Processor), which we have developed and tested. Our approach is positioned between the two different philosophies listed above: it has a software (and then flexible) implementation running on a specific hardware only slightly more expensive than PCs.
bibtex
```
@INCOLLECTION{Bolla.06.PKTGEN,
author={Bolla, Raffaele and Bruschi, Roberto and Canini, Marco and Repetto, Matteo},
title= {{A High Performance IP Traffic Generation Tool Based On The Intel IXP2400 Network Processor}},
booktitle={Distributed Cooperative Laboratories: Networking, Instrumentation, and Measurements},
series={Signals and Communication Technology},
editor={Davoli, Franco and Palazzo, Sergio and Zappatore, Sandro},
publisher={Springer US},
pages={127--142},
year={2006}
}
```

Extended abstracts

Collaborative Learning Experimentation Testbed.
J. Božič, A. R. Faustino, B. Radovič, M. Canini, V. Pejović
APSys’24, Sep 2024.
Proof-of-Concept of a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation.
B. Liu, M. A. Ojewale, Y. Ding, M. Canini
NAIC’24, Aug 2024.
MicroView: Cloud-Native Observability with Temporal Precision.
A. Cornacchia, T. A. Benson, M. Bilal, M. Canini
CoNEXT-SW’23, Dec 2023.
IntML: Natural Compression for Distributed Deep Learning.
S. Horvath, C.-Y. Ho, L. Horvath, A. N. Sahu, M. Canini, P. Richtarik.
Workshop on AI Systems at SOSP’19, Oct 2019.
Dynam-IX: a Dynamic Interconnection eXchange.
P. Marcos, M. Chiesa, L. Müller, P. Kathiravelu, C. Dietzel, M. Canini, M. Barcellos.
SIGCOMM’18, Aug 2018.
Kurma: Geo-Distributed Load Balancer for Back-End Storage Systems.
K. L. Bogdanov, W. Reda, G. Q. Maguire Jr., D. Kostic, M. Canini.
SOSP’17, Oct 2017.
DAIET: A System for Data Aggregation Inside the Network.
A. Sapio, I. Abdelaziz, M. Canini, P. Kalnis.
SoCC’17, Sep 2017.
Privacy-Preserving Detection of Inter-Domain SDN Rules Overlaps.
A. Dethise, M. Chiesa, M. Canini.
SIGCOMM’17, Aug 2017.
A Self-Organizing Distributed and In-Band SDN Control Plane.
M. Canini, I. Salem, L. Schiff, E. M. Schiller, S. Schmid.
ICDCS’17, Jun 2017.
Towards meticulous data plane monitoring.
A. Shukla, S. J. Saidi, M. Canini, S. Schmid, A. Feldmann.
EuroDW’17, Apr 2017.
Towards Securing Internet eXchange Points Against Curious onlooKers.
M. Chiesa, D. Demmler, M. Canini, M. Schapira, T. Schneider.
ANRW’16, Jul 2016.
Automagic Configuration Optimization for Big Data Processing Systems.
M. Bilal, M. Canini.
EuroDW’16, Apr 2016.
BRB: BetteR Batch Scheduling to Reduce Tail Latencies in Cloud Data Stores.
W. Reda, L. Suresh, M. Canini, S. Braithwaite.
ACM SIGCOMM’15, Aug 2015.
Medieval: Towards A Self-Stabilizing, Plug & Play, In-Band SDN Control Network.
L. Schiff, S. Schmid, M. Canini.
SOSR’15, Jun 2015.
SDLoad: An Extensible Framework for SDN Workload Generation.
N. Laurent, S. Vissicchio, M. Canini.
HotSDN’14, Aug 2014.
Incremental SDN Deployment in Enterprise Networks.
D. Levin, M. Canini, S. Schmid, A. Feldmann.
ACM SIGCOMM’13, Aug 2013.
Automatic Failure Recovery for Software-Defined Networks.
M. Kuzniar, P. Peresini, N. Vasic, M. Canini, D. Kostic.
HotSDN’13, Aug 2013.
Systematic Software Testing Meets Networking.
M. Canini and D. Kostic.
Open Networking Summit (ONS’13), Apr 2013.
Consistent Packet Processing — Because Consistent Updates Are Not Enough.
P. Peresini, M. Kuzniar, N. Vasic, M. Canini, D. Kostic.
Open Networking Summit (ONS’13), Apr 2013.
Is Your OpenFlow Application Correct?.
P. Peresini and M. Canini.
ACM CoNEXT Student Workshop, Dec 2011.
Online Testing of Federated and Heterogeneous Distributed Systems.
M. Canini, V. Jovanovic, D. Venzano, D. Novakovic, D. Kostic
ACM SIGCOMM ‘11, Aug 2011.
Evolution of Cache Replacement Policies to Track Heavy-hitter Flows.
M. Zadnik and M. Canini.
ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS ‘10), Oct 2010.
DiCE: Predicting Faults in Heterogeneous, Federated Distributed Systems.
V. Jovanovic, M. Canini, G. Kumar, B. Spasojevic, O. Crameri, D. Kostic
USENIX Symposium on Operating Systems Design and Implementation (OSDI ‘10), Oct 2010.
Toward the Identification of Anonymous Web Proxies.
M. Canini, W. Li, A. W. Moore.
PAM Student Workshop, Apr 2009.
Revisiting HTTP Traffic in the New Age.
W. Li, A. W. Moore, M. Canini.
ACM SIGCOMM ‘08, Aug 2008.

Book chapters

Experiences with the Collection of Application Ground Truth Data.
M. Canini, R. Rapuzzi, R. Bolla.
In Antonio Pescape and Carlo Sansone (Editor), RECIPE. Robust and Efficient traffic Classification of IP nEtworks, Fridericiana Editrice Universitaria, Napoli, ISBN: 978-88-833-8081-5, July 2009.

Copyright notice

Certain contents is © IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Certain contents is © ACM. This is the author’s version of the work. It is posted here for your personal use. Not for redistribution.