Abstract
Deep learning (DL) toolkits like TensorFlow and PyTorch provide distributed training strategies, but they have limited support for fault tolerance. They both provide the user with built-in checkpoint functionality to continue training from a restart; however, when a partial failure is detected, all ongoing training jobs are aborted and restarted, repeating work and wasting resources. In response, we explore use of the Syndicated Actor Model (SAM) [5, 10], a recent model of distributed computation, to coordinate activities in a distributed training environment. The SAM encourages expression of distributed systems in terms of the joint state of a shared activity instead of in terms of point-to-point message exchange. This change in perspective has the potential to improve task distribution, resilience to failure, and efficient use of resources in distributed training scenarios.
| Original language | English |
|---|---|
| Title of host publication | MIND 2025 - Proceedings of the 1st International Workshop on Next-Gen Middleware for MLOps in Distributed Systems |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 1-6 |
| Number of pages | 6 |
| ISBN (Electronic) | 9798400723056 |
| DOIs | |
| Publication status | Published - 14 Dec 2025 |
| Event | 1st International Workshop on Next-Gen Middleware for MLOps in Distributed Systems, MIND 2025 - Nashville, usa, Nashville Duration: 15 Dec 2025 → 19 Dec 2025 https://mindmlops.github.io/ |
Conference
| Conference | 1st International Workshop on Next-Gen Middleware for MLOps in Distributed Systems, MIND 2025 |
|---|---|
| Abbreviated title | MIND 2025 |
| City | Nashville |
| Period | 15/12/25 → 19/12/25 |
| Other | Welcome to the 1st International workshop on Next-Gen Middleware for MLOps in Distributed Systems (MIND). MIND will be hosted in conjunction with the 26th ACM/IFIP International Middleware Conference conference, which will be held in Vanderbilt University, Nashville, TN, USA from 15th – 19th December 2025. This workshop aims to bring together researchers, practitioners, and industry stakeholders to explore middleware innovations that support MLOps in distributed systems. It will focus on practical solutions to real-world challenges in orchestrating end to end ML pipelines, from data collection to model deployment and continuous monitoring in dynamic, heterogeneous, and resource constrained environments. |
| Internet address |
Keywords
- checkpointing
- deep learning
- fault tolerance
- partial failure
- PyTorch
- syndicated actor model
- TensorFlow
Fingerprint
Dive into the research topics of 'Fine-grained Fault Tolerance in Distributed Training Toolkits using the Syndicated Actor Model'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver