Skip to main navigation Skip to search Skip to main content

Fine-grained Fault Tolerance in Distributed Training Toolkits using the Syndicated Actor Model

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingAcademicpeer-review

Abstract

Deep learning (DL) toolkits like TensorFlow and PyTorch provide distributed training strategies, but they have limited support for fault tolerance. They both provide the user with built-in checkpoint functionality to continue training from a restart; however, when a partial failure is detected, all ongoing training jobs are aborted and restarted, repeating work and wasting resources. In response, we explore use of the Syndicated Actor Model (SAM) [5, 10], a recent model of distributed computation, to coordinate activities in a distributed training environment. The SAM encourages expression of distributed systems in terms of the joint state of a shared activity instead of in terms of point-to-point message exchange. This change in perspective has the potential to improve task distribution, resilience to failure, and efficient use of resources in distributed training scenarios.
Original languageEnglish
Title of host publicationMIND 2025 - Proceedings of the 1st International Workshop on Next-Gen Middleware for MLOps in Distributed Systems
PublisherAssociation for Computing Machinery, Inc
Pages1-6
Number of pages6
ISBN (Electronic)9798400723056
DOIs
Publication statusPublished - 14 Dec 2025
Event1st International Workshop on Next-Gen Middleware for MLOps in Distributed Systems, MIND 2025 - Nashville, usa, Nashville
Duration: 15 Dec 202519 Dec 2025
https://mindmlops.github.io/

Conference

Conference1st International Workshop on Next-Gen Middleware for MLOps in Distributed Systems, MIND 2025
Abbreviated titleMIND 2025
CityNashville
Period15/12/2519/12/25
OtherWelcome to the 1st International workshop on Next-Gen Middleware for MLOps in Distributed Systems (MIND). MIND will be hosted in conjunction with the 26th ACM/IFIP International Middleware Conference conference, which will be held in Vanderbilt University, Nashville, TN, USA from 15th – 19th December 2025.

This workshop aims to bring together researchers, practitioners, and industry stakeholders to explore middleware innovations that support MLOps in distributed systems. It will focus on practical solutions to real-world challenges in orchestrating end to end ML pipelines, from data collection to model deployment and continuous monitoring in dynamic, heterogeneous, and resource constrained environments.
Internet address

Keywords

  • checkpointing
  • deep learning
  • fault tolerance
  • partial failure
  • PyTorch
  • syndicated actor model
  • TensorFlow

Fingerprint

Dive into the research topics of 'Fine-grained Fault Tolerance in Distributed Training Toolkits using the Syndicated Actor Model'. Together they form a unique fingerprint.

Cite this