End-to-end Adaptive Distributed Training

Distributed training is a popular method for training large neural networks efficiently by processing large amounts of data. However, meeting the requirements of different neural network models, computing resources, and their dynamic changes during a training job is a significant challenge. This challenge is even more significant in industrial applications and production environments.

The End-to-End Adaptive Distributed Training Framework

In this study, a systematic approach has been designed for the distributed training framework to provide an adaptive ability for different scenarios, particularly in industrial applications and production environments. This framework is equipped with a global cost model and a global planner. Based on the unified distributed graph and the unified cluster object, the framework enables the following features:

Arbitrary parallelism
Resource-aware placement
Multi-mode execution
Fault-tolerant
Elastic distributed training

Unified Distributed Graph and Cluster Object

The framework is designed in a systematic end-to-end view that fully considers the resource allocation, model partition, task placement, and distributed execution. The unified distributed graph and the unified cluster object are used as the basis of the framework. These two components provide a comprehensive and intuitive view of a distributed training job as a graph. Each node in the graph represents an operation, and each edge represents the input/output dependency between operations.

Global Cost Model and Global Planner

The global cost model and global planner are essential components of the framework, and they enable the framework's adaptability. The global cost model estimates the cost of executing an operation in different scenarios, considering the computation and communication overhead. The global planner uses the cost model to select the optimal placement and scheduling of operations on resources in different scenarios, taking into account the resource availability, network topology, and inter-operation dependencies.

Experiments

This framework has been tested, and the experiments demonstrated that it can satisfy various requirements from the diversity of applications and the heterogeneity of resources with highly competitive performance. The adaptive framework's unique features enable it to provide excellent parallelism, optimal resource utilization, efficient communication, fault tolerance, and elasticity.

In summary, the End-to-End Adaptive Distributed Training Framework is a systematic approach that provides an adaptive ability for different scenarios, particularly in industrial applications and production environments. This framework enables arbitrary parallelism, resource-aware placement, multi-mode execution, fault tolerance, and elastic distributed training. The framework has been tested and has demonstrated excellent performance in various scenarios.