CGX: Adaptive System Support for Communication-Efficient Deep Learning
In: Middleware 2022; (2021)
Online
report
The ability to scale out training workloads has been one of the key performance enablers of deep learning. The main scaling approach is data-parallel GPU-based training, which has been boosted by hardware and software support for highly efficient point-to-point communication, and in particular via hardware bandwidth overprovisioning. Overprovisioning comes at a cost: there is an order of magnitude price difference between "cloud-grade" servers with such support, relative to their popular "consumer-grade" counterparts, although single server-grade and consumer-grade GPUs can have similar computational envelopes. In this paper, we show that the costly hardware overprovisioning approach can be supplanted via algorithmic and system design, and propose a framework called CGX, which provides efficient software support for compressed communication in ML applications, for both multi-GPU single-node training, as well as larger-scale multi-node training. CGX is based on two technical advances: \emph{At the system level}, it relies on a re-developed communication stack for ML frameworks, which provides flexible, highly-efficient support for compressed communication. \emph{At the application level}, it provides \emph{seamless, parameter-free} integration with popular frameworks, so that end-users do not have to modify training recipes, nor significant training code. This is complemented by a \emph{layer-wise adaptive compression} technique which dynamically balances compression gains with accuracy preservation. CGX integrates with popular ML frameworks, providing up to 3X speedups for multi-GPU nodes based on commodity hardware, and order-of-magnitude improvements in the multi-node setting, with negligible impact on accuracy.
Titel: |
CGX: Adaptive System Support for Communication-Efficient Deep Learning
|
---|---|
Autor/in / Beteiligte Person: | Markov, Ilia ; Ramezanikebrya, Hamidreza ; Alistarh, Dan |
Link: | |
Quelle: | Middleware 2022; (2021) |
Veröffentlichung: | 2021 |
Medientyp: | report |
DOI: | 10.1145/3528535.3565248 |
Schlagwort: |
|
Sonstiges: |
|