The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
딥 러닝 워크로드는 훈련 데이터에 대해 수많은 행렬 연산을 수행하므로 GPU(그래픽 처리 장치)는 특히 훈련 단계에서 효율적입니다. 각각 여러 GPU를 갖춘 컴퓨터 클러스터는 딥 러닝 작업 부하를 크게 가속화할 수 있습니다. 보다 구체적으로 훈련에는 경사하강법을 따르는 역전파 알고리즘이 사용됩니다. 그라디언트 계산은 여전히 훈련의 주요 병목 현상이지만, 그라디언트 집계 및 최적화는 통신 및 계산 오버헤드를 모두 부과하므로 훈련 시간을 더욱 단축하기 위해 이를 줄여야 합니다. 이 문제를 해결하기 위해 이 백서에서는 여러 GPU를 10Gbit 이더넷(10GbE) 기술을 통해 PCI Express(PCIe)로 상호 연결합니다. 이러한 원격 GPU는 네트워크 스위치와 상호 연결되므로 경사 집계 및 최적화 프로그램(예: SGD, AdaGrad, Adam 및 SMORMS3)은 원격 GPU 사이의 FPGA 기반 10GbE 스위치로 오프로드됩니다. 따라서 네트워크에서 그래디언트 집계 및 매개변수 최적화가 완료됩니다. 10개의 최적화 기능을 갖춘 제안된 FPGA 기반 56GbE 스위치는 NetFPGA-SUME 보드에 구현됩니다. 리소스 활용도는 최적화 프로그램의 PE에 의해 증가하며 리소스의 최대 3.0%를 소비합니다. 제안된 FPGA 기반 스위치를 통해 연결된 1.25개의 원격 GPU를 사용한 평가 결과는 이러한 최적화 프로그램이 CPU 및 GPU 구현에 비해 각각 최대 98.3x 및 10x 가속화되었음을 보여줍니다. 또한 FPGA 기반 스위치의 그래디언트 집계 처리량은 XNUMXGbE 회선 속도의 최대 XNUMX%를 달성합니다.
Tomoya ITSUBO
Keio University
Michihiro KOIBUCHI
National Institute of Informatics
Hideharu AMANO
Keio University
Hiroki MATSUTANI
Keio University
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
부
Tomoya ITSUBO, Michihiro KOIBUCHI, Hideharu AMANO, Hiroki MATSUTANI, "An FPGA-Based Optimizer Design for Distributed Deep Learning with Multiple GPUs" in IEICE TRANSACTIONS on Information,
vol. E104-D, no. 12, pp. 2057-2067, December 2021, doi: 10.1587/transinf.2021PAP0008.
Abstract: Since deep learning workloads perform a large number of matrix operations on training data, GPUs (Graphics Processing Units) are efficient especially for the training phase. A cluster of computers each of which equips multiple GPUs can significantly accelerate the deep learning workloads. More specifically, a back-propagation algorithm following a gradient descent approach is used for the training. Although the gradient computation is still a major bottleneck of the training, gradient aggregation and optimization impose both communication and computation overheads, which should also be reduced for further shortening the training time. To address this issue, in this paper, multiple GPUs are interconnected with a PCI Express (PCIe) over 10Gbit Ethernet (10GbE) technology. Since these remote GPUs are interconnected with network switches, gradient aggregation and optimizers (e.g., SGD, AdaGrad, Adam, and SMORMS3) are offloaded to FPGA-based 10GbE switches between remote GPUs; thus, the gradient aggregation and parameter optimization are completed in the network. The proposed FPGA-based 10GbE switches with the four optimizers are implemented on NetFPGA-SUME board. Their resource utilizations are increased by PEs for the optimizers, and they consume up to 56% of the resources. Evaluation results using four remote GPUs connected via the proposed FPGA-based switch demonstrate that these optimizers are accelerated by up to 3.0x and 1.25x compared to CPU and GPU implementations, respectively. Also, the gradient aggregation throughput by the FPGA-based switch achieves up to 98.3% of the 10GbE line rate.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2021PAP0008/_p
부
@ARTICLE{e104-d_12_2057,
author={Tomoya ITSUBO, Michihiro KOIBUCHI, Hideharu AMANO, Hiroki MATSUTANI, },
journal={IEICE TRANSACTIONS on Information},
title={An FPGA-Based Optimizer Design for Distributed Deep Learning with Multiple GPUs},
year={2021},
volume={E104-D},
number={12},
pages={2057-2067},
abstract={Since deep learning workloads perform a large number of matrix operations on training data, GPUs (Graphics Processing Units) are efficient especially for the training phase. A cluster of computers each of which equips multiple GPUs can significantly accelerate the deep learning workloads. More specifically, a back-propagation algorithm following a gradient descent approach is used for the training. Although the gradient computation is still a major bottleneck of the training, gradient aggregation and optimization impose both communication and computation overheads, which should also be reduced for further shortening the training time. To address this issue, in this paper, multiple GPUs are interconnected with a PCI Express (PCIe) over 10Gbit Ethernet (10GbE) technology. Since these remote GPUs are interconnected with network switches, gradient aggregation and optimizers (e.g., SGD, AdaGrad, Adam, and SMORMS3) are offloaded to FPGA-based 10GbE switches between remote GPUs; thus, the gradient aggregation and parameter optimization are completed in the network. The proposed FPGA-based 10GbE switches with the four optimizers are implemented on NetFPGA-SUME board. Their resource utilizations are increased by PEs for the optimizers, and they consume up to 56% of the resources. Evaluation results using four remote GPUs connected via the proposed FPGA-based switch demonstrate that these optimizers are accelerated by up to 3.0x and 1.25x compared to CPU and GPU implementations, respectively. Also, the gradient aggregation throughput by the FPGA-based switch achieves up to 98.3% of the 10GbE line rate.},
keywords={},
doi={10.1587/transinf.2021PAP0008},
ISSN={1745-1361},
month={December},}
부
TY - JOUR
TI - An FPGA-Based Optimizer Design for Distributed Deep Learning with Multiple GPUs
T2 - IEICE TRANSACTIONS on Information
SP - 2057
EP - 2067
AU - Tomoya ITSUBO
AU - Michihiro KOIBUCHI
AU - Hideharu AMANO
AU - Hiroki MATSUTANI
PY - 2021
DO - 10.1587/transinf.2021PAP0008
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E104-D
IS - 12
JA - IEICE TRANSACTIONS on Information
Y1 - December 2021
AB - Since deep learning workloads perform a large number of matrix operations on training data, GPUs (Graphics Processing Units) are efficient especially for the training phase. A cluster of computers each of which equips multiple GPUs can significantly accelerate the deep learning workloads. More specifically, a back-propagation algorithm following a gradient descent approach is used for the training. Although the gradient computation is still a major bottleneck of the training, gradient aggregation and optimization impose both communication and computation overheads, which should also be reduced for further shortening the training time. To address this issue, in this paper, multiple GPUs are interconnected with a PCI Express (PCIe) over 10Gbit Ethernet (10GbE) technology. Since these remote GPUs are interconnected with network switches, gradient aggregation and optimizers (e.g., SGD, AdaGrad, Adam, and SMORMS3) are offloaded to FPGA-based 10GbE switches between remote GPUs; thus, the gradient aggregation and parameter optimization are completed in the network. The proposed FPGA-based 10GbE switches with the four optimizers are implemented on NetFPGA-SUME board. Their resource utilizations are increased by PEs for the optimizers, and they consume up to 56% of the resources. Evaluation results using four remote GPUs connected via the proposed FPGA-based switch demonstrate that these optimizers are accelerated by up to 3.0x and 1.25x compared to CPU and GPU implementations, respectively. Also, the gradient aggregation throughput by the FPGA-based switch achieves up to 98.3% of the 10GbE line rate.
ER -