The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
GPU(그래픽 처리 장치)는 병렬 스텐실 코드를 위한 매우 효율적인 아키텍처입니다. 그러나 작은 장치(예: GPU) 메모리 용량(수십 GB)으로 인해 초과 데이터를 처리하려면 코어 외부 계산을 사용해야 합니다. 효율적인 코어 외부 스텐실 코드를 수동으로 구현하려면 엄청난 프로그래밍 노력이 필요합니다. 이러한 프로그래밍 부담을 완화하기 위해 파이프라인 가속기(PACC)와 같은 지시문 기반 프레임워크가 등장했습니다. 그러나 일반적으로 데이터 전송을 줄이기 위한 구체적인 최적화가 부족합니다. 본 논문에서는 데이터 전송 문제를 해결하기 위해 두 가지 데이터 중심 최적화로 PACC를 확장합니다. 첫 번째는 원본 데이터와 장치 버퍼 사이를 중재하는 호스트(즉, CPU) 버퍼를 제거하는 직접 매핑 방식입니다. 두 번째는 호스트에서 장치로의 데이터 전송을 크게 줄이는 영역 공유 방식입니다. 확장된 PACC는 음파 전파기에 적용되어 원래 직렬 코드의 길이를 2.3배 자동으로 확장하여 코어 외부 코드를 얻습니다. 실험 결과 Tesla V100 GPU에서 생성된 코드는 OpenMP(Open Multi-Processing), 통합 메모리 및 이전 PACC 기반 구현보다 각각 41.0배, 22.1배, 3.6배 빠르게 실행되는 것으로 나타났습니다. 생성된 코드는 또한 장치 용량에 맞는 작은 데이터 세트에서 유용성을 보여 주었으며, 코어 내 구현보다 1.3배 더 빠르게 실행되었습니다.
Jingcheng SHEN
Osaka University
Fumihiko INO
Osaka University
Albert FARRÉS
Barcelona Supercomputing Center
Mauricio HANZICH
Barcelona Supercomputing Center
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
부
Jingcheng SHEN, Fumihiko INO, Albert FARRÉS, Mauricio HANZICH, "A Data-Centric Directive-Based Framework to Accelerate Out-of-Core Stencil Computation on a GPU" in IEICE TRANSACTIONS on Information,
vol. E103-D, no. 12, pp. 2421-2434, December 2020, doi: 10.1587/transinf.2020PAP0014.
Abstract: Graphics processing units (GPUs) are highly efficient architectures for parallel stencil code; however, the small device (i.e., GPU) memory capacity (several tens of GBs) necessitates the use of out-of-core computation to process excess data. Great programming effort is needed to manually implement efficient out-of-core stencil code. To relieve such programming burdens, directive-based frameworks emerged, such as the pipelined accelerator (PACC); however, they usually lack specific optimizations to reduce data transfer. In this paper, we extend PACC with two data-centric optimizations to address data transfer problems. The first is a direct-mapping scheme that eliminates host (i.e., CPU) buffers, which intermediate between the original data and device buffers. The second is a region-sharing scheme that significantly reduces host-to-device data transfer. The extended PACC was applied to an acoustic wave propagator, automatically extending the length of original serial code 2.3-fold to obtain the out-of-core code. Experimental results revealed that on a Tesla V100 GPU, the generated code ran 41.0, 22.1, and 3.6 times as fast as implementations based on Open Multi-Processing (OpenMP), Unified Memory, and the previous PACC, respectively. The generated code also demonstrated usefulness with small datasets that fit in the device capacity, running 1.3 times as fast as an in-core implementation.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2020PAP0014/_p
부
@ARTICLE{e103-d_12_2421,
author={Jingcheng SHEN, Fumihiko INO, Albert FARRÉS, Mauricio HANZICH, },
journal={IEICE TRANSACTIONS on Information},
title={A Data-Centric Directive-Based Framework to Accelerate Out-of-Core Stencil Computation on a GPU},
year={2020},
volume={E103-D},
number={12},
pages={2421-2434},
abstract={Graphics processing units (GPUs) are highly efficient architectures for parallel stencil code; however, the small device (i.e., GPU) memory capacity (several tens of GBs) necessitates the use of out-of-core computation to process excess data. Great programming effort is needed to manually implement efficient out-of-core stencil code. To relieve such programming burdens, directive-based frameworks emerged, such as the pipelined accelerator (PACC); however, they usually lack specific optimizations to reduce data transfer. In this paper, we extend PACC with two data-centric optimizations to address data transfer problems. The first is a direct-mapping scheme that eliminates host (i.e., CPU) buffers, which intermediate between the original data and device buffers. The second is a region-sharing scheme that significantly reduces host-to-device data transfer. The extended PACC was applied to an acoustic wave propagator, automatically extending the length of original serial code 2.3-fold to obtain the out-of-core code. Experimental results revealed that on a Tesla V100 GPU, the generated code ran 41.0, 22.1, and 3.6 times as fast as implementations based on Open Multi-Processing (OpenMP), Unified Memory, and the previous PACC, respectively. The generated code also demonstrated usefulness with small datasets that fit in the device capacity, running 1.3 times as fast as an in-core implementation.},
keywords={},
doi={10.1587/transinf.2020PAP0014},
ISSN={1745-1361},
month={December},}
부
TY - JOUR
TI - A Data-Centric Directive-Based Framework to Accelerate Out-of-Core Stencil Computation on a GPU
T2 - IEICE TRANSACTIONS on Information
SP - 2421
EP - 2434
AU - Jingcheng SHEN
AU - Fumihiko INO
AU - Albert FARRÉS
AU - Mauricio HANZICH
PY - 2020
DO - 10.1587/transinf.2020PAP0014
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E103-D
IS - 12
JA - IEICE TRANSACTIONS on Information
Y1 - December 2020
AB - Graphics processing units (GPUs) are highly efficient architectures for parallel stencil code; however, the small device (i.e., GPU) memory capacity (several tens of GBs) necessitates the use of out-of-core computation to process excess data. Great programming effort is needed to manually implement efficient out-of-core stencil code. To relieve such programming burdens, directive-based frameworks emerged, such as the pipelined accelerator (PACC); however, they usually lack specific optimizations to reduce data transfer. In this paper, we extend PACC with two data-centric optimizations to address data transfer problems. The first is a direct-mapping scheme that eliminates host (i.e., CPU) buffers, which intermediate between the original data and device buffers. The second is a region-sharing scheme that significantly reduces host-to-device data transfer. The extended PACC was applied to an acoustic wave propagator, automatically extending the length of original serial code 2.3-fold to obtain the out-of-core code. Experimental results revealed that on a Tesla V100 GPU, the generated code ran 41.0, 22.1, and 3.6 times as fast as implementations based on Open Multi-Processing (OpenMP), Unified Memory, and the previous PACC, respectively. The generated code also demonstrated usefulness with small datasets that fit in the device capacity, running 1.3 times as fast as an in-core implementation.
ER -