The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
멀티코어 기술의 발전으로 단일 칩에 수백 또는 수천 개의 코어 프로세서가 가능해졌습니다. 그러나 대규모 멀티코어에서는 하드웨어 기반 캐시 일관성 메커니즘이 엄청나게 복잡해지고, 뜨거워지며, 비용이 많이 듭니다. 따라서 우리는 하드웨어 캐시 일관성 메커니즘 없이 공유 메모리 멀티코어 시스템을 위한 병렬 컴파일러에 의해 관리되는 소프트웨어 일관성 체계를 제안합니다. 우리가 제안한 방법은 간단하고 효율적입니다. OSCAR 자동 병렬화 컴파일러에 내장되어 있습니다. OSCAR 컴파일러는 대략적인 작업을 병렬화하고 프로그램의 오래된 데이터와 라인 공유를 분석한 다음 간단한 프로그램 재구성 및 데이터 동기화를 통해 이러한 문제를 해결합니다. 제안된 방법을 사용하여 SPEC10, SPEC2000, NAS Parallel Benchmark(NPB) 및 MediaBench II에서 2006개의 벤치마크 프로그램을 컴파일했습니다. 그런 다음 컴파일된 바이너리는 Renesas RP2, 8코어 SH-4A 프로세서 및 Altera Arria 8 FPGA의 맞춤형 10코어 Altera Nios II 시스템에서 실행됩니다. RP2 프로세서의 캐시 일관성 하드웨어는 최대 4개의 코어에만 사용할 수 있습니다. 비일관성 캐시 모드에서는 RP2의 캐시 일관성 하드웨어를 끌 수도 있습니다. Nios II 멀티코어 시스템에는 하드웨어 캐시 일관성 메커니즘이 없습니다. 따라서 컴파일러 지원 없이는 병렬 프로그램을 실행하기가 어렵습니다. 제안된 방법은 하드웨어 캐시 일관성 기법과 동등하거나 그 이상으로 성능을 발휘하면서도 하드웨어 일관성 기법으로는 올바른 결과를 제공하였다. 이 방법을 사용하면 HPC 설정에서 대규모 공유 메모리 CPU 코어 배열이나 간단한 비일관성 멀티코어 내장 CPU를 쉽게 프로그래밍할 수 있습니다. 예를 들어, RP2 프로세서에서 제안된 소프트웨어 제어 비일관성 캐시(NCC) 방법은 순차 실행에 비해 2.6코어 SPEC 2000 "equake"에 대해 4배 속도 향상을 제공한 반면 2.5코어 MESI 하드웨어에서는 4배 속도 향상에 그쳤습니다. 일관된 제어. 또한 소프트웨어 일관성 제어는 하드웨어 일관성 메커니즘을 사용할 수 없는 4.4코어에 대해 8배의 속도 향상을 제공했습니다.
Boma A. ADHI
Waseda University
Tomoya KASHIMATA
Waseda University
Ken TAKAHASHI
Waseda University
Keiji KIMURA
Waseda University
Hironori KASAHARA
Waseda University
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
부
Boma A. ADHI, Tomoya KASHIMATA, Ken TAKAHASHI, Keiji KIMURA, Hironori KASAHARA, "Compiler Software Coherent Control for Embedded High Performance Multicore" in IEICE TRANSACTIONS on Electronics,
vol. E103-C, no. 3, pp. 85-97, March 2020, doi: 10.1587/transele.2019LHP0008.
Abstract: The advancement of multicore technology has made hundreds or even thousands of cores processor on a single chip possible. However, on a larger scale multicore, a hardware-based cache coherency mechanism becomes overwhelmingly complicated, hot, and expensive. Therefore, we propose a software coherence scheme managed by a parallelizing compiler for shared-memory multicore systems without a hardware cache coherence mechanism. Our proposed method is simple and efficient. It is built into OSCAR automatic parallelizing compiler. The OSCAR compiler parallelizes the coarse grain task, analyzes stale data and line sharing in the program, then solves those problems by simple program restructuring and data synchronization. Using our proposed method, we compiled 10 benchmark programs from SPEC2000, SPEC2006, NAS Parallel Benchmark (NPB), and MediaBench II. The compiled binaries then are run on Renesas RP2, an 8 cores SH-4A processor, and a custom 8-core Altera Nios II system on Altera Arria 10 FPGA. The cache coherence hardware on the RP2 processor is only available for up to 4 cores. The RP2's cache coherence hardware can also be turned off for non-coherence cache mode. The Nios II multicore system does not have any hardware cache coherence mechanism; therefore, running a parallel program is difficult without any compiler support. The proposed method performed as good as or better than the hardware cache coherence scheme while still provided the correct result as the hardware coherence mechanism. This method allows a massive array of shared memory CPU cores in an HPC setting or a simple non-coherent multicore embedded CPU to be easily programmed. For example, on the RP2 processor, the proposed software-controlled non-coherent-cache (NCC) method gave us 2.6 times speedup for SPEC 2000 “equake” with 4 cores against sequential execution while got only 2.5 times speedup for 4 cores MESI hardware coherent control. Also, the software coherence control gave us 4.4 times speedup for 8 cores with no hardware coherence mechanism available.
URL: https://global.ieice.org/en_transactions/electronics/10.1587/transele.2019LHP0008/_p
부
@ARTICLE{e103-c_3_85,
author={Boma A. ADHI, Tomoya KASHIMATA, Ken TAKAHASHI, Keiji KIMURA, Hironori KASAHARA, },
journal={IEICE TRANSACTIONS on Electronics},
title={Compiler Software Coherent Control for Embedded High Performance Multicore},
year={2020},
volume={E103-C},
number={3},
pages={85-97},
abstract={The advancement of multicore technology has made hundreds or even thousands of cores processor on a single chip possible. However, on a larger scale multicore, a hardware-based cache coherency mechanism becomes overwhelmingly complicated, hot, and expensive. Therefore, we propose a software coherence scheme managed by a parallelizing compiler for shared-memory multicore systems without a hardware cache coherence mechanism. Our proposed method is simple and efficient. It is built into OSCAR automatic parallelizing compiler. The OSCAR compiler parallelizes the coarse grain task, analyzes stale data and line sharing in the program, then solves those problems by simple program restructuring and data synchronization. Using our proposed method, we compiled 10 benchmark programs from SPEC2000, SPEC2006, NAS Parallel Benchmark (NPB), and MediaBench II. The compiled binaries then are run on Renesas RP2, an 8 cores SH-4A processor, and a custom 8-core Altera Nios II system on Altera Arria 10 FPGA. The cache coherence hardware on the RP2 processor is only available for up to 4 cores. The RP2's cache coherence hardware can also be turned off for non-coherence cache mode. The Nios II multicore system does not have any hardware cache coherence mechanism; therefore, running a parallel program is difficult without any compiler support. The proposed method performed as good as or better than the hardware cache coherence scheme while still provided the correct result as the hardware coherence mechanism. This method allows a massive array of shared memory CPU cores in an HPC setting or a simple non-coherent multicore embedded CPU to be easily programmed. For example, on the RP2 processor, the proposed software-controlled non-coherent-cache (NCC) method gave us 2.6 times speedup for SPEC 2000 “equake” with 4 cores against sequential execution while got only 2.5 times speedup for 4 cores MESI hardware coherent control. Also, the software coherence control gave us 4.4 times speedup for 8 cores with no hardware coherence mechanism available.},
keywords={},
doi={10.1587/transele.2019LHP0008},
ISSN={1745-1353},
month={March},}
부
TY - JOUR
TI - Compiler Software Coherent Control for Embedded High Performance Multicore
T2 - IEICE TRANSACTIONS on Electronics
SP - 85
EP - 97
AU - Boma A. ADHI
AU - Tomoya KASHIMATA
AU - Ken TAKAHASHI
AU - Keiji KIMURA
AU - Hironori KASAHARA
PY - 2020
DO - 10.1587/transele.2019LHP0008
JO - IEICE TRANSACTIONS on Electronics
SN - 1745-1353
VL - E103-C
IS - 3
JA - IEICE TRANSACTIONS on Electronics
Y1 - March 2020
AB - The advancement of multicore technology has made hundreds or even thousands of cores processor on a single chip possible. However, on a larger scale multicore, a hardware-based cache coherency mechanism becomes overwhelmingly complicated, hot, and expensive. Therefore, we propose a software coherence scheme managed by a parallelizing compiler for shared-memory multicore systems without a hardware cache coherence mechanism. Our proposed method is simple and efficient. It is built into OSCAR automatic parallelizing compiler. The OSCAR compiler parallelizes the coarse grain task, analyzes stale data and line sharing in the program, then solves those problems by simple program restructuring and data synchronization. Using our proposed method, we compiled 10 benchmark programs from SPEC2000, SPEC2006, NAS Parallel Benchmark (NPB), and MediaBench II. The compiled binaries then are run on Renesas RP2, an 8 cores SH-4A processor, and a custom 8-core Altera Nios II system on Altera Arria 10 FPGA. The cache coherence hardware on the RP2 processor is only available for up to 4 cores. The RP2's cache coherence hardware can also be turned off for non-coherence cache mode. The Nios II multicore system does not have any hardware cache coherence mechanism; therefore, running a parallel program is difficult without any compiler support. The proposed method performed as good as or better than the hardware cache coherence scheme while still provided the correct result as the hardware coherence mechanism. This method allows a massive array of shared memory CPU cores in an HPC setting or a simple non-coherent multicore embedded CPU to be easily programmed. For example, on the RP2 processor, the proposed software-controlled non-coherent-cache (NCC) method gave us 2.6 times speedup for SPEC 2000 “equake” with 4 cores against sequential execution while got only 2.5 times speedup for 4 cores MESI hardware coherent control. Also, the software coherence control gave us 4.4 times speedup for 8 cores with no hardware coherence mechanism available.
ER -