The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
감지 속도와 에너지 소비를 유지하면서 정확도를 높이는 새로운 비디오 객체 감지 방법인 “Temporal Ensemble SSDLite”를 제안합니다. 비디오의 객체 감지는 로봇 공학, 자율 주행 및 기타 유망 분야의 애플리케이션의 핵심 부분으로 점점 더 중요해지고 있습니다. 이러한 애플리케이션 중 다수는 실행 가능하려면 높은 정확도와 속도가 필요하지만 컴퓨팅 및 에너지가 제한된 환경에서 사용됩니다. 따라서 비디오 객체 감지의 전반적인 성능, 즉 정확도와 속도를 향상시키는 새로운 방법이 개발되어야 합니다. 정확성을 높이기 위해 우리는 여러 모델의 예측을 결합하는 기계 학습 방법인 앙상블을 사용합니다. 앙상블의 단점은 사용된 모델 수에 비례하여 계산 비용이 증가한다는 것입니다. 우리는 앙상블을 시간적으로 배치하여 이러한 결함을 극복합니다. 즉, 각 프레임에서 단일 모델로만 추론하고 각 프레임에서 모델 앙상블을 순환합니다. 그런 다음 마지막 예측을 결합합니다. N 프레임 N 는 비최대 억제를 통한 앙상블의 모델 수입니다. 이는 시간적 상관 관계로 인해 비디오의 가까운 프레임이 매우 유사하기 때문에 가능합니다. 결과적으로 각 프레임에서 단일 모델만 추론하여 감지 속도를 유지하면서 앙상블을 통해 정확도를 높입니다. 제안을 평가하기 위해 Imagenet VID 데이터 세트를 사용하여 기계 학습 추론 가속기인 Google Edge TPU의 정확도, 감지 속도 및 에너지 소비를 측정합니다. 우리의 결과는 실시간 감지 속도와 이미지당 4.9mJ의 에너지 소비를 유지하면서 정확도가 최대 181% 향상되었음을 보여줍니다.
Lukas NAKAMURA
Osaka University
Hiromitsu AWANO
Kyoto University
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
부
Lukas NAKAMURA, Hiromitsu AWANO, "Temporal Ensemble SSDLite: Exploiting Temporal Correlation in Video for Accurate Object Detection" in IEICE TRANSACTIONS on Fundamentals,
vol. E105-A, no. 7, pp. 1082-1090, July 2022, doi: 10.1587/transfun.2021EAP1068.
Abstract: We propose “Temporal Ensemble SSDLite,” a new method for video object detection that boosts accuracy while maintaining detection speed and energy consumption. Object detection for video is becoming increasingly important as a core part of applications in robotics, autonomous driving and many other promising fields. Many of these applications require high accuracy and speed to be viable, but are used in compute and energy restricted environments. Therefore, new methods that increase the overall performance of video object detection i.e., accuracy and speed have to be developed. To increase accuracy we use ensemble, the machine learning method of combining predictions of multiple different models. The drawback of ensemble is the increased computational cost which is proportional to the number of models used. We overcome this deficit by deploying our ensemble temporally, meaning we inference with only a single model at each frame, cycling through our ensemble of models at each frame. Then, we combine the predictions for the last N frames where N is the number of models in our ensemble through non-max-suppression. This is possible because close frames in a video are extremely similar due to temporal correlation. As a result, we increase accuracy through the ensemble while only inferencing a single model at each frame and therefore keeping the detection speed. To evaluate the proposal, we measure the accuracy, detection speed and energy consumption on the Google Edge TPU, a machine learning inference accelerator, with the Imagenet VID dataset. Our results demonstrate an accuracy boost of up to 4.9% while maintaining real-time detection speed and an energy consumption of 181mJ per image.
URL: https://global.ieice.org/en_transactions/fundamentals/10.1587/transfun.2021EAP1068/_p
부
@ARTICLE{e105-a_7_1082,
author={Lukas NAKAMURA, Hiromitsu AWANO, },
journal={IEICE TRANSACTIONS on Fundamentals},
title={Temporal Ensemble SSDLite: Exploiting Temporal Correlation in Video for Accurate Object Detection},
year={2022},
volume={E105-A},
number={7},
pages={1082-1090},
abstract={We propose “Temporal Ensemble SSDLite,” a new method for video object detection that boosts accuracy while maintaining detection speed and energy consumption. Object detection for video is becoming increasingly important as a core part of applications in robotics, autonomous driving and many other promising fields. Many of these applications require high accuracy and speed to be viable, but are used in compute and energy restricted environments. Therefore, new methods that increase the overall performance of video object detection i.e., accuracy and speed have to be developed. To increase accuracy we use ensemble, the machine learning method of combining predictions of multiple different models. The drawback of ensemble is the increased computational cost which is proportional to the number of models used. We overcome this deficit by deploying our ensemble temporally, meaning we inference with only a single model at each frame, cycling through our ensemble of models at each frame. Then, we combine the predictions for the last N frames where N is the number of models in our ensemble through non-max-suppression. This is possible because close frames in a video are extremely similar due to temporal correlation. As a result, we increase accuracy through the ensemble while only inferencing a single model at each frame and therefore keeping the detection speed. To evaluate the proposal, we measure the accuracy, detection speed and energy consumption on the Google Edge TPU, a machine learning inference accelerator, with the Imagenet VID dataset. Our results demonstrate an accuracy boost of up to 4.9% while maintaining real-time detection speed and an energy consumption of 181mJ per image.},
keywords={},
doi={10.1587/transfun.2021EAP1068},
ISSN={1745-1337},
month={July},}
부
TY - JOUR
TI - Temporal Ensemble SSDLite: Exploiting Temporal Correlation in Video for Accurate Object Detection
T2 - IEICE TRANSACTIONS on Fundamentals
SP - 1082
EP - 1090
AU - Lukas NAKAMURA
AU - Hiromitsu AWANO
PY - 2022
DO - 10.1587/transfun.2021EAP1068
JO - IEICE TRANSACTIONS on Fundamentals
SN - 1745-1337
VL - E105-A
IS - 7
JA - IEICE TRANSACTIONS on Fundamentals
Y1 - July 2022
AB - We propose “Temporal Ensemble SSDLite,” a new method for video object detection that boosts accuracy while maintaining detection speed and energy consumption. Object detection for video is becoming increasingly important as a core part of applications in robotics, autonomous driving and many other promising fields. Many of these applications require high accuracy and speed to be viable, but are used in compute and energy restricted environments. Therefore, new methods that increase the overall performance of video object detection i.e., accuracy and speed have to be developed. To increase accuracy we use ensemble, the machine learning method of combining predictions of multiple different models. The drawback of ensemble is the increased computational cost which is proportional to the number of models used. We overcome this deficit by deploying our ensemble temporally, meaning we inference with only a single model at each frame, cycling through our ensemble of models at each frame. Then, we combine the predictions for the last N frames where N is the number of models in our ensemble through non-max-suppression. This is possible because close frames in a video are extremely similar due to temporal correlation. As a result, we increase accuracy through the ensemble while only inferencing a single model at each frame and therefore keeping the detection speed. To evaluate the proposal, we measure the accuracy, detection speed and energy consumption on the Google Edge TPU, a machine learning inference accelerator, with the Imagenet VID dataset. Our results demonstrate an accuracy boost of up to 4.9% while maintaining real-time detection speed and an energy consumption of 181mJ per image.
ER -