The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
VQA(시각적 질문 응답)는 시각과 텍스트의 동시 처리가 필요한 다중 작업 연구입니다. VQA 모델에 대한 최근 연구에서는 Co-Attention 메커니즘을 사용하여 컨텍스트와 이미지 사이에 모델을 구축합니다. 그러나 질문의 특징과 이미지 영역의 모델링으로 인해 모델에서 관련 없는 정보가 강제로 계산되어 성능에 영향을 미칩니다. 본 논문에서는 이 문제를 해결하기 위해 희소 질문 네트워크(DSSQN)를 사용한 새로운 이중 자기 유도 주의를 제안합니다. 목표는 질문과 이미지 모두에 대한 내부 종속성을 모델링할 때 관련 없는 정보가 모델에 계산되는 것을 방지하는 것입니다. 동시에 희소 질문 기능과 이미지 기능 간의 거친 상호 작용을 극복합니다. 먼저, 인코더의 SQSA(Sparse Question Self-Attention) 유닛은 가장 높은 가중치를 갖는 특징을 계산합니다. 질문 단어의 self-attention 학습에서 더 큰 가중치의 질문 특징이 유보됩니다. 둘째, 희소 질문 특징은 이미지 특징에 초점을 맞춰 세분화된 이미지 특징을 얻고, 관련 없는 정보가 모델에 계산되는 것을 방지하는 데 활용됩니다. DSGA(Dual Self-Guided Attention) 장치는 질문과 이미지 간의 모달 상호 작용을 개선하도록 설계되었습니다. 셋째, 매개변수 δ의 희소 질문 self-attention을 최적화하여 이러한 질문 관련 개체 영역을 선택합니다. VQA 2.0 벤치마크 데이터 세트를 사용한 실험에서는 DSSQN이 최첨단 방법보다 성능이 우수하다는 것을 보여줍니다. 예를 들어, 우리가 제안한 모델의 정확도는 테스트 개발 및 테스트 표준 각각 71.03%, 71.37%이다. 또한 시각화 결과를 통해 우리 모델이 다른 고급 모델보다 중요한 기능에 더 많은 주의를 기울일 수 있음을 보여줍니다. 동시에 인공지능(AI) 분야에서 VQA의 발전도 촉진할 수 있기를 바랍니다.
Xiang SHEN
Shanghai Maritime University
Dezhi HAN
Shanghai Maritime University
Chin-Chen CHANG
Feng Chia University
Liang ZONG
Shaoyang University
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
부
Xiang SHEN, Dezhi HAN, Chin-Chen CHANG, Liang ZONG, "Dual Self-Guided Attention with Sparse Question Networks for Visual Question Answering" in IEICE TRANSACTIONS on Information,
vol. E105-D, no. 4, pp. 785-796, April 2022, doi: 10.1587/transinf.2021EDP7189.
Abstract: Visual Question Answering (VQA) is multi-task research that requires simultaneous processing of vision and text. Recent research on the VQA models employ a co-attention mechanism to build a model between the context and the image. However, the features of questions and the modeling of the image region force irrelevant information to be calculated in the model, thus affecting the performance. This paper proposes a novel dual self-guided attention with sparse question networks (DSSQN) to address this issue. The aim is to avoid having irrelevant information calculated into the model when modeling the internal dependencies on both the question and image. Simultaneously, it overcomes the coarse interaction between sparse question features and image features. First, the sparse question self-attention (SQSA) unit in the encoder calculates the feature with the highest weight. From the self-attention learning of question words, the question features of larger weights are reserved. Secondly, sparse question features are utilized to guide the focus on image features to obtain fine-grained image features, and to also prevent irrelevant information from being calculated into the model. A dual self-guided attention (DSGA) unit is designed to improve modal interaction between questions and images. Third, the sparse question self-attention of the parameter δ is optimized to select these question-related object regions. Our experiments with VQA 2.0 benchmark datasets demonstrate that DSSQN outperforms the state-of-the-art methods. For example, the accuracy of our proposed model on the test-dev and test-std is 71.03% and 71.37%, respectively. In addition, we show through visualization results that our model can pay more attention to important features than other advanced models. At the same time, we also hope that it can promote the development of VQA in the field of artificial intelligence (AI).
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2021EDP7189/_p
부
@ARTICLE{e105-d_4_785,
author={Xiang SHEN, Dezhi HAN, Chin-Chen CHANG, Liang ZONG, },
journal={IEICE TRANSACTIONS on Information},
title={Dual Self-Guided Attention with Sparse Question Networks for Visual Question Answering},
year={2022},
volume={E105-D},
number={4},
pages={785-796},
abstract={Visual Question Answering (VQA) is multi-task research that requires simultaneous processing of vision and text. Recent research on the VQA models employ a co-attention mechanism to build a model between the context and the image. However, the features of questions and the modeling of the image region force irrelevant information to be calculated in the model, thus affecting the performance. This paper proposes a novel dual self-guided attention with sparse question networks (DSSQN) to address this issue. The aim is to avoid having irrelevant information calculated into the model when modeling the internal dependencies on both the question and image. Simultaneously, it overcomes the coarse interaction between sparse question features and image features. First, the sparse question self-attention (SQSA) unit in the encoder calculates the feature with the highest weight. From the self-attention learning of question words, the question features of larger weights are reserved. Secondly, sparse question features are utilized to guide the focus on image features to obtain fine-grained image features, and to also prevent irrelevant information from being calculated into the model. A dual self-guided attention (DSGA) unit is designed to improve modal interaction between questions and images. Third, the sparse question self-attention of the parameter δ is optimized to select these question-related object regions. Our experiments with VQA 2.0 benchmark datasets demonstrate that DSSQN outperforms the state-of-the-art methods. For example, the accuracy of our proposed model on the test-dev and test-std is 71.03% and 71.37%, respectively. In addition, we show through visualization results that our model can pay more attention to important features than other advanced models. At the same time, we also hope that it can promote the development of VQA in the field of artificial intelligence (AI).},
keywords={},
doi={10.1587/transinf.2021EDP7189},
ISSN={1745-1361},
month={April},}
부
TY - JOUR
TI - Dual Self-Guided Attention with Sparse Question Networks for Visual Question Answering
T2 - IEICE TRANSACTIONS on Information
SP - 785
EP - 796
AU - Xiang SHEN
AU - Dezhi HAN
AU - Chin-Chen CHANG
AU - Liang ZONG
PY - 2022
DO - 10.1587/transinf.2021EDP7189
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E105-D
IS - 4
JA - IEICE TRANSACTIONS on Information
Y1 - April 2022
AB - Visual Question Answering (VQA) is multi-task research that requires simultaneous processing of vision and text. Recent research on the VQA models employ a co-attention mechanism to build a model between the context and the image. However, the features of questions and the modeling of the image region force irrelevant information to be calculated in the model, thus affecting the performance. This paper proposes a novel dual self-guided attention with sparse question networks (DSSQN) to address this issue. The aim is to avoid having irrelevant information calculated into the model when modeling the internal dependencies on both the question and image. Simultaneously, it overcomes the coarse interaction between sparse question features and image features. First, the sparse question self-attention (SQSA) unit in the encoder calculates the feature with the highest weight. From the self-attention learning of question words, the question features of larger weights are reserved. Secondly, sparse question features are utilized to guide the focus on image features to obtain fine-grained image features, and to also prevent irrelevant information from being calculated into the model. A dual self-guided attention (DSGA) unit is designed to improve modal interaction between questions and images. Third, the sparse question self-attention of the parameter δ is optimized to select these question-related object regions. Our experiments with VQA 2.0 benchmark datasets demonstrate that DSSQN outperforms the state-of-the-art methods. For example, the accuracy of our proposed model on the test-dev and test-std is 71.03% and 71.37%, respectively. In addition, we show through visualization results that our model can pay more attention to important features than other advanced models. At the same time, we also hope that it can promote the development of VQA in the field of artificial intelligence (AI).
ER -