The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
시각적 스토리텔링으로 알려진 시각적 데이터 표현에서 텍스트 스토리의 자동 생성은 이미지-텍스트 문제의 최근 발전입니다. 단일 이미지를 입력으로 사용하는 대신 시각적 스토리텔링은 일련의 이미지 배열을 일관된 문장으로 처리합니다. 이야기에는 문자 그대로의 대상에 대한 설명뿐만 아니라 비시각적 개념도 포함됩니다. 이전 접근 방식이 외부 지식을 적용한 반면, 우리의 접근 방식은 비시각적 개념을 시각적 양식과 텍스트 양식 간의 의미적 상관 관계로 간주하는 것이었습니다. 따라서 이 논문에서는 두 양식 간의 표준 상관 분석을 기반으로 새로운 기능 표현을 제시합니다. 어텐션 메커니즘은 표준 인코더-디코더 모델이 아닌 이미지-텍스트 문제의 기본 아키텍처로 채택됩니다. 제안된 end-to-end 아키텍처인 CAAM(Canonical Correlation Attention Mechanism)은 교차 모달 상관관계를 최대화하여 시계열 상관관계를 추출합니다. 자동 메트릭 측면에서 아키텍처의 효율성을 입증하기 위해 VIST 데이터 세트(http://visionand언어.net/VIST/dataset.html)에 대한 광범위한 실험이 수행되었으며, 추가 실험에서는 양식 융합 전략의 영향이 나타났습니다.
Rizal Setya PERDANA
Toyohashi University of Technology,Universitas Brawijaya
Yoshiteru ISHIDA
Toyohashi University of Technology
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
부
Rizal Setya PERDANA, Yoshiteru ISHIDA, "Vision-Text Time Series Correlation for Visual-to-Language Story Generation" in IEICE TRANSACTIONS on Information,
vol. E104-D, no. 6, pp. 828-839, June 2021, doi: 10.1587/transinf.2020EDP7131.
Abstract: Automatic generation of textual stories from visual data representation, known as visual storytelling, is a recent advancement in the problem of images-to-text. Instead of using a single image as input, visual storytelling processes a sequential array of images into coherent sentences. A story contains non-visual concepts as well as descriptions of literal object(s). While previous approaches have applied external knowledge, our approach was to regard the non-visual concept as the semantic correlation between visual modality and textual modality. This paper, therefore, presents new features representation based on a canonical correlation analysis between two modalities. Attention mechanism are adopted as the underlying architecture of the image-to-text problem, rather than standard encoder-decoder models. Canonical Correlation Attention Mechanism (CAAM), the proposed end-to-end architecture, extracts time series correlation by maximizing the cross-modal correlation. Extensive experiments on VIST dataset ( http://visionandlanguage.net/VIST/dataset.html ) were conducted to demonstrate the effectiveness of the architecture in terms of automatic metrics, with additional experiments show the impact of modality fusion strategy.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2020EDP7131/_p
부
@ARTICLE{e104-d_6_828,
author={Rizal Setya PERDANA, Yoshiteru ISHIDA, },
journal={IEICE TRANSACTIONS on Information},
title={Vision-Text Time Series Correlation for Visual-to-Language Story Generation},
year={2021},
volume={E104-D},
number={6},
pages={828-839},
abstract={Automatic generation of textual stories from visual data representation, known as visual storytelling, is a recent advancement in the problem of images-to-text. Instead of using a single image as input, visual storytelling processes a sequential array of images into coherent sentences. A story contains non-visual concepts as well as descriptions of literal object(s). While previous approaches have applied external knowledge, our approach was to regard the non-visual concept as the semantic correlation between visual modality and textual modality. This paper, therefore, presents new features representation based on a canonical correlation analysis between two modalities. Attention mechanism are adopted as the underlying architecture of the image-to-text problem, rather than standard encoder-decoder models. Canonical Correlation Attention Mechanism (CAAM), the proposed end-to-end architecture, extracts time series correlation by maximizing the cross-modal correlation. Extensive experiments on VIST dataset ( http://visionandlanguage.net/VIST/dataset.html ) were conducted to demonstrate the effectiveness of the architecture in terms of automatic metrics, with additional experiments show the impact of modality fusion strategy.},
keywords={},
doi={10.1587/transinf.2020EDP7131},
ISSN={1745-1361},
month={June},}
부
TY - JOUR
TI - Vision-Text Time Series Correlation for Visual-to-Language Story Generation
T2 - IEICE TRANSACTIONS on Information
SP - 828
EP - 839
AU - Rizal Setya PERDANA
AU - Yoshiteru ISHIDA
PY - 2021
DO - 10.1587/transinf.2020EDP7131
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E104-D
IS - 6
JA - IEICE TRANSACTIONS on Information
Y1 - June 2021
AB - Automatic generation of textual stories from visual data representation, known as visual storytelling, is a recent advancement in the problem of images-to-text. Instead of using a single image as input, visual storytelling processes a sequential array of images into coherent sentences. A story contains non-visual concepts as well as descriptions of literal object(s). While previous approaches have applied external knowledge, our approach was to regard the non-visual concept as the semantic correlation between visual modality and textual modality. This paper, therefore, presents new features representation based on a canonical correlation analysis between two modalities. Attention mechanism are adopted as the underlying architecture of the image-to-text problem, rather than standard encoder-decoder models. Canonical Correlation Attention Mechanism (CAAM), the proposed end-to-end architecture, extracts time series correlation by maximizing the cross-modal correlation. Extensive experiments on VIST dataset ( http://visionandlanguage.net/VIST/dataset.html ) were conducted to demonstrate the effectiveness of the architecture in terms of automatic metrics, with additional experiments show the impact of modality fusion strategy.
ER -