The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
본 연구에서는 시각적 표현을 통해 기계가 상황을 인식하는 자연어를 생성하기 위한 공식을 제시합니다. 이미지 시퀀스 입력이 주어지면 시각적 스토리텔링 작업(VST)은 일관되고 객체 중심적이며 상황에 맞는 문장 스토리를 생성하는 것을 목표로 합니다. 이 영역의 이전 연구에서는 시간적 다중 모드 데이터에서 작동하는 아키텍처를 모델링하는 데 문제가 있었으며, 이로 인해 낮은 어휘 다양성, 단조로운 문장 및 부정확한 문맥과 같은 낮은 품질의 출력이 발생했습니다. 본 연구에서는 추가적인 개선 사항, 즉 시각적-시간적 특징을 추출하고 그럴듯한 스토리를 생성하도록 최적화된 교차 모달 맥락화 주의라고 불리는 엔드투엔드 아키텍처를 소개합니다. 시각적 개체 및 비시각적 개념 기능은 컨볼루셔널 기능 맵에서 인코딩되며 개체 감지 기능은 언어 기능과 결합됩니다. 사전 훈련된 언어 생성 모델의 가중치를 통합하여 언어 생성 디코딩에 세 가지 시나리오가 정의됩니다. 제안된 모델이 자동 측정 및 수동 인간 평가 측면에서 다른 모델보다 우수한지 확인하기 위해 광범위한 실험이 수행되었습니다.
Rizal Setya PERDANA
Toyohashi University of Technology,Universitas Brawijaya
Yoshiteru ISHIDA
Toyohashi University of Technology
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
부
Rizal Setya PERDANA, Yoshiteru ISHIDA, "Contextualized Language Generation on Visual-to-Language Storytelling" in IEICE TRANSACTIONS on Information,
vol. E105-D, no. 5, pp. 873-886, May 2022, doi: 10.1587/transinf.2021KBP0002.
Abstract: This study presents a formulation for generating context-aware natural language by machine from visual representation. Given an image sequence input, the visual storytelling task (VST) aims to generate a coherent, object-focused, and contextualized sentence story. Previous works in this domain faced a problem in modeling an architecture that works in temporal multi-modal data, which led to a low-quality output, such as low lexical diversity, monotonous sentences, and inaccurate context. This study introduces a further improvement, that is, an end-to-end architecture, called cross-modal contextualize attention, optimized to extract visual-temporal features and generate a plausible story. Visual object and non-visual concept features are encoded from the convolutional feature map, and object detection features are joined with language features. Three scenarios are defined in decoding language generation by incorporating weights from a pre-trained language generation model. Extensive experiments are conducted to confirm that the proposed model outperforms other models in terms of automatic metrics and manual human evaluation.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2021KBP0002/_p
부
@ARTICLE{e105-d_5_873,
author={Rizal Setya PERDANA, Yoshiteru ISHIDA, },
journal={IEICE TRANSACTIONS on Information},
title={Contextualized Language Generation on Visual-to-Language Storytelling},
year={2022},
volume={E105-D},
number={5},
pages={873-886},
abstract={This study presents a formulation for generating context-aware natural language by machine from visual representation. Given an image sequence input, the visual storytelling task (VST) aims to generate a coherent, object-focused, and contextualized sentence story. Previous works in this domain faced a problem in modeling an architecture that works in temporal multi-modal data, which led to a low-quality output, such as low lexical diversity, monotonous sentences, and inaccurate context. This study introduces a further improvement, that is, an end-to-end architecture, called cross-modal contextualize attention, optimized to extract visual-temporal features and generate a plausible story. Visual object and non-visual concept features are encoded from the convolutional feature map, and object detection features are joined with language features. Three scenarios are defined in decoding language generation by incorporating weights from a pre-trained language generation model. Extensive experiments are conducted to confirm that the proposed model outperforms other models in terms of automatic metrics and manual human evaluation.},
keywords={},
doi={10.1587/transinf.2021KBP0002},
ISSN={1745-1361},
month={May},}
부
TY - JOUR
TI - Contextualized Language Generation on Visual-to-Language Storytelling
T2 - IEICE TRANSACTIONS on Information
SP - 873
EP - 886
AU - Rizal Setya PERDANA
AU - Yoshiteru ISHIDA
PY - 2022
DO - 10.1587/transinf.2021KBP0002
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E105-D
IS - 5
JA - IEICE TRANSACTIONS on Information
Y1 - May 2022
AB - This study presents a formulation for generating context-aware natural language by machine from visual representation. Given an image sequence input, the visual storytelling task (VST) aims to generate a coherent, object-focused, and contextualized sentence story. Previous works in this domain faced a problem in modeling an architecture that works in temporal multi-modal data, which led to a low-quality output, such as low lexical diversity, monotonous sentences, and inaccurate context. This study introduces a further improvement, that is, an end-to-end architecture, called cross-modal contextualize attention, optimized to extract visual-temporal features and generate a plausible story. Visual object and non-visual concept features are encoded from the convolutional feature map, and object detection features are joined with language features. Three scenarios are defined in decoding language generation by incorporating weights from a pre-trained language generation model. Extensive experiments are conducted to confirm that the proposed model outperforms other models in terms of automatic metrics and manual human evaluation.
ER -