The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
인간-컴퓨터 상호작용 분야에서 인기 있는 주제 중 하나인 음성 감정 인식(SER)은 화자의 발화에서 감정적 경향을 분류하는 것을 목표로 합니다. 기존의 딥러닝 방법과 많은 양의 훈련 데이터를 활용하면 매우 정확한 성능 결과를 얻을 수 있습니다. 안타깝게도 보편적으로 적용할 수 있는 이렇게 거대한 감정 음성 데이터베이스를 구축하는 것은 시간이 많이 걸리고 어려운 작업입니다. 그러나 본 논문에서 논의하는 SNN(Siamese Neural Network)은 샘플 부족의 영향을 완화하고 충분한 반복을 제공하는 쌍별 학습을 통해 제한된 양의 학습 데이터만으로 매우 정확한 결과를 얻을 수 있습니다. 충분한 SER 훈련을 얻기 위해 본 연구에서는 Siamese Attention 기반 장기 단기 기억 네트워크를 사용하는 새로운 방법을 제안합니다. 이 프레임워크에서 우리는 동일한 가중치를 공유하는 두 개의 주의 기반 장기 단기 기억 네트워크를 설계했으며, 발화 수준의 감정적 특징보다는 프레임 수준의 음향적 감정적 특징을 Siamese 네트워크에 입력했습니다. 제안된 솔루션은 EMODB, ABC, UYGSEDB corpora에서 평가되었으며, 기존 딥러닝 방법에 비해 SER 결과가 크게 향상되었음을 보여주었습니다.
Tashpolat NIZAMIDIN
Southeast University
Li ZHAO
Southeast University
Ruiyu LIANG
Nanjing Institute of Technology
Yue XIE
Southeast University
Askar HAMDULLA
Xinjiang University
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
부
Tashpolat NIZAMIDIN, Li ZHAO, Ruiyu LIANG, Yue XIE, Askar HAMDULLA, "Siamese Attention-Based LSTM for Speech Emotion Recognition" in IEICE TRANSACTIONS on Fundamentals,
vol. E103-A, no. 7, pp. 937-941, July 2020, doi: 10.1587/transfun.2019EAL2156.
Abstract: As one of the popular topics in the field of human-computer interaction, the Speech Emotion Recognition (SER) aims to classify the emotional tendency from the speakers' utterances. Using the existing deep learning methods, and with a large amount of training data, we can achieve a highly accurate performance result. Unfortunately, it's time consuming and difficult job to build such a huge emotional speech database that can be applicable universally. However, the Siamese Neural Network (SNN), which we discuss in this paper, can yield extremely precise results with just a limited amount of training data through pairwise training which mitigates the impacts of sample deficiency and provides enough iterations. To obtain enough SER training, this study proposes a novel method which uses Siamese Attention-based Long Short-Term Memory Networks. In this framework, we designed two Attention-based Long Short-Term Memory Networks which shares the same weights, and we input frame level acoustic emotional features to the Siamese network rather than utterance level emotional features. The proposed solution has been evaluated on EMODB, ABC and UYGSEDB corpora, and showed significant improvement on SER results, compared to conventional deep learning methods.
URL: https://global.ieice.org/en_transactions/fundamentals/10.1587/transfun.2019EAL2156/_p
부
@ARTICLE{e103-a_7_937,
author={Tashpolat NIZAMIDIN, Li ZHAO, Ruiyu LIANG, Yue XIE, Askar HAMDULLA, },
journal={IEICE TRANSACTIONS on Fundamentals},
title={Siamese Attention-Based LSTM for Speech Emotion Recognition},
year={2020},
volume={E103-A},
number={7},
pages={937-941},
abstract={As one of the popular topics in the field of human-computer interaction, the Speech Emotion Recognition (SER) aims to classify the emotional tendency from the speakers' utterances. Using the existing deep learning methods, and with a large amount of training data, we can achieve a highly accurate performance result. Unfortunately, it's time consuming and difficult job to build such a huge emotional speech database that can be applicable universally. However, the Siamese Neural Network (SNN), which we discuss in this paper, can yield extremely precise results with just a limited amount of training data through pairwise training which mitigates the impacts of sample deficiency and provides enough iterations. To obtain enough SER training, this study proposes a novel method which uses Siamese Attention-based Long Short-Term Memory Networks. In this framework, we designed two Attention-based Long Short-Term Memory Networks which shares the same weights, and we input frame level acoustic emotional features to the Siamese network rather than utterance level emotional features. The proposed solution has been evaluated on EMODB, ABC and UYGSEDB corpora, and showed significant improvement on SER results, compared to conventional deep learning methods.},
keywords={},
doi={10.1587/transfun.2019EAL2156},
ISSN={1745-1337},
month={July},}
부
TY - JOUR
TI - Siamese Attention-Based LSTM for Speech Emotion Recognition
T2 - IEICE TRANSACTIONS on Fundamentals
SP - 937
EP - 941
AU - Tashpolat NIZAMIDIN
AU - Li ZHAO
AU - Ruiyu LIANG
AU - Yue XIE
AU - Askar HAMDULLA
PY - 2020
DO - 10.1587/transfun.2019EAL2156
JO - IEICE TRANSACTIONS on Fundamentals
SN - 1745-1337
VL - E103-A
IS - 7
JA - IEICE TRANSACTIONS on Fundamentals
Y1 - July 2020
AB - As one of the popular topics in the field of human-computer interaction, the Speech Emotion Recognition (SER) aims to classify the emotional tendency from the speakers' utterances. Using the existing deep learning methods, and with a large amount of training data, we can achieve a highly accurate performance result. Unfortunately, it's time consuming and difficult job to build such a huge emotional speech database that can be applicable universally. However, the Siamese Neural Network (SNN), which we discuss in this paper, can yield extremely precise results with just a limited amount of training data through pairwise training which mitigates the impacts of sample deficiency and provides enough iterations. To obtain enough SER training, this study proposes a novel method which uses Siamese Attention-based Long Short-Term Memory Networks. In this framework, we designed two Attention-based Long Short-Term Memory Networks which shares the same weights, and we input frame level acoustic emotional features to the Siamese network rather than utterance level emotional features. The proposed solution has been evaluated on EMODB, ABC and UYGSEDB corpora, and showed significant improvement on SER results, compared to conventional deep learning methods.
ER -