The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
화자가 동일한 대화 내에서 두 개 이상의 언어를 혼합하는 현상을 코드 전환(CS)이라고 합니다. ASR(자동 음성 인식) 및 TTS(텍스트 음성 변환)에서는 다국어 입력에 대처해야 하기 때문에 CS를 처리하는 것이 어렵습니다. CS 텍스트나 음성은 소셜 미디어에서 찾을 수 있지만 CS 음성 데이터세트와 해당 CS 전사본은 지도 교육에 필요함에도 불구하고 얻기가 어렵습니다. 이 작업은 딥러닝 기반 기계 음성 체인을 채택하여 반지도 학습을 통해 CS ASR과 CS TTS를 서로 훈련시킵니다. 단일 언어 데이터를 사용한 지도 학습 후 기계 음성 체인은 CS 텍스트 또는 음성에 대한 비지도 학습을 통해 수행됩니다. 결과는 기계 음성 체인이 ASR과 TTS를 함께 훈련하고 CS 음성 쌍과 해당 CS 텍스트를 요구하지 않고도 성능을 향상시킨다는 것을 보여줍니다. 또한 언어 정보를 제공하여 CS를 더 잘 처리하기 위해 언어 임베딩 및 언어 식별을 CS 기계 음성 체인에 통합합니다. 우리는 제안된 접근 방식이 훈련 데이터에서 제외된 알려지지 않은 CS를 포함하여 단일 CS 언어 쌍과 다중 CS 언어 쌍 모두에서 성능을 향상시킬 수 있음을 보여줍니다.
Sahoko NAKAYAMA
Nara Institute of Science and Technology,RIKEN, Center for Advanced Intelligence Project AIP
Andros TJANDRA
Nara Institute of Science and Technology
Sakriani SAKTI
Nara Institute of Science and Technology,RIKEN, Center for Advanced Intelligence Project AIP
Satoshi NAKAMURA
Nara Institute of Science and Technology,RIKEN, Center for Advanced Intelligence Project AIP
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
부
Sahoko NAKAYAMA, Andros TJANDRA, Sakriani SAKTI, Satoshi NAKAMURA, "Code-Switching ASR and TTS Using Semisupervised Learning with Machine Speech Chain" in IEICE TRANSACTIONS on Information,
vol. E104-D, no. 10, pp. 1661-1677, October 2021, doi: 10.1587/transinf.2021EDP7005.
Abstract: The phenomenon where a speaker mixes two or more languages within the same conversation is called code-switching (CS). Handling CS is challenging for automatic speech recognition (ASR) and text-to-speech (TTS) because it requires coping with multilingual input. Although CS text or speech may be found in social media, the datasets of CS speech and corresponding CS transcriptions are hard to obtain even though they are required for supervised training. This work adopts a deep learning-based machine speech chain to train CS ASR and CS TTS with each other with semisupervised learning. After supervised learning with monolingual data, the machine speech chain is then carried out with unsupervised learning of either the CS text or speech. The results show that the machine speech chain trains ASR and TTS together and improves performance without requiring the pair of CS speech and corresponding CS text. We also integrate language embedding and language identification into the CS machine speech chain in order to handle CS better by giving language information. We demonstrate that our proposed approach can improve the performance on both a single CS language pair and multiple CS language pairs, including the unknown CS excluded from training data.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2021EDP7005/_p
부
@ARTICLE{e104-d_10_1661,
author={Sahoko NAKAYAMA, Andros TJANDRA, Sakriani SAKTI, Satoshi NAKAMURA, },
journal={IEICE TRANSACTIONS on Information},
title={Code-Switching ASR and TTS Using Semisupervised Learning with Machine Speech Chain},
year={2021},
volume={E104-D},
number={10},
pages={1661-1677},
abstract={The phenomenon where a speaker mixes two or more languages within the same conversation is called code-switching (CS). Handling CS is challenging for automatic speech recognition (ASR) and text-to-speech (TTS) because it requires coping with multilingual input. Although CS text or speech may be found in social media, the datasets of CS speech and corresponding CS transcriptions are hard to obtain even though they are required for supervised training. This work adopts a deep learning-based machine speech chain to train CS ASR and CS TTS with each other with semisupervised learning. After supervised learning with monolingual data, the machine speech chain is then carried out with unsupervised learning of either the CS text or speech. The results show that the machine speech chain trains ASR and TTS together and improves performance without requiring the pair of CS speech and corresponding CS text. We also integrate language embedding and language identification into the CS machine speech chain in order to handle CS better by giving language information. We demonstrate that our proposed approach can improve the performance on both a single CS language pair and multiple CS language pairs, including the unknown CS excluded from training data.},
keywords={},
doi={10.1587/transinf.2021EDP7005},
ISSN={1745-1361},
month={October},}
부
TY - JOUR
TI - Code-Switching ASR and TTS Using Semisupervised Learning with Machine Speech Chain
T2 - IEICE TRANSACTIONS on Information
SP - 1661
EP - 1677
AU - Sahoko NAKAYAMA
AU - Andros TJANDRA
AU - Sakriani SAKTI
AU - Satoshi NAKAMURA
PY - 2021
DO - 10.1587/transinf.2021EDP7005
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E104-D
IS - 10
JA - IEICE TRANSACTIONS on Information
Y1 - October 2021
AB - The phenomenon where a speaker mixes two or more languages within the same conversation is called code-switching (CS). Handling CS is challenging for automatic speech recognition (ASR) and text-to-speech (TTS) because it requires coping with multilingual input. Although CS text or speech may be found in social media, the datasets of CS speech and corresponding CS transcriptions are hard to obtain even though they are required for supervised training. This work adopts a deep learning-based machine speech chain to train CS ASR and CS TTS with each other with semisupervised learning. After supervised learning with monolingual data, the machine speech chain is then carried out with unsupervised learning of either the CS text or speech. The results show that the machine speech chain trains ASR and TTS together and improves performance without requiring the pair of CS speech and corresponding CS text. We also integrate language embedding and language identification into the CS machine speech chain in order to handle CS better by giving language information. We demonstrate that our proposed approach can improve the performance on both a single CS language pair and multiple CS language pairs, including the unknown CS excluded from training data.
ER -