The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
본 논문에서는 비병렬 음성 변환(VC) 시스템을 위한 INmfCA 알고리즘이라는 새로운 훈련 프레임워크를 제안합니다. 변환 모델을 훈련하려면 기존 VC 프레임워크에는 소스 및 대상 화자가 동일한 언어 콘텐츠를 발화하는 병렬 말뭉치가 필요합니다. 프레임워크는 고품질 VC를 달성했지만 병렬 말뭉치를 사용할 수 없는 상황에서는 적용할 수 없습니다. 병렬 말뭉치 없이 변환 모델을 획득하기 위해 비병렬 방법이 널리 연구되고 있습니다. 프레임워크는 비병렬 조건에서 VC를 달성하지만 엄청난 배경 지식이나 많은 교육 발언이 필요한 경향이 있습니다. 많은 양의 데이터가 없으면 언어정보와 화자정보를 풀기 어렵기 때문이다. 이 작업에서 우리는 감독되지 않은 방식으로 음향 특징을 시변 및 시불변 구성 요소로 분해할 수 있는 NMF를 활용하여 이 문제를 해결합니다. 이 방법은 원화자의 발화와 목표 사전의 음향 특징 사이의 정렬을 획득하고 획득된 정렬을 NMF의 활성화로 사용하여 병렬 말뭉치 없이 원화자의 사전을 훈련합니다. 획득 방법은 비평행 말뭉치의 정렬을 획득하는 INCA 알고리즘을 기반으로 합니다. INCA 알고리즘과 달리 관찰된 샘플에만 정렬이 제한되지 않으므로 제안하는 방법은 작은 비평행 말뭉치를 효율적으로 활용할 수 있습니다. 주관적 실험 결과, 제안한 알고리즘과 INCA 알고리즘의 조합은 INCA 기반의 비병렬 프레임워크뿐만 아니라 추가적인 훈련 데이터 없이 비병렬 VC를 수행하는 CycleGAN-VC보다 성능이 뛰어난 것으로 나타났다. 결과는 또한 제안된 방법을 기반으로 소스 화자를 훈련할 필요가 없는 원샷 VC 프레임워크를 구축할 수 있음을 나타냅니다.
Hitoshi SUDA
the University of Tokyo
Gaku KOTANI
the University of Tokyo
Daisuke SAITO
the University of Tokyo
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
부
Hitoshi SUDA, Gaku KOTANI, Daisuke SAITO, "INmfCA Algorithm for Training of Nonparallel Voice Conversion Systems Based on Non-Negative Matrix Factorization" in IEICE TRANSACTIONS on Information,
vol. E105-D, no. 6, pp. 1196-1210, June 2022, doi: 10.1587/transinf.2021EDP7234.
Abstract: In this paper, we propose a new training framework named the INmfCA algorithm for nonparallel voice conversion (VC) systems. To train conversion models, traditional VC frameworks require parallel corpora, in which source and target speakers utter the same linguistic contents. Although the frameworks have achieved high-quality VC, they are not applicable in situations where parallel corpora are unavailable. To acquire conversion models without parallel corpora, nonparallel methods are widely studied. Although the frameworks achieve VC under nonparallel conditions, they tend to require huge background knowledge or many training utterances. This is because of difficulty in disentangling linguistic and speaker information without a large amount of data. In this work, we tackle this problem by exploiting NMF, which can factorize acoustic features into time-variant and time-invariant components in an unsupervised manner. The method acquires alignment between the acoustic features of a source speaker's utterances and a target dictionary and uses the obtained alignment as activation of NMF to train the source speaker's dictionary without parallel corpora. The acquisition method is based on the INCA algorithm, which obtains the alignment of nonparallel corpora. In contrast to the INCA algorithm, the alignment is not restricted to observed samples, and thus the proposed method can efficiently utilize small nonparallel corpora. The results of subjective experiments show that the combination of the proposed algorithm and the INCA algorithm outperformed not only an INCA-based nonparallel framework but also CycleGAN-VC, which performs nonparallel VC without any additional training data. The results also indicate that a one-shot VC framework, which does not need to train source speakers, can be constructed on the basis of the proposed method.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2021EDP7234/_p
부
@ARTICLE{e105-d_6_1196,
author={Hitoshi SUDA, Gaku KOTANI, Daisuke SAITO, },
journal={IEICE TRANSACTIONS on Information},
title={INmfCA Algorithm for Training of Nonparallel Voice Conversion Systems Based on Non-Negative Matrix Factorization},
year={2022},
volume={E105-D},
number={6},
pages={1196-1210},
abstract={In this paper, we propose a new training framework named the INmfCA algorithm for nonparallel voice conversion (VC) systems. To train conversion models, traditional VC frameworks require parallel corpora, in which source and target speakers utter the same linguistic contents. Although the frameworks have achieved high-quality VC, they are not applicable in situations where parallel corpora are unavailable. To acquire conversion models without parallel corpora, nonparallel methods are widely studied. Although the frameworks achieve VC under nonparallel conditions, they tend to require huge background knowledge or many training utterances. This is because of difficulty in disentangling linguistic and speaker information without a large amount of data. In this work, we tackle this problem by exploiting NMF, which can factorize acoustic features into time-variant and time-invariant components in an unsupervised manner. The method acquires alignment between the acoustic features of a source speaker's utterances and a target dictionary and uses the obtained alignment as activation of NMF to train the source speaker's dictionary without parallel corpora. The acquisition method is based on the INCA algorithm, which obtains the alignment of nonparallel corpora. In contrast to the INCA algorithm, the alignment is not restricted to observed samples, and thus the proposed method can efficiently utilize small nonparallel corpora. The results of subjective experiments show that the combination of the proposed algorithm and the INCA algorithm outperformed not only an INCA-based nonparallel framework but also CycleGAN-VC, which performs nonparallel VC without any additional training data. The results also indicate that a one-shot VC framework, which does not need to train source speakers, can be constructed on the basis of the proposed method.},
keywords={},
doi={10.1587/transinf.2021EDP7234},
ISSN={1745-1361},
month={June},}
부
TY - JOUR
TI - INmfCA Algorithm for Training of Nonparallel Voice Conversion Systems Based on Non-Negative Matrix Factorization
T2 - IEICE TRANSACTIONS on Information
SP - 1196
EP - 1210
AU - Hitoshi SUDA
AU - Gaku KOTANI
AU - Daisuke SAITO
PY - 2022
DO - 10.1587/transinf.2021EDP7234
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E105-D
IS - 6
JA - IEICE TRANSACTIONS on Information
Y1 - June 2022
AB - In this paper, we propose a new training framework named the INmfCA algorithm for nonparallel voice conversion (VC) systems. To train conversion models, traditional VC frameworks require parallel corpora, in which source and target speakers utter the same linguistic contents. Although the frameworks have achieved high-quality VC, they are not applicable in situations where parallel corpora are unavailable. To acquire conversion models without parallel corpora, nonparallel methods are widely studied. Although the frameworks achieve VC under nonparallel conditions, they tend to require huge background knowledge or many training utterances. This is because of difficulty in disentangling linguistic and speaker information without a large amount of data. In this work, we tackle this problem by exploiting NMF, which can factorize acoustic features into time-variant and time-invariant components in an unsupervised manner. The method acquires alignment between the acoustic features of a source speaker's utterances and a target dictionary and uses the obtained alignment as activation of NMF to train the source speaker's dictionary without parallel corpora. The acquisition method is based on the INCA algorithm, which obtains the alignment of nonparallel corpora. In contrast to the INCA algorithm, the alignment is not restricted to observed samples, and thus the proposed method can efficiently utilize small nonparallel corpora. The results of subjective experiments show that the combination of the proposed algorithm and the INCA algorithm outperformed not only an INCA-based nonparallel framework but also CycleGAN-VC, which performs nonparallel VC without any additional training data. The results also indicate that a one-shot VC framework, which does not need to train source speakers, can be constructed on the basis of the proposed method.
ER -