The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
본 논문에서는 대상 화자의 몇 가지 발화를 사용하여 음소 지속 시간을 모델링하기 위한 화자 임베딩을 위한 음성 리듬 기반 방법을 제안합니다. 음성리듬은 F0와 같은 음향특성과 함께 음성합성에서 개별 발화를 재현하기 위한 화자의 특성 중 필수적인 요소 중 하나이다. 제안된 방법의 새로운 특징은 말하기 리듬과 관련이 있는 것으로 알려진 음소 및 지속 시간에서 추출된 리듬 기반 임베딩입니다. 이는 기존의 스펙트럼 특징 기반 모델과 유사한 화자 식별 모델을 사용하여 추출되었습니다. 성능을 평가하기 위해 스피커 임베딩 생성, 생성된 임베딩을 사용한 음성 합성, 임베딩 공간 분석의 세 가지 실험을 수행했습니다. 제안된 방법은 음소와 지속시간 정보만으로 적당한 화자 식별 성능(15.2% EER)을 보였다. 객관적, 주관적 평가 결과를 통해 제안한 방법이 기존 방법보다 대상 화자에 더 가까운 음성 리듬을 가진 음성을 합성할 수 있음을 입증하였다. 또한 임베딩 거리와 지각적 유사성 사이의 관계를 평가하기 위해 임베딩을 시각화했습니다. 임베딩 공간의 시각화와 친밀도 간의 관계 분석을 통해 임베딩 분포가 주관적 유사성과 객관적 유사성을 반영하는 것으로 나타났습니다.
Kenichi FUJITA
NTT Corporation
Atsushi ANDO
NTT Corporation
Yusuke IJIMA
NTT Corporation
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
부
Kenichi FUJITA, Atsushi ANDO, Yusuke IJIMA, "Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis" in IEICE TRANSACTIONS on Information,
vol. E107-D, no. 1, pp. 93-104, January 2024, doi: 10.1587/transinf.2023EDP7039.
Abstract: This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker. Speech rhythm is one of the essential factors among speaker characteristics, along with acoustic features such as F0, for reproducing individual utterances in speech synthesis. A novel feature of the proposed method is the rhythm-based embeddings extracted from phonemes and their durations, which are known to be related to speaking rhythm. They are extracted with a speaker identification model similar to the conventional spectral feature-based one. We conducted three experiments, speaker embeddings generation, speech synthesis with generated embeddings, and embedding space analysis, to evaluate the performance. The proposed method demonstrated a moderate speaker identification performance (15.2% EER), even with only phonemes and their duration information. The objective and subjective evaluation results demonstrated that the proposed method can synthesize speech with speech rhythm closer to the target speaker than the conventional method. We also visualized the embeddings to evaluate the relationship between the distance of the embeddings and the perceptual similarity. The visualization of the embedding space and the relation analysis between the closeness indicated that the distribution of embeddings reflects the subjective and objective similarity.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2023EDP7039/_p
부
@ARTICLE{e107-d_1_93,
author={Kenichi FUJITA, Atsushi ANDO, Yusuke IJIMA, },
journal={IEICE TRANSACTIONS on Information},
title={Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis},
year={2024},
volume={E107-D},
number={1},
pages={93-104},
abstract={This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker. Speech rhythm is one of the essential factors among speaker characteristics, along with acoustic features such as F0, for reproducing individual utterances in speech synthesis. A novel feature of the proposed method is the rhythm-based embeddings extracted from phonemes and their durations, which are known to be related to speaking rhythm. They are extracted with a speaker identification model similar to the conventional spectral feature-based one. We conducted three experiments, speaker embeddings generation, speech synthesis with generated embeddings, and embedding space analysis, to evaluate the performance. The proposed method demonstrated a moderate speaker identification performance (15.2% EER), even with only phonemes and their duration information. The objective and subjective evaluation results demonstrated that the proposed method can synthesize speech with speech rhythm closer to the target speaker than the conventional method. We also visualized the embeddings to evaluate the relationship between the distance of the embeddings and the perceptual similarity. The visualization of the embedding space and the relation analysis between the closeness indicated that the distribution of embeddings reflects the subjective and objective similarity.},
keywords={},
doi={10.1587/transinf.2023EDP7039},
ISSN={1745-1361},
month={January},}
부
TY - JOUR
TI - Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis
T2 - IEICE TRANSACTIONS on Information
SP - 93
EP - 104
AU - Kenichi FUJITA
AU - Atsushi ANDO
AU - Yusuke IJIMA
PY - 2024
DO - 10.1587/transinf.2023EDP7039
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E107-D
IS - 1
JA - IEICE TRANSACTIONS on Information
Y1 - January 2024
AB - This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker. Speech rhythm is one of the essential factors among speaker characteristics, along with acoustic features such as F0, for reproducing individual utterances in speech synthesis. A novel feature of the proposed method is the rhythm-based embeddings extracted from phonemes and their durations, which are known to be related to speaking rhythm. They are extracted with a speaker identification model similar to the conventional spectral feature-based one. We conducted three experiments, speaker embeddings generation, speech synthesis with generated embeddings, and embedding space analysis, to evaluate the performance. The proposed method demonstrated a moderate speaker identification performance (15.2% EER), even with only phonemes and their duration information. The objective and subjective evaluation results demonstrated that the proposed method can synthesize speech with speech rhythm closer to the target speaker than the conventional method. We also visualized the embeddings to evaluate the relationship between the distance of the embeddings and the perceptual similarity. The visualization of the embedding space and the relation analysis between the closeness indicated that the distribution of embeddings reflects the subjective and objective similarity.
ER -