The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
본 논문에서는 잡음이 있는 조건에서 화자 인식을 위한 위상의 효율성을 조사하고 위상 정보를 멜주파수 켑스트럴 계수(MFCC)와 결합합니다. 현재까지 대부분의 화자 인식 방법은 시끄러운 환경에서도 MFCC를 기반으로 합니다. 성도 정보를 주로 캡처하는 MFCC의 경우 시간 영역 음성 프레임의 푸리에 변환 크기만 사용되며 위상 정보는 무시되었습니다. 위상 정보에는 풍부한 음성 소스 정보가 포함되어 있으므로 위상 정보와 MFCC의 높은 보완성이 기대됩니다. 또한 일부 연구에서는 위상 기반 기능이 잡음에 강하다고 보고했습니다. 이전 연구에서는 입력 음성의 클리핑 위치에 따른 위상 변화 변화를 정규화하는 위상 정보 추출 방법이 제안되었으며, 위상 정보와 MFCC의 조합 성능은 MFCC보다 현저히 우수했다. 본 논문에서는 잡음이 있는 상황에서 화자를 식별하기 위해 제안된 위상 정보의 견고성을 평가합니다. 낮은 에너지/신호 대 잡음(SN)을 사용하여 프레임을 건너뛰는 방법인 스펙트럼 차감과 잡음이 있는 음성 훈련 모델을 사용하여 잡음이 있는 조건에서 위상 정보 및 MFCC의 효과를 분석합니다. NTT 데이터베이스와 고정/비고정 소음이 추가된 JNAS(일본 신문 기사 문장) 데이터베이스를 사용하여 제안된 방법을 평가했습니다. MFCC는 깨끗한 음성에 대한 위상 정보보다 성능이 뛰어났습니다. 반면, 시끄러운 음성에 대한 위상 정보의 열화는 MFCC보다 훨씬 적었습니다. 깔끔한 음성 훈련 모델을 통해 많은 경우에 위상 정보의 개별 결과가 MFCC의 결과보다 훨씬 뛰어났습니다. 신뢰할 수 없는 프레임(에너지/SN이 낮은 프레임)을 삭제함으로써 화자 식별 성능이 크게 향상되었습니다. 위상 정보를 MFCC와 통합함으로써 표준 MFCC 기반 방식에 비해 화자 식별 오류 감소율은 약 30~60%였다.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
부
Longbiao WANG, Kazue MINAMI, Kazumasa YAMAMOTO, Seiichi NAKAGAWA, "Speaker Recognition by Combining MFCC and Phase Information in Noisy Conditions" in IEICE TRANSACTIONS on Information,
vol. E93-D, no. 9, pp. 2397-2406, September 2010, doi: 10.1587/transinf.E93.D.2397.
Abstract: In this paper, we investigate the effectiveness of phase for speaker recognition in noisy conditions and combine the phase information with mel-frequency cepstral coefficients (MFCCs). To date, almost speaker recognition methods are based on MFCCs even in noisy conditions. For MFCCs which dominantly capture vocal tract information, only the magnitude of the Fourier Transform of time-domain speech frames is used and phase information has been ignored. High complement of the phase information and MFCCs is expected because the phase information includes rich voice source information. Furthermore, some researches have reported that phase based feature was robust to noise. In our previous study, a phase information extraction method that normalizes the change variation in the phase depending on the clipping position of the input speech was proposed, and the performance of the combination of the phase information and MFCCs was remarkably better than that of MFCCs. In this paper, we evaluate the robustness of the proposed phase information for speaker identification in noisy conditions. Spectral subtraction, a method skipping frames with low energy/Signal-to-Noise (SN) and noisy speech training models are used to analyze the effect of the phase information and MFCCs in noisy conditions. The NTT database and the JNAS (Japanese Newspaper Article Sentences) database added with stationary/non-stationary noise were used to evaluate our proposed method. MFCCs outperformed the phase information for clean speech. On the other hand, the degradation of the phase information was significantly smaller than that of MFCCs for noisy speech. The individual result of the phase information was even better than that of MFCCs in many cases by clean speech training models. By deleting unreliable frames (frames having low energy/SN), the speaker identification performance was improved significantly. By integrating the phase information with MFCCs, the speaker identification error reduction rate was about 30%-60% compared with the standard MFCC-based method.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E93.D.2397/_p
부
@ARTICLE{e93-d_9_2397,
author={Longbiao WANG, Kazue MINAMI, Kazumasa YAMAMOTO, Seiichi NAKAGAWA, },
journal={IEICE TRANSACTIONS on Information},
title={Speaker Recognition by Combining MFCC and Phase Information in Noisy Conditions},
year={2010},
volume={E93-D},
number={9},
pages={2397-2406},
abstract={In this paper, we investigate the effectiveness of phase for speaker recognition in noisy conditions and combine the phase information with mel-frequency cepstral coefficients (MFCCs). To date, almost speaker recognition methods are based on MFCCs even in noisy conditions. For MFCCs which dominantly capture vocal tract information, only the magnitude of the Fourier Transform of time-domain speech frames is used and phase information has been ignored. High complement of the phase information and MFCCs is expected because the phase information includes rich voice source information. Furthermore, some researches have reported that phase based feature was robust to noise. In our previous study, a phase information extraction method that normalizes the change variation in the phase depending on the clipping position of the input speech was proposed, and the performance of the combination of the phase information and MFCCs was remarkably better than that of MFCCs. In this paper, we evaluate the robustness of the proposed phase information for speaker identification in noisy conditions. Spectral subtraction, a method skipping frames with low energy/Signal-to-Noise (SN) and noisy speech training models are used to analyze the effect of the phase information and MFCCs in noisy conditions. The NTT database and the JNAS (Japanese Newspaper Article Sentences) database added with stationary/non-stationary noise were used to evaluate our proposed method. MFCCs outperformed the phase information for clean speech. On the other hand, the degradation of the phase information was significantly smaller than that of MFCCs for noisy speech. The individual result of the phase information was even better than that of MFCCs in many cases by clean speech training models. By deleting unreliable frames (frames having low energy/SN), the speaker identification performance was improved significantly. By integrating the phase information with MFCCs, the speaker identification error reduction rate was about 30%-60% compared with the standard MFCC-based method.},
keywords={},
doi={10.1587/transinf.E93.D.2397},
ISSN={1745-1361},
month={September},}
부
TY - JOUR
TI - Speaker Recognition by Combining MFCC and Phase Information in Noisy Conditions
T2 - IEICE TRANSACTIONS on Information
SP - 2397
EP - 2406
AU - Longbiao WANG
AU - Kazue MINAMI
AU - Kazumasa YAMAMOTO
AU - Seiichi NAKAGAWA
PY - 2010
DO - 10.1587/transinf.E93.D.2397
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E93-D
IS - 9
JA - IEICE TRANSACTIONS on Information
Y1 - September 2010
AB - In this paper, we investigate the effectiveness of phase for speaker recognition in noisy conditions and combine the phase information with mel-frequency cepstral coefficients (MFCCs). To date, almost speaker recognition methods are based on MFCCs even in noisy conditions. For MFCCs which dominantly capture vocal tract information, only the magnitude of the Fourier Transform of time-domain speech frames is used and phase information has been ignored. High complement of the phase information and MFCCs is expected because the phase information includes rich voice source information. Furthermore, some researches have reported that phase based feature was robust to noise. In our previous study, a phase information extraction method that normalizes the change variation in the phase depending on the clipping position of the input speech was proposed, and the performance of the combination of the phase information and MFCCs was remarkably better than that of MFCCs. In this paper, we evaluate the robustness of the proposed phase information for speaker identification in noisy conditions. Spectral subtraction, a method skipping frames with low energy/Signal-to-Noise (SN) and noisy speech training models are used to analyze the effect of the phase information and MFCCs in noisy conditions. The NTT database and the JNAS (Japanese Newspaper Article Sentences) database added with stationary/non-stationary noise were used to evaluate our proposed method. MFCCs outperformed the phase information for clean speech. On the other hand, the degradation of the phase information was significantly smaller than that of MFCCs for noisy speech. The individual result of the phase information was even better than that of MFCCs in many cases by clean speech training models. By deleting unreliable frames (frames having low energy/SN), the speaker identification performance was improved significantly. By integrating the phase information with MFCCs, the speaker identification error reduction rate was about 30%-60% compared with the standard MFCC-based method.
ER -