The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
언어 인식을 위한 전형적인 음성 체계 중 하나는 병렬 전화 인식과 그에 따른 벡터 공간 모델링(PPRVSM)입니다. 이 시스템에서는 다양한 전화 인식기가 병렬로 적용되고 점수 수준에서 융합됩니다. 각 전화 인식기는 알려진 언어에 대해 훈련되었으며, 이는 효과적인 융합을 위해 보완적인 정보를 추출하는 것으로 가정됩니다. 그러나 이 방법은 단어 또는 전화 수준의 전사가 필요한 많은 양의 훈련 샘플로 인해 제한됩니다. 또한 기능이나 모델 수준의 융합은 점수 수준보다 더 많은 정보를 유지하므로 점수 융합은 최적의 방법이 아닙니다. 본 논문에서는 병렬 전화 인식기(PPR)를 구축하고 융합하는 새로운 전략을 제시합니다. 이는 여러 음향 다각화된 전화 인식기를 훈련하고 기능 수준에서 융합함으로써 달성됩니다. 전화 인식기는 동일한 음성 데이터로 훈련되지만 다른 음향 특징과 모델 훈련 기술을 사용합니다. 음향 특성에는 MFCC(Mel-Frequency Cepstral Coefficient)와 PLP(Perceptual Linear Prediction)가 모두 사용됩니다. 또한, 보완적인 음향 정보를 추출하기 위해 새로운 시간-주파수 켑스트럼(TFC) 기능이 제안되었습니다. 모델 훈련을 위해 우리는 보완적인 음향 모델을 훈련하기 위해 최대 우도 및 기능 최소 전화 오류 방법의 사용을 조사합니다. 본 연구에서는 PPRVSM 시스템을 구축하기 위해 간단한 선형 융합 방법을 사용하여 음향 다각화된 전화 인식기의 음성 특징을 융합합니다. 융합 인자 최적화를 위해 새로운 LROW(로지스틱 회귀 최적화 가중치) 접근 방식이 도입되었습니다. 실험 결과는 특징 수준의 융합이 점수 수준의 융합보다 더 효과적이라는 것을 보여줍니다. 그리고 제안된 시스템은 기존의 PPRVSM과 경쟁력이 있다. 마지막으로 추가 개선을 위해 두 시스템을 결합합니다. 본 문서에 보고된 최고 성능의 시스템은 폐쇄형 데이터베이스에 대해 NIST 1.24 LRE 4.98초, 14.96초 및 2007초 평가 데이터베이스에서 각각 30%, 10% 및 3%의 EER(동일 오류율)을 달성했습니다. 테스트 조건을 설정합니다.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
부
Yan DENG, Wei-Qiang ZHANG, Yan-Min QIAN, Jia LIU, "Language Recognition Based on Acoustic Diversified Phone Recognizers and Phonotactic Feature Fusion" in IEICE TRANSACTIONS on Information,
vol. E94-D, no. 3, pp. 679-689, March 2011, doi: 10.1587/transinf.E94.D.679.
Abstract: One typical phonotactic system for language recognition is parallel phone recognition followed by vector space modeling (PPRVSM). In this system, various phone recognizers are applied in parallel and fused at the score level. Each phone recognizer is trained for a known language, which is assumed to extract complementary information for effective fusion. But this method is limited by the large amount of training samples for which word or phone level transcription is required. Also, score fusion is not the optimal method as fusion at the feature or model level will retain more information than at the score level. This paper presents a new strategy to build and fuse parallel phone recognizers (PPR). This is achieved by training multiple acoustic diversified phone recognizers and fusing at the feature level. The phone recognizers are trained on the same speech data but using different acoustic features and model training techniques. For the acoustic features, Mel-frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) are both employed. In addition, a new time-frequency cepstrum (TFC) feature is proposed to extract complementary acoustic information. For the model training, we examine the use of the maximum likelihood and feature minimum phone error methods to train complementary acoustic models. In this study, we fuse phonotactic features of the acoustic diversified phone recognizers using a simple linear fusion method to build the PPRVSM system. A novel logistic regression optimized weighting (LROW) approach is introduced for fusion factor optimization. The experimental results show that fusion at the feature level is more effective than at the score level. And the proposed system is competitive with the traditional PPRVSM. Finally, the two systems are combined for further improvement. The best performing system reported in this paper achieves an equal error rate (EER) of 1.24%, 4.98% and 14.96% on the NIST 2007 LRE 30-second, 10-second and 3-second evaluation databases, respectively, for the closed-set test condition.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E94.D.679/_p
부
@ARTICLE{e94-d_3_679,
author={Yan DENG, Wei-Qiang ZHANG, Yan-Min QIAN, Jia LIU, },
journal={IEICE TRANSACTIONS on Information},
title={Language Recognition Based on Acoustic Diversified Phone Recognizers and Phonotactic Feature Fusion},
year={2011},
volume={E94-D},
number={3},
pages={679-689},
abstract={One typical phonotactic system for language recognition is parallel phone recognition followed by vector space modeling (PPRVSM). In this system, various phone recognizers are applied in parallel and fused at the score level. Each phone recognizer is trained for a known language, which is assumed to extract complementary information for effective fusion. But this method is limited by the large amount of training samples for which word or phone level transcription is required. Also, score fusion is not the optimal method as fusion at the feature or model level will retain more information than at the score level. This paper presents a new strategy to build and fuse parallel phone recognizers (PPR). This is achieved by training multiple acoustic diversified phone recognizers and fusing at the feature level. The phone recognizers are trained on the same speech data but using different acoustic features and model training techniques. For the acoustic features, Mel-frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) are both employed. In addition, a new time-frequency cepstrum (TFC) feature is proposed to extract complementary acoustic information. For the model training, we examine the use of the maximum likelihood and feature minimum phone error methods to train complementary acoustic models. In this study, we fuse phonotactic features of the acoustic diversified phone recognizers using a simple linear fusion method to build the PPRVSM system. A novel logistic regression optimized weighting (LROW) approach is introduced for fusion factor optimization. The experimental results show that fusion at the feature level is more effective than at the score level. And the proposed system is competitive with the traditional PPRVSM. Finally, the two systems are combined for further improvement. The best performing system reported in this paper achieves an equal error rate (EER) of 1.24%, 4.98% and 14.96% on the NIST 2007 LRE 30-second, 10-second and 3-second evaluation databases, respectively, for the closed-set test condition.},
keywords={},
doi={10.1587/transinf.E94.D.679},
ISSN={1745-1361},
month={March},}
부
TY - JOUR
TI - Language Recognition Based on Acoustic Diversified Phone Recognizers and Phonotactic Feature Fusion
T2 - IEICE TRANSACTIONS on Information
SP - 679
EP - 689
AU - Yan DENG
AU - Wei-Qiang ZHANG
AU - Yan-Min QIAN
AU - Jia LIU
PY - 2011
DO - 10.1587/transinf.E94.D.679
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E94-D
IS - 3
JA - IEICE TRANSACTIONS on Information
Y1 - March 2011
AB - One typical phonotactic system for language recognition is parallel phone recognition followed by vector space modeling (PPRVSM). In this system, various phone recognizers are applied in parallel and fused at the score level. Each phone recognizer is trained for a known language, which is assumed to extract complementary information for effective fusion. But this method is limited by the large amount of training samples for which word or phone level transcription is required. Also, score fusion is not the optimal method as fusion at the feature or model level will retain more information than at the score level. This paper presents a new strategy to build and fuse parallel phone recognizers (PPR). This is achieved by training multiple acoustic diversified phone recognizers and fusing at the feature level. The phone recognizers are trained on the same speech data but using different acoustic features and model training techniques. For the acoustic features, Mel-frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) are both employed. In addition, a new time-frequency cepstrum (TFC) feature is proposed to extract complementary acoustic information. For the model training, we examine the use of the maximum likelihood and feature minimum phone error methods to train complementary acoustic models. In this study, we fuse phonotactic features of the acoustic diversified phone recognizers using a simple linear fusion method to build the PPRVSM system. A novel logistic regression optimized weighting (LROW) approach is introduced for fusion factor optimization. The experimental results show that fusion at the feature level is more effective than at the score level. And the proposed system is competitive with the traditional PPRVSM. Finally, the two systems are combined for further improvement. The best performing system reported in this paper achieves an equal error rate (EER) of 1.24%, 4.98% and 14.96% on the NIST 2007 LRE 30-second, 10-second and 3-second evaluation databases, respectively, for the closed-set test condition.
ER -