Research


Speaker recognition (identification) is a behavioral biometric identification technology which automatically identify a person from characteristics of her/his voices. It has a wide application in information security, forensic identification, human-computer interaction and etc. Although state-of-the-art speaker verification system has a low EER (Equal error rate) on several academic databases, such as NIST SREs and RSR2015 , its performance in real application is not always (or rarely) satisfied due to mis-matched contents, communication channels, languages, background noises, vocal effort, health condition and so on. For higher recognition accuracy and more robust application, we try to solve this challenge task by discriminant analysis, local learning, deep neural network and joint modeling.

Spoken language (or dialect, accent) recognition is a branch of speech signal processing, and its goal is to automatically identify the type of language by a segment of speech. It can be used in many fields, such as multilingual speech recognition, audio indexing, audio retrieval, information security and so on. Apart from the similar difficulties mentioned in speaker recognition, some languages (or dialect, accent) are easily confused, for example, Hindi vs India English, American English vs British English. We try to tell these differences by acoustic, phonetic or prosodic characteristics.

Undoubtedly, automatic speech recognition, speaker recognition and other speech-based recognition technologies are hot topics in the field of audio signal processing at present. As a matter of fact, apart from speech, audio contains more information than your think. For example, we can diagnose the running state of a machine from its sound, identify if there is a baby crying, or tell the scene by some special sounds. We call an audio segment with a concept we are interested in as an audio event. And audio event detection is a task to detect an audio event from an audio stream. It's a rather challenge task for several reasons. First, it's user-defined. Different audio events vary considerably. Second, it's a detection not classification task. The timestamp of an audio event is required in most cases. Third, the overlap of different audio events is very common. Fourth, some audio event is rare. In some extreme case, there is only one example. And finally, we are often short of enough annotation. So far, we lack a unified and appropriate theory (algorithm, or method) to solve it, and handle with it case by case. Due to the improtance of audio event decoration, we believe more and more researchers from academic and industrial areas will put attention on this interesting field.

We also interested in, and do research on voice activity detection, speech enhancement, audio indexing, duplicate audio detection, continuous speech recognition and keyword spotting.

people


principle investigator

Liang He

Associate Professor
Department of Electronic Engineering
Tsinghua University

Cinque Terre

Contact

Address: Rohm Building 8101, Beijing

EDUCATION

B.S., Communication Engineering, Civil Aviation University of China, Tianjin, China

M.S., Information & Communication Engineering, Zhejiang University, Hangzhou, China

Ph.D, Electronic Engineering, Tsinghua University, Beijing, China

WORKING EXPERIENCE

2011-2013, Postdoctoral fellow, Electronic Engineering, Tsinghua University, Beijing, China

2013-2018, Assistant Professor, Electronic Engineering, Tsinghua University, Beijing, China

2018-, Associate Professor, Electronic Engineering, Tsinghua University, Beijing, China

RESEARCH INTERESTS

Speaker recognition, speaker diarization, speech recognition, audio event detection, language recognition, voice activity detection, speech enhancement and audio indexing.


Tianyu Liang

Research and development engineer
Department of Electronic Engineering
Tsinghua University

Cinque Terre

Contact

Address: Rohm Building 8101, Beijing

EDUCATION

B.S., School of Mathematical Sciences, Beijing Normal University, Beijing, China

BIOGRAPHY

My work focus on both text-dependent and text-independent speaker recognition. I'm now interested in end-to-end system and other algorithms related to neural networks. I also do research on voice activity detection based on convolutional neural network and duplicate audio detection based on audio fingerprinting.

RESEARCH INTERESTS

Speaker recognition, voice activity detection, audio indexing, duplicate audio detection.


Xianwei Zhang

Research and development engineer
Department of Electronic Engineering
Tsinghua University

Cinque Terre

Contact

Address: Rohm Building 8101, Beijing

EDUCATION

B.S., mathematics and applied mathematics, Ningbo University, Ningbo China

M.S., probability and mathematical statistics, Beijing Normal University, Beijing China

BIOGRAPHY

Xianwei Zhang studied stochastic processes and the eigenvalue problems for Markov chains during achieving the master diploma period. His research in speaker recognition and machine learning has focused on compression algorithm, deep neural network and support vector machine. He is now engaging in the study of acoustic event detection and semi-supervised audio classification at Tsinghua University.

RESEARCH INTERESTS

Deep learning, speaker recognition, acoustic event detection, semi-supervised audio classification.


Yaoguang Wang

M.S. candidate
Department of Electronic Engineering
Tsinghua University

Cinque Terre

Contact

Address: Rohm Building 8101, Beijing

EDUCATION

B.S., Communication Engineering, Harbin Institute of Technology, Harbin, China

BIOGRAPHY

Mr.Wang focuses on analyzing and processing audio data using machine learning, especially deep learning. He mainly studies weakly supervised sound event detection and classification such as multi-instance learning and semi-supervised sound event detection and classification using teacher-student model. In addition, he also researches rare sound event detection and sound source separation, etc. He is now committed to real-time detection of abnormal sound events.

RESEARCH INTERESTS

Weakly supervised and semi-supervised sound event detection and classification, rare sound event detection, sound source separation, abnormal sound events.


Yuting Wang

M.S. candidate
Department of Electronic Engineering
Tsinghua University

Cinque Terre

Contact

Address: Rohm Building 8101, Beijing

EDUCATION

B.S., Communication Engineering, Zhengzhou University, Henan, China

BIOGRAPHY

My work focus on audio topic classification. Using transformer and deep neural network to build an end-to-end system. I’m also interested in text classification and transformer, the former can help me implement audio topic classification, the latter parallelizes efficiently.

RESEARCH INTERESTS

Audio topic classification, end-to-end, speech recognition, deep learning.


Xinyue Ma

M.S. candidate
Department of Electronic Engineering
Tsinghua University

Cinque Terre

Contact

Address: Rohm Building 8101, Beijing

EDUCATION

B.S., Optoelectronics Engineering, Tianjin University, Tianjin, China

BIOGRAPHY

My work mainly focus on speaker recognition. I make efforts to get a comprehensive understanding of this field through investigating related researches. Recently, I pay more attention to ASVspoof challenge and do in-depth study in this direction. As a postgraduate student, during this period, I do hope to make full use of learning resources to get self-improvement and then embrace the future in better shape.

RESEARCH INTERESTS

Speaker recognition, spoof speech detection (antispoof), deep neural network.


Ruida Ye

Combined postgraduate training
Department of Astronautical Engineering
Space Engineering University

Cinque Terre

Contact

Address: Rohm Building 8101, Beijing

EDUCATION

M.S. candidate, Astronautical Engineering, Space Engineering University, Beijing, China

BIOGRAPHY

My research direction is mainly automatic speech recognition sound and audio event detection. I am now more interested in the security system in the aerospace field. I want to detect faults of rockets and spacecraft before launch. Because the spacecraft's faulty audio data is difficult to collect, the main job is to perform unsupervised abnormal audio event detection.

RESEARCH INTERESTS

Automatic speech recognition, Audio event detection, Voice conversion,Data augmentation.

publications


Wenhao Ding, Liang He, “Adaptive Multi-Scale Detection of Acoustic Events,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 294–306, 2020, doi: 10.1109/TASLP.2019.2953350.

Liu Yi, Liang He, Jia Liu and Michael T. Johnson, “Introducing phonetic information to speaker embedding for speaker verification,” Eurasip Journal on Audio, Speech, and Music Processing, vol. 2019, no. 1, 2019.

Liang He, Xianhong Chen, Can Xu, Liu Yi, Jia Liu and Michael T. Johnson, “Latent class model with application to speaker diarization,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2019, no. 1, p. 12, Jul. 2019.

Xianhong chen, Liang He, Can Xu and Jia Liu, “Distance-Dependent Metric Learning,” IEEE Signal Processing Letters, Feb. 2019, 26(2), 357-361.

Liang He, Xianhong Chen, Can Xu, and Jia Liu, “Multi-objective Optimization Training of PLDA for Speaker Verification,” ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 6026-6030. [code] .

Yi Liu, Liang He and Jia Liu, “Large Margin Softmax Loss for Speaker Verification,” INTERSPEECH 2019, 20th Annual Conference of the International Speech Communication Association, INTERSPEECH, Graz, Austria, 2019, vol. 2019-September, pp. 2873–2877. [code].

Zhixuan Li, Liang He, Jingyang Li, Li Wang and Weiqiang Zhang, “Towards Discriminative Representations and Unbiased Predictions: Class-specific Angular Softmax for Speech Emotion Recognition,” INTERSPEECH 2019, 20th Annual Conference of the International Speech Communication Association, INTERSPEECH, Graz, Austria, 2019, vol. 2019-September, pp. 1696–1700.

Jingyang Zhang, Wenhao Ding, Jintao Kang and Liang He, “Multi-Scale Time-Frequency Attention for Rare Sound Event Detection,” INTERSPEECH 2019, 20th Annual Conference of the International Speech Communication Association, INTERSPEECH, Graz, Austria, 2019, vol. 2019-September, pp. 3855–3859.

Can Xu, Xianhong Chen, Liang He and Jia Liu, “Geometric Discriminant Analysis for I-vector Based Speaker Verification,” 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Accepted.

Liang He, Xianhong Chen, Can Xu and Jia Liu, “Subtraction-Positive Similarity Learning,” 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Accepted.

Liang He, Xianhong Chen, Can Xu, Jia Liu and Michael T. Johnson, “Local Pairwise Linear Discriminant Analysis for Speaker Verification,” IEEE Signal Processing Letters, Oct. 2018, 25(10), 1575-1579. [code].

Wenhao Ding and Liang He, “MTGAN: Speaker Verification through Multitasking Triplet Generative Adversarial Networks,” INTERSPEECH 2018, 19th Annual Conference of the International Speech Communication Association, 2-6 September 2018, Hyderabad, 3633-3637.

Yi Liu, Liang He, Jia Liu and Michael T. Johnson, “Speaker Embedding Extraction with Phonetic Information,” INTERSPEECH 2018, 19th Annual Conference of the International Speech Communication Association, 2-6 September 2018, Hyderabad, 2247-2251. [code].

Liang He, Xianhong Chen, Can Xu and Jia Liu, “Latent Class Model for Single Channel Speaker Diarization,” Odyssey 2018 The Speaker and Language Recognition Workshop, 26-29 June 2018, Les Sables d'Olonne, France, 128-133.

Xianhong Chen, Liang He, Can Xu, Yi Liu, Tianyu Liang and Jia Liu, “VB-HMM Speaker Diarization with Enhanced and Refined Segment Representation ,” Odyssey 2018 The Speaker and Language Recognition Workshop, 26-29 June 2018, Les Sables d'Olonne, France, 134-139.

Xukui Yang, Liang He, Dan Qu and Weiqiang Zhang, “Semi-supervised minimum redundancy maximum relevance feature selection for audio classification,” Multimedia Tools and Applications 77(1), 713-739.

Tianyu Liang, Xianhong Chen, Can Xu and Liang He, “Parallel Double Audio Fingerprinting,” 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), Taipei City, Taiwan, 2018, pp. 344-348.

Yi Liu, Liang He, Weiwei Liu and Jia Liu, “Exploring a Unified Attention-Based Pooling Framework for Speaker Verification,” 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), Taipei City, Taiwan, 2018, pp. 200-204.

Yi Liu, Liang He, Weiqiang Zhang, Jia Liu and Michael T. Johnson, “Investigation of Frame Alignments for GMM-based Digit-prompted Speaker Verification,” 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Honolulu, HI, USA, 2018, pp. 1467-1472.

Liang He, Xianhong Chen, Can Xu, Tianyu Liang and Jia Liu, “Ivec-PLDA-AHC priors for VB-HMM speaker diarization system”, 2017 IEEE International Workshop on Signal Processing Systems (SiPS), Lorient, 2017, 1-6.

Yao Tian, Liang He, Meng Cai, Weiqiang Zhang and Jia Liu, “Deep neural networks based speaker modeling at different levels of phonetic granularity”, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, 2017, 5440-5444.

Junbiao Liu, Xinyu Jin, Fang Dong, Liang He and Hong Liu, “Fading channel modelling using single-hidden layer feedforward neural networks”, Multidimensional Systems and Signal Processing 28(3), 885-903.

Yi Liu, Liang He, Yao Tian, Zhuzi Chen, Jia Liu and Michael T. Johnson, “Comparison of multiple features and modeling methods for text-dependent speaker verification”, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, 2017, 629-636.

Yao Tian, Meng Cai, Liang He, Weiqiang Zhang and Jia Liu, “Improving deep neural networks based speaker verification using unlabeled data”, INTERSPEECH 2016, 17th Annual Conference of the International Speech Communication Association, 1863-1867.

Liang He, Yao Tian, Yi Liu, Jiaming Xu, Weiwei Liu, Meng Cai and Jia Liu, “THU-EE system description for NIST LRE 2015”, INTERSPEECH 2016, 17th Annual Conference of the International Speech Communication Association, 3294-3298.

Yi Liu, Yao Tian, Liang He and Jia Liu, “Investigating various diarization algorithms for speaker in the wild (SITW) speaker recognition challenge”, INTERSPEECH 2016, 17th Annual Conference of the International Speech Communication Association, 853-857.

Liang He, Yao Tian, Yi Liu, Fang Dong, Weiqiang Zhang and Jia Liu, “A study of variational method for text-independent speaker recognition”, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), Tianjin, 2016, 1-5.

Xukui Yang, Liang He, Dan Qu, Weiqiang Zhang and Michael T. Johnson, “Semi-supervised feature selection for audio classification based on constraint compensated Laplacian score”, EURASIP Journal on Audio, Speech, and Music Processing.

Xukui Yang, Liang He, Dan Qu and Weiqiang Zhang, “Voice activity detection algorithm based on long-term pitch information”, EURASIP Journal on Audio, Speech, and Music Processing.

Yao Tian, Meng Cai, Liang He and Jia Liu, “Speaker recognition system based on deep neural networks and bottleneck features”, Journal of Tsinghua University 56(11), 1143-1148.

Fang Dong, Junbiao Liu, Liang He, Xiaohui Hu and Hong Liu, “Channel Estimation Based on Extreme Learning Machine for High Speed Environments ”, Proceedings Of ELM-2015, Vol 1: Theory, Algorithms And Applications (I) 6, 159-167.

Liang He, Weiqiang Zhang and Mengnan Shi, “Channel Non-negative Tensor Factorization for Speech Enhancement”, Proceedings of the 2016 International Conference on Artificial Intelligence: Technologies and Applications.

Liang He and Jia Liu, “PRISM: A statistical modeling framework for text-independent speaker verification”, 2015 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP), Chengdu, 2015, 529-533.

Like Hui, Meng Cai, Cong Guo, Liang He, Weiqiang Zhang and Jia Liu, “Convolutional maxout neural networks for speech separation”, 2015 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), Abu Dhabi, 2015, 24-27.

Yao Tian, Meng Cai, Liang He and Jia Liu, “Investigation of bottleneck features and multilingual deep neural networks for speaker verification”, INTERSPEECH 2015, 16th Annual Conference of the International Speech Communication Association, 1151-1155.

Yao Tian, Liang He and Jia Liu, “Stacked bottleneck features for speaker verification”, 2015 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP), Chengdu, 2015, 514-518.

Yi Liu, Yao Tian, Liang He, Jia Liu and Michael T. Johnson“Simultaneous utilization of spectral magnitude and phase information to extract supervectors for speaker verification anti-spoofing”, INTERSPEECH 2015, 16th Annual Conference of the International Speech Communication Association, 2082-2086.

Weiqiang Zhang, Cong Guo, Qian Zhang, Jian Kang, Liang He, Jia Liu and Michael T. Johnson, "A speech enhancement algorithm based on computational auditory scene analysis," Journal of Tsinghua University 48(8), 663-669.

Yao Tian, Liang He, Zhiyi Li, Weilan Wu, Weiqiang Zhang and Jia Liu, “Speaker verification using Fisher vector,” The 9th International Symposium on Chinese Spoken Language Processing, Singapore, 2014, 419-422.

Zhiyi Li, Liang He, Weiqiang Zhang and Jia Liu, “Total variability subspace adaptation based speaker recognition,” Acta Automatica Sinica 40(8), 1836-1840.

Yi Liu, Liang He and Jia Liu, “Improved multitaper PNCC feature for robust speaker verification,”, The 9th International Symposium on Chinese Spoken Language Processing, Singapore, 2014, 168-172.

Liang He and Jia Liu, "I-matrix for text-independent speaker recognition," 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, 2013, 7194-7198.

Liu, Weiwei and Zhang, Weiqiang and He, Liang and Xu, Jiaming and Liu, Jia, "THUEE system for the Albayzin 2012 language recognition evaluation," 2013 IEEE China Summit and International Conference on Signal and Information Processing, Beijing, 2013, 109-112.

Liang He and Jia Liu, "Orthogonal subspace combination based on the joint factor analysis for text-independent speaker recognition," Lecture Notes in Computer Science, vol 7701, Springer, Berlin, Heidelberg.

Liang He and Jia Liu, "Discriminant local information distance preserving projection for text-independent speaker recognition," 2012 8th International Symposium on Chinese Spoken Language Processing (ISCSLP 2012), 349-352.

Zhiyi Li, Liang He, Weiqiang Zhang and Jia Liu, "Speaker recognition based on discriminant i-vector local distance preserving projection," Journal of Tsinghua University 52(5), 598-601.

Liang He, Yi Yang and Jia Liu, "TLS-NAP algorithm for text-independent speaker recognition," Pattern Recognition and Artificial Intelligence 25(6), 916-921.

Liang He, Yongzhe Shi and Jia Liu, "Eigenchannel space combination method of joint factor analysis," Acta Automatica Sinica 37(7), 849-856.