• Speaker recognition
  • Knowledge graph
  • Audio event detection
  • Pre-training
Speaker recognition

Speaker recognition (identification) is a behavioral biometric identification technology which automatically identify a person from characteristics of her/his voices. It has a wide application in information security, forensic identification, human-computer interaction and etc. In addition to the research on this, our laboratory also studies related works such as speaker diarization, speech anti-spoofing and multi-modal identity verification. As for speech anti-spoofing, with the advance of text-to-speech synthesis (TTS), voice conversion (VC), replay and impersonation attacks, they greatly imperil the security of the speaker recognition system, so it is necessary to design anti-spoofing system to assist with speaker recognition system for distinguishing between spoofed and bona fide speech.

Knowledge graph

Knowledge graph is a knowledge base that uses graphical structured data models or topological structures to integrate data. It is often used to store entities (objects, events, situations, or abstract concepts) with free-form semantics and the relationships between them.

Starting from cognitive intelligence, it allows machines to have abilities such as explanation, reasoning, and induction, and has rich application value in assisting intelligent question answering, big data analysis, and recommendation calculations.

The research on knowledge graphs includes knowledge acquisition, the fusion of knowledge graphs from different sources, and knowledge reasoning and application.

We have done some research on entity recognition and relationship extraction, entity alignment, etc., and will further develop our work in other areas.

Audio event detection

Audio event detection(AED) is one of the main tasks of audio content analysis and processing. The goal is to determine the type of event that occurs in the audio segment and mark the start and end time of the audio event. In recent years, AED has become an important research topic in the field of auditory perception, AED has broad application in security monitoring, medical application, multimedia retrieval, smart home, etc. But there are still many challenges in practical applications. First, in abnormal sound detection(ASD), abnormal data is scarce and difficult to obtain. Second, In real environment scenes, there will be many noises that are difficult to eliminate, and there will be overlapping event sound sources, which will affect the effect of the audio event detection system. Third, since a large amount of strongly labeled data is difficult to obtain, audio event detection on weakly labeled data sets (without timestamps) is particularly important in practical applications. We solve these challenges from the following aspects: firstly, for weakly labeled data, that is, incomplete, fuzzy or wrong labeled data, we carry out weakly supervised learning and detection. The methods include active learning, semi-supervised learning, multi-instance learning, noisy learning and etc. Secondly, for a large number of unlabeled data, because the cost of manual labeling is too high, we use unsupervised learning methods to detect, such as clustering algorithm. Thirdly, in the case of very few or even one audio data to be trained, we try to use one-shot learning or zero-shot learning methods to solve these problems.


Pre-training: "Use as much training data as possible to extract as many common features as possible, so as to reduce the learning burden of the model for specific tasks". In recent years, large-scale pre-training models (PTM) such as BERT and GPT have achieved great success and become a milestone in the field of artificial intelligence. Large-scale PTM can effectively capture knowledge from large amounts of labeled and unlabeled data due to its complex pre-training objectives and large model parameters. By storing knowledge in a large number of parameters and fine-tuning specific tasks, rich knowledge encoded implicitly in a large number of parameters can benefit a variety of downstream tasks, which has been widely demonstrated through experimental and empirical analysis. Therefore, we try to use pre-training to better solve various tasks related to voice, such as: subject classification, keyword spotting, fake iris detection, speaker recognition, etc., hoping to make greater progress on the original basis.

Recent Publications