Context and Motivations of this Thesis

People like to talk

Speech is still one of the most used way that humans have to communicate their ideas and to convey information to the world outside of ourselves. In fact, the quantity of available information by means of speech (telephone, radio, television, meetings, lectures, internet, etc) that is being stored is very big and rapidly increasing given the cheaper and cheaper ways of storage available nowadays. Following the two maxima that say ``time is money'' and ``information is power'', it becomes clear how desirable it is to have access to all this information, but as we only have two ears and limited time, we would like someone else to access it for us and to tell us only what is important, not wasting time in listening to multiple hours of contentless recordings. Some other times we might be interested in accessing some particular bit of this information which we do not know where it is, lost inside of our ``Alexandria audio library.'' This is one area where speech technology can make a big contribution by means of techniques like audio indexing, where information is automatically extracted from the audio, which allows the processing, search and recovery of the desired content much easier. Considering a parallelism, acoustic indexing could be considered to an audio-based library what a good librarian is to a paper-based library.

People do not like, usually, to talk alone

Most of the times when a person speaks, his/her speech is directed to someone or something else, which we expect to communicate with. In fact, even when we are talking to an animal, a machine or a little baby we are adapting our speech so that the message is conveyed to this outer entity. When dealing with information extraction from a recording, it becomes very important to answer questions like: ``What was said?'' as it conveys the message, but also ``Who said it?'' as information varies depending on who utters the spoken words.

Within the speech technologies, the broad topic of acoustic indexing studies the classification of sounds into different classes/sources. Such classes could be as broad as [cats, dogs, humans] or more concrete like [pit bull, pug, German shepherd]. Algorithms used for acoustic indexing worry about the correct classification of the sounds, but not necessarily about the correct separation of them when more than one exist in the same audio segment. These purely classification techniques have sometimes been called audio clustering, which benefit from the broad topic of clustering, well studied in many areas.

When multiple sounds appear in the same audio signal one must turn his attention to techniques called audio diarization to process them. As described in Reynolds and Torres-Carrasquillo (2004), audio diarization is known as the process of annotating an input audio signal with information that attributes (possibly overlapping) temporal regions of signal to their specific sources/classes (i.e. creating a ``diary'' of events in the audio document). These can include particular speakers, music, background noise sources, and other signal source/channel characteristics. It is very dependent on the application which particular classes are defined, becoming as broad or narrow as intended. In the simplest case, one could refer as audio diarization to the task of speech versus non-speech detection.

When the possible classes correspond to the different speakers in a recording these techniques are called speaker diarization. They aim at answering the question ``Who spoke when?'' given an audio signal. Algorithms doing speaker diarization need to locate each speaker turn and assign them to the appropriate speaker cluster. The output of the system is a set of segments with a unique ID assigned to each person that intervenes in the recording, leaving it to speaker identification systems to determine the person's identity given each ID. Until the present time, the domains that have received most research attention within the speaker diarization community have been

Telephone speech: Speaker diarization systems started being evaluated by NIST (National Institute for Standards and Technology, 2006) using single channel telephone speech signals, within the speaker recognition evaluations in the late 1990s.
Broadcast News (radio and TV broadcasts): Mainly with the impulse of DARPA's EARS program (DARPA Effective, Affordable, Reusable Speech-to-Text (EARS), 2004) rich transcription of broadcasted news content became the primary research domain for speaker diarization roughly from 2002 to 2004. Rich transcription consists on the addition of extra information (generally called metadata, including speaker diarization information) to the speech-to-text transcriptions.
Meetings (lectures and conferences): Mainly due to the impulse of the European CHIL and AMI projects (Computers in the Human Interaction Loop (CHIL) website (2006), Augmented Multiparty Interaction (AMI) website (2006)) the focus of research shifted from broadcast news to meetings around 2004. Although its current prominence, many smaller projects had studied and recorded meetings previously in the 1990's.

When talking about speaker diarization it is equivalent to saying speaker segmentation and clustering of an audio document as both these techniques are normally used together in diarization. On one hand, speaker segmentation (also called speaker change detection) aims at finding changes of speaker in an audio recording. It differs from acoustic change detection in that it does not consider changes in the background sounds during a single speaker segment to be a change to consider. On the other hand, speaker clustering agglomerates audio segments into homogeneous groups, coming from a similar/same source. In the general definition it does not constrain the process to a single file as all it requires is that each segment contain only a single speaker. When used in conjunction with speaker segmentation for speaker diarization, it clusters the segments created by the segmentation of one single recording.

Finally, also related to speaker diarization there are techniques regarding speaker tracking where the identity of one or more speakers is known a priori and the aim is to locate their interventions within the audio document.

From a general point of view, speaker diarization algorithms are a very useful part of many speech technology systems, for example:

Speaker indexing and rich transcription: By indexing the audio according to the speakers and adding extra information to speech transcripts it becomes easier for humans to locate information and for machines to process it. Typical automatic uses of such output might be speech summarization and translation.
Speaker segmentation and clustering helping Automatic Speech Recognition (ASR) systems: Segmentation algorithms are used to split the audio into small segments (maintaining all acoustic units intact) for the ASR systems to process. Also, speaker diarization algorithms are used to cluster all the input data into speakers for speaker dependent model adaptation. Sometimes the clustering is performed into broader speaker clusters (fewer than the actual number of speakers) to maximize the amount of adaptation data.
Preprocessing modules for speaker-based algorithms: Speaker diarization can be used before speaker tracking, speaker identification, speaker verification and other single speaker-based algorithms, to split the data into individual speakers.

This thesis is about speaker diarization performed in the meetings environment. While doing so, and following the guidelines proposed by NIST in the Rich Transcription (RT) evaluations, it processes the data without any prior information on the number of speakers present in the meeting or their identities. Furthermore, the system is intended for use without any assumption on the meeting room layout, which usually contains multiple microphones recording synchronously. These microphones are of different kinds and it is assumed that their exact location is unknown to the system.

This thesis is being presented in partial fullfillment of the requirements for the PhD in Theory of Signal and Communications in the UPC, where I have taken the necessary doctorate courses and previously prepared the thesis proposal. The proposed system was implemented at the International Computer Science Institute during a two years research stay with funding from the AMI project in the first year and the Spanish visitors program in the second. The implementation of speaker diarization for meetings takes into account all available prior knowledge in speaker diarization for the broadcast news environment present at ICSI at the start of this project. It is based on a modified version of the Bayesian Information Criterion (BIC) which does not require the tuning of any penalty term. It performs an agglomerative clustering of the initial acoustic data after it has been filtered with a speech/non-speech detector to eliminate all non-speech information. The clustering finished when no cluster pair is available for merging.

Improvements are proposed for the system in three main areas. To extend the applicability of the system to multiple microphone recordings it implements a filter&sum beamforming and adds several algorithms to improve the output signal when microphones are very dissimilar. The beamforming algorithm also started being used by the ASR system at ICSI in the meetings Rich Transcription evaluations with great results directly attributed to this module. Another area is the speech/non-speech detection where a new train-free system was implemented to allow for an accurate filtering of the silence segments in a meeting. Finally, within the inherited broadcast news system, several algorithms are either added or improved to increase the robustness of the system and to allow for it to extract as much information as possible from each recording allowing for fast adaptation to new domains. These include the automatic initial number of clusters and model complexity selection algorithms, two purification algorithms to allow better comparisons between clusters and a a more robust training. Finally, the time delay of arrival between microphones in the beamforming module is successfully used in the diarization to increase the amount of information used to perform the diarization.

user 2008-12-08