The ICSI Broadcast News System

The broadcast news (BN) system currently used at ICSI and which has been used as a base for the meetings system, was originally created by Jitendra Ajmera circa 2003. He built the system while he was a PhD student at EPFL (Lausanne, Switzerland) and IDIAP (Martigny, Switzerland) and implemented it at ICSI while visiting for 6 months. During Ajmera's stay, ICSI participated in the NIST 2003 Rich transcription of broadcast news spring evaluation with the developed system, and soon afterwards in the RT03f (``who spoke the words'' evaluation). The diarization system was then improved and ICSI participated again in the RT04f evaluation (Wooters et al., 2004), also in broadcast news.

The system is a bottom-up agglomerative clustering approach that uses a modified version of the BIC distance (Ajmera et al., 2003) in order to iteratively merge the closest clusters until the same BIC distance determines the system to stop. Speaker segmentation of the data is not done explicitly before the clustering part, but it is done via Viterbi decoding of the data given the current speaker models at every iteration. For a thorough description of the system refer to Ajmera (2004).

The philosophy behind the system and all research that has been done towards implementation of the meetings system is based on these key concepts:

  1. Make the system as robust as possible to data within the same domain which the system has not been adapted to.

  2. Allow for a fast adaptation of the system to use it in new domains (i.e. broadcast news, meetings, telephone speech, and others).

These key concepts were put into practice by imposing the following guidelines:

The implementation of the broadcast news system used as a baseline for the meetings domain was presented to the RT04f broadcast news evaluation (Wooters et al., 2004), which is the latest broadcast news evaluation conducted by NIST within the EARS (Effective Affordable Reusable Speech-to-Text) program. It differs from the original diarization system created by Ajmera (Ajmera and Wooters, 2003) in four main points. First the inclusion of a speech/non-speech detector to filter out the non-speech segments prior to doing any further processing to the data and the discontinuation of use of a speech/music classifier used in the RT03s evaluation. Also, the parameterization used was MFCC, instead of PLP used until then. Finally, the inclusion of am iterative segmentation-training loop in the algorithm to allow models to converge to the clusters data.

It can be seen in figure 3.1 the main blocks constituting the system. In the following sections a detailed description of the different blocks is given.

Figure 3.1: ICSI Speaker Diarization for Broadcast News blocks diagram



Subsections
user 2008-12-08