Speech/non-Speech Detection and Parameters Extraction

The use of a speech/non-speech detector in speaker diarization is important to ensure that acoustic models for each of the clusters correctly represents the speech data and is not ``contaminated'' by non-speech information. In the ICSI system each cluster is initially modeled using a small number of Gaussian mixtures (usually 5) given that they are trained using ML over a small amount of data. This causes that the inclusion of non-speech data into the training makes clusters resemble each other much more and make the system prone to clustering errors, together with non-speech errors.

The speech/non-speech detector for the BN system is a two-class detector, in which each class is modeled by a three-state HMM, with a minimum duration of 30 msec. The non-speech model includes both music and silence. The features used in the SNS detector (MFCC12) are different from the features used for clustering. This detector that was initially used for ICSI-SRI’s BN STT system on RT04f. It was trained on 80 hours of 1996 HUB4 BN acoustic data. No tuning was made to adapt the detector to speaker diarization for the RT04f evaluation.

In order to illustrate the advantages of using a speech/non-speech (spnsp) detector (also sometimes referred as Speech Activity Detection, SAD) in table 3.1 (taken from Wooters et al. (2004)) diarization error rates are shown on the RT04f data set using different kinds of spnsp detectors. The Diarization Error Rate (DER) is the percentage of time that the system miss-attributes speakers/non-speech segments. It can be broken down into speaker errors, which accounts for miss-attributed speaker segments, false alarms (FA) and missed speech errors (MISS), which account for non-speech labelled as speech, and viceversa. For an exhaustive definition of each on of these types of error refer to section 6.1.3.

The first column shows the baseline system composed of the RT03f system. It has an overall non-speech error of 5.1% and a speaker error of 17.8%. By adding the speech/non-speech detector proposed for broadcast news it not only improved the non-speech errors but also reduces the speaker error, due to the reduction in clustering errors as noted above. Finally, it is interesting to see how much can be achieved in terms of DER if a perfect spnsp detector was built. Such detector is obtained by extracting the speaker segments from the reference segmentation and running the diarization with those as spnsp input. It can be seen that the proposed spnsp detector is still about 1.2% worse than the perfect detector. The speaker error is lower in the proposed spnsp detector than in the ideal one. This could indicate that some non-speech data can still be beneficiary to train discriminant speaker models. In this implementation the system obtained a 0.2% and 0.1% MISS errors in the perfect spnsp and baseline systems which was later reduced to 0%.

Table 3.1: DER improvement by using a speech/non-speech detector

System used	%MISS	%FA	%SPKR	%DER
RT03f system	0.1	5.0	17.8	22.95
+SRI/spnsp	1.5	1.2	15.4	18.17
+ideal spnsp	0.2	0.0	16.8	16.98

With respect to the parameters used in the system, as it happens with other speech processing areas, acoustic modeling for speaker diarization is performed based of acoustic features extracted from the input signal. For the broadcast news system at ICSI the features used have been modified over the years finally settling down into the use of MFCC features with 19 coefficient, without any deltas or double deltas and without the zeroth cepstral coefficient, linked to the energy of the signal. For broadcast news these features were computed over a 60 millisecond analysis window in 20 milliseconds intervals. Multiple tests were done resulting on the selection of these features. On one hand, the increase in computation involved in using the delta and double delta coefficients was considered unacceptable given that the system gave mixed results when using them. On the other hand, MFCC19 were chosen as opposed to PLP12, which were used on RT03f, due to a slightly better performance when using them together with the spnsp detector.

As can be seen in table 3.2 also from Wooters et al. (2004), the baseline system using PLP and no spnsp detector produces better overall results than the counterpart MFCC system, but this second one is better when spnsp is added. In the diarization system for meetings a possible combination to use delays as features is proposed which is also applicable to all other kinds of feature vectors.

Table 3.2: Comparison of PLP12 and MFCC19 parameterizations on RT04f

System used	%MISS	%FA	%SPKR	%DER
RT03f PLP	0.1	5.0	15.8	20.93
+SRI spnsp	1.6	1.2	15.5	18.36
RT03f MFCC	0.1	5.0	17.8	22.95
+SRI spnsp	1.5	1.2	15.4	18.17