## Clusters Comparison, Pruning and Clusters Merging

Given M clusters with their corresponding models, the matrix of distances between every cluster pair is created and the closest pair is merged if it is determined that both clusters contain data from the same speaker. In order to obtain a measure of similarity between two clusters modeled by a GMM a modified version of the Bayesian Information Criterion (BIC) is used, as introduced by Ajmera in Ajmera, McCowan and Bourlard (2004) and Ajmera and Wooters (2003). As explained in the state of the art chapter, the BIC value quantifies the appropriateness of a model given the data. It is a likelihood-based metric that introduces a penalty term (in the standard formulation) to penalize models by their complexity.

Given two clusters and composed of and acoustic frames respectively, they are modeled with two GMM models and . The data used to train each of the models is the union of the data belonging to each one of the segments labelled as belonging to cluster . In the same manner a third model is trained with .

In the standard BIC implementation to compare two clusters (Shaobing Chen and Gopalakrishnan, 1998) the penalty term adds a factor that is used to determine the effect of the penalty on the likelihood. The equation of the standard BIC for GMM models is   (3.1)

where is the number of free parameters to be estimated for each of the models, i.e. relative to the topology and complexity of the model.

It is considered that two clusters belong to the same speaker if they have a positive BIC value. Such value is affected by the penalty term (including the , which acts as a threshold determining which clusters to merge and which not to). The penalty term also modifies the order in which the cluster pairs are merged in an agglomerative clustering system, as each pair will have a different number of total frames and/or models complexities, which will cause the penalty term to be different. In systems based on BIC this penalty term ( ) is usually tuned based on development data and it always takes values greater than 0. In some cases two different values are defined, one for the merging criterion and the other one for the stopping criterion.

When training models to be used in BIC, these need to be trained using an ML approach, but there is no constraint in the kind of model to use. Ajmera's modification to the traditional BIC formula comes with the inclusion of a constraint to the combined model : (3.2)

This is an easy to follow rule as model is normally built exclusively for the comparison.

By Applying this rule to the BIC formula one avoids having to decide on a particular parameter and therefore the real threshold which is applied to consider if two clusters are from the same speaker becomes 0. The formula of the modified-BIC becomes equivalent to the GLR, but with the condition that 3.2 applies.

The lack of an extra tuning parameter makes the system more robust to changes in the data to be processed, although, as the BIC formula is just an approximation of the Bayesian Factor (BF) formulation, sometimes the robustness increase comes with a small detriment on performance.

In the broadcast news system that has been described, the model is generated directly from the two individual models and by pooling all the Gaussian mixtures together. Then the data belonging to both parent models is used to train the new model via ML. Training is always performed using an Expectation Maximization (EM-ML) algorithm and performing 5 iterations on the data.

Once the BIC metric between all possible cluster pairs has been computed, it searches for the biggest value and if the two clusters are merged into a single cluster. In this case, the merged cluster is created in the same way as is created, although it is not a requirement. The total complexity of the system remains intact through the merge as the number of Gaussian mixtures representing the data is the same, clustered in M-1 clusters.

The computation of BIC for all possible combinations of clusters is by far the most computationally intensive step in the agglomerative clustering system. Given that the models are retrained and the data is resegmented after each merge it obtains models at each iteration that are quite dissimilar to the models in the previous iteration, therefore it is recommended to compute all values again. Some techniques were tested to speedup this process by looking at the behavior of the BIC values. Some are:

• In every iteration merge more than one cluster pair, selecting them to be the pairs with highest BIC value and positive. This generates mixed results as modifies the way that the models are grouped.

• Compute the BIC value only for clusters which have changed considerably between iterations, maintaining the same value as previous iterations for those with almost the same segments assigned. This also resulted in mixed results, depending on the shows evaluated.

• Do not compute the BIC value for clusters pairs that obtain a negative value at any given iteration. This reduces the computation as only positive BIC values will be considered in subsequent iterations. It was implemented with success in the broadcast news system.

user 2008-12-08