Frame-Based Cluster Purification Metrics

In order to detect and filter out the non-speech frames using the detected likelihood property of the non-speech data, two variants of a likelihood-based metric are proposed.

$\displaystyle \bar{\mathcal{L}}(x[i] \vert \Theta_{A}) = \frac{1}{Q} \sum_{j=-...
...-1} \sum_{m=1}^{\widetilde{M}} log\Big(W_{A}[m] \mathcal{N}_{A,m}(x[i+j])\Big)$ (4.15)

The two metrics are based in equation 4.15 where $ Q$ defines the length of an average window and is used to average the measure around the desired value to avoid noisy values; $ \widetilde{M}$ is the number of Gaussian mixtures used to compute the likelihood (where $ \widetilde{M} < M$, the number of mixtures in the model); $ W_{A}[m]$ is the mixture weight and $ \mathcal{N}_{A,m}(x[i+j])(x[\cdot])$ is the result of evaluating $ x[\cdot]$ on the Gaussian mixture $ \mathcal{N}_{A,m}(x[i+j])$:

Metric 1
A standard smoothed likelihood over 100ms of data ($ Q=5$ with 10ms acoustic frames) around each acoustic frame, with $ \widetilde{M} = M$ (all mixtures in model $ \Theta_{A}$).

Metric 2
The same smoothed likelihood (over 100ms) given a model formed by a subset of all Gaussian mixtures in the speaker model, which include the mixtures assigned to non-speech. The mixtures used are selected by computing the sum of variance over all dimensions and selecting those with smaller accumulated variance, $ \widetilde{M}=M_{non-speech}$. This second metric is equivalent to metric 1 when 100% of the Gaussian mixtures are selected.

user 2008-12-08