BIC Formulation for Gaussian Mixture Models

The purpose of this appendix is to show the equivalence between two different representations of the Bayesian Information Criterion (BIC), one based on the likelihood of the data given the models, which allows the models to be arbitrary and as complex as necessary given the task at hand, and another representation only dependent on the sufficient statistics of the data, which considers the case of a single Gaussian modeling the data. These two representations are used alternatively in the bibliography with various modifications, which sometimes cause the results not to be comparable between each other.

Given an acoustic segment $X_{i}$ with $N_{i}$ acoustic frames, modeled by $M_{i}$ which is an arbitrary model with a certain number of free parameters to estimate from the data, given by $\char93 (\mathcal{M}_{i})$ , which accounts for the complexity of such model. The general BIC expression of such model using the likelihood of the data is given by

$\displaystyle BIC(\mathcal{M}_{i}) = \log \mathcal{L}(\mathcal{X}_{i},\mathcal{M}_{i}) - \lambda \frac{1}{2} \char93 (\mathcal{M}_{i}) log(N_{i})$

(A.1)

Being $\log \mathcal{L}(\mathcal{X}_{i},\mathcal{M}_{i})$ the log-likelihood of the data given the considered model. The parameter $\lambda$ is a design parameter which is not part of the original BIC formulation but which is used to change the effect of the penalty term in the formula. Such formula allows the model $\mathcal{M}_{i}$ to be of any kind.

If instead it is considered that the model is created by a single Gaussian, eq. A.1 can be rewritten as

$\displaystyle BIC(\mathcal{M}_{i}) = -\frac{1}{2} N_{i} log(\vert S_{i}\vert) -... ...2} d(1 + log(2\pi)) - \lambda \frac{1}{2} \char93 (\mathcal{M}_{i}) log(N_{i})$

(A.2)

where is the covariance matrix representing the data and is its dimension. Such formulation only depends on the sufficient statistics of the data, and therefore its computation is very fast.

Let us progress from equation 2.1 into obtaining equation A.2. Considering that the used model is a single Gaussian with full covariance, one can rewrite eq. 2.1 as

$\displaystyle BIC(\mathcal{M}_{i}) = \sum_{n=1}^{N_{i}} \log p(x_{i}[n]\vert\m... ...\vert^{1/2}} \exp^{-\frac{1}{2}(x_{i}[n]-\bar{x})' S^{-1} (x_{i}[n]-\bar{x})}]$

(A.3)

by doing the $N_{i}$ products one obtains a sum of terms in the exponential, where each terms is a scalar value. One can use mathematical properties of the trace in order to obtain a closer form for it. As the trace(scalar) = scalar, it does not change the result.

Let us then consider only the trace of the numerator in the exponent

$\displaystyle Tr[(x_{i}[1]-\bar{x})' S^{-1} (x_{i}[1]-\bar{x}) + \dots + (x_{i}[N_{i}]-\bar{x})' S^{-1} (x_{i}[N_{i}]-\bar{x})]$

Applying the property that $Tr[\sum] = \sum Tr[]$

$\displaystyle \dots = Tr[(x_{i}[1]-\bar{x})' S^{-1} (x_{i}[1]-\bar{x})] + \dots + tr[(x_{i}[N_{i}]-\bar{x})' S^{-1} (x_{i}[N_{i}]-\bar{x})]$
For each trace element applying the circularity of the trace property

$\displaystyle \dots = Tr[(x_{i}[1]-\bar{x})' (x_{i}[1]-\bar{x}) S^{-1}] + \dots + tr[(x_{i}[N_{i}]-\bar{x})' (x_{i}[N_{i}]-\bar{x})S^{-1}]$
Applying the property of matrix algebra AB+CB = (A+C)B one can isolate the inverse covariance matrix. At this point, given the definition of covariance matrix one can identify it in the equation and therefore we obtain

$\displaystyle \dots = Tr[N_{i} S S^{-1}] = N_{i} Tr[\mathbb{I}] = Nd$

Given this result and going back to the BIC formulation in eq. A.3 and using the log properties

$\displaystyle BIC(\mathcal{M}_{i}) = \log [\frac{1}{(2\pi)^{dN_{i}/2}\vert S\vert^{N/2}} e^{-N_{i}d/2}]$

$\displaystyle = -\log(2\pi)^{dN_{i}/2} - \log(\vert S\vert)^{N_{i}/2} - \frac{N_{i}d}{2} - \lambda \frac{1}{2} \char93 (\mathcal{M}_{i}) log(N_{i})$

Obtaining finally

$\displaystyle BIC(\mathcal{M}_{i}) = -\frac{1}{2} N_{i} log(\vert S_{i}\vert) -... ...2} d(1 + log(2\pi)) - \lambda \frac{1}{2} \char93 (\mathcal{M}_{i}) log(N_{i})$

(A.4)

Which is in fact equation A.2. Note that a factor $\frac{1}{2}$ applies to each term in the expression. Such factor is sometimes omitted, causing the optimum $\lambda$ factor to differ in the different implementations.

user 2008-12-08