# Expectation-Maximization Algorithm for Bernoulli Mixture Models (Tutorial)

Even though the title is quite a mouthful, this post is about two really cool ideas:

1. A solution to the "chicken-and-egg" problem (known as the Expectation-Maximization method, described by A. Dempster, N. Laird and D. Rubin in 1977), and
2. An application of this solution to automatic image clustering by similarity, using Bernoulli Mixture Models.

For the curious, an implementation of the automatic image clustering is shown in the video below. The source code (C#, Windows x86/x64) is also available for download!

Automatic clustering of handwritten digits from MNIST database using Expectation-Maximization algorithm

While automatic image clustering nicely illustrates the E-M algorithm, E-M has been successfully applied in a number of other areas: I have seen it being used for word alignment in automated machine translation, valuation of derivatives in financial models, and gene expression clustering/motif finding in bioinformatics.

As a side note, the notation used in this tutorial closely matches the one used in Christopher M. Bishop's "Pattern Recognition and Machine Learning". This should hopefully encourage you to check out his great book for a broader understanding of E-M, mixture models or machine learning in general.

Alright, let's dive in!

#### 1. Expectation-Maximization Algorithm

Imagine the following situation. You observe some data set $\mathbf{X}$ (e.g. a bunch of images). You hypothesize that these images are of $K$ different objects... but you don't know which images represent which objects.

Let $\mathbf{Z}$ be a set of latent (hidden) variables, which tell precisely that: which images represent which objects.

Clearly, if you knew $\mathbf{Z}$, you could group images into the clusters (where each cluster represents an object), and vice versa, if you knew the groupings you could deduce $\mathbf{Z}$. A classical "chicken-and-egg" problem, and a perfect target for an Expectation-Maximization algorithm.

Here's a general idea of how E-M algorithm tackles it. First of all, all images are assigned to clusters arbitrarily. Then we use this assignment to modify the parameters of the clusters (e.g. we change what object is represented by that cluster) to maximize the clusters' ability to explain the data; after which we re-assign all images to the expected most-likely clusters. Wash, rinse, repeat, until the assignment explains the data well-enough (i.e. images from the same clusters are similar enough).

(Notice the words in bold in the previous paragraph: this is where the expectation and maximization stages in the E-M algorithm come from.)

To formalize (and generalize) this a bit further, say that you have a set of model parameters $\mathbf{\theta}$ (in the example above, some sort of cluster descriptions).

To solve the problem of cluster assignments we effectively need to find model parameters $\mathbf{\theta'}$ that maximize the likelihood of the observed data $\mathbf{X}$, or, equivalently, the model parameters that maximize the log likelihod

Using some simple algebra we can show that for any latent variable distribution $q(\mathbf{Z})$, the log likelihood of the data can be decomposed as
\begin{align}
\ln \,\text{Pr}(\mathbf{X} | \theta) = \mathcal{L}(q, \theta) + \text{KL}(q || p), \label{eq:logLikelihoodDecomp}
\end{align}
where $\text{KL}(q || p)$ is the Kullback-Leibler divergence between $q(\mathbf{Z})$ and the posterior distribution $\,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta)$, and
\begin{align}
\mathcal{L}(q, \theta) := \sum_{\mathbf{Z}} q(\mathbf{Z}) \left( \mathcal{L}(\theta) - \ln q(\mathbf{Z}) \right)
\end{align}
with $\mathcal{L}(\theta) := \ln \,\text{Pr}(\mathbf{X}, \mathbf{Z}| \mathbf{\theta})$ being the "complete-data" log likelihood (i.e. log likelihood of both observed and latent data).

To understand what the E-M algorithm does in the expectation (E) step, observe that $\text{KL}(q || p) \geq 0$ for any $q(\mathbf{Z})$ and hence $\mathcal{L}(q, \theta)$ is a lower bound on $\ln \,\text{Pr}(\mathbf{X} | \theta)$.

Then, in the E step, the gap between the $\mathcal{L}(q, \theta)$ and $\ln \,\text{Pr}(\mathbf{X} | \theta)$ is minimized by minimizing the Kullback-Leibler divergence $\text{KL}(q || p)$ with respect to $q(\mathbf{Z})$ (while keeping the parameters $\theta$ fixed).

Since $\text{KL}(q || p)$ is minimized at $\text{KL}(q || p) = 0$ when $q(\mathbf{Z}) = \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta)$, at the E step $q(\mathbf{Z})$ is set to the conditional distribution $\,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta)$.

To maximize the model parameters in the M step, the lower bound $\mathcal{L}(q, \theta)$ is maximized with respect to the parameters $\theta$ (while keeping $q(\mathbf{Z}) = \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta)$ fixed; notice that $\theta$ in this equation corresponds to the old set of parameters, hence to avoid confusion let $q(\mathbf{Z}) = \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old})$).

The function $\mathcal{L}(q, \theta)$ that is being maximized w.r.t. $\theta$ at the M step can be re-written as
\begin{align*}
\theta^\text{new} &= \underset{\mathbf{\theta}}{\text{arg max }} \left. \mathcal{L}(q, \theta) \right|_{q(\mathbf{Z}) = \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old})} \\
&= \underset{\mathbf{\theta}}{\text{arg max }} \left. \sum_{\mathbf{Z}} q(\mathbf{Z}) \left( \mathcal{L}(\theta) - \ln q(\mathbf{Z}) \right) \right|_{q(\mathbf{Z}) = \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old})} \\
&= \underset{\mathbf{\theta}}{\text{arg max }} \sum_{\mathbf{Z}} \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old}) \left( \mathcal{L}(\theta) - \ln \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old}) \right) \\
&= \underset{\mathbf{\theta}}{\text{arg max }} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right] - \sum_{\mathbf{Z}} \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old}) \ln \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old}) \\
&= \underset{\mathbf{\theta}}{\text{arg max }} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right] - (C \in \mathbb{R}) \\
&= \underset{\mathbf{\theta}}{\text{arg max }} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right],
\end{align*}

i.e. in the M step the expectation of the joint log likelihood of the complete data is maximized with respect to the parameters $\theta$.

So, just to summarize,

• Expectation step: $q^{t + 1}(\mathbf{Z}) \leftarrow \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \mathbf{\theta}^t)$
• Maximization step: $\mathbf{\theta}^{t + 1} \leftarrow \underset{\mathbf{\theta}}{\text{arg max }} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{t}} \left[ \mathcal{L}(\theta) \right]$ (where superscript $\mathbf{\theta}^t$ indicates the value of parameter $\mathbf{\theta}$ at time $t$).

Phew. Let's go to the image clustering example, and see how all of this actually works.

#### 2. Bernoulli Mixture Models for Image Clustering

First of all, let's represent the image clustering problem in a more formal way.

##### 2.1. Formal description

Say that we are given $N$ same-sized training images $\mathbf{x_n} = (x_{n,1}, ..., x_{n,D})^T$ for $n \in \{1, ..., N \}$, each image containing $D$ binary pixels (i.e. $x_{n,i} \in \{ 0, 1 \}$).

Assuming that the pixels are conditionally independent from each other (i.e. that $x_{n, i}$ is conditionally independent from $x_{n, j \neq i}$ for each $i, j \in \{ 1, ..., D \}$), the probability distribution of the pixel $i$ over all images belonging to a component $k$ can be modelled using Bernoulli distribution with a parameter $0 \leq \mu_{k, i} \leq 1$.

To incorporate some prior knowledge about the image assignment to $K$ clusters (e.g. the proportions of images in each cluster), the assignments can be treated as being sampled from the multivariate distribution with the parameters $\pi_1, ..., \pi_K$ (where $0 \leq \pi_i \leq 1$, $\sum_{i = 1}^K \pi_i = 1$). Each $\pi_i$ for $i \in \{1, ..., K\}$ is called a mixing coefficient of cluster $i$.

Let say that the model parameters include the pixel distributions of each cluster and the prior knowledge about the image assignments, i.e. $\theta = (\mathbf{\mu}, \mathbf{\pi})$, where $\mathbf{\mu} := (\mathbf{\mu_1} \; \mathbf{\mu_2} \;... \;\mathbf{\mu_K} ) = \left( \begin{array}{cccc} \mu_{1, 1} & \mu_{2, 1} & ... & \mu_{K, 1} \\ \mu_{1, 2} & \mu_{2, 2} & ... & \mu_{K, 2} \\ \vdots & \vdots & \ddots & \vdots \\ \mu_{1, D} & \mu_{2, D} & ... & \mu_{K, D} \\ \end{array} \right)$ and $\mathbf{\pi} := ( \pi_1, ..., \pi_K )^T$.

Then, the likelihood of a single training image $\mathbf{x}$ is
\begin{align}
\,\text{Pr}(\mathbf{x} | \theta) = \,\text{Pr}(\mathbf{x} | \mathbf{\mu}, \mathbf{\pi}) = \sum_{k = 1}^K \pi_k \,\text{Pr}(\mathbf{x}|\mathbf{\mu_k})
\end{align}
where the probability that $\mathbf{x}$ is generated by cluster $k$ can be written as
\begin{align}
\,\text{Pr}(\mathbf{x}|\mathbf{\mu_k}) = \prod_{i = 1}^D \mu_{k, i}^{x_i} (1 - \mu_{k, i})^{1 - x_i}.
\end{align}

To model the assignment of images to clusters, associate a latent $K$-dimensional binary random variable $\mathbf{z_i}$ with each of the training examples $\mathbf{x_i}$. Say that $\mathbf{z_i}$ has a 1-of-$K$ representation, i.e. for $\mathbf{z_i} := (z_{i, 1}, ..., z_{i, K})^T$ it must be the case that $z_{i, j} \in \{0, 1\}$ for $i \in \{ 1, ..., N \}, j \in \{ 1, ..., K \}$ and $\sum_{j = 1}^{K} z_{i, j} = 1$.

Furthermore, let the marginal distribution over $\mathbf{z_i}$ be specified in terms of mixing coefficients $\mathbf{\pi}$ s.t. $\,\text{Pr}(z_{i, j} = 1) = \pi_j$, then
\begin{align}
\,\text{Pr}(\mathbf{z_n} | \mathbf{\pi}) = \prod_{i = 1}^K \pi_i^{z_{n, i}}.
\end{align}

Similarly, let $\,\text{Pr}(\mathbf{x_n} | z_{n, k} = 1) = \,\text{Pr}(\mathbf{x_n} | \mathbf{\mu_k})$, then
\begin{align}
\,\text{Pr}(\mathbf{x_n} | \mathbf{z_n}, \mathbf{\mu}, \mathbf{\pi}) = \prod_{k = 1}^K \,\text{Pr}(\mathbf{x_n} | \mathbf{\mu_k})^{z_{n, k}}.
\end{align}

By combining all latent variables $\mathbf{z_i}$ into a set $\mathbf{Z} := \{ \mathbf{z_1}, ..., \mathbf{z_N} \}$, we can write
\begin{equation} \label{eq:probZ}
\begin{split}
\,\text{Pr}(\mathbf{Z}|\mathbf{\pi}) &= \prod_{n = 1}^N \,\text{Pr}(\mathbf{z_n}|\mathbf{\pi}) \\
&= \prod_{n = 1}^N \prod_{k = 1}^K \pi_k^{z_{n, k}},
\end{split}
\end{equation}
and, similarly, combining all training images $\mathbf{x_i}$ into a set $\mathbf{X} := \{ \mathbf{x_1}, ..., \mathbf{x_N} \}$, we can express the marginal training data distribution as
\begin{equation} \label{eq:probXgivZ}
\begin{split}
\,\text{Pr}(\mathbf{X}|\mathbf{Z}, \mathbf{\mu}, \mathbf{\pi}) &= \prod_{n = 1}^N \,\text{Pr}(\mathbf{x_n}|\mathbf{z_n},\mathbf{\mu},\mathbf{\pi}) \\
&= \prod_{n = 1}^N \prod_{k = 1}^K \,\text{Pr}(\mathbf{x_n} | \mathbf{\mu_k})^{z_{n, k}} \\
&= \prod_{n = 1}^N \prod_{k = 1}^K \left( \prod_{i = 1}^D \mu_{k, i}^{x_{n, i}} (1 - \mu_{k, i})^{1 - x_{n, i}} \right)^{z_{n, k}}.
\end{split}
\end{equation}

From the last two equations and the probability chain rule, the complete data likelihood can be written as:
\begin{equation} \label{eq:probXandZ}
\begin{split}
\,\text{Pr}(\mathbf{X}, \mathbf{Z}| \mathbf{\mu}, \mathbf{\pi}) &= \,\text{Pr}(\mathbf{X} | \mathbf{Z}, \mathbf{\mu}, \mathbf{\pi}) \,\text{Pr}(\mathbf{Z}| \mathbf{\mu}, \mathbf{\pi}) \\
&= \prod_{n = 1}^N \prod_{k = 1}^K \left( \pi_k \prod_{i = 1}^D \mu_{k, i}^{x_{n, i}} (1 - \mu_{k, i})^{1 - x_{n, i}} \right)^{z_{n, k}},
\end{split}
\end{equation}

and thus the complete data log likelihood $\mathcal{L}(\theta)$ can be obtained by taking a log of the equation above:
\begin{equation}
\begin{split}
\mathcal{L}(\theta) &= \ln \,\text{Pr}(\mathbf{X}, \mathbf{Z}| \mathbf{\mu}, \mathbf{\pi}) \\
&= \sum_{n = 1}^N \sum_{k = 1}^K z_{n, k} \left( \ln \pi_k + \sum_{i = 1}^D x_{n, i} \ln \mu_{k, i} + (1 - x_{n, i}) \ln (1 - \mu_{k, i}) \right).
\end{split}
\end{equation}

(Still following? Great. Take five, and below we will derive the E and M step update equations.)

##### 2.2. E-M update equations for BMMs

In order to update the latent variable distribution (i.e. image assignment to clusters) at the expectation step, we need to set the probability distribution of $\textbf{Z}$ to $\,\text{Pr}(\mathbf{Z} | \mathbf{X}, \mathbf{\theta})$.

However, we cannot calculate this distribution exactly, hence we will have to approximate this assignment. A simple way of doing it is to replace the current values of $z_{n, k}$ with the expected ones:

\begin{equation} \label{eq:z}
\begin{split}
z_{n, k}^\text{new} \leftarrow \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \mathbf{\mu}, \mathbf{\pi}}[z_{n, k}] &= \sum_{z_{n, k}} \,\text{Pr}(z_{n,k} | \mathbf{x_n}, \mathbf{\mu}, \mathbf{\pi}) \, z_{n,k}\\
&= \frac{\pi_k \,\text{Pr}(\mathbf{x_n} |\mathbf{\mu_k})}{\sum_{m = 1}^K \pi_m \,\text{Pr}(\mathbf{x_n} | \mathbf{\mu_m})} \\
&= \frac{\pi_k \prod_{i = 1}^D \mu_{k, i}^{x_{n, i}} (1 - \mu_{k, i})^{1 - x_{n, i}} }{\sum_{m = 1}^K \pi_m \prod_{i = 1}^D \mu_{m, i}^{x_{n, i}} (1 - \mu_{m, i})^{1 - x_{n, i}}}.
\end{split}
\end{equation}

(Notice that after this update $\mathbf{z_{n}}^\text{new}$ is no longer represented as 1-of-$K$ vector, i.e. the same image can be "partially" assigned to multiple clusters.)

In the maximization step we need to maximize the model parameters (i.e. the mixing coefficients and the pixel distributions) using the update equation from earlier

Observe that
\begin{align}
\mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right] &= \sum_{n = 1}^N \sum_{k = 1}^K \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \mathbf{\mu}^\text{old}, \mathbf{\pi}^\text{old}} \left[ z_{n, k} \right] \left( \ln \pi_k + \sum_{i = 1}^D x_{n, i} \ln \mu_{k, i} + (1 - x_{n, i}) \ln (1 - \mu_{k, i}) \right).
\end{align}
The equation above can be maximized w.r.t. $\mathbf{\mu_k}$ by simply setting its derivative to zero:
\begin{align}
\frac{\partial}{\partial \mu_{m, j}} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right] &= \sum_{n = 1}^N \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \mathbf{\mu}^\text{old}, \mathbf{\pi}^\text{old}} \left[ z_{n, m} \right] \left( \frac{x_{n, j}}{\mu_{m, j}} - \frac{1 - x_{n, j}}{1 - \mu_{m, j}} \right) \\
&= \sum_{n = 1}^N z_{n, m}^\text{new} \frac{x_{n, j} - \mu_{m, j}}{\mu_{m, j} (1 - \mu_{m, j})} = 0 \Leftrightarrow \\
\mu_{m, j} &= \frac{1}{N_m} \sum_{n = 1}^N x_{n, j} z_{n, m}^\text{new},
\end{align}
where $N_m = \sum_{n = 1}^N z_{n, m}^\text{new}$ is the effective number of images assigned to cluster $m$.

Then the full cluster $m$ pixel distribution vector $\mathbf{\mu_m}$ can be written as

where $\mathbf{\bar{x}_m} = \frac{1}{N_m} \sum_{n = 1}^N z_{n, m}^\text{new} \mathbf{x_n}$ is the weighted mean of the images associated with cluster $m$.

To maximize $\mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right]$ w.r.t. the mixing coefficients $\mathbf{\pi}$ (subject to the constraint $\sum_{k = 1}^K \pi_k = 1$) we can use the Lagrange multipliers, yielding the following optimization problem:
\begin{equation*}
\Lambda(\theta, \lambda) := \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right] + \lambda \left( \sum_{k = 1}^K \pi_k - 1 \right).
\end{equation*}
The optimizing solution can then be found again with simple partial derivatives:
\begin{align}
\frac{\partial}{\partial \pi_{m}} \Lambda(\theta, \lambda) &= \frac{1}{\pi_m} \sum_{n = 1}^N z_{n,m}^\text{new} + \lambda = 0 \Leftrightarrow \\
\pi_m &= -\frac{N_m}{\lambda},
\end{align}
\begin{align}
\frac{\partial}{\partial \lambda} \Lambda(\theta, \lambda) &= \sum_{k = 1}^K \pi_k - 1 = 0 \Leftrightarrow \\
\sum_{k = 1}^K \pi_k &= 1.
\end{align}
By combining these two results $\lambda = - \sum_{k = 1}^K N_k = - N$, and thus

Done!

##### 2.3. Summary

In summary, the update equations for Bernoulli Mixture Models using E-M are:

• Expectation step:

• Maximization step:

where $\mathbf{\bar{x}_m} = \frac{1}{N_m} \sum_{n = 1}^N z_{n, m} \mathbf{x_n}$ and $N_m = \sum_{n = 1}^N z_{n, m}$.

#### 3. References

[Dempster et al, 1977] A. P. Dempster, N. M. Laird, D. B. Rubin. "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal Statistical Society. Series B (Methodological) 39 (1): 1–38.

[Bishop, 2006] C. M. Bishop. "Pattern Recognition and Machine Learning". Springer, 2006. ISBN 9780387310732.