tutorial – Manfred Zabarauskas' Blog

Expectation-Maximization Algorithm for Bernoulli Mixture Models (Tutorial)

Manfredas Zabarauskas — Tue, 12 Feb 2013 03:05:53 +0000

Even though the title is quite a mouthful, this post is about two really cool ideas:

A solution to the "chicken-and-egg" problem (known as the Expectation-Maximization method, described by A. Dempster, N. Laird and D. Rubin in 1977), and
An application of this solution to automatic image clustering by similarity, using Bernoulli Mixture Models.

For the curious, an implementation of the automatic image clustering is shown in the video below. The source code (C#, Windows x86/x64) is also available for download!

Automatic clustering of handwritten digits from MNIST database using Expectation-Maximization algorithm

While automatic image clustering nicely illustrates the E-M algorithm, E-M has been successfully applied in a number of other areas: I have seen it being used for word alignment in automated machine translation, valuation of derivatives in financial models, and gene expression clustering/motif finding in bioinformatics.

As a side note, the notation used in this tutorial closely matches the one used in Christopher M. Bishop's "Pattern Recognition and Machine Learning". This should hopefully encourage you to check out his great book for a broader understanding of E-M, mixture models or machine learning in general.

Alright, let's dive in!

1. Expectation-Maximization Algorithm

Imagine the following situation. You observe some data set $\mathbf{X}$ (e.g. a bunch of images). You hypothesize that these images are of $K$ different objects... but you don't know which images represent which objects.

Let $\mathbf{Z}$ be a set of latent (hidden) variables, which tell precisely that: which images represent which objects.

Clearly, if you knew $\mathbf{Z}$ , you could group images into the clusters (where each cluster represents an object), and vice versa, if you knew the groupings you could deduce $\mathbf{Z}$ . A classical "chicken-and-egg" problem, and a perfect target for an Expectation-Maximization algorithm.

Here's a general idea of how E-M algorithm tackles it. First of all, all images are assigned to clusters arbitrarily. Then we use this assignment to modify the parameters of the clusters (e.g. we change what object is represented by that cluster) to maximize the clusters' ability to explain the data; after which we re-assign all images to the expected most-likely clusters. Wash, rinse, repeat, until the assignment explains the data well-enough (i.e. images from the same clusters are similar enough).

(Notice the words in bold in the previous paragraph: this is where the expectation and maximization stages in the E-M algorithm come from.)

To formalize (and generalize) this a bit further, say that you have a set of model parameters $\mathbf{\theta}$ (in the example above, some sort of cluster descriptions).

To solve the problem of cluster assignments we effectively need to find model parameters $\mathbf{\theta'}$ that maximize the likelihood of the observed data $\mathbf{X}$ , or, equivalently, the model parameters that maximize the log likelihod

$\mathbf{\theta'} = \underset{\mathbf{\theta}}{\text{arg max }} \ln \,\text{Pr} (\mathbf{X} | \mathbf{\theta}).$

Using some simple algebra we can show that for any latent variable distribution $q(\mathbf{Z})$ , the log likelihood of the data can be decomposed as
\begin{align}
\ln \,\text{Pr}(\mathbf{X} | \theta) = \mathcal{L}(q, \theta) + \text{KL}(q || p), \label{eq:logLikelihoodDecomp}
\end{align}
where $\text{KL}(q || p)$ is the Kullback-Leibler divergence between $q(\mathbf{Z})$ and the posterior distribution $\,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta)$ , and
\begin{align}
\mathcal{L}(q, \theta) := \sum_{\mathbf{Z}} q(\mathbf{Z}) \left( \mathcal{L}(\theta) - \ln q(\mathbf{Z}) \right)
\end{align}
with $\mathcal{L}(\theta) := \ln \,\text{Pr}(\mathbf{X}, \mathbf{Z}| \mathbf{\theta})$ being the "complete-data" log likelihood (i.e. log likelihood of both observed and latent data).

To understand what the E-M algorithm does in the expectation (E) step, observe that $\text{KL}(q || p) \geq 0$ for any $q(\mathbf{Z})$ and hence $\mathcal{L}(q, \theta)$ is a lower bound on $\ln \,\text{Pr}(\mathbf{X} | \theta)$ .

Then, in the E step, the gap between the $\mathcal{L}(q, \theta)$ and $\ln \,\text{Pr}(\mathbf{X} | \theta)$ is minimized by minimizing the Kullback-Leibler divergence $\text{KL}(q || p)$ with respect to $q(\mathbf{Z})$ (while keeping the parameters $\theta$ fixed).

Since $\text{KL}(q || p)$ is minimized at $\text{KL}(q || p) = 0$ when $q(\mathbf{Z}) = \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta)$ , at the E step $q(\mathbf{Z})$ is set to the conditional distribution $\,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta)$ .

To maximize the model parameters in the M step, the lower bound $\mathcal{L}(q, \theta)$ is maximized with respect to the parameters $\theta$ (while keeping $q(\mathbf{Z}) = \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta)$ fixed; notice that $\theta$ in this equation corresponds to the old set of parameters, hence to avoid confusion let $q(\mathbf{Z}) = \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old})$ ).

The function $\mathcal{L}(q, \theta)$ that is being maximized w.r.t. $\theta$ at the M step can be re-written as
\begin{align*}
\theta^\text{new} &= \underset{\mathbf{\theta}}{\text{arg max }} \left. \mathcal{L}(q, \theta) \right|_{q(\mathbf{Z}) = \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old})} \\
&= \underset{\mathbf{\theta}}{\text{arg max }} \left. \sum_{\mathbf{Z}} q(\mathbf{Z}) \left( \mathcal{L}(\theta) - \ln q(\mathbf{Z}) \right) \right|_{q(\mathbf{Z}) = \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old})} \\
&= \underset{\mathbf{\theta}}{\text{arg max }} \sum_{\mathbf{Z}} \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old}) \left( \mathcal{L}(\theta) - \ln \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old}) \right) \\
&= \underset{\mathbf{\theta}}{\text{arg max }} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right] - \sum_{\mathbf{Z}} \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old}) \ln \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old}) \\
&= \underset{\mathbf{\theta}}{\text{arg max }} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right] - (C \in \mathbb{R}) \\
&= \underset{\mathbf{\theta}}{\text{arg max }} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right],
\end{align*}

i.e. in the M step the expectation of the joint log likelihood of the complete data is maximized with respect to the parameters $\theta$ .

So, just to summarize,

Expectation step: $q^{t + 1}(\mathbf{Z}) \leftarrow \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \mathbf{\theta}^t)$
Maximization step: $\mathbf{\theta}^{t + 1} \leftarrow \underset{\mathbf{\theta}}{\text{arg max }} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{t}} \left[ \mathcal{L}(\theta) \right]$ (where superscript $\mathbf{\theta}^t$ indicates the value of parameter $\mathbf{\theta}$ at time $t$ ).

Phew. Let's go to the image clustering example, and see how all of this actually works.

2. Bernoulli Mixture Models for Image Clustering

First of all, let's represent the image clustering problem in a more formal way.

2.1. Formal description

Say that we are given $N$ same-sized training images $\mathbf{x_n} = (x_{n,1}, ..., x_{n,D})^T$ for $n \in \{1, ..., N \}$ , each image containing $D$ binary pixels (i.e. $x_{n,i} \in \{ 0, 1 \}$ ).

Assuming that the pixels are conditionally independent from each other (i.e. that $x_{n, i}$ is conditionally independent from $x_{n, j \neq i}$ for each $i, j \in \{ 1, ..., D \}$ ), the probability distribution of the pixel $i$ over all images belonging to a component $k$ can be modelled using Bernoulli distribution with a parameter $0 \leq \mu_{k, i} \leq 1$ .

To incorporate some prior knowledge about the image assignment to $K$ clusters (e.g. the proportions of images in each cluster), the assignments can be treated as being sampled from the multivariate distribution with the parameters $\pi_1, ..., \pi_K$ (where $0 \leq \pi_i \leq 1$ , $\sum_{i = 1}^K \pi_i = 1$ ). Each $\pi_i$ for $i \in \{1, ..., K\}$ is called a mixing coefficient of cluster $i$ .

Let say that the model parameters include the pixel distributions of each cluster and the prior knowledge about the image assignments, i.e. $\theta = (\mathbf{\mu}, \mathbf{\pi})$ , where $\mathbf{\mu} := (\mathbf{\mu_1} \; \mathbf{\mu_2} \;... \;\mathbf{\mu_K} ) = \left( \begin{array}{cccc} \mu_{1, 1} & \mu_{2, 1} & ... & \mu_{K, 1} \\ \mu_{1, 2} & \mu_{2, 2} & ... & \mu_{K, 2} \\ \vdots & \vdots & \ddots & \vdots \\ \mu_{1, D} & \mu_{2, D} & ... & \mu_{K, D} \\ \end{array} \right)$ and $\mathbf{\pi} := ( \pi_1, ..., \pi_K )^T$ .

Then, the likelihood of a single training image $\mathbf{x}$ is
\begin{align}
\,\text{Pr}(\mathbf{x} | \theta) = \,\text{Pr}(\mathbf{x} | \mathbf{\mu}, \mathbf{\pi}) = \sum_{k = 1}^K \pi_k \,\text{Pr}(\mathbf{x}|\mathbf{\mu_k})
\end{align}
where the probability that $\mathbf{x}$ is generated by cluster $k$ can be written as
\begin{align}
\,\text{Pr}(\mathbf{x}|\mathbf{\mu_k}) = \prod_{i = 1}^D \mu_{k, i}^{x_i} (1 - \mu_{k, i})^{1 - x_i}.
\end{align}

To model the assignment of images to clusters, associate a latent $K$ -dimensional binary random variable $\mathbf{z_i}$ with each of the training examples $\mathbf{x_i}$ . Say that $\mathbf{z_i}$ has a 1-of- $K$ representation, i.e. for $\mathbf{z_i} := (z_{i, 1}, ..., z_{i, K})^T$ it must be the case that $z_{i, j} \in \{0, 1\}$ for $i \in \{ 1, ..., N \}, j \in \{ 1, ..., K \}$ and $\sum_{j = 1}^{K} z_{i, j} = 1$ .

Furthermore, let the marginal distribution over $\mathbf{z_i}$ be specified in terms of mixing coefficients $\mathbf{\pi}$ s.t. $\,\text{Pr}(z_{i, j} = 1) = \pi_j$ , then
\begin{align}
\,\text{Pr}(\mathbf{z_n} | \mathbf{\pi}) = \prod_{i = 1}^K \pi_i^{z_{n, i}}.
\end{align}

Similarly, let $\,\text{Pr}(\mathbf{x_n} | z_{n, k} = 1) = \,\text{Pr}(\mathbf{x_n} | \mathbf{\mu_k})$ , then
\begin{align}
\,\text{Pr}(\mathbf{x_n} | \mathbf{z_n}, \mathbf{\mu}, \mathbf{\pi}) = \prod_{k = 1}^K \,\text{Pr}(\mathbf{x_n} | \mathbf{\mu_k})^{z_{n, k}}.
\end{align}

By combining all latent variables $\mathbf{z_i}$ into a set $\mathbf{Z} := \{ \mathbf{z_1}, ..., \mathbf{z_N} \}$ , we can write
\begin{equation} \label{eq:probZ}
\begin{split}
\,\text{Pr}(\mathbf{Z}|\mathbf{\pi}) &= \prod_{n = 1}^N \,\text{Pr}(\mathbf{z_n}|\mathbf{\pi}) \\
&= \prod_{n = 1}^N \prod_{k = 1}^K \pi_k^{z_{n, k}},
\end{split}
\end{equation}
and, similarly, combining all training images $\mathbf{x_i}$ into a set $\mathbf{X} := \{ \mathbf{x_1}, ..., \mathbf{x_N} \}$ , we can express the marginal training data distribution as
\begin{equation} \label{eq:probXgivZ}
\begin{split}
\,\text{Pr}(\mathbf{X}|\mathbf{Z}, \mathbf{\mu}, \mathbf{\pi}) &= \prod_{n = 1}^N \,\text{Pr}(\mathbf{x_n}|\mathbf{z_n},\mathbf{\mu},\mathbf{\pi}) \\
&= \prod_{n = 1}^N \prod_{k = 1}^K \,\text{Pr}(\mathbf{x_n} | \mathbf{\mu_k})^{z_{n, k}} \\
&= \prod_{n = 1}^N \prod_{k = 1}^K \left( \prod_{i = 1}^D \mu_{k, i}^{x_{n, i}} (1 - \mu_{k, i})^{1 - x_{n, i}} \right)^{z_{n, k}}.
\end{split}
\end{equation}

From the last two equations and the probability chain rule, the complete data likelihood can be written as:
\begin{equation} \label{eq:probXandZ}
\begin{split}
\,\text{Pr}(\mathbf{X}, \mathbf{Z}| \mathbf{\mu}, \mathbf{\pi}) &= \,\text{Pr}(\mathbf{X} | \mathbf{Z}, \mathbf{\mu}, \mathbf{\pi}) \,\text{Pr}(\mathbf{Z}| \mathbf{\mu}, \mathbf{\pi}) \\
&= \prod_{n = 1}^N \prod_{k = 1}^K \left( \pi_k \prod_{i = 1}^D \mu_{k, i}^{x_{n, i}} (1 - \mu_{k, i})^{1 - x_{n, i}} \right)^{z_{n, k}},
\end{split}
\end{equation}

and thus the complete data log likelihood $\mathcal{L}(\theta)$ can be obtained by taking a log of the equation above:
\begin{equation}
\begin{split}
\mathcal{L}(\theta) &= \ln \,\text{Pr}(\mathbf{X}, \mathbf{Z}| \mathbf{\mu}, \mathbf{\pi}) \\
&= \sum_{n = 1}^N \sum_{k = 1}^K z_{n, k} \left( \ln \pi_k + \sum_{i = 1}^D x_{n, i} \ln \mu_{k, i} + (1 - x_{n, i}) \ln (1 - \mu_{k, i}) \right).
\end{split}
\end{equation}

(Still following? Great. Take five, and below we will derive the E and M step update equations.)

2.2. E-M update equations for BMMs

In order to update the latent variable distribution (i.e. image assignment to clusters) at the expectation step, we need to set the probability distribution of $\textbf{Z}$ to $\,\text{Pr}(\mathbf{Z} | \mathbf{X}, \mathbf{\theta})$ .

However, we cannot calculate this distribution exactly, hence we will have to approximate this assignment. A simple way of doing it is to replace the current values of $z_{n, k}$ with the expected ones:

\begin{equation} \label{eq:z}
\begin{split}
z_{n, k}^\text{new} \leftarrow \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \mathbf{\mu}, \mathbf{\pi}}[z_{n, k}] &= \sum_{z_{n, k}} \,\text{Pr}(z_{n,k} | \mathbf{x_n}, \mathbf{\mu}, \mathbf{\pi}) \, z_{n,k}\\
&= \frac{\pi_k \,\text{Pr}(\mathbf{x_n} |\mathbf{\mu_k})}{\sum_{m = 1}^K \pi_m \,\text{Pr}(\mathbf{x_n} | \mathbf{\mu_m})} \\
&= \frac{\pi_k \prod_{i = 1}^D \mu_{k, i}^{x_{n, i}} (1 - \mu_{k, i})^{1 - x_{n, i}} }{\sum_{m = 1}^K \pi_m \prod_{i = 1}^D \mu_{m, i}^{x_{n, i}} (1 - \mu_{m, i})^{1 - x_{n, i}}}.
\end{split}
\end{equation}

(Notice that after this update $\mathbf{z_{n}}^\text{new}$ is no longer represented as 1-of- $K$ vector, i.e. the same image can be "partially" assigned to multiple clusters.)

In the maximization step we need to maximize the model parameters (i.e. the mixing coefficients and the pixel distributions) using the update equation from earlier

$\mathbf{\theta}^\text{new} \leftarrow \underset{\mathbf{\theta}}{\text{arg max }} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right].$

Observe that
\begin{align}
\mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right] &= \sum_{n = 1}^N \sum_{k = 1}^K \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \mathbf{\mu}^\text{old}, \mathbf{\pi}^\text{old}} \left[ z_{n, k} \right] \left( \ln \pi_k + \sum_{i = 1}^D x_{n, i} \ln \mu_{k, i} + (1 - x_{n, i}) \ln (1 - \mu_{k, i}) \right).
\end{align}
The equation above can be maximized w.r.t. $\mathbf{\mu_k}$ by simply setting its derivative to zero:
\begin{align}
\frac{\partial}{\partial \mu_{m, j}} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right] &= \sum_{n = 1}^N \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \mathbf{\mu}^\text{old}, \mathbf{\pi}^\text{old}} \left[ z_{n, m} \right] \left( \frac{x_{n, j}}{\mu_{m, j}} - \frac{1 - x_{n, j}}{1 - \mu_{m, j}} \right) \\
&= \sum_{n = 1}^N z_{n, m}^\text{new} \frac{x_{n, j} - \mu_{m, j}}{\mu_{m, j} (1 - \mu_{m, j})} = 0 \Leftrightarrow \\
\mu_{m, j} &= \frac{1}{N_m} \sum_{n = 1}^N x_{n, j} z_{n, m}^\text{new},
\end{align}
where $N_m = \sum_{n = 1}^N z_{n, m}^\text{new}$ is the effective number of images assigned to cluster $m$ .

Then the full cluster $m$ pixel distribution vector $\mathbf{\mu_m}$ can be written as

$\mathbf{\mu_m} = \mathbf{\bar{x}_m},$

where

$\mathbf{\bar{x}_m} = \frac{1}{N_m} \sum_{n = 1}^N z_{n, m}^\text{new} \mathbf{x_n}$ is the weighted mean of the images associated with cluster

$m$ .

To maximize $\mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right]$ w.r.t. the mixing coefficients $\mathbf{\pi}$ (subject to the constraint $\sum_{k = 1}^K \pi_k = 1$ ) we can use the Lagrange multipliers, yielding the following optimization problem:
\begin{equation*}
\Lambda(\theta, \lambda) := \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right] + \lambda \left( \sum_{k = 1}^K \pi_k - 1 \right).
\end{equation*}
The optimizing solution can then be found again with simple partial derivatives:
\begin{align}
\frac{\partial}{\partial \pi_{m}} \Lambda(\theta, \lambda) &= \frac{1}{\pi_m} \sum_{n = 1}^N z_{n,m}^\text{new} + \lambda = 0 \Leftrightarrow \\
\pi_m &= -\frac{N_m}{\lambda},
\end{align}
\begin{align}
\frac{\partial}{\partial \lambda} \Lambda(\theta, \lambda) &= \sum_{k = 1}^K \pi_k - 1 = 0 \Leftrightarrow \\
\sum_{k = 1}^K \pi_k &= 1.
\end{align}
By combining these two results $\lambda = - \sum_{k = 1}^K N_k = - N$ , and thus

$\pi_m = -\frac{N_m}{\lambda} = \frac{N_m}{N}.$

Done!

2.3. Summary

In summary, the update equations for Bernoulli Mixture Models using E-M are:

Expectation step:

$z_{n, k} \leftarrow \frac{\pi_k \prod_{i = 1}^D \mu_{k, i}^{x_{n, i}} (1 - \mu_{k, i})^{1 - x_{n, i}} }{\sum_{m = 1}^K \pi_m \prod_{i = 1}^D \mu_{m, i}^{x_{n, i}} (1 - \mu_{m, i})^{1 - x_{n, i}}}.$
Maximization step:

$\mathbf{\mu_m} \leftarrow \mathbf{\bar{x}_m},$

$\pi_m \leftarrow \frac{N_m}{N},$

where $\mathbf{\bar{x}_m} = \frac{1}{N_m} \sum_{n = 1}^N z_{n, m} \mathbf{x_n}$ and $N_m = \sum_{n = 1}^N z_{n, m}$ .

3. References

[Dempster et al, 1977] A. P. Dempster, N. M. Laird, D. B. Rubin. "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal Statistical Society. Series B (Methodological) 39 (1): 1–38.

[Bishop, 2006] C. M. Bishop. "Pattern Recognition and Machine Learning". Springer, 2006. ISBN 9780387310732.

Backpropagation Tutorial

Manfredas Zabarauskas — Sun, 17 Apr 2011 23:16:25 +0000

The PhD thesis of Paul J. Werbos at Harvard in 1974 described backpropagation as a method of teaching feed-forward artificial neural networks (ANNs). In the words of Wikipedia, it lead to a "rennaisance" in the ANN research in 1980s.

As we will see later, it is an extremely straightforward technique, yet most of the tutorials online seem to skip a fair amount of details. Here's a simple (yet still thorough and mathematical) tutorial of how backpropagation works from the ground-up; together with a couple of example applets. Feel free to play with them (and watch the videos) to get a better understanding of the methods described below!

Training a single perceptron

Training a multilayer neural network

1. Background

To start with, imagine that you have gathered some empirical data relevant to the situation that you are trying to predict - be it fluctuations in the stock market, chances that a tumour is benign, likelihood that the picture that you are seeing is a face or (like in the applets above) the coordinates of red and blue points.

We will call this data training examples and we will describe $i$ ^th training example as a tuple $(\vec{x_i}, y_i)$ , where $\vec{x_i} \in \mathbb{R}^n$ is a vector of inputs and $y_i \in \mathbb{R}$ is the observed output.

Ideally, our neural network should output $y_i$ when given $\vec{x_i}$ as an input. In case that does not always happen, let's define the error measure as a simple squared distance between the actual observed output and the prediction of the neural network: $E := \sum_i (h(\vec{x_i}) - y_i)^2$ , where $h(\vec{x_i})$ is the output of the network.

2. Perceptrons (building-blocks)

The simplest classifiers out of which we will build our neural network are perceptrons (fancy name thanks to Frank Rosenblatt). In reality, a perceptron is a plain-vanilla linear classifier which takes a number of inputs $a_1, ..., a_n$ , scales them using some weights $w_1, ..., w_n$ , adds them all up (together with some bias $b$ ) and feeds everything through an activation function $\sigma \in \mathbb{R} \rightarrow \mathbb{R}$ .

A picture is worth a thousand equations:

Perceptron (linear classifier)

To slightly simplify the equations, define $w_0 := b$ and $a_0 := 1$ . Then the behaviour of the perceptron can be described as $\sigma(\vec{a} \cdot \vec{w})$ , where $\vec{a} := (a_0, a_1, ..., a_n)$ and $\vec{w} := (w_0, w_1, ..., w_n)$ .

To complete our definition, here are a few examples of typical activation functions:

sigmoid: $\sigma(x) = \frac{1}{1 + \exp(-x)}$ ,
hyperbolic tangent: $\sigma(x) = \tanh(x)$ ,
plain linear $\sigma(x) = x$ and so on.

Now we can finally start building neural networks. The simplest kind of network that we can build is... exactly, one perceptron! Here's how we can train it to classify things!

3. Single-layer neural network

We defined the error earlier as $E := \sum_i (h(\vec{x_i}) - y_i)^2$ . Obviously, since we are using a single perceptron both our error and the output of the network ( $h_{\vec{w}}(\vec{x_i}) = \sigma(\vec{w} \cdot \vec{x_i})$ ) depend on the weights vector $\vec{w}$ .

Incorporating those observations into the updated error measure we obtain $E(\vec{w}) := \sum_i (h_{\vec{w}}(\vec{x_i}) - y_i)^2$ .

Our goal is to find such a vector of weights $\vec{w}$ that $E(\vec{w})$ is minimised - that way our perceptron will correctly predict the output for all inputs of our training examples!

We will do that by applying the gradient descent algorithm: in essence we will treat the error as a surface in n-dimensional space, then we will find a greatest downwards slope at the current point $\vec{w_t}$ and will go in that direction to obtain $\vec{w}_{t+1}$ . This way hopefully we will find a minimum point on the error surface and we will use the coordinates of that point as the final weight vector.

By skipping a great deal of maths on whether the minimum point exists, is it unique and global, can we "overjump" it by accident, what are the conditions for the following partial derivatives to exist, etc, etc; we will dive straight in hoping for the best and will calculate the gradient of the error surface at $\vec{w_t}$ . Then we will take a step in the opposite direction of the gradient (i.e. in the direction of the fastest decreasing slope on the error surface) to obtain $\vec{w}_{t + 1}$ .

To express it in a slightly more mathematical way, we will start with some randomized (!) weight vector $\vec{w_0}$ and will train our perceptron by updating the weights

\begin{align} \vec{w}_{t+1} := \vec{w_t} - \eta \frac{\partial E(\vec{w})}{\partial \vec{w}} \bigg|_{\vec{w_t}}, \end{align}

where $\eta$ is known as a learning rate (a simple scaling factor that typically ranges between zero and one).

Observe that

\begin{align} \frac{\partial E(\vec{w})}{\partial \vec{w}} = \left( \frac{\partial E(\vec{w})}{\partial w_0},\frac{\partial E(\vec{w})}{\partial w_1}, ... ,\frac{\partial E(\vec{w})}{w_n} \right), \end{align}

and we can calculate

\begin{align} \frac{\partial E(\vec{w})}{\partial w_j} &= \frac{\partial}{\partial w_j} \sum_i (h_{\vec{w}}(\vec{x_i}) - y_i)^2 \\ &= \sum_i 2(h_{\vec{w}}(\vec{x_i}) - y_i) \frac{\partial}{\partial w_j} (h_{\vec{w}}(\vec{x_i}) - y_i) \\ &= \sum_i 2(h_{\vec{w}}(\vec{x_i}) - y_i) \frac{\partial}{\partial w_j} \sigma(\vec{x_i} \cdot \vec{w}) \\ &= \sum_i 2(h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (\vec{x_i} \cdot \vec{w}) \frac{d}{d w_j} \vec{x_i} \cdot \vec{w} \\ &= \sum_i 2(h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (\vec{x_i} \cdot \vec{w}) \frac{d}{d w_j} \sum_{k=1}^n x_{i,k} w_k \\ &= 2 \sum_i (h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (\vec{x_i} \cdot \vec{w}) x_{i,j} \end{align}

for each $0 \leq j \leq n$ .

3.1. Example single-layer neural network

In this applet, a perceptron takes two inputs (normalized x and y coordinates $in_x$ and $in_y$ , i.e. $a_1 = in_x$ , $a_2 = in_y$ ) and uses sigmoid as an activation function with the learning rate $\eta = 0.1$ .

Then, using a previous general result

\begin{align} \frac{\partial E(\vec{w})}{\partial w_j} &= 2 \sum_i (h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (\vec{x_i} \cdot \vec{w}) x_{i,j} \\ &= 2 \sum_i (\sigma(\vec{w} \cdot \vec{x_i}) - y_i) \sigma(\vec{x_i} \cdot \vec{w}) (1 - \sigma(\vec{x_i} \cdot \vec{w})) x_{i,j}, \end{align}

(since for the sigmoid activation function $\sigma ' (x) = \sigma(x) (1 - \sigma(x))$ ); and thus

\begin{align} \frac{\partial E(\vec{w})}{\partial \vec{w}} = 2 \sum_i (\sigma(\vec{w} \cdot \vec{x_i}) - y_i) \sigma(\vec{x_i} \cdot \vec{w}) (1 - \sigma(\vec{x_i} \cdot \vec{w})) \vec{x_i}. \end{align}

The final algorithm to update the weight vector $\vec{w} = (w_0, w_1, w_2)$ (which is initially randomized) then is

\begin{align} \vec{w}_{t+1} := \vec{w_t} - 0.2 \sum_i (h_{\vec{w}_t}(\vec{x_i}) - y_i) h_{\vec{w}_t}(\vec{x_i}) (1 - h_{\vec{w}_t}(\vec{x_i})) \vec{x_i}, \end{align}

where $h_{\vec{w}_t}(\vec{x_i}) = \sigma(\vec{w}_t \cdot \vec{x_i})$ .

However, a single perceptron is extremely limited in the sense that different classes of examples must be separable with a hyperplane (hence the name, linear classifier), which is usually not the case in real-life applications.

Time to bump things up a notch: let's connect a few of them together to obtain a multilayer feed-forward neural network!

4. Multilayer neural network

Let's consider a general case first: a completely unrestricted feed-forward structure (with the only condition being that there are no loops between the perceptrons to avoid general madness and chaos).

Since it is structurally more complex than just a single perceptron, take a look at the following figure that explains some more notation:

Multilayer neural network

Here the weight $w_{i \rightarrow j}$ connects perceptrons $i$ and $j$ , the sum of the weighed inputs of perceptron $j$ is denoted by $s_j := \sum_k z_k w_{k \rightarrow j}$ where $k$ iterates over all perceptrons connected to $j$ , and the output of $j$ is written as $z_j := \sigma(s_j)$ , where $\sigma$ is $j$ 's activation function.

We will use the same error measure $E(\vec{w}) := \sum_i (h_{\vec{w}}(\vec{x_i}) - y_i)^2$ , except now the weights vector $\vec{w}$ will contain all the weights in the network, i.e. $\vec{w} = (\;\;w_{i \rightarrow j}\;\;)$ for all $i, j$ .

To find $\vec{w}$ that minimizes $E(\vec{w})$ using gradient descent we have to calculate $\frac{\partial E(\vec{w})}{\partial \vec{w}}$ (again). However, this time it is (very slightly) more involved.

First of all let's separate the contributions of individual training examples to the overall error using the following observation:
\begin{align} \frac{\partial E(\vec{w})}{\partial \vec{w}} = \sum_i \frac{\partial E_i(\vec{w})}{\partial \vec{w}}, \end{align}
where $E_i(\vec{w}) = (h_{\vec{w}}(\vec{x_i}) - y_i)^2$ .

Then

\begin{align} \frac{\partial E_i(\vec{w})}{\partial w_{j \rightarrow k}} &= \frac{\partial}{\partial w_{j \rightarrow k}} (h_{\vec{w}}(\vec{x_i}) - y_i)^2 \\ &= 2 (h_{\vec{w}}(\vec{x_i}) - y_i) \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial w_{j \rightarrow k}} \\ &= 2 (h_{\vec{w}}(\vec{x_i}) - y_i) \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_k} \frac{\partial s_k}{\partial w_{j \rightarrow k}} \\ &= 2 (h_{\vec{w}}(\vec{x_i}) - y_i) \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_k} z_j. \end{align}

If $k$ is an output node, then
\begin{align} \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_k} = \frac{d \sigma(s_k)}{d s_k} = \sigma' (s_k)\end{align}
and thus
\begin{align} \frac{\partial E_i(\vec{w})}{\partial w_{j \rightarrow k}} &= 2 (h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (s_k)\; z_j. \end{align}

However, if $k$ is not an output node, then a change in $s_k$ can affect all the nodes which are connected to $k$ 's output, i.e.
\begin{align} \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_k} &= \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial z_k} \frac{\partial z_k}{\partial s_k} \\ &= \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial z_k} \sigma ' (s_k) \\ &= \sum_{o \in \{ v \; | \; v \text{ is connected to } k \}} \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_o} \frac{\partial s_o}{\partial z_k} \sigma ' (s_k) \\ &= \sum_{o \in \{ v \; | \; v \text{ is connected to } k \}} \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_o} w_{k \rightarrow o} \; \sigma ' (s_k), \end{align}
... and we are almost done! All what is left to do is to place the $i$ ^th example at the inputs of our neural network, calculate $s_k$ and $z_k$ for all the nodes (the forward-propagation step) and work our way backwards from the output node calculating $\frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_k}$ (hence the name, backpropagation).

To summarize, if $k$ is an output node, then

\begin{align} \frac{\partial E_i(\vec{w})}{\partial w_{j \rightarrow k}} &= 2 (h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (s_k)\; z_j, \end{align}

otherwise

\begin{align} \frac{\partial E_i(\vec{w})}{\partial w_{j \rightarrow k}} &= 2 (h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (s_k)\; z_j \sum_{o \in \{ v \; | \; v \text{ conn. to } k \}} \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_o} w_{k \rightarrow o}. \end{align}

Then after the following is obtained
\begin{align} \frac{\partial E_i(\vec{w})}{\partial \vec{w}} = \left( \; \; \frac{\partial E_i(\vec{w})}{\partial w_{j \rightarrow k}} \; \; \right), \forall j, k \end{align}
the weight vector can either be updated in one go (batch update)
\begin{align} \vec{w}_{t+1} := \vec{w_t} - \eta \frac{\partial E(\vec{w})}{\partial \vec{w}} \bigg|_{\vec{w_t}} = \vec{w_t} - \eta \sum_i \frac{\partial E_i(\vec{w})}{\partial \vec{w}}\bigg|_{\vec{w_t}}, \end{align}
or it can be updated sequentially using one training example at a time:
\begin{align} \vec{w}_{t+1} := \vec{w_t} - \eta \frac{\partial E_i(\vec{w})}{\partial \vec{w}} \bigg|_{\vec{w_t}}.\end{align}

4.1. Example multilayer network

If you launch and play with the applet above, you will see that it is able to separate classes non-linearly (indicating that it's using more than one perceptron). It is built using this two-layer neural network:

Two-layer neural network example

The weights vector $\vec{w}$ contains all the weights in the network, i.e.
\begin{align} \vec{w} = ( w_{in_1 \rightarrow 1}, w_{in_x \rightarrow 1}, w_{in_y \rightarrow 1}, w_{in_1 \rightarrow 2}, ..., w_{in_y \rightarrow 5}, w_{in_1 \rightarrow 6}, w_{1 \rightarrow 6}, w_{2 \rightarrow 6}, ..., w_{5 \rightarrow 6}). \end{align}

Each perceptron is using sigmoid as its activation function and the output of the perceptron $6$ is the output for the whole network, i.e. $h_{\vec{w}}(\vec{x_i}) = z_6$ .

Then an individual point i (with x and y coordinates normalized) is considered as an $i$ ^th training example and fed through the network. While it's being propagated, each $s_i$ and $z_i$ for $i = 1, ..., 6$ are stored.

Then the gradient of an $i$ ^th error surface is calculated as follows:
\begin{align}
\frac{\partial E_i(\vec{w})}{\partial \vec{w}} &= \left( \frac{\partial E_i(\vec{w})}{\partial w_{in_1 \rightarrow 1}},\frac{\partial E_i(\vec{w})}{\partial w_{in_x \rightarrow 1}}, ..., \frac{\partial E_i(\vec{w})}{\partial w_{in_y \rightarrow 5}},\frac{\partial E_i(\vec{w})}{\partial w_{in_1 \rightarrow 6}},\frac{\partial E_i(\vec{w})}{\partial w_{1 \rightarrow 6}},\frac{\partial E_i(\vec{w})}{\partial w_{2 \rightarrow 6}}, ..., \frac{\partial E_i(\vec{w})}{\partial w_{5 \rightarrow 6}} \right) , \end{align}
where
\begin{align} \frac{\partial E_i(\vec{w})}{\partial w_{in_1 \rightarrow 1}} &= 2 (h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (s_1)\; \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_6} w_{1 \rightarrow 6} \\
&= 2 (z_6 - y_i) \; \sigma (s_1) \; (1 - \sigma (s_1)) \; \sigma (s_6) \; (1 - \sigma (s_6)) \; w_{1 \rightarrow 6}, \\
\frac{\partial E_i(\vec{w})}{\partial w_{in_x \rightarrow 1}} &= 2 (z_6 - y_i) \; \sigma (s_1) \; (1 - \sigma (s_1)) \; {in}_x \; \sigma (s_6) \; (1 - \sigma (s_6)) \; w_{1 \rightarrow 6}, \\
& \vdots \\
\frac{\partial E_i(\vec{w})}{\partial w_{in_y \rightarrow 5}} &= 2 (z_6 - y_i) \; \sigma (s_5) \; (1 - \sigma (s_5)) \; {in}_y \; \sigma (s_6) \; (1 - \sigma (s_6)) \; w_{5 \rightarrow 6}, \\
\frac{\partial E_i(\vec{w})}{\partial w_{in_1 \rightarrow 6}} &= 2 (h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (s_6) \\
&= 2 (z_6 - y_i) \; \sigma (s_6) \; (1 - \sigma (s_6)) , \\
\frac{\partial E_i(\vec{w})}{\partial w_{1 \rightarrow 6}} &= 2 (z_6 - y_i) \; \sigma (s_6) \; (1 - \sigma (s_6)) \; z_1, \\
\frac{\partial E_i(\vec{w})}{\partial w_{2 \rightarrow 6}} &= 2 (z_6 - y_i) \; \sigma (s_6) \; (1 - \sigma (s_6)) \; z_2, \\
& \vdots \\
\frac{\partial E_i(\vec{w})}{\partial w_{5 \rightarrow 6}} &= 2 (z_6 - y_i) \; \sigma (s_6) \; (1 - \sigma (s_6)) \; z_5.
\end{align}

Finally, the network is sequentially trained with the learning rate $\eta = 0.5$ (starting with a random initial weight vector $w_0$ )
\begin{align} \vec{w}_{t+1} := \vec{w_t} - 0.5 \frac{\partial E_i(\vec{w})}{\partial \vec{w}} \bigg|_{\vec{w_t}}.\end{align}

That's it, I hope it sheds some light on the backpropagation!

Eigenfaces Tutorial

Manfredas Zabarauskas — Fri, 02 Oct 2009 16:43:22 +0000

The main purpose behind writing this tutorial was to provide a more detailed set of instructions for someone who is trying to implement an eigenface based face detection or recognition systems. It is assumed that the reader is familiar (at least to some extent) with the eigenface technique as described in the original M. Turk and A. Pentland papers (see "References" for more details).

1. Introduction

The idea behind eigenfaces is similar (to a certain extent) to the one behind the periodic signal representation as a sum of simple oscillating functions in a Fourier decomposition. The technique described in this tutorial, as well as in the original papers, also aims to represent a face as a linear composition of the base images (called the eigenfaces).

The recognition/detection process consists of initialization, during which the eigenface basis is established and face classification, during which a new image is projected onto the "face space" and the resulting image is categorized by the weight patterns as a known-face, an unknown-face or a non-face image.

2. Demonstration

To download the software shown in video for 32-bit x86 platform, click here. It was compiled using Microsoft Visual C++ 2008 and uses GSL for Windows.

3. Establishing the Eigenface Basis

First of all, we have to obtain a training set of $M$ grayscale face images $I_1, I_2, ..., I_M$ . They should be:

face-wise aligned, with eyes in the same level and faces of the same scale,
normalized so that every pixel has a value between 0 and 255 (i.e. one byte per pixel encoding), and
of the same $N \times N$ size.

So just capturing everything formally, we want to obtain a set $\{ I_1, I_2, ..., I_M \}$ , where \begin{align} I_k = \begin{bmatrix} p_{1,1}^k & p_{1,2}^k & ... & p_{1,N}^k \\ p_{2,1}^k & p_{2,2}^k & ... & p_{2,N}^k \\ \vdots \\ p_{N,1}^k & p_{N,2}^k & ... & p_{N,N}^k \end{bmatrix}_{N \times N} \end{align} and $0 \leq p_{i,j}^k \leq 255.$

Once we have that, we should change the representation of a face image $I_k$ from a $N \times N$ matrix, to a $\Gamma_k$ point in $N^2$ -dimensional space. Now here is how we do it: we concatenate all the rows of the matrix $I_k$ into one big vector of dimension $N^2$ . Can it get any more simpler than that?

This is how it looks formally:

$\Gamma_k = \begin{bmatrix} p_{1,1}^k \\ p_{1,2}^k \\ \vdots \\ p_{1,N}^k \\ p_{2,1}^k \\ p_{2,2}^k \\ \vdots \\ p_{2,N}^k \\ \vdots \\ p_{N,1}^k \\ p_{N,2}^k \\ \vdots \\ p_{N,N}^k \end{bmatrix}_{N \times 1}$ , where

$k = 1, ..., M$ and

$p_{i,j}^k \in I_k$

Since we are much more interested in the characteristic features of those faces, let's subtract everything what is common between them, i.e. the average face.
The average face of the previous mean-adjusted images can be defined as $\Psi = {{1}\over{M}} \sum_{i=1}^{M} \Gamma_i$ , then each face differs from the average by the vector $\Phi_i = \Gamma_i - \Psi$ .

Now we should attempt to find a set of orthonormal vectors which best describe the distribution of our data. The necessary steps in this at a first glance daunting task would seem to be:

Obtain a covariance matrix
$C = {{1}\over{M}} \sum_{i=1}^{M} \Phi_i \Phi_i^T = AA^T$ , where $A = \left[ \Phi_1 \Phi_2 ... \Phi_M \right]$ .
Find the eigenvectors $u_k$ and eigenvalues $\lambda_k$ of $C$ .

However, note two things here: $A$ is of the size $N^2 \times M$ and hence the matrix $C$ is of the size $N^2 \times N^2$ . To put things into perspective - if your image size is $128 \times 128$ , then the size of the matrix $C$ would be $16384 \times 16384$ . Determining eigenvectors and eigenvalues for a matrix this size would be an absolutely intractable task!

So how do we go about it? A simple mathematical trick: first let's calculate the inner product matrix $L = A^T A$ , of the size $M \times M$ . Then let's find it's eigenvectors $v_i, i = 1, ..., M$ of $L$ (of the $M$ -th dimension). Now observe, that if $L v_i = \lambda_i v_i$ , then

\begin{array} {rcl} A L v_i &=& \lambda_i A v_i \Rightarrow \\ A A^T A v_i &=& \lambda_i A v_i \Rightarrow \\ C A v_i &=& \lambda_i A v_i, \end{array}

and hence $u_i = A v_i$ and $\lambda_i$ are respectively the $M$ eigenvectors (of $N^2$ -th dimension) and eigenvalues of $C$ . Make sure to normalize $u_i$ , such that $\left\| u_i \right\| = 1$ .

We will call these eigenvectors $u_i$ the eigenfaces. Scale them to 255 and render on the screen, to see why.

It turns out that quite a few eigenfaces with the smallest eigenvalues can be discarded, so leave only the $R \leq M$ ones with the largest eigenvalues (i.e. only the ones making the greatest contribution to the variance of the original image set) and chuck them into the matrix $U = \left[ u_1 u_2 ... u_R \right]_{N^2 \times R}$

After you have done that - congratulations! We won't need anything else, but the matrix $U$ for the next steps - face detection and classification.

4. Face Classification Using Eigenfaces

Once the eigenfaces are created, a new face image $\Gamma$ can be transformed into it's eigenface components by a simple operation:

$\Omega = U^T (\Gamma - \Psi) = \begin{bmatrix} \omega_1 \\ \omega_2 \\ \vdots \\ \omega_R \end{bmatrix}_{R \times 1}$ .

The weights $\omega_i \in \Omega$ describe the contribution of each eigenface in representing the input face image. We can use this vector for face recognition by finding the smallest Euclidean distance $\epsilon_{rec}$ between the input face and training faces weight vectors, i.e. by calculating $\epsilon_{rec} = min \left\| \Omega - \Omega_i \right\|$ . If $\epsilon_{rec} < \Theta_{rec}$ , where $\Theta_{rec}$ is a treshold chosen heuristically, then we can say that the input image is recognized as the image with which it gives the lowest score. The weights vector can also be used for an unknown face detection, exploiting the fact that the images of faces do not change radically when projected into the face space, while the projection of non-face images appear quite different. To do so, we can calculate the distance $\epsilon_{det}$ from the mean-adjusted input image $\Phi = \Gamma - \Psi$ and its projection onto face space $\Phi_f = \sum_{i=1}^R \omega_i u_i$ , i.e. $\epsilon_{det} = \left\| \Phi - \Phi_f \right\|$ . Again, if $\epsilon_{det} < \Theta_{det}$ for some treshold $\Theta_{det}$ (also obtained heuristically, for example, by observing $\epsilon_{det}$ for an input set consisting only of face images and a set of non-face images) we can conclude that the input image is a face.

5. References

1. Face Recognition Using Eigenfaces, Matthew A. Turk and Alex P. Pentland, MIT Vision and Modeling Lab, CVPR ‘91.
2. Eigenfaces for Recognition, Matthew A. Turk and Alex P. Pentland, Journal of Cognitive Neuroscience ‘91.
3. Eigenfaces. Sheng Zhang and Matthew Turk (2008), Scholarpedia, 3(9):4244.