Archive for category Development
Expectation-Maximization Algorithm for Bernoulli Mixture Models (Tutorial)
Posted by Manfredas Zabarauskas in Development, Education on February 12th, 2013
Even though the title is quite a mouthful, this post is about two really cool ideas:
- A solution to the "chicken-and-egg" problem (known as the Expectation-Maximization method, described by A. Dempster, N. Laird and D. Rubin in 1977), and
- An application of this solution to automatic image clustering by similarity, using Bernoulli Mixture Models.
For the curious, an implementation of the automatic image clustering is shown in the video below. The source code (C#, Windows x86/x64) is also available for download!
Automatic clustering of handwritten digits from MNIST database using Expectation-Maximization algorithm
While automatic image clustering nicely illustrates the E-M algorithm, E-M has been successfully applied in a number of other areas: I have seen it being used for word alignment in automated machine translation, valuation of derivatives in financial models, and gene expression clustering/motif finding in bioinformatics.
As a side note, the notation used in this tutorial closely matches the one used in Christopher M. Bishop's "Pattern Recognition and Machine Learning". This should hopefully encourage you to check out his great book for a broader understanding of E-M, mixture models or machine learning in general.
Alright, let's dive in!
1. Expectation-Maximization Algorithm
Imagine the following situation. You observe some data set
(e.g. a bunch of images). You hypothesize that these images are of
different objects... but you don't know which images represent which objects.
Let
be a set of latent (hidden) variables, which tell precisely that: which images represent which objects.
Clearly, if you knew
, you could group images into the clusters (where each cluster represents an object), and vice versa, if you knew the groupings you could deduce
. A classical "chicken-and-egg" problem, and a perfect target for an Expectation-Maximization algorithm.
Here's a general idea of how E-M algorithm tackles it. First of all, all images are assigned to clusters arbitrarily. Then we use this assignment to modify the parameters of the clusters (e.g. we change what object is represented by that cluster) to maximize the clusters' ability to explain the data; after which we re-assign all images to the expected most-likely clusters. Wash, rinse, repeat, until the assignment explains the data well-enough (i.e. images from the same clusters are similar enough).
(Notice the words in bold in the previous paragraph: this is where the expectation and maximization stages in the E-M algorithm come from.)
To formalize (and generalize) this a bit further, say that you have a set of model parameters
(in the example above, some sort of cluster descriptions).
To solve the problem of cluster assignments we effectively need to find model parameters
that maximize the likelihood of the observed data
, or, equivalently, the model parameters that maximize the log likelihod

Using some simple algebra we can show that for any latent variable distribution
, the log likelihood of the data can be decomposed as
\begin{align}
\ln \,\text{Pr}(\mathbf{X} | \theta) = \mathcal{L}(q, \theta) + \text{KL}(q || p), \label{eq:logLikelihoodDecomp}
\end{align}
where
is the Kullback-Leibler divergence between
and the posterior distribution
, and
\begin{align}
\mathcal{L}(q, \theta) := \sum_{\mathbf{Z}} q(\mathbf{Z}) \left( \mathcal{L}(\theta) - \ln q(\mathbf{Z}) \right)
\end{align}
with
being the "complete-data" log likelihood (i.e. log likelihood of both observed and latent data).
To understand what the E-M algorithm does in the expectation (E) step, observe that
for any
and hence
is a lower bound on
.
Then, in the E step, the gap between the
and
is minimized by minimizing the Kullback-Leibler divergence
with respect to
(while keeping the parameters
fixed).
Since
is minimized at
when
, at the E step
is set to the conditional distribution
.
To maximize the model parameters in the M step, the lower bound
is maximized with respect to the parameters
(while keeping
fixed; notice that
in this equation corresponds to the old set of parameters, hence to avoid confusion let
).
The function
that is being maximized w.r.t.
at the M step can be re-written as
\begin{align*}
\theta^\text{new} &= \underset{\mathbf{\theta}}{\text{arg max }} \left. \mathcal{L}(q, \theta) \right|_{q(\mathbf{Z}) = \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old})} \\
&= \underset{\mathbf{\theta}}{\text{arg max }} \left. \sum_{\mathbf{Z}} q(\mathbf{Z}) \left( \mathcal{L}(\theta) - \ln q(\mathbf{Z}) \right) \right|_{q(\mathbf{Z}) = \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old})} \\
&= \underset{\mathbf{\theta}}{\text{arg max }} \sum_{\mathbf{Z}} \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old}) \left( \mathcal{L}(\theta) - \ln \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old}) \right) \\
&= \underset{\mathbf{\theta}}{\text{arg max }} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right] - \sum_{\mathbf{Z}} \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old}) \ln \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old}) \\
&= \underset{\mathbf{\theta}}{\text{arg max }} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right] - (C \in \mathbb{R}) \\
&= \underset{\mathbf{\theta}}{\text{arg max }} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right],
\end{align*}
i.e. in the M step the expectation of the joint log likelihood of the complete data is maximized with respect to the parameters
.
So, just to summarize,
- Expectation step:

- Maximization step:
(where superscript
indicates the value of parameter
at time
).
Phew. Let's go to the image clustering example, and see how all of this actually works. Read the rest of this entry »
3D Display Simulation using Head-Tracking with Kinect
Posted by Manfredas Zabarauskas in Development, Education on October 31st, 2012
During my final year in Cambridge I had the opportunity to work on the project that I wanted to implement for the last three years.
It all started when I saw Johnny Lee's "Head Tracking for Desktop VR Displays using the Wii Remote" project in early 2008 (see below). He cunningly used the infrared camera in the Nintendo Wii's remote and a head mounted sensor bar to track the location of the viewer's head and render view dependent images on the screen. He called it a "portal to the virtual environment".
Johnny Lee's project "Head Tracking for Desktop VR Displays using the Wii Remote".
I always thought that it would be really cool to have this behaviour without having to wear anything on your head (and it was - see the video below!).
My "portal to the virtual environment" which does not require head gear. And it has 3D Tetris!
I am a firm believer in three-dimensional displays, and I am certain that we do not see the widespread adoption of 3D displays simply because of a classic network effect (also know as "chicken-and-egg" problem). The creation and distribution of a three-dimensional content is inevitably much more expensive than a regular, old-school 2D content. If there is no demand (i.e. no one has a 3D display at home/work), then the content providers do not have much of an incentive to bother creating the 3D content. Vice versa, if there is no content then consumers do not see much incentive to invest in (inevitably more expensive) 3D displays.
A "portal to the virtual environment", or as I like to call it, a 2.5D display could effectively solve this. If we could enhance every 2D display to get what you see in Johnny's and my videos (and I mean every: LCD, CRT, you-name-it), then suddenly everyone can consume the 3D content even without having the "fully" 3D display. At that point it starts making sense to mass-create 3D content.
The terms "fully" and 2.5D, however, require a bit of explanation.
Backpropagation Tutorial
Posted by Manfredas Zabarauskas in Development, Education on April 18th, 2011
The PhD thesis of Paul J. Werbos at Harvard in 1974 described backpropagation as a method of teaching feed-forward artificial neural networks (ANNs). In the words of Wikipedia, it lead to a "rennaisance" in the ANN research in 1980s.
As we will see later, it is an extremely straightforward technique, yet most of the tutorials online seem to skip a fair amount of details. Here's a simple (yet still thorough and mathematical) tutorial of how backpropagation works from the ground-up; together with a couple of example applets. Feel free to play with them (and watch the videos) to get a better understanding of the methods described below!
Training a single perceptron (linear classifier) Training a multilayer neural network
1. Background
To start with, imagine that you have gathered some empirical data relevant to the situation that you are trying to predict - be it fluctuations in the stock market, chances that a tumour is benign, likelihood that the picture that you are seeing is a face or (like in the applets above) the coordinates of red and blue points.
We will call this data training examples and we will describe
th training example as a tuple
, where
is a vector of inputs and
is the observed output.
Ideally, our neural network should output
when given
as an input. In case that does not always happen, let's define the error measure as a simple squared distance between the actual observed output and the prediction of the neural network:
, where
is the output of the network.
2. Perceptrons (building-blocks)
The simplest classifiers out of which we will build our neural network are perceptrons (fancy name thanks to Frank Rosenblatt). In reality, a perceptron is a plain-vanilla linear classifier which takes a number of inputs
, scales them using some weights
, adds them all up (together with some bias
) and feeds everything through an activation function
.
A picture is worth a thousand equations:
Perceptron (linear classifier)
To slightly simplify the equations, define
and
. Then the behaviour of the perceptron can be described as
, where
and
.
To complete our definition, here are a few examples of typical activation functions:
- sigmoid:
, - hyperbolic tangent:
, - plain linear
and so on.
Now we can finally start building neural networks. Read the rest of this entry »
Halfway There
Posted by Manfredas Zabarauskas in Development, Education, Life on April 12th, 2011
Another term in Cambridge has gone by - four out of nine to go. In the meantime, here's a quick update of what I've been up to in the past few months.
1. Microsoft internship
In January I had the opportunity to visit Microsoft's headquarters in Redmond, WA, to interview for the Software Development Engineer in Test intern position in the Office team. In short - a great trip, in every aspect.
I left London Heathrow on January 11th, 2:20 PM and landed in Seattle Tacoma at 4:10 PM (I suspect that there might have been a few time zones in between those two points). I arrived in Mariott Redmond roughly an hour later, which meant that because of my anti-jetlag technique ("do not go to bed until 10-11 PM in the new timezone no matter what") I had a few hours to kill. Ample time to unpack, grab a dinner in Mariott's restaurant and go for a short stroll around Redmond before going to sleep.
On the next day I had four interviews arranged. The interviews themselves were absolutely stress-free, it felt more like a chance to meet and have a chat with some properly smart (and down-to-earth) folks. Top of the Space Needle. Seattle, WA, 2011
My trip ended with a quick visit to Seattle on the next day: a few pictures of the Space Needle, a cup of Seattle's Best Coffee and there I was on my flight back to London, having spent $0.00 (yeap, Microsoft paid for everything - flights, hotel, meals, taxis, etc). Even so, the best thing about Microsoft definitely seemed to be the people working there; since I have received and accepted the offer, we'll see if my opinion remains unchanged after this summer!
2. Lent term v2.0
TrueMobileCoverage group project
Well, things are still picking up the speed. Seven courses with twenty-eight supervisions in under two months, plus managing a group project (crowd-sourcing mobile network signal strength, the link is on the left), a few basketball practices each week on top of that and you'll see a reason why this blog has not been updated for a couple of months.
It's not all doom and gloom, of course. Courses themselves are great, lecturers make some decently convoluted material understandable in minutes and an occasional formal hall (e.g. below) also helps.
All in all, my opinion, that Cambridge provides a great opportunity to learn a huge amount of material in a very short timeframe, remains unchanged.
There will be more to come about some cool things that I've learnt in separate posts, but now speaking of learning - it's revision time... :-)
Me and Ada at the CompSci formal. Cambridge, England, 2011
Conway's Game of Life (cont.)
Posted by Manfredas Zabarauskas in Development on November 12th, 2010
"Beauty in things exists in the mind which contemplates them."
- David Hume (1711-1776)
Conway's Game of Life theme continues. Here is a short video with the Game of Life, this time running on Altera DE2 FPGA board with custom soft MIPS CPU.
Game of Life running on Altera DE2 FPGA board.
Morgan Stanley Internship
Posted by Manfredas Zabarauskas in Development, Life on October 23rd, 2010
Last week I received a short e-mail from my former manager at Morgan Stanley:
"Hi Manfred,
Just to let you know that GlobalAxe all went live last week and so far no issues at all."
Since the people on the trading floor started using my system and it seems to be standing on its feet so far, it probably is a good time to recap on what had happened over my ten week internship at Morgan Stanley.
I was working as technology analyst in repo trading team (in institutional securities group). My task was to develop and integrate a new screen into trading software, to create an associated e-mail subsystem generating daily/weekly reports for senior executives and to code a website which would provide access to the data for executives/sales people without the trading software on their machines.
Development-wise it involved working with quite a wide range of technologies, such as C# and CAB for UI development, Java/Spring for e-mail report generation/server backend, MVC under ASP.NET for the website, Transact-SQL for Sybase DB backend; everything interconnected with SOAP/XML and distributed locally over in-house pubsub systems or through IBM's MQ for inter-continental data transactions.
Even though working and learning about all these technologies was fun on it's own right, the best thing I would say about my experience was the people. There is no better feeling than having a quick call with traders in New York demoing them the stuff that you just wrote, then dropping an e-mail to Tokyo checking if your recent changes made it through to their database, discussing the architecture of your system with the guys in your team and then going to the global team video-meeting; all in the same day.
And sometimes you feel the need to pinch yourself, because the level of responsibility that you get as an intern is staggering. You have the same rights and responsibilities as any other team member: a screw up in your code can block sixty people from submitting their code before the end of the iteration, a failure to convince the head of traders in NY that what you are doing is going to help them will affect the name of the whole team, and so on.
But then, you own your project: you make the final design decisions, you implement it and you give it to the end-users, who often appear to be bigshots. And that more than makes up for a few late nights in the office. Plus, Canary Wharf is absolutely beautiful at night.
Without expanding too much (and breaching too many non-disclosure agreements) - it was definitely the best experience so far: in terms of team, project, technology, skill, involvement and everything else. And it seems like I will have a chance to repeat it again: I have already received an unconditional offer for the internship at MS next summer!
Oh, and regarding the summer days spent in glass, steel and stone towers... well, Majorca more than made up for it!
Conway's Game of Life
Posted by Manfredas Zabarauskas in Development, Education on April 4th, 2010
Description
In 1970s John Horton Conway (British mathematician and University of Cambridge graduate) opened a whole new field of mathematical research by publishing a revolutionary paper on the cellular automaton called the Game of Life. Suffice it to say that the game which he has described with four simple rules has the power of a universal Turing machine, i.e. anything that can be computed algorithmically can be computed within Conway's Game of Life (outlines of a proof for given by Berlekamp et al; implemented by Chapman as a universal register machine within the Game of Life in 2002).
The Game of Life is a zero-player game, i.e. the player interacts only by creating an initial configuration on a two-dimensional grid of square cells and then observing how it evolves. Every new generation of cells (which can be either live or dead) is a pure function of the previous generation and is described by this set of rules:
- Any live cell with fewer than two live neighbours dies, as if caused by underpopulation.
- Any live cell with more than three live neighbours dies, as if by overcrowding.
- Any live cell with two or three live neighbours lives on to the next generation.
- Any dead cell with exactly three live neighbours becomes a live cell.
For more information, patterns and current news about the research involving Game of Life check out the brilliant LifeWiki at conwaylife.com.
Implementation
The following applet visualising the Game of Life has been developed as part of the coursework for Object-Oriented Programming at the University of Cambridge, all code was written and compiled in Sun's Java SE 1.6.
Click on any of the screenshots or the button below to launch the Game of Life (and if nothing shows up, make sure that you have the Java Runtime Environment (JRE) installed).
It Literally Pays Off to Do Homework in Cambridge
Posted by Manfredas Zabarauskas in Development, Education, Life on November 24th, 2009
"Well done! - joint shortest traditional (and practical :-) sort for the OS1a prize tick". Cambridge, 2009.
45 minutes, 28 MIPS instructions and £25.
Computer Science FTW.
# Copyright Manfredas Zabarauskas, 2009.
# MIPS routine that reads an array of ten integers
# and prints the sorted array to console.
.text
main: sub $t7, $sp, 40
l_read: li $v0, 5
syscall
sw $v0, 0($t7)
add $t7, $t7, 4
bne $t7, $sp, l_read
l_out: sub $t8, $sp, 36
sub $t7, 40
l_inn: add $t8, $t8, 4
lw $t2, -8($t8)
lw $t3, -4($t8)
ble $t2, $t3, no_swp
sw $t2, -4($t8)
sw $t3, -8($t8)
move $t7, $sp
no_swp: bne $t8, $sp, l_inn
beq $t7, $sp, l_out
l_prnt: li $v0, 11
li $a0, 10
syscall
li $v0, 1
lw $a0, 0($t7)
syscall
add $t7, $t7, 4
bne $t7, $sp, l_prnt
li $v0, 10
syscall
Eigenfaces Tutorial
Posted by Manfredas Zabarauskas in Development, Education on October 2nd, 2009
The main purpose behind writing this tutorial was to provide a more detailed set of instructions for someone who is trying to implement an eigenface based face detection or recognition systems. It is assumed that the reader is familiar (at least to some extent) with the eigenface technique as described in the original M. Turk and A. Pentland papers (see "References" for more details).
Introduction
The idea behind eigenfaces is similar (to a certain extent) to the one behind the periodic signal representation as a sum of simple oscillating functions in a Fourier decomposition. The technique described in this tutorial, as well as in the original papers, also aims to represent a face as a linear composition of the base images (called the eigenfaces).
The recognition/detection process consists of initialization, during which the eigenface basis is established and face classification, during which a new image is projected onto the "face space" and the resulting image is categorized by the weight patterns as a known-face, an unknown-face or a non-face image.
Demonstration
To download the software shown in video for 32-bit x86 platform, click here. It was compiled using Microsoft Visual C++ 2008 and uses GSL for Windows.
Establishing the Eigenface Basis
First of all, we have to obtain a training set of
grayscale face images
. They should be:
- face-wise aligned, with eyes in the same level and faces of the same scale,
- normalized so that every pixel has a value between 0 and 255 (i.e. one byte per pixel encoding), and
- of the same
size.
So just capturing everything formally, we want to obtain a set
, where \begin{align} I_k = \begin{bmatrix} p_{1,1}^k & p_{1,2}^k & ... & p_{1,N}^k \\ p_{2,1}^k & p_{2,2}^k & ... & p_{2,N}^k \\ \vdots \\ p_{N,1}^k & p_{N,2}^k & ... & p_{N,N}^k \end{bmatrix}_{N \times N} \end{align} and 
Once we have that, we should change the representation of a face image
from a
matrix, to a
point in
-dimensional space. Now here is how we do it: Read the rest of this entry »
SMRP6400/SMDK64X0 IIC synchronization problems
Posted by Manfredas Zabarauskas in Development on July 7th, 2009
Debugging Samsung SMRP6400
Since last week we have spent quite some time debugging Samsung SMRP6400/SMDK64X0 IIC drivers, I thought I might share with one particular example here. It is both a good showcase of the hardware/software synchronization issues and since the bugs are in the latest version of the drivers shipped together with the development platforms, it might save someone from additional headaches.
So, while working on the drivers for IIC one of our automated tests to make sure that IIC still works was to write some data on the bus, do an immediate read-back and verify that both data written and read back matches; something similar to this:
UCHAR outData[3] = { REGISTER, DATA_BYTE_1, DATA_BYTE2 };
UCHAR inData[2] = { 0, 0 };
IIC_Write(SLAVE_ADDRESS, outData, 3);
IIC_Read(SLAVE_ADDRESS, REGISTER, inData, 2);
if ((inData[0] != DATA_BYTE_1) || (inData[1] != DATA_BYTE_2))
{
DEBUGMSG(ERROR_MSG,
(TEXT("Immediate write/read data mismatch: data sent [0x%X " \
"0x%X] differs from the data received [0x%X, 0x%X]."),
DATA_BYTE_1, DATA_BYTE_2, inData[0], inData[1]));
}
However, after we added some IIC read calls from another hardware driver it started spitting fire throwing the following error message:
Immediate write/read data mismatch: data sent [0xCA, 0xFE] differs from the data received [0xCA, 0xFE].
Now this clearly meant we had a serious problem: the error message was saying that the data does not match, which obviously was not the case as shown by the message text!
Since we were pretty confident about our side of things, as well as the readings from the scope, it seemed like a good time to start looking at the Samsung IIC drivers (and especially s3c64X0_iic_lib.cpp). The driver structure there is pretty straightforward: the first read/write byte on the bus triggers the hardware IRQ, which is mapped to SysIntr triggering a transfer event; then any subsequent call to a read/write function blocks waiting for a transfer-done event, which is triggered when the last byte is read/written in the IST.
Everything looks sane up to the point where a transfer-done event is signalled from the IST (pseudocode, not the original code below, due to the legal issues):
static HANDLE ghTransferEvent;
static HANDLE ghTransferDone;
static DWORD IICSysIntr;
...
static DWORD IST(LPVOID lpContext)
{
BOOL bTransferDone = FALSE;
while (TRUE) {
WaitForSingleObject(ghTransferEvent, INFINITE);
switch (IIC_BUS_STATUS) {
case MASTER_RECEIVE:
// receive bytes and store them in the buffer
break;
case MASTER_TRANSMIT:
// transmit bytes from the buffer in memory
if (LAST_BYTE) bTransferDone = TRUE;
break;
InterruptDone(IICSysIntr);
if (bTransferDone) {
SetEvent(ghTransferDone);
}
}
}
}
Two major problems with this code are:
- The
bTransferDonevariable is never set from the IIC read, and hence thetransfer-doneevent for bus reads is never triggered, - After the
bTransferDoneis set from the IIC write, it is never reset, hence thetransfer-doneevent is triggered after reading/writing single byte on the bus in all subsequent transactions.
That explains the initial test case failure: during the write/immediate read data comparison the data arrives exactly between the if statement and the following printout, thus triggering the error message, but printing the correct data to the output due to the early signal of the event.
The way to solve this is also straightforward: make sure that the bTransferDone variable is cleared after the transfer-done event is triggered, and make sure that master-receive mode sets the bTransferDone variable after reading the last byte of the transaction from the IIC bus.
In pseudocode it would look similar to:
static DWORD IST(LPVOID context)
{
BOOL bTransferDone = FALSE;
while (TRUE) {
WaitForSingleObject(ghTransferEvent, INFINITE);
switch (IIC_BUS_STATUS) {
case MASTER_RECEIVE:
// receive bytes and store them in the buffer
if (LAST_BYTE) bTransferDone = TRUE;
break;
case MASTER_TRANSMIT:
// transmit bytes from the buffer in memory
if (LAST_BYTE) bTransferDone = TRUE;
break;
InterruptDone(IICSysIntr);
if (bTransferDone) {
bTransferDone = FALSE;
SetEvent(ghTransferDone);
}
}
}
}
The lesson of the day (quoting my colleague) is: "The first rule about multithreading - you're wrong".

At work. Wolfson Microelectronics PLC, 2009

