# Robot Photographer (Part IV, Final)

The story of Luke, an autonomous robot photographer, is officially completed and submitted to ICRA 2014.

Luke: an autonomous robot photographer.

The source code of the implemented robot photographer can be found at https://github.com/manfredzab/robot-photographer, together with a short technical description. If you need more technical information, take a look at my master's thesis, which contains all of the gory theoretical and implementation details, spread out over 140 pages.

All in all, developing Luke has been amazingly fun, but now it's time to move on to other exciting things. Silicon Valley awaits!

# Robot Photographer (Part III)

The autonomous robot photographer is officially done, and it has already successfully completed its first real-world assignment!

The robot has been deployed during the open-days in Oxford's Computer Science Department. It ran for nearly three hours, taking 103 photos (approximately one photo every two minutes). A short video from this deployment is shown below.

Now, for some good news (at least for some of you, anyway). I will be producing a detailed write-up of robot's implementation as part of my Master's dissertation. As soon as that is completed (in a month or so), I will release both the dissertation and the source code for all software that is powering the robot! Given that the robot is built using low cost off-the-shelf hardware parts, I hope that it will enable some of you to create even more awesome autonomous robots!

Robot photographer development: real-world deployment

# Robot Photographer (Part II)

A new milestone for the robot photographer! Here's a list of new things that the robot is capable of:

• It has been taught to detect and track humans in Kinect's depth image!
• Depth and photographic cameras have been aligned (camera intrinsics and extrinsics have been calibrated). This effectively means that given a point on one image plane (e.g. in the depth image), a robot is able to tell where the corresponding point would be on the other image plane (e.g. in a photograph).
• Composition and framing modules have been implemented, i.e. now the robot knows how a "good" picture looks like.

All in all, this means that the "beta" version of the robot is now fully functional. Here's a new video of it in action!

Robot photographer development: human head tracking and picture framing

Next step: real world deployment!

# Robot Photographer (Part I)

For a few weeks now I have been working on a new idea: an autonomous robot photographer.

My ultimate goal is to build a fully autonomous robot which could take well-composed pictures in various social events: conferences, open days, weddings and so on.

In the process I am also hoping to build a scalable system that could be extended to more complicated tasks and, potentially, used as a teaching tool at an undergraduate level.

I am planning to open-source both the software and the construction details, therefore I am using only widely accessible and inexpensive parts, and only open-source software.

My preliminary "ingredients" list for the robot includes:

• a Microsoft Kinect sensor to detect humans and avoid obstacles in the environment,
• a cheap point-and-shoot camera for taking the actual pictures, and
• a Turtlebot spec compliant Kobuki robot base for the actual locomotion.

I will update the blog with more development details, but in the meantime, here's a video of the first development milestone: an autonomous navigation and obstacle avoidance.

Robot photographer development: obstacle avoidance

# Expectation-Maximization Algorithm for Bernoulli Mixture Models (Tutorial)

Even though the title is quite a mouthful, this post is about two really cool ideas:

1. A solution to the "chicken-and-egg" problem (known as the Expectation-Maximization method, described by A. Dempster, N. Laird and D. Rubin in 1977), and
2. An application of this solution to automatic image clustering by similarity, using Bernoulli Mixture Models.

For the curious, an implementation of the automatic image clustering is shown in the video below. The source code (C#, Windows x86/x64) is also available for download!

Automatic clustering of handwritten digits from MNIST database using Expectation-Maximization algorithm

While automatic image clustering nicely illustrates the E-M algorithm, E-M has been successfully applied in a number of other areas: I have seen it being used for word alignment in automated machine translation, valuation of derivatives in financial models, and gene expression clustering/motif finding in bioinformatics.

As a side note, the notation used in this tutorial closely matches the one used in Christopher M. Bishop's "Pattern Recognition and Machine Learning". This should hopefully encourage you to check out his great book for a broader understanding of E-M, mixture models or machine learning in general.

Alright, let's dive in!

#### 1. Expectation-Maximization Algorithm

Imagine the following situation. You observe some data set $\mathbf{X}$ (e.g. a bunch of images). You hypothesize that these images are of $K$ different objects... but you don't know which images represent which objects.

Let $\mathbf{Z}$ be a set of latent (hidden) variables, which tell precisely that: which images represent which objects.

Clearly, if you knew $\mathbf{Z}$, you could group images into the clusters (where each cluster represents an object), and vice versa, if you knew the groupings you could deduce $\mathbf{Z}$. A classical "chicken-and-egg" problem, and a perfect target for an Expectation-Maximization algorithm.

Here's a general idea of how E-M algorithm tackles it. First of all, all images are assigned to clusters arbitrarily. Then we use this assignment to modify the parameters of the clusters (e.g. we change what object is represented by that cluster) to maximize the clusters' ability to explain the data; after which we re-assign all images to the expected most-likely clusters. Wash, rinse, repeat, until the assignment explains the data well-enough (i.e. images from the same clusters are similar enough).

(Notice the words in bold in the previous paragraph: this is where the expectation and maximization stages in the E-M algorithm come from.)

To formalize (and generalize) this a bit further, say that you have a set of model parameters $\mathbf{\theta}$ (in the example above, some sort of cluster descriptions).

To solve the problem of cluster assignments we effectively need to find model parameters $\mathbf{\theta'}$ that maximize the likelihood of the observed data $\mathbf{X}$, or, equivalently, the model parameters that maximize the log likelihod

Using some simple algebra we can show that for any latent variable distribution $q(\mathbf{Z})$, the log likelihood of the data can be decomposed as
\begin{align}
\ln \,\text{Pr}(\mathbf{X} | \theta) = \mathcal{L}(q, \theta) + \text{KL}(q || p), \label{eq:logLikelihoodDecomp}
\end{align}
where $\text{KL}(q || p)$ is the Kullback-Leibler divergence between $q(\mathbf{Z})$ and the posterior distribution $\,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta)$, and
\begin{align}
\mathcal{L}(q, \theta) := \sum_{\mathbf{Z}} q(\mathbf{Z}) \left( \mathcal{L}(\theta) - \ln q(\mathbf{Z}) \right)
\end{align}
with $\mathcal{L}(\theta) := \ln \,\text{Pr}(\mathbf{X}, \mathbf{Z}| \mathbf{\theta})$ being the "complete-data" log likelihood (i.e. log likelihood of both observed and latent data).

To understand what the E-M algorithm does in the expectation (E) step, observe that $\text{KL}(q || p) \geq 0$ for any $q(\mathbf{Z})$ and hence $\mathcal{L}(q, \theta)$ is a lower bound on $\ln \,\text{Pr}(\mathbf{X} | \theta)$.

Then, in the E step, the gap between the $\mathcal{L}(q, \theta)$ and $\ln \,\text{Pr}(\mathbf{X} | \theta)$ is minimized by minimizing the Kullback-Leibler divergence $\text{KL}(q || p)$ with respect to $q(\mathbf{Z})$ (while keeping the parameters $\theta$ fixed).

Since $\text{KL}(q || p)$ is minimized at $\text{KL}(q || p) = 0$ when $q(\mathbf{Z}) = \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta)$, at the E step $q(\mathbf{Z})$ is set to the conditional distribution $\,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta)$.

To maximize the model parameters in the M step, the lower bound $\mathcal{L}(q, \theta)$ is maximized with respect to the parameters $\theta$ (while keeping $q(\mathbf{Z}) = \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta)$ fixed; notice that $\theta$ in this equation corresponds to the old set of parameters, hence to avoid confusion let $q(\mathbf{Z}) = \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old})$).

The function $\mathcal{L}(q, \theta)$ that is being maximized w.r.t. $\theta$ at the M step can be re-written as
\begin{align*}
\theta^\text{new} &= \underset{\mathbf{\theta}}{\text{arg max }} \left. \mathcal{L}(q, \theta) \right|_{q(\mathbf{Z}) = \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old})} \\
&= \underset{\mathbf{\theta}}{\text{arg max }} \left. \sum_{\mathbf{Z}} q(\mathbf{Z}) \left( \mathcal{L}(\theta) - \ln q(\mathbf{Z}) \right) \right|_{q(\mathbf{Z}) = \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old})} \\
&= \underset{\mathbf{\theta}}{\text{arg max }} \sum_{\mathbf{Z}} \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old}) \left( \mathcal{L}(\theta) - \ln \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old}) \right) \\
&= \underset{\mathbf{\theta}}{\text{arg max }} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right] - \sum_{\mathbf{Z}} \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old}) \ln \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old}) \\
&= \underset{\mathbf{\theta}}{\text{arg max }} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right] - (C \in \mathbb{R}) \\
&= \underset{\mathbf{\theta}}{\text{arg max }} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right],
\end{align*}

i.e. in the M step the expectation of the joint log likelihood of the complete data is maximized with respect to the parameters $\theta$.

So, just to summarize,

• Expectation step: $q^{t + 1}(\mathbf{Z}) \leftarrow \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \mathbf{\theta}^t)$
• Maximization step: $\mathbf{\theta}^{t + 1} \leftarrow \underset{\mathbf{\theta}}{\text{arg max }} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{t}} \left[ \mathcal{L}(\theta) \right]$ (where superscript $\mathbf{\theta}^t$ indicates the value of parameter $\mathbf{\theta}$ at time $t$).

Phew. Let's go to the image clustering example, and see how all of this actually works. Continue reading

# 3D Display Simulation using Head-Tracking with Kinect

During my final year in Cambridge I had the opportunity to work on the project that I wanted to implement for the last three years: a glasses-free 3D display.

#### 1. Introduction

It all started when I saw Johnny Lee's "Head Tracking for Desktop VR Displays using the Wii Remote" project in early 2008 (see below). He cunningly used the infrared camera in the Nintendo Wii's remote and a head mounted sensor bar to track the location of the viewer's head and render view dependent images on the screen. He called it a "portal to the virtual environment".

I always thought that it would be really cool to have this behaviour without having to wear anything on your head (and it was - see the video below!).

My "portal to the virtual environment" which does not require head gear. And it has 3D Tetris!

I am a firm believer in three-dimensional displays, and I am certain that we do not see the widespread adoption of 3D displays simply because of a classic network effect (also know as "chicken-and-egg" problem). The creation and distribution of a three-dimensional content is inevitably much more expensive than a regular, old-school 2D content. If there is no demand (i.e. no one has a 3D display at home/work), then the content providers do not have much of an incentive to bother creating the 3D content. Vice versa, if there is no content then consumers do not see much incentive to invest in (inevitably more expensive) 3D displays.

A "portal to the virtual environment", or as I like to call it, a 2.5D display could effectively solve this. If we could enhance every 2D display to get what you see in Johnny's and my videos (and I mean every: LCD, CRT, you-name-it), then suddenly everyone can consume the 3D content even without having the "fully" 3D display. At that point it starts making sense to mass-create 3D content.

The terms "fully" and 2.5D, however, require a bit of explanation. Continue reading

# Backpropagation Tutorial

The PhD thesis of Paul J. Werbos at Harvard in 1974 described backpropagation as a method of teaching feed-forward artificial neural networks (ANNs). In the words of Wikipedia, it lead to a "rennaisance" in the ANN research in 1980s.

As we will see later, it is an extremely straightforward technique, yet most of the tutorials online seem to skip a fair amount of details. Here's a simple (yet still thorough and mathematical) tutorial of how backpropagation works from the ground-up; together with a couple of example applets. Feel free to play with them (and watch the videos) to get a better understanding of the methods described below!

Training a single perceptron

Training a multilayer neural network

##### 1. Background

To start with, imagine that you have gathered some empirical data relevant to the situation that you are trying to predict - be it fluctuations in the stock market, chances that a tumour is benign, likelihood that the picture that you are seeing is a face or (like in the applets above) the coordinates of red and blue points.

We will call this data training examples and we will describe $i$th training example as a tuple $(\vec{x_i}, y_i)$, where $\vec{x_i} \in \mathbb{R}^n$ is a vector of inputs and $y_i \in \mathbb{R}$ is the observed output.

Ideally, our neural network should output $y_i$ when given $\vec{x_i}$ as an input. In case that does not always happen, let's define the error measure as a simple squared distance between the actual observed output and the prediction of the neural network: $E := \sum_i (h(\vec{x_i}) - y_i)^2$, where $h(\vec{x_i})$ is the output of the network.

#### 2. Perceptrons (building-blocks)

The simplest classifiers out of which we will build our neural network are perceptrons (fancy name thanks to Frank Rosenblatt). In reality, a perceptron is a plain-vanilla linear classifier which takes a number of inputs $a_1, ..., a_n$, scales them using some weights $w_1, ..., w_n$, adds them all up (together with some bias $b$) and feeds everything through an activation function $\sigma \in \mathbb{R} \rightarrow \mathbb{R}$.

A picture is worth a thousand equations:

Perceptron (linear classifier)

To slightly simplify the equations, define $w_0 := b$ and $a_0 := 1$. Then the behaviour of the perceptron can be described as $\sigma(\vec{a} \cdot \vec{w})$, where $\vec{a} := (a_0, a_1, ..., a_n)$ and $\vec{w} := (w_0, w_1, ..., w_n)$.

To complete our definition, here are a few examples of typical activation functions:

• sigmoid: $\sigma(x) = \frac{1}{1 + \exp(-x)}$,
• hyperbolic tangent: $\sigma(x) = \tanh(x)$,
• plain linear $\sigma(x) = x$ and so on.

Now we can finally start building neural networks. Continue reading

# Halfway There

Another term in Cambridge has gone by - four out of nine to go. In the meantime, here's a quick update of what I've been up to in the past few months.

#### 1. Microsoft internship

Redmond, WA, 2011

In January I had the opportunity to visit Microsoft's headquarters in Redmond, WA, to interview for the Software Development Engineer in Test intern position in the Office team. In short - a great trip, in every aspect.

I left London Heathrow on January 11th, 2:20 PM and landed in Seattle Tacoma at 4:10 PM (I suspect that there might have been a few time zones in between those two points). I arrived in Mariott Redmond roughly an hour later, which meant that because of my anti-jetlag technique ("do not go to bed until 10-11 PM in the new timezone no matter what") I had a few hours to kill. Ample time to unpack, grab a dinner in Mariott's restaurant and go for a short stroll around Redmond before going to sleep.

On the next day I had four interviews arranged. The interviews themselves were absolutely stress-free, it felt more like a chance to meet and have a chat with some properly smart (and down-to-earth) folks.

Top of the Space Needle. Seattle, WA, 2011

The structure of the interviews seemed fairly typical: each interview consisted of some algorithm/data structure problems, a short discussion about the past experience and the opportunity to ask questions (obviously a great chance to learn more about the team/company/company culture, etc). Since this was my third round of summer internship applications (I have worked as a software engineer for Wolfson Microelectronics in '09 and Morgan Stanley in '10), everything made sense and was pretty much what I expected.

My trip ended with a quick visit to Seattle on the next day: a few pictures of the Space Needle, a cup of Seattle's Best Coffee and there I was on my flight back to London, having spent \$0.00 (yeap, Microsoft paid for everything - flights, hotel, meals, taxis, etc). Even so, the best thing about Microsoft definitely seemed to be the people working there; since I have received and accepted the offer, we'll see if my opinion remains unchanged after this summer!

#### 2. Lent term v2.0

Well, things are still picking up the speed. Seven courses with twenty-eight supervisions in under two months, plus managing a group project (crowd-sourcing mobile network signal strength), a few basketball practices each week on top of that and you'll see a reason why this blog has not been updated for a couple of months.

It's not all doom and gloom, of course. Courses themselves are great, lecturers make some decently convoluted material understandable in minutes and an occasional formal hall (e.g. below) also helps.

All in all, my opinion, that Cambridge provides a great opportunity to learn a huge amount of material in a very short timeframe, remains unchanged.

There will be more to come about some cool things that I've learnt in separate posts, but now speaking of learning - it's revision time... :-)

Me and Ada at the CompSci formal hall. Cambridge, England, 2011

# Morgan Stanley Internship

After work (Canary Wharf, 2010)

Last week I received a short e-mail from my former manager at Morgan Stanley:

"Hi Manfred,

Just to let you know that GlobalAxe all went live last week and so far no issues at all."

Since the people on the trading floor started using my system and it seems to be standing on its feet so far, it probably is a good time to recap on what had happened over my ten week internship at Morgan Stanley.

I was working as technology analyst in repo trading team (in institutional securities group). My task was to develop and integrate a new screen into trading software, to create an associated e-mail subsystem generating daily/weekly reports for senior executives and to code a website which would provide access to the data for executives/sales people without the trading software on their machines.

Development-wise it involved working with quite a wide range of technologies, such as C# and CAB for UI development, Java/Spring for e-mail report generation/server backend, MVC under ASP.NET for the website, Transact-SQL for Sybase DB backend; everything interconnected with SOAP/XML and distributed locally over in-house pubsub systems or through IBM's MQ for inter-continental data transactions.

Even though working and learning about all these technologies was fun on it's own right, the best thing I would say about my experience was the people.

Night at Canary Wharf, 2010

There is no better feeling than having a quick call with traders in New York demoing them the stuff that you just wrote, then dropping an e-mail to Tokyo checking if your recent changes made it through to their database, discussing the architecture of your system with the guys in your team and then going to the global team video-meeting; all in the same day.

And sometimes you feel the need to pinch yourself, because the level of responsibility that you get as an intern is staggering. You have the same rights and responsibilities as any other team member: a screw up in your code can block sixty people from submitting their code before the end of the iteration, a failure to convince the head of traders in NY that what you are doing is going to help them will affect the name of the whole team, and so on.

But then, you own your project: you make the final design decisions, you implement it and you give it to the end-users, who often appear to be bigshots. And that more than makes up for a few late nights in the office. Plus, Canary Wharf is absolutely beautiful at night.

Without expanding too much (and breaching too many non-disclosure agreements) - it was definitely the best experience so far: in terms of team, project, technology, skill, involvement and everything else. And it seems like I will have a chance to repeat it again: I have already received an unconditional offer for the internship at MS next summer!

Oh, and regarding the summer days spent in glass, steel and stone towers... well, Majorca more than made up for it!