Development – Manfred Zabarauskas' Blog

"gcloud": Google Cloud Platform CLI for Power-Users

Manfredas Zabarauskas — Mon, 23 Mar 2015 20:10:21 +0000

I tend to spend a lot of time in the command-line tools when working with the public cloud computing providers. It is fast, scriptable, and... strangely empowering.

Over the last year as a Product Manager on Google Cloud Platform, I have been thinking a lot about improving the developer experience. Obviously, one of the crucial parts of this experience is the command-line access to the cloud. To improve this experience, we have launched gcloud: a command-line tool for Google Cloud Platform, which enables developers to manage their Google Cloud-based resources in a consistent, uniform and efficient way.

I have captured the following video which showcases some of the power-use cases of gcloud that I am particularly excited about, and I think it may be worthwhile sharing. In particular, the video below features:

A single-line installation
Unified help
Command autocompletion
Remote resource autocompletion
Output formatting
Command chaining
Component management and updating
Access to the bleeding-edge commands, and so on.

The video is rather fast (apologies, but I wanted to keep it around 5 minutes), so you might want to pause it if you find something interesting.

Power use of gcloud: the CLI for Google Cloud Platform.

I hope you'll find it useful!

With that, I'm signing off from building next-generation cloud computing to building cheap, scrappy and accessible virtual reality.

Onwards and upwards, my friends!

Google I/O 2014: Cloud and Android

Manfredas Zabarauskas — Sun, 27 Jul 2014 22:13:30 +0000

It's obvious that cloud and mobile worlds are converging. The majority of interesting mobile applications nowadays are powered by cloud, and new startups are springing up every day to fight for their niche in this transition. Check out HN in your spare time; chances are you'll notice some new cloud-based data storage, a new way to test your apps in the cloud, or a new way to develop mobile apps using cloud-based IDEs. The importance of this convergence has even started changing the direction of old timers like Microsoft.

About a month ago, I had the opportunity to talk about some of the work I've been doing as a Product Manager on Google Cloud Platform to bring together the mobile and cloud worlds at Google I/O 2014. If you're curious to see how you could make your apps more engaging and more interactive using cloud without having to write hundreds of lines of code, check out the video of our session below.

Finally, as an added bonus, here are a few inside pictures from I/O 14! We live in the times of fundamental changes in computing and moonshot ideas, and it was a truly humbling experience to get a glimpse of that together with 6,000 in-person attendees at Moscone Center and over a million people who were watching the I/O keynote online.

Robot Photographer (Part IV, Final)

Manfredas Zabarauskas — Tue, 17 Sep 2013 23:54:01 +0000

The story of Luke, an autonomous robot photographer, is officially completed and accepted to ICRA 2014!

The source code of the implemented robot photographer can be found at https://github.com/manfredzab/robot-photographer and the short description is below. If you need more technical information, take a look at my master's thesis, which contains all of the gory theoretical and implementation details, spread out over 140 pages.

All in all, developing Luke has been amazingly fun, but now it's time to move on to other exciting things. Silicon Valley awaits!

1. Background

Within the field of autonomous robotics and the variety of its application areas, robot photographers serve as excellent low-cost research platforms. They encompass a number of challenges common in robotics research, like task and path planning, locomotion and navigation (including obstacle avoidance), and human subject detection/tracking.

Robot photographers also include multidisciplinary challenges, like the automatic photograph composition (which requires computational understanding of the aesthetics of photography) or Human-Robot Interaction (HRI). As pointed out by Byers et al. [9], robotic photographer platforms are particularly well suited for HRI research, since the general public can easily grasp the overall concept ("it's a robot whose job is to take pictures"), and thus tend to interact with the robot naturally.

Finally, robot photographers show potential in commercial applications (e.g. event photography), since the service costs of an autonomous robot photographer are significantly smaller than those of a professional photographer.

Below a system is described that uses a single Microsoft Kinect sensor for obstacle avoidance and people detection, coupled with a consumer "point-and-shoot" camera for taking the final images, and which outperforms earlier robot photographer approaches.

2. Robot Photographer's Hardware

The developed autonomous robot photographer, Luke (shown below), is built using iClebo Kobuki's base, which has a 4kg payload, an operating time of around 3 hours, and maximum velocity of 65cm/s. Furthermore, Kobuki's base contains three bumpers (left, center and right) which can be used to provide alternatives to vision-based obstacle avoidance. This base is integrated into the Turtlebot 2 open robotics platform.

Luke: An Autonomous Robot Photographer

For its vision, Luke uses a Microsoft Kinect RGB-D sensor, which provides both a 10-bit depth value and VGA resolution color at 30 FPS. The sensor has a combined 57° horizontal and 43° vertical field-of-view.
The Kinect was attached to the Turtlebot's base at an inclination of 10° in order to be able to track upright standing humans at a distance of 1.5-2m. Since this limits low obstacle detection abilities, the linear velocity of the robot is limited to 10cm/s and the bumpers on Kobuki's base are used to provide graceful recovery in the case of collision with a low-lying obstacle.

To take the photographic pictures Luke uses a simple point-and-shoot Nikon COOLPIX S3100 camera, which has a maximum resolution of 14 megapixels, a built-in flash, and supports automatic exposure/ISO sensitivity/white balance settings. This camera is mounted on a lightweight, aluminium König KN-TRIPOD21 tripod (weighing 645 grams), which is attached to the top mounting plate of the robot. The overall size of the robot is approximately 34cm x 135cm x 35cm (W x H x D).

For Luke's state externalization, an HTC HD7 smartphone with a 4.3 inch LCD display was mounted onto the robot. The display has a resolution of 480 x 800 pixels, and is used to display Luke's state messages and to show the QR (Quick Response) codes containing the URLs of the pictures that Luke takes and uploads to Flickr. The smartphone also serves as a wireless hotspot, providing a wireless network connection between Luke's on-board computer and a monitoring/debugging station. Furthermore, it provides the internet connection to the on-board computer (for photo uploading to Flickr) by tethering the phone's 3G/EDGE connection over Wi-Fi.

The on-board ASUS Eee PC 1025C netbook has an Intel Atom N2800 1.6 GHz CPU and 1 GB RAM, providing a battery life of around 3 hours and weighs just under 1.25kg. It is running the Groovy Galapagos version of the Robot Operating System framework (ROS, [29]) on a Ubuntu 12.04 LTS operating system. All processing (including obstacle avoidance, human subject detection, photographic composition evaluation and so on) is done on this machine.

The robot contains two major power sources: a 2200mAh lithium-ion battery which is enclosed in the Kobuki's base, and a 5200mAh lithium-ion battery installed in the on-board netbook. Table below summarizes how these sources are used to power individual hardware components of the robot.

Hardware component	Power source
On-board netbook	Netbook's battery
Wheel motors	Base's battery
Phone (display and Wi-Fi hotspot)	Netbook's battery
Photographic camera	Netbook's battery
Kinect RGB-D sensor	Combined base and netbook's battery

During the empirical tests of the fully-powered robot, the average discharge times for the netbook's/Kobuki base's batteries were 3 hours and 6 minutes/3 hours and 20 minutes respectively.

3. Robot Photographer's Software

3.1. Architectural Design

Luke's software uses Brooks' [5] hierarchical levels-of-competence approach. Each of the layers in Luke's software hierarchy is based on the behaviors that Luke can perform:

The base layer allows Luke to aimlessly wander around the environment, while avoiding collisions.
The second layer suppresses the random wandering behavior at certain time intervals (adhering to what [5] called a subsumption architecture), and enables Luke to compose, take and upload photographs.
The final layer enables Luke to externalize his state i) visually, by showing text messages/QR codes on the attached display, and ii) vocally, by reading state messages out loud using text-to-speech software.

This architecture is illustrated in the following figure, and key implementation details are summarized below.

A simplified Luke's architectural design diagram, showing ROS nodes (red, green, blue and yellow) together with I/O devices (gray rectangles), and the data that is being passed between them (text on the arrows). All nodes with prefixes rp_ (green, blue and yellow) are the results of the work described below, the red nodes are parts of Kobuki/ROS/GFreenect/Kinect AUX libraries. Yellow, green and blue nodes represent the first, second and third Luke's competence layers (corresponding to obstacle avoidance, human tracking/photograph taking and state externalization behaviors).

3.2. Layer I: Random Walking with Collision Avoidance

Luke's capability to randomly wander in the environment without bumping into any static or moving obstacles is implemented in three ROS nodes: rp_obstacle_avoidance, rp_locomotion and rp_navigation, as briefly described below.

Obstacle detection and avoidance is based on [3], chosen due to its computational efficiency and suitability for the random navigation mode which Luke uses to wander around in the environment. It consists of three main steps:

The input point cloud is obtained using the GFreenect library [27], and subsampled using a voxel grid filter.
The Kinect's tilt angle is provided by the Kinect AUX library [13] which returns the readings from the Kinect's accelerometer at 20Hz, and with a flat-floor assumption used to tweak the subsampled point cloud.
The region of interest (ROI) in front of the robot (defined by the user) is isolated from the transformed point cloud and a moving average of the ROI's size calculated. A positive average size generates a turn direction, otherwise the robot moves forward.

To prevent the robot from getting stuck in an oscillating loop when facing a large obstacle, it is prohibited from changing the direction of the turn once it has started turning, as suggested in [3]. Also, an improvement from [28] is used whereby if the unfiltered point cloud is small then it is assumed that the robot is facing a nearby large obstacle, and a turn directive is issued.

3.3. Layer II: Taking Well-Composed Photographs of Humans

Luke's second major behavioral competence involves his ability to i) track humans in an unstructured environment, ii) take well-composed pictures of them, and iii) upload these pictures to an on-line picture gallery. This competence layer is implemented by five ROS nodes: rp_head_tracking, rp_framing, rp_camera, rp_uploading and rp_autonomous_photography.

The head detection and tracking node (rp_head_tracking) is the most sophisticated node in Luke's ROS graph. For subject head detection it uses a knowledge-based method by Garstka and Peters' [18], which is extended to cope with multiple people presence in the image. To improve the head detection results, it employs one of the two skin detectors: a Bayesian skin detector by Jones and Rehg [22], and an adaptive skin detector based on a logistic regression classifier with a Gaussian kernel, and trained on an on-line skin model obtained from the face regions detected using the Viola and Jones [32] detector. Finally, to exploit the spatial locality of human heads over a sequence of frames, this node uses a depth-based extension of the continuously-adaptive mean-shift algorithm by Bradski [4].

The multiple person extension still scans through a blurred and depth-shadow-filtered depth image one horizontal line at a time, from top to bottom. However, instead of keeping a single potential vertical head axis, a set of vertical head axes $\{ H_1, ..., H_k \}$ is constructed. Each head axis is represented by $H_i = \left( \overline{d}_i, (X, Y)_i \right)$ , where $\overline{d}_i$ is the average candidate head distance from the sensor, and $(x, y) \in (X, Y)_i$ are the image points on the vertical axis.

When a new arithmetic mean of left and right lateral gradients $\overline{x}(y)$ is calculated in the original algorithm, the extended method searches for the head axis $H_j$ such that the last added point $(x', y') \in (X, Y)_j$ is within 5cm distance from the point $(\overline{x}(y), y)$ .

If the pixel $(\overline{x}(y), y)$ satisfies the above constraint then $(X, Y)_j$ is updated by adding the point $(\overline{x}(y), y)$ , and the average head distance $\overline{d}_j$ is recalculated.

A vertical head axis $H_i$ is classified as a detected candidate head if it is closer than 5m, is between 20-30cm tall, and is rotated by less than 35°. A few examples of multiple head detection using this method are shown in figure below.

Multiple head detection examples

The detected candidate heads are verified using one out of two skin detection methods: a Bayesian classifier trained off-line on a very large scale skin/non-skin image dataset [22], and an on-line skin detector trained using skin histograms obtained from a small set of face detections using Viola and Jones [32] detector. In the first case a histogram-based Bayesian classifier similar to [22] is implemented and trained off-line on a large-scale Compaq skin dataset containing nearly a billion pixels, each hand-labelled as skin or non-skin.

If this skin classification method is used, then a given candidate head detection is accepted/rejected based on the proportion of skin-color pixels in the corresponding RGB image region.

In the second case, many faces are detected in the initialization stage using the frontal and profile face detectors using [32]. For each of the detected face rectangles, a binary mask is applied to segment the image into face oval/background regions, and the pixel hue histograms are assembled in each of the regions. Then these histograms are used as feature vectors in kernel logistic regression (KLR) classifier training. When this classifier is trained, the depth-based head detections can be verified by applying the same oval binary mask to the detected head rectangle, constructing a hue histogram from the face region, applying the KLR classifier, and thresholding.

To further reduce the computational complexity requirements of head/skin detection methods described above, a depth-data based extension of the continuously adaptive mean-shift tracking algorithm (CAMShift, [4]) is employed to exploit the spatial locality of humans over a sequence of frames. While the original CAMShift algorithm uses the probability distribution obtained from the color hue distribution in the image, in this project it is adapted to use the depth information. In particular, the constraints that [18] use to reject local horizontal minima which could not possibly lie on the vertical head axis, are used to define the following degenerate head probability:

$\text{Pr}(\textit{head}\,|(x, y)) = \left\lbrace \begin{array}{cl} 1, & \text{if pixel }(x, y)\text{ is a local minimum} \\ & \text{in depth image and it satisfies the} \\ & \text{inner/outer head bound constraints}, \\ 0, & \text{otherwise}, \end{array} \right.$

which is then tracked using CAMShift.

The second most important node in Luke's "picture taking" behavioral capability layer is the photograph composition and framing (rp_framing) node. This node works as follows. First of all, it subscribes to the locations of detected/tracked human subject heads in Kinect's image plane, published by the rp_head_tracking node. Then this node maps the head locations from Kinect's image plane to the photographic camera's image plane and calculates the ideal framing based on the framing rules described by Dixon et al. [12]. If the calculated ideal frame lies outside the current photographic camera's image plane, a turn direction is proposed; otherwise, the ideal frame location is published over /rp/framing/frame topic.

In order to map the locations of detected heads from Kinect's to photographic camera's image plane, the rigid body translation vector is first established between the Kinect senor and the photographic camera. Then, the photographic camera is undistorted using a "plumb bob" model proposed by [6]. This model simulates (and hence can be used to correct) both radial distortion caused by the spherical shape of the lens, and the tangential distortion, arising from the inaccuracies of the assembly process. Finally, the approach of [34], [35] is used to estimate the camera intrinsics (viz. camera's focal length and principal point offsets). Then, any point in Kinect's world space can be projected into the photographic camera's image space.

Using this approach, the 3D locations of the detected heads (provided by the rp_head_tracking node) are projected onto the photographic camera's image plane. Then, based on these locations the ideal framing for the photographs is calculated using the photograph composition heuristics proposed by [12]. These heuristics are based on the following four photographic composition rules [21]:

Rule of thirds, which suggests that the points of interest in the scene should be placed at the intersections (or along) the lines which break the image into horizontal and vertical thirds.
No middle rule, which states that a single subject should not be placed at a vertical middle line of the photograph.
No edge rule, which states that the edges of an ideal frame should not be crossing through the human subjects.
Occupancy ("empty space") rule, which suggests that approximately a third of the image should be occupied by the subject of the photograph.

Given these rules, [12] define three different heuristics for single person and wide/narrow group picture composition. In order to choose which heuristic will be used they employ an iterative procedure. It starts by identifying a human subject closest to the center of the current image and calculating the ideal framing for this person using the single person composition heuristic. If this frame includes other candidate subjects, the group framing rules are applied iteratively on the expanded candidate set, until no new candidates are added.

After an ideal frame F is calculated, the rp_framing node calculates the overlap coefficient between the part of the frame visible in the current image and the whole frame.

If this exceeds a given threshold and the visible part of the frame exceeds the minimal width/height thresholds $\theta_w \times \theta_h$ , the node considers that the satisfying composition has been achieved and publishes the position/size of the ideal frame over the /rp/framing/frame topic.

Otherwise, the framing node determines the direction of where the robot should turn in order to improve the quality of the composition (based on the position of the ideal frame's center w.r.t. the image's center) and publishes these driving directions over the /rp/framing/driving_direction topic. In order to prevent the robot from getting stuck indefinitely while trying to achieve an ideal framing, a decaying temporal threshold for the minimum required overlap is also used. In the current robot photographer's implementation, the framing time limit is set to 60 seconds, the maximum deviation from the ideal overlap is set to 50% and the minimum visible frame size thresholds $\theta_w \times \theta_h$ are set to 2160 x 1620 pixels.

The rp_autonomous_photography node coordinates the actual photograph taking/uploading process, and divides the robot's control time between the obstacle avoidance (rp_obstacle_avoidance) and framing (rp_framing) nodes.

The photograph taking node (rp_camera) acts as an interface between other ROS nodes and the physical Nikon COOLPIX S3100 camera that Luke uses to take pictures using the libgphoto2 API for the open-source gPhoto² library [2], which in turn connects to the camera using the Picture Transfer Protocol (PTP). This node provides access to the camera for the rest of the Luke's ROS graph by exposing a ROS service at /rp/camera/photo. Any other ROS node can send an empty request to this service, which rp_camera node transforms into the photo capture request for the libgphoto2 API. This request triggers a physical camera capture, storing the taken picture in the camera's built-in memory. After the picture is taken, rp_camera node moves the picture from the camera's memory to the on-board computer and returns the string file name of the downloaded picture via the service response.

The photograph uploading node (rp_uploader) uses the Python Flickr API [31] to upload image files to an online Flickr photo gallery. It exposes the Flickr API to the rest of ROS graph by providing /rp/uploader/upload ROS service. The internet connection required for the picture uploading is provided by the on-board HTC HD7 phone (which also acts as a robot's state display) by tethering the phone's 3G/EDGE data connection over Wi-Fi to an on-board netbook which runs the overall Luke's ROS graph.

3.4. Layer III: Externalization of the Current State via Vocal and Visual Messages

Luke's third and final behavioral competence involves its ability to externalize its current state via synthesized voice messages (played over the on-board computer's speakers), and text messages/QR codes (shown on the display of the attached HTC HD7 phone). The state externalization node (rp_state_externalization) subscribes to the status outputs from all major nodes in Luke's ROS graph, in particular, the lomotion (rp_locomotion), head tracking (rp_head_tracking), framing (rp_framing) and photography process control (rp_autonomous_photography) nodes.

In order to produce the robot's state messages (which are later vocalized/displayed by rp_speech and rp_display nodes) the state externalization node uses a table of pre-defined text messages, indexed by the states of four major nodes listed above. If the table contains more than one message for a given collection of states, then the message to be produced is chosen uniformly at random from the matching messages.

In the current implementation, a HTC HD7 phone is used to show the received messages. This phone has a 4.3 inch, 480 x 800 pixel LCD display, and is running Windows Phone (WP) 7.8 operating system. To show the messages generated by rp_state_externalization node, a WP OS app connects to the rp_display node over TCP and renders received text messages in full-screen mode. If a hyperlink is present within the received text message then this app also generates and renders a QR (Quick Response) code. This makes it easier for the humans in the robot's vicinity to follow this link, since any modern phone can use the phone's camera to automatically read QR codes.

To vocalize the text messages sent by rp_externalization node, the rp_speech node uses an open-source eSpeak [14] speech synthesis engine, which in turn is configured to use a formant synthesis based approach as described by Klatt [24].
Since this method does not need a database of speech samples and uses computationally cheap digital signal filters, the resulting text-to-speech engine is both memory and CPU efficient, making it highly appropriate for the use in a mobile robot.

A few examples of pictures taken by this robot together with the average ratings assigned by independent judges are shown in the figure below.

Luke's photograph examples

4. References

[2] Christophe Barbe, Hubert Figuiere, Hans Ulrich Niedermann, Marcus
Meissner, and Scott Fritzinger. gPhoto2: Digital Camera Software,
2002.

[3] Sol Boucher. Obstacle Detection and Avoidance using TurtleBot
Platform and XBox Kinect. Technical report, Rochester Institute of
Technology, 2012.

[4] Gary R Bradski. Computer Vision Face Tracking For Use in a
Perceptual User Interface. Intel Technology Journal, (Q2):214–219,
1998.

[5] R Brooks. A Robust Layered Control System for a Mobile Robot.
IEEE Journal of Robotics and Automation, 2(1):14–23, 1986.

[6] Duane C Brown. Decentering Distortion of Lenses. Photometric
Engineering, 32(3):444–462, 1966.

[9] Zachary Byers, Michael Dixon, Kevin Goodier, Cindy M Grimm, and
William D Smart. An Autonomous Robot Photographer. Intelligent
Robots and Systems, 3:2636–2641, 2003.

[12] Michael Dixon, C Grimm, and W Smart. Picture Composition for a
Robot Photographer. Technical report, Washington University in St.
Louis, 2003.

[13] Ivan Dryanovski, William Morris, St´ephane Magnenat, Radu Bogdan
Rusu, and Patrick Mihelich. Kinect AUX Driver for ROS, 2011.

[14] Jonathan Duddington. eSpeak: Speech Synthesizer, 2006.

[18] Jens Garstka and Gabriele Peters. View-dependent 3D Projection using
Depth-Image-based Head Tracking. In Proceedings of the 8th IEEE
International Workshop on Projector-Camera Systems, pages 52–58,
2011.

[21] Tom Grill and Mark Scanlon. Photographic Composition. American
Photographic Book Publishing, Orlando, FL, 1990.

[22] Michael Jones and James M Rehg. Statistical Color Models with
Application to Skin Detection. International Journal of Computer
Vision, 46(1):81–96, 2002.

[24] D H Klatt. Software for a Cascade/Parallel Formant Synthesizer.
Journal of the Acoustical Society of America, 67(3):971–995, 1980.

[27] OpenKinect. GFreenect Library, 2012.

[28] Brian Peasley and Stan Birchfield. Real-Time Obstacle Detection and
Avoidance in the Presence of Specular Surfaces using an Active 3D
Sensor. In 2013 IEEE Workshop on Robot Vision (WORV), pages
197–202, January 2013.

[29] Morgan Quigley, Brian Gerkey, Ken Conley, Josh Faust, Tully Foote,
Jeremy Leibs, Eric Berger, Rob Wheeler, and Andrew Ng. ROS: an
Open-Source Robot Operating System. In Proceedings of the 2009
IEEE International Conference on Robotics and Automation, 2009.

[31] Sybren A. Stuvel. Python Flickr API, 2007.

[32] Paul Viola and Michael Jones. Rapid Object Detection Using a
Boosted Cascade of Simple Features. In Proceedings of the 2001
IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, pages 511–518, 2001.

[34] Zhengyou Zhang. Flexible Camera Calibration by Viewing a Plane
from Unknown Orientations. In Proceedings of the 7th IEEE International
Conference on Computer Vision, pages 666–673, 1999.

[35] Zhengyou Zhang. A Flexible New Technique for Camera Calibration.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
22(11):1330–1334, 2000.

Robot Photographer (Part III)

Manfredas Zabarauskas — Mon, 29 Jul 2013 21:43:38 +0000

The autonomous robot photographer is officially done, and it has already successfully completed its first real-world assignment!

The robot has been deployed during the open-days in Oxford's Computer Science Department. It ran for nearly three hours, taking 103 photos (approximately one photo every two minutes). A short video from this deployment is shown below.

Now, for some good news (at least for some of you, anyway). I will be producing a detailed write-up of robot's implementation as part of my Master's dissertation. As soon as that is completed (in a month or so), I will release both the dissertation and the source code for all software that is powering the robot! Given that the robot is built using low cost off-the-shelf hardware parts, I hope that it will enable some of you to create even more awesome autonomous robots!

Robot photographer development: real-world deployment

Robot Photographer (Part II)

Manfredas Zabarauskas — Thu, 27 Jun 2013 21:14:33 +0000

A new milestone for the robot photographer! Here's a list of new things that the robot is capable of:

It has been taught to detect and track humans in Kinect's depth image!
Depth and photographic cameras have been aligned (camera intrinsics and extrinsics have been calibrated). This effectively means that given a point on one image plane (e.g. in the depth image), a robot is able to tell where the corresponding point would be on the other image plane (e.g. in a photograph).
Composition and framing modules have been implemented, i.e. now the robot knows how a "good" picture looks like.

All in all, this means that the "beta" version of the robot is now fully functional. Here's a new video of it in action!

Robot photographer development: human head tracking and picture framing

Next step: real world deployment!

Robot Photographer (Part I)

Manfredas Zabarauskas — Wed, 19 Jun 2013 17:43:13 +0000

For a few weeks now I have been working on a new idea: an autonomous robot photographer.

My ultimate goal is to build a fully autonomous robot which could take well-composed pictures in various social events: conferences, open days, weddings and so on.

In the process I am also hoping to build a scalable system that could be extended to more complicated tasks and, potentially, used as a teaching tool at an undergraduate level.

I am planning to open-source both the software and the construction details, therefore I am using only widely accessible and inexpensive parts, and only open-source software.

My preliminary "ingredients" list for the robot includes:

a Microsoft Kinect sensor to detect humans and avoid obstacles in the environment,
a cheap point-and-shoot camera for taking the actual pictures, and
a Turtlebot spec compliant Kobuki robot base for the actual locomotion.

I will update the blog with more development details, but in the meantime, here's a video of the first development milestone: an autonomous navigation and obstacle avoidance.

Robot photographer development: obstacle avoidance

Expectation-Maximization Algorithm for Bernoulli Mixture Models (Tutorial)

Manfredas Zabarauskas — Tue, 12 Feb 2013 03:05:53 +0000

Even though the title is quite a mouthful, this post is about two really cool ideas:

A solution to the "chicken-and-egg" problem (known as the Expectation-Maximization method, described by A. Dempster, N. Laird and D. Rubin in 1977), and
An application of this solution to automatic image clustering by similarity, using Bernoulli Mixture Models.

For the curious, an implementation of the automatic image clustering is shown in the video below. The source code (C#, Windows x86/x64) is also available for download!

Automatic clustering of handwritten digits from MNIST database using Expectation-Maximization algorithm

While automatic image clustering nicely illustrates the E-M algorithm, E-M has been successfully applied in a number of other areas: I have seen it being used for word alignment in automated machine translation, valuation of derivatives in financial models, and gene expression clustering/motif finding in bioinformatics.

As a side note, the notation used in this tutorial closely matches the one used in Christopher M. Bishop's "Pattern Recognition and Machine Learning". This should hopefully encourage you to check out his great book for a broader understanding of E-M, mixture models or machine learning in general.

Alright, let's dive in!

1. Expectation-Maximization Algorithm

Imagine the following situation. You observe some data set $\mathbf{X}$ (e.g. a bunch of images). You hypothesize that these images are of $K$ different objects... but you don't know which images represent which objects.

Let $\mathbf{Z}$ be a set of latent (hidden) variables, which tell precisely that: which images represent which objects.

Clearly, if you knew $\mathbf{Z}$ , you could group images into the clusters (where each cluster represents an object), and vice versa, if you knew the groupings you could deduce $\mathbf{Z}$ . A classical "chicken-and-egg" problem, and a perfect target for an Expectation-Maximization algorithm.

Here's a general idea of how E-M algorithm tackles it. First of all, all images are assigned to clusters arbitrarily. Then we use this assignment to modify the parameters of the clusters (e.g. we change what object is represented by that cluster) to maximize the clusters' ability to explain the data; after which we re-assign all images to the expected most-likely clusters. Wash, rinse, repeat, until the assignment explains the data well-enough (i.e. images from the same clusters are similar enough).

(Notice the words in bold in the previous paragraph: this is where the expectation and maximization stages in the E-M algorithm come from.)

To formalize (and generalize) this a bit further, say that you have a set of model parameters $\mathbf{\theta}$ (in the example above, some sort of cluster descriptions).

To solve the problem of cluster assignments we effectively need to find model parameters $\mathbf{\theta'}$ that maximize the likelihood of the observed data $\mathbf{X}$ , or, equivalently, the model parameters that maximize the log likelihod

$\mathbf{\theta'} = \underset{\mathbf{\theta}}{\text{arg max }} \ln \,\text{Pr} (\mathbf{X} | \mathbf{\theta}).$

Using some simple algebra we can show that for any latent variable distribution $q(\mathbf{Z})$ , the log likelihood of the data can be decomposed as
\begin{align}
\ln \,\text{Pr}(\mathbf{X} | \theta) = \mathcal{L}(q, \theta) + \text{KL}(q || p), \label{eq:logLikelihoodDecomp}
\end{align}
where $\text{KL}(q || p)$ is the Kullback-Leibler divergence between $q(\mathbf{Z})$ and the posterior distribution $\,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta)$ , and
\begin{align}
\mathcal{L}(q, \theta) := \sum_{\mathbf{Z}} q(\mathbf{Z}) \left( \mathcal{L}(\theta) - \ln q(\mathbf{Z}) \right)
\end{align}
with $\mathcal{L}(\theta) := \ln \,\text{Pr}(\mathbf{X}, \mathbf{Z}| \mathbf{\theta})$ being the "complete-data" log likelihood (i.e. log likelihood of both observed and latent data).

To understand what the E-M algorithm does in the expectation (E) step, observe that $\text{KL}(q || p) \geq 0$ for any $q(\mathbf{Z})$ and hence $\mathcal{L}(q, \theta)$ is a lower bound on $\ln \,\text{Pr}(\mathbf{X} | \theta)$ .

Then, in the E step, the gap between the $\mathcal{L}(q, \theta)$ and $\ln \,\text{Pr}(\mathbf{X} | \theta)$ is minimized by minimizing the Kullback-Leibler divergence $\text{KL}(q || p)$ with respect to $q(\mathbf{Z})$ (while keeping the parameters $\theta$ fixed).

Since $\text{KL}(q || p)$ is minimized at $\text{KL}(q || p) = 0$ when $q(\mathbf{Z}) = \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta)$ , at the E step $q(\mathbf{Z})$ is set to the conditional distribution $\,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta)$ .

To maximize the model parameters in the M step, the lower bound $\mathcal{L}(q, \theta)$ is maximized with respect to the parameters $\theta$ (while keeping $q(\mathbf{Z}) = \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta)$ fixed; notice that $\theta$ in this equation corresponds to the old set of parameters, hence to avoid confusion let $q(\mathbf{Z}) = \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old})$ ).

The function $\mathcal{L}(q, \theta)$ that is being maximized w.r.t. $\theta$ at the M step can be re-written as
\begin{align*}
\theta^\text{new} &= \underset{\mathbf{\theta}}{\text{arg max }} \left. \mathcal{L}(q, \theta) \right|_{q(\mathbf{Z}) = \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old})} \\
&= \underset{\mathbf{\theta}}{\text{arg max }} \left. \sum_{\mathbf{Z}} q(\mathbf{Z}) \left( \mathcal{L}(\theta) - \ln q(\mathbf{Z}) \right) \right|_{q(\mathbf{Z}) = \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old})} \\
&= \underset{\mathbf{\theta}}{\text{arg max }} \sum_{\mathbf{Z}} \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old}) \left( \mathcal{L}(\theta) - \ln \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old}) \right) \\
&= \underset{\mathbf{\theta}}{\text{arg max }} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right] - \sum_{\mathbf{Z}} \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old}) \ln \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \theta^\text{old}) \\
&= \underset{\mathbf{\theta}}{\text{arg max }} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right] - (C \in \mathbb{R}) \\
&= \underset{\mathbf{\theta}}{\text{arg max }} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right],
\end{align*}

i.e. in the M step the expectation of the joint log likelihood of the complete data is maximized with respect to the parameters $\theta$ .

So, just to summarize,

Expectation step: $q^{t + 1}(\mathbf{Z}) \leftarrow \,\text{Pr}(\mathbf{Z} | \mathbf{X}, \mathbf{\theta}^t)$
Maximization step: $\mathbf{\theta}^{t + 1} \leftarrow \underset{\mathbf{\theta}}{\text{arg max }} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{t}} \left[ \mathcal{L}(\theta) \right]$ (where superscript $\mathbf{\theta}^t$ indicates the value of parameter $\mathbf{\theta}$ at time $t$ ).

Phew. Let's go to the image clustering example, and see how all of this actually works.

2. Bernoulli Mixture Models for Image Clustering

First of all, let's represent the image clustering problem in a more formal way.

2.1. Formal description

Say that we are given $N$ same-sized training images $\mathbf{x_n} = (x_{n,1}, ..., x_{n,D})^T$ for $n \in \{1, ..., N \}$ , each image containing $D$ binary pixels (i.e. $x_{n,i} \in \{ 0, 1 \}$ ).

Assuming that the pixels are conditionally independent from each other (i.e. that $x_{n, i}$ is conditionally independent from $x_{n, j \neq i}$ for each $i, j \in \{ 1, ..., D \}$ ), the probability distribution of the pixel $i$ over all images belonging to a component $k$ can be modelled using Bernoulli distribution with a parameter $0 \leq \mu_{k, i} \leq 1$ .

To incorporate some prior knowledge about the image assignment to $K$ clusters (e.g. the proportions of images in each cluster), the assignments can be treated as being sampled from the multivariate distribution with the parameters $\pi_1, ..., \pi_K$ (where $0 \leq \pi_i \leq 1$ , $\sum_{i = 1}^K \pi_i = 1$ ). Each $\pi_i$ for $i \in \{1, ..., K\}$ is called a mixing coefficient of cluster $i$ .

Let say that the model parameters include the pixel distributions of each cluster and the prior knowledge about the image assignments, i.e. $\theta = (\mathbf{\mu}, \mathbf{\pi})$ , where $\mathbf{\mu} := (\mathbf{\mu_1} \; \mathbf{\mu_2} \;... \;\mathbf{\mu_K} ) = \left( \begin{array}{cccc} \mu_{1, 1} & \mu_{2, 1} & ... & \mu_{K, 1} \\ \mu_{1, 2} & \mu_{2, 2} & ... & \mu_{K, 2} \\ \vdots & \vdots & \ddots & \vdots \\ \mu_{1, D} & \mu_{2, D} & ... & \mu_{K, D} \\ \end{array} \right)$ and $\mathbf{\pi} := ( \pi_1, ..., \pi_K )^T$ .

Then, the likelihood of a single training image $\mathbf{x}$ is
\begin{align}
\,\text{Pr}(\mathbf{x} | \theta) = \,\text{Pr}(\mathbf{x} | \mathbf{\mu}, \mathbf{\pi}) = \sum_{k = 1}^K \pi_k \,\text{Pr}(\mathbf{x}|\mathbf{\mu_k})
\end{align}
where the probability that $\mathbf{x}$ is generated by cluster $k$ can be written as
\begin{align}
\,\text{Pr}(\mathbf{x}|\mathbf{\mu_k}) = \prod_{i = 1}^D \mu_{k, i}^{x_i} (1 - \mu_{k, i})^{1 - x_i}.
\end{align}

To model the assignment of images to clusters, associate a latent $K$ -dimensional binary random variable $\mathbf{z_i}$ with each of the training examples $\mathbf{x_i}$ . Say that $\mathbf{z_i}$ has a 1-of- $K$ representation, i.e. for $\mathbf{z_i} := (z_{i, 1}, ..., z_{i, K})^T$ it must be the case that $z_{i, j} \in \{0, 1\}$ for $i \in \{ 1, ..., N \}, j \in \{ 1, ..., K \}$ and $\sum_{j = 1}^{K} z_{i, j} = 1$ .

Furthermore, let the marginal distribution over $\mathbf{z_i}$ be specified in terms of mixing coefficients $\mathbf{\pi}$ s.t. $\,\text{Pr}(z_{i, j} = 1) = \pi_j$ , then
\begin{align}
\,\text{Pr}(\mathbf{z_n} | \mathbf{\pi}) = \prod_{i = 1}^K \pi_i^{z_{n, i}}.
\end{align}

Similarly, let $\,\text{Pr}(\mathbf{x_n} | z_{n, k} = 1) = \,\text{Pr}(\mathbf{x_n} | \mathbf{\mu_k})$ , then
\begin{align}
\,\text{Pr}(\mathbf{x_n} | \mathbf{z_n}, \mathbf{\mu}, \mathbf{\pi}) = \prod_{k = 1}^K \,\text{Pr}(\mathbf{x_n} | \mathbf{\mu_k})^{z_{n, k}}.
\end{align}

By combining all latent variables $\mathbf{z_i}$ into a set $\mathbf{Z} := \{ \mathbf{z_1}, ..., \mathbf{z_N} \}$ , we can write
\begin{equation} \label{eq:probZ}
\begin{split}
\,\text{Pr}(\mathbf{Z}|\mathbf{\pi}) &= \prod_{n = 1}^N \,\text{Pr}(\mathbf{z_n}|\mathbf{\pi}) \\
&= \prod_{n = 1}^N \prod_{k = 1}^K \pi_k^{z_{n, k}},
\end{split}
\end{equation}
and, similarly, combining all training images $\mathbf{x_i}$ into a set $\mathbf{X} := \{ \mathbf{x_1}, ..., \mathbf{x_N} \}$ , we can express the marginal training data distribution as
\begin{equation} \label{eq:probXgivZ}
\begin{split}
\,\text{Pr}(\mathbf{X}|\mathbf{Z}, \mathbf{\mu}, \mathbf{\pi}) &= \prod_{n = 1}^N \,\text{Pr}(\mathbf{x_n}|\mathbf{z_n},\mathbf{\mu},\mathbf{\pi}) \\
&= \prod_{n = 1}^N \prod_{k = 1}^K \,\text{Pr}(\mathbf{x_n} | \mathbf{\mu_k})^{z_{n, k}} \\
&= \prod_{n = 1}^N \prod_{k = 1}^K \left( \prod_{i = 1}^D \mu_{k, i}^{x_{n, i}} (1 - \mu_{k, i})^{1 - x_{n, i}} \right)^{z_{n, k}}.
\end{split}
\end{equation}

From the last two equations and the probability chain rule, the complete data likelihood can be written as:
\begin{equation} \label{eq:probXandZ}
\begin{split}
\,\text{Pr}(\mathbf{X}, \mathbf{Z}| \mathbf{\mu}, \mathbf{\pi}) &= \,\text{Pr}(\mathbf{X} | \mathbf{Z}, \mathbf{\mu}, \mathbf{\pi}) \,\text{Pr}(\mathbf{Z}| \mathbf{\mu}, \mathbf{\pi}) \\
&= \prod_{n = 1}^N \prod_{k = 1}^K \left( \pi_k \prod_{i = 1}^D \mu_{k, i}^{x_{n, i}} (1 - \mu_{k, i})^{1 - x_{n, i}} \right)^{z_{n, k}},
\end{split}
\end{equation}

and thus the complete data log likelihood $\mathcal{L}(\theta)$ can be obtained by taking a log of the equation above:
\begin{equation}
\begin{split}
\mathcal{L}(\theta) &= \ln \,\text{Pr}(\mathbf{X}, \mathbf{Z}| \mathbf{\mu}, \mathbf{\pi}) \\
&= \sum_{n = 1}^N \sum_{k = 1}^K z_{n, k} \left( \ln \pi_k + \sum_{i = 1}^D x_{n, i} \ln \mu_{k, i} + (1 - x_{n, i}) \ln (1 - \mu_{k, i}) \right).
\end{split}
\end{equation}

(Still following? Great. Take five, and below we will derive the E and M step update equations.)

2.2. E-M update equations for BMMs

In order to update the latent variable distribution (i.e. image assignment to clusters) at the expectation step, we need to set the probability distribution of $\textbf{Z}$ to $\,\text{Pr}(\mathbf{Z} | \mathbf{X}, \mathbf{\theta})$ .

However, we cannot calculate this distribution exactly, hence we will have to approximate this assignment. A simple way of doing it is to replace the current values of $z_{n, k}$ with the expected ones:

\begin{equation} \label{eq:z}
\begin{split}
z_{n, k}^\text{new} \leftarrow \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \mathbf{\mu}, \mathbf{\pi}}[z_{n, k}] &= \sum_{z_{n, k}} \,\text{Pr}(z_{n,k} | \mathbf{x_n}, \mathbf{\mu}, \mathbf{\pi}) \, z_{n,k}\\
&= \frac{\pi_k \,\text{Pr}(\mathbf{x_n} |\mathbf{\mu_k})}{\sum_{m = 1}^K \pi_m \,\text{Pr}(\mathbf{x_n} | \mathbf{\mu_m})} \\
&= \frac{\pi_k \prod_{i = 1}^D \mu_{k, i}^{x_{n, i}} (1 - \mu_{k, i})^{1 - x_{n, i}} }{\sum_{m = 1}^K \pi_m \prod_{i = 1}^D \mu_{m, i}^{x_{n, i}} (1 - \mu_{m, i})^{1 - x_{n, i}}}.
\end{split}
\end{equation}

(Notice that after this update $\mathbf{z_{n}}^\text{new}$ is no longer represented as 1-of- $K$ vector, i.e. the same image can be "partially" assigned to multiple clusters.)

In the maximization step we need to maximize the model parameters (i.e. the mixing coefficients and the pixel distributions) using the update equation from earlier

$\mathbf{\theta}^\text{new} \leftarrow \underset{\mathbf{\theta}}{\text{arg max }} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right].$

Observe that
\begin{align}
\mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right] &= \sum_{n = 1}^N \sum_{k = 1}^K \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \mathbf{\mu}^\text{old}, \mathbf{\pi}^\text{old}} \left[ z_{n, k} \right] \left( \ln \pi_k + \sum_{i = 1}^D x_{n, i} \ln \mu_{k, i} + (1 - x_{n, i}) \ln (1 - \mu_{k, i}) \right).
\end{align}
The equation above can be maximized w.r.t. $\mathbf{\mu_k}$ by simply setting its derivative to zero:
\begin{align}
\frac{\partial}{\partial \mu_{m, j}} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right] &= \sum_{n = 1}^N \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \mathbf{\mu}^\text{old}, \mathbf{\pi}^\text{old}} \left[ z_{n, m} \right] \left( \frac{x_{n, j}}{\mu_{m, j}} - \frac{1 - x_{n, j}}{1 - \mu_{m, j}} \right) \\
&= \sum_{n = 1}^N z_{n, m}^\text{new} \frac{x_{n, j} - \mu_{m, j}}{\mu_{m, j} (1 - \mu_{m, j})} = 0 \Leftrightarrow \\
\mu_{m, j} &= \frac{1}{N_m} \sum_{n = 1}^N x_{n, j} z_{n, m}^\text{new},
\end{align}
where $N_m = \sum_{n = 1}^N z_{n, m}^\text{new}$ is the effective number of images assigned to cluster $m$ .

Then the full cluster $m$ pixel distribution vector $\mathbf{\mu_m}$ can be written as

$\mathbf{\mu_m} = \mathbf{\bar{x}_m},$

where

$\mathbf{\bar{x}_m} = \frac{1}{N_m} \sum_{n = 1}^N z_{n, m}^\text{new} \mathbf{x_n}$ is the weighted mean of the images associated with cluster

$m$ .

To maximize $\mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right]$ w.r.t. the mixing coefficients $\mathbf{\pi}$ (subject to the constraint $\sum_{k = 1}^K \pi_k = 1$ ) we can use the Lagrange multipliers, yielding the following optimization problem:
\begin{equation*}
\Lambda(\theta, \lambda) := \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right] + \lambda \left( \sum_{k = 1}^K \pi_k - 1 \right).
\end{equation*}
The optimizing solution can then be found again with simple partial derivatives:
\begin{align}
\frac{\partial}{\partial \pi_{m}} \Lambda(\theta, \lambda) &= \frac{1}{\pi_m} \sum_{n = 1}^N z_{n,m}^\text{new} + \lambda = 0 \Leftrightarrow \\
\pi_m &= -\frac{N_m}{\lambda},
\end{align}
\begin{align}
\frac{\partial}{\partial \lambda} \Lambda(\theta, \lambda) &= \sum_{k = 1}^K \pi_k - 1 = 0 \Leftrightarrow \\
\sum_{k = 1}^K \pi_k &= 1.
\end{align}
By combining these two results $\lambda = - \sum_{k = 1}^K N_k = - N$ , and thus

$\pi_m = -\frac{N_m}{\lambda} = \frac{N_m}{N}.$

Done!

2.3. Summary

In summary, the update equations for Bernoulli Mixture Models using E-M are:

Expectation step:

$z_{n, k} \leftarrow \frac{\pi_k \prod_{i = 1}^D \mu_{k, i}^{x_{n, i}} (1 - \mu_{k, i})^{1 - x_{n, i}} }{\sum_{m = 1}^K \pi_m \prod_{i = 1}^D \mu_{m, i}^{x_{n, i}} (1 - \mu_{m, i})^{1 - x_{n, i}}}.$
Maximization step:

$\mathbf{\mu_m} \leftarrow \mathbf{\bar{x}_m},$

$\pi_m \leftarrow \frac{N_m}{N},$

where $\mathbf{\bar{x}_m} = \frac{1}{N_m} \sum_{n = 1}^N z_{n, m} \mathbf{x_n}$ and $N_m = \sum_{n = 1}^N z_{n, m}$ .

3. References

[Dempster et al, 1977] A. P. Dempster, N. M. Laird, D. B. Rubin. "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal Statistical Society. Series B (Methodological) 39 (1): 1–38.

[Bishop, 2006] C. M. Bishop. "Pattern Recognition and Machine Learning". Springer, 2006. ISBN 9780387310732.

3D Display Simulation using Head-Tracking with Kinect

Manfredas Zabarauskas — Wed, 31 Oct 2012 20:15:18 +0000

During my final year in Cambridge I had the opportunity to work on the project that I wanted to implement for the last three years: a glasses-free 3D display.

1. Introduction

It all started when I saw Johnny Lee's "Head Tracking for Desktop VR Displays using the Wii Remote" project in early 2008 (see below). He cunningly used the infrared camera in the Nintendo Wii's remote and a head mounted sensor bar to track the location of the viewer's head and render view dependent images on the screen. He called it a "portal to the virtual environment".

Johnny Lee's project "Head Tracking for Desktop VR Displays using the Wii Remote".

I always thought that it would be really cool to have this behaviour without having to wear anything on your head (and it was - see the video below!).

My "portal to the virtual environment" which does not require head gear. And it has 3D Tetris!

I am a firm believer in three-dimensional displays, and I am certain that we do not see the widespread adoption of 3D displays simply because of a classic network effect (also know as "chicken-and-egg" problem). The creation and distribution of a three-dimensional content is inevitably much more expensive than a regular, old-school 2D content. If there is no demand (i.e. no one has a 3D display at home/work), then the content providers do not have much of an incentive to bother creating the 3D content. Vice versa, if there is no content then consumers do not see much incentive to invest in (inevitably more expensive) 3D displays.

A "portal to the virtual environment", or as I like to call it, a 2.5D display could effectively solve this. If we could enhance every 2D display to get what you see in Johnny's and my videos (and I mean every: LCD, CRT, you-name-it), then suddenly everyone can consume the 3D content even without having the "fully" 3D display. At that point it starts making sense to mass-create 3D content.

The terms "fully" and 2.5D, however, require a bit of explanation.

1.1. Human depth perception

Eye "convergence" depth cue

You see, human depth perception comes from a variety of sensory cues. Some of them are come from our ability to sense the position of our eyes and the tension in our eye muscles ("oculomotor" cues). For example, when the object of focus moves closer to the eye, we can feel the eye moving inwards (i.e. we feel the extraocular muscles stretching). This is a so called "convergence" depth cue.

Another kinesthetic sensation arises from the change in the shape of the eye lens that occurs when the sight is focused on the objects at different distances (called "accommodation"). In this case, ciliary muscles stretch the lens making it thinner and changing the eye's focal length. These kinesthetic sensations (processed in the visual cortex) serve as the basic cues for distance interpretation.

Eye "accommodation" depth cue

Johnny's and my displays are not able to simulate these oculomotor cues. In fact, the increased depth perception seen in the videos above comes from a monocular (read: single eye) motion cue, called "motion parallax". Fancy name aside, motion parallax simply means that the objects closer to the moving observer seem to move faster (and in opposite direction to the movement of the observer), whereas the objects farther away move slower (and in the same direction). However, motion parallax alone is not enough to create a full 3D impression.

In the average adult human, the eyes are horizontally separated by about 6 cm, hence even when looking at the same scene, the images formed on the retinae are different. The difference in the images between the left and the right eyes (called "binocular disparity") is actually translated into depth perception in the brain (in striate cortex and higher up in the visual system), creating a "stereopsis" depth cue.

As you will see in a figure below (by Cutting and Vishton, 1995), stereopsis and motion parallax are the two most important depth cues in the near distance (< 2 meters).

Ranking of Depth Cues in the Observer's Space. Based on Cutting and Vishton, 1995.

The fact that we are not able to simulate the stereopsis depth cue on a standard LCD/CRT/etc display is one of the main reasons why I am calling such displays 2.5-dimensional (nevertheless, they are still really exciting!).

So, how did I create my 2.5D display?

2. 2.5D display implementation

Well, initially I thought of using just a standard webcam to try to infer the viewer's distance from the camera using a whole bunch of cunning calibration and computer vision techniques. However, my supervisor-to-be, Neil Dodgson (as it turns out, a chair of international Stereoscopic Displays & Applications conferences in 2006, 2007, 2010 and 2011, and a pretty cool dude in general; you can check out his blog here) suggested using Microsoft Kinect.

Kinect's projected IR dot pattern (taken from here)

This suggestion proved to be tremendously useful.

The neat piece of hardware that actually makes Kinect exciting is an IR depth-finding camera. It means that for (almost) each pixel in the video stream, Kinect can determine its distance from the camera essentially by looking at the distortions of the projected IR dot pattern. Combined with some clever machine learning, this feature enables Kinect to track the positions of twenty major joints of the user's body in real-time (called skeletal tracking).

However, for skeletal tracking Kinect requires the whole person to be visible in the sensor's field of view - not a very realistic requirement, especially in desktop PC environments.

The final idea was simple:

Use Viola and Jones face detector to detect the viewer's face in the colour (RGB) image.
Use the enhanced CAMShift face tracker to track it until the first loss, after which use V-J face detector again to re-detect the face.
Use ViBe background subtractor to get rid of the nearly-static background to help with the tracking.
In parallel, to exploit the depth data coming from Kinect, use Garstka and Peters depth-based head detector and a modified CAMShift tracker to track the head.
Merge the colour- and depth-based tracker predictions, filter the noise using some impulse/high-pass filters and... Bob's your uncle!

(Seriously, though, if you are interested in the actual technology behind it, drop me an e-mail at manfredas@zabarauskas.com and I might be able to provide you with my actual 167-page thesis, containing all the nitty-gritty details.)

Because I had decided to write everything from scratch, I also had to implement the distributed training framework for Viola-Jones using AsymBoost (in the image below you can see the Cambridge Computer Lab machines piling through more than 32 million non-face images in order to "learn" the differences between a human face and, say, a chair).

University of Cambridge Computer Laboratory running distributed Viola-Jones training framework.

Misclassified non-faces

After the training, there were 42 misclassifications out of 32.9 million non-face images in total (three examples of non-face images misclassified as faces are shown on the left).

Besides this, a whole bunch of evaluation software had to be implemented: recording and replaying Kinect depth and video streams, tools to help with the ground-truth tagging of depth and colour evaluation videos, Viola-Jones framework evaluators, and so on.

3. Final outcome

So, what was the result?

Well, during 10 minutes of evaluation recordings (containing unconstrained viewer’s head movement in six degrees-of-freedom, in presence of occlusions, changing facial expressions, different backgrounds and varying lighting conditions) the combined head-tracker was able to predict the viewer’s head center location within less than 1/3 of head's size from the actual head center on average!

Computer Lab's News

It was running at 28.24 FPS (limited only by Kinect's frame rate of 30 FPS) using 56.8% of a single Intel i5-2410M CPU @ 2.30 GHz core (with hyperthreading enabled).

The project has been highly-commended by the University of Cambridge Faculty of Computer Science and Technology (a fancy name for the Computer Lab), and the international Undergraduate Awards 2012 (an achievement which has received a couple of mentions on my college's and the Computer Lab's websites).

Wolfson College's News

All in all, I managed to accomplish something that I wanted to do for a long time. Eventually, I might publish the code and the details of the technology, but there is still work to be done, so don't hold your breath.

I firmly believe that under the right circumstances the capabilities of devices like Kinect could be world-changing. And to be honest, there is a good chance that I might have a small part in that effort in the nearest future. But that is a story for another blog post.

Backpropagation Tutorial

Manfredas Zabarauskas — Sun, 17 Apr 2011 23:16:25 +0000

The PhD thesis of Paul J. Werbos at Harvard in 1974 described backpropagation as a method of teaching feed-forward artificial neural networks (ANNs). In the words of Wikipedia, it lead to a "rennaisance" in the ANN research in 1980s.

As we will see later, it is an extremely straightforward technique, yet most of the tutorials online seem to skip a fair amount of details. Here's a simple (yet still thorough and mathematical) tutorial of how backpropagation works from the ground-up; together with a couple of example applets. Feel free to play with them (and watch the videos) to get a better understanding of the methods described below!

Training a single perceptron

Training a multilayer neural network

1. Background

To start with, imagine that you have gathered some empirical data relevant to the situation that you are trying to predict - be it fluctuations in the stock market, chances that a tumour is benign, likelihood that the picture that you are seeing is a face or (like in the applets above) the coordinates of red and blue points.

We will call this data training examples and we will describe $i$ ^th training example as a tuple $(\vec{x_i}, y_i)$ , where $\vec{x_i} \in \mathbb{R}^n$ is a vector of inputs and $y_i \in \mathbb{R}$ is the observed output.

Ideally, our neural network should output $y_i$ when given $\vec{x_i}$ as an input. In case that does not always happen, let's define the error measure as a simple squared distance between the actual observed output and the prediction of the neural network: $E := \sum_i (h(\vec{x_i}) - y_i)^2$ , where $h(\vec{x_i})$ is the output of the network.

2. Perceptrons (building-blocks)

The simplest classifiers out of which we will build our neural network are perceptrons (fancy name thanks to Frank Rosenblatt). In reality, a perceptron is a plain-vanilla linear classifier which takes a number of inputs $a_1, ..., a_n$ , scales them using some weights $w_1, ..., w_n$ , adds them all up (together with some bias $b$ ) and feeds everything through an activation function $\sigma \in \mathbb{R} \rightarrow \mathbb{R}$ .

A picture is worth a thousand equations:

Perceptron (linear classifier)

To slightly simplify the equations, define $w_0 := b$ and $a_0 := 1$ . Then the behaviour of the perceptron can be described as $\sigma(\vec{a} \cdot \vec{w})$ , where $\vec{a} := (a_0, a_1, ..., a_n)$ and $\vec{w} := (w_0, w_1, ..., w_n)$ .

To complete our definition, here are a few examples of typical activation functions:

sigmoid: $\sigma(x) = \frac{1}{1 + \exp(-x)}$ ,
hyperbolic tangent: $\sigma(x) = \tanh(x)$ ,
plain linear $\sigma(x) = x$ and so on.

Now we can finally start building neural networks. The simplest kind of network that we can build is... exactly, one perceptron! Here's how we can train it to classify things!

3. Single-layer neural network

We defined the error earlier as $E := \sum_i (h(\vec{x_i}) - y_i)^2$ . Obviously, since we are using a single perceptron both our error and the output of the network ( $h_{\vec{w}}(\vec{x_i}) = \sigma(\vec{w} \cdot \vec{x_i})$ ) depend on the weights vector $\vec{w}$ .

Incorporating those observations into the updated error measure we obtain $E(\vec{w}) := \sum_i (h_{\vec{w}}(\vec{x_i}) - y_i)^2$ .

Our goal is to find such a vector of weights $\vec{w}$ that $E(\vec{w})$ is minimised - that way our perceptron will correctly predict the output for all inputs of our training examples!

We will do that by applying the gradient descent algorithm: in essence we will treat the error as a surface in n-dimensional space, then we will find a greatest downwards slope at the current point $\vec{w_t}$ and will go in that direction to obtain $\vec{w}_{t+1}$ . This way hopefully we will find a minimum point on the error surface and we will use the coordinates of that point as the final weight vector.

By skipping a great deal of maths on whether the minimum point exists, is it unique and global, can we "overjump" it by accident, what are the conditions for the following partial derivatives to exist, etc, etc; we will dive straight in hoping for the best and will calculate the gradient of the error surface at $\vec{w_t}$ . Then we will take a step in the opposite direction of the gradient (i.e. in the direction of the fastest decreasing slope on the error surface) to obtain $\vec{w}_{t + 1}$ .

To express it in a slightly more mathematical way, we will start with some randomized (!) weight vector $\vec{w_0}$ and will train our perceptron by updating the weights

\begin{align} \vec{w}_{t+1} := \vec{w_t} - \eta \frac{\partial E(\vec{w})}{\partial \vec{w}} \bigg|_{\vec{w_t}}, \end{align}

where $\eta$ is known as a learning rate (a simple scaling factor that typically ranges between zero and one).

Observe that

\begin{align} \frac{\partial E(\vec{w})}{\partial \vec{w}} = \left( \frac{\partial E(\vec{w})}{\partial w_0},\frac{\partial E(\vec{w})}{\partial w_1}, ... ,\frac{\partial E(\vec{w})}{w_n} \right), \end{align}

and we can calculate

\begin{align} \frac{\partial E(\vec{w})}{\partial w_j} &= \frac{\partial}{\partial w_j} \sum_i (h_{\vec{w}}(\vec{x_i}) - y_i)^2 \\ &= \sum_i 2(h_{\vec{w}}(\vec{x_i}) - y_i) \frac{\partial}{\partial w_j} (h_{\vec{w}}(\vec{x_i}) - y_i) \\ &= \sum_i 2(h_{\vec{w}}(\vec{x_i}) - y_i) \frac{\partial}{\partial w_j} \sigma(\vec{x_i} \cdot \vec{w}) \\ &= \sum_i 2(h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (\vec{x_i} \cdot \vec{w}) \frac{d}{d w_j} \vec{x_i} \cdot \vec{w} \\ &= \sum_i 2(h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (\vec{x_i} \cdot \vec{w}) \frac{d}{d w_j} \sum_{k=1}^n x_{i,k} w_k \\ &= 2 \sum_i (h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (\vec{x_i} \cdot \vec{w}) x_{i,j} \end{align}

for each $0 \leq j \leq n$ .

3.1. Example single-layer neural network

In this applet, a perceptron takes two inputs (normalized x and y coordinates $in_x$ and $in_y$ , i.e. $a_1 = in_x$ , $a_2 = in_y$ ) and uses sigmoid as an activation function with the learning rate $\eta = 0.1$ .

Then, using a previous general result

\begin{align} \frac{\partial E(\vec{w})}{\partial w_j} &= 2 \sum_i (h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (\vec{x_i} \cdot \vec{w}) x_{i,j} \\ &= 2 \sum_i (\sigma(\vec{w} \cdot \vec{x_i}) - y_i) \sigma(\vec{x_i} \cdot \vec{w}) (1 - \sigma(\vec{x_i} \cdot \vec{w})) x_{i,j}, \end{align}

(since for the sigmoid activation function $\sigma ' (x) = \sigma(x) (1 - \sigma(x))$ ); and thus

\begin{align} \frac{\partial E(\vec{w})}{\partial \vec{w}} = 2 \sum_i (\sigma(\vec{w} \cdot \vec{x_i}) - y_i) \sigma(\vec{x_i} \cdot \vec{w}) (1 - \sigma(\vec{x_i} \cdot \vec{w})) \vec{x_i}. \end{align}

The final algorithm to update the weight vector $\vec{w} = (w_0, w_1, w_2)$ (which is initially randomized) then is

\begin{align} \vec{w}_{t+1} := \vec{w_t} - 0.2 \sum_i (h_{\vec{w}_t}(\vec{x_i}) - y_i) h_{\vec{w}_t}(\vec{x_i}) (1 - h_{\vec{w}_t}(\vec{x_i})) \vec{x_i}, \end{align}

where $h_{\vec{w}_t}(\vec{x_i}) = \sigma(\vec{w}_t \cdot \vec{x_i})$ .

However, a single perceptron is extremely limited in the sense that different classes of examples must be separable with a hyperplane (hence the name, linear classifier), which is usually not the case in real-life applications.

Time to bump things up a notch: let's connect a few of them together to obtain a multilayer feed-forward neural network!

4. Multilayer neural network

Let's consider a general case first: a completely unrestricted feed-forward structure (with the only condition being that there are no loops between the perceptrons to avoid general madness and chaos).

Since it is structurally more complex than just a single perceptron, take a look at the following figure that explains some more notation:

Multilayer neural network

Here the weight $w_{i \rightarrow j}$ connects perceptrons $i$ and $j$ , the sum of the weighed inputs of perceptron $j$ is denoted by $s_j := \sum_k z_k w_{k \rightarrow j}$ where $k$ iterates over all perceptrons connected to $j$ , and the output of $j$ is written as $z_j := \sigma(s_j)$ , where $\sigma$ is $j$ 's activation function.

We will use the same error measure $E(\vec{w}) := \sum_i (h_{\vec{w}}(\vec{x_i}) - y_i)^2$ , except now the weights vector $\vec{w}$ will contain all the weights in the network, i.e. $\vec{w} = (\;\;w_{i \rightarrow j}\;\;)$ for all $i, j$ .

To find $\vec{w}$ that minimizes $E(\vec{w})$ using gradient descent we have to calculate $\frac{\partial E(\vec{w})}{\partial \vec{w}}$ (again). However, this time it is (very slightly) more involved.

First of all let's separate the contributions of individual training examples to the overall error using the following observation:
\begin{align} \frac{\partial E(\vec{w})}{\partial \vec{w}} = \sum_i \frac{\partial E_i(\vec{w})}{\partial \vec{w}}, \end{align}
where $E_i(\vec{w}) = (h_{\vec{w}}(\vec{x_i}) - y_i)^2$ .

Then

\begin{align} \frac{\partial E_i(\vec{w})}{\partial w_{j \rightarrow k}} &= \frac{\partial}{\partial w_{j \rightarrow k}} (h_{\vec{w}}(\vec{x_i}) - y_i)^2 \\ &= 2 (h_{\vec{w}}(\vec{x_i}) - y_i) \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial w_{j \rightarrow k}} \\ &= 2 (h_{\vec{w}}(\vec{x_i}) - y_i) \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_k} \frac{\partial s_k}{\partial w_{j \rightarrow k}} \\ &= 2 (h_{\vec{w}}(\vec{x_i}) - y_i) \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_k} z_j. \end{align}

If $k$ is an output node, then
\begin{align} \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_k} = \frac{d \sigma(s_k)}{d s_k} = \sigma' (s_k)\end{align}
and thus
\begin{align} \frac{\partial E_i(\vec{w})}{\partial w_{j \rightarrow k}} &= 2 (h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (s_k)\; z_j. \end{align}

However, if $k$ is not an output node, then a change in $s_k$ can affect all the nodes which are connected to $k$ 's output, i.e.
\begin{align} \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_k} &= \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial z_k} \frac{\partial z_k}{\partial s_k} \\ &= \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial z_k} \sigma ' (s_k) \\ &= \sum_{o \in \{ v \; | \; v \text{ is connected to } k \}} \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_o} \frac{\partial s_o}{\partial z_k} \sigma ' (s_k) \\ &= \sum_{o \in \{ v \; | \; v \text{ is connected to } k \}} \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_o} w_{k \rightarrow o} \; \sigma ' (s_k), \end{align}
... and we are almost done! All what is left to do is to place the $i$ ^th example at the inputs of our neural network, calculate $s_k$ and $z_k$ for all the nodes (the forward-propagation step) and work our way backwards from the output node calculating $\frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_k}$ (hence the name, backpropagation).

To summarize, if $k$ is an output node, then

\begin{align} \frac{\partial E_i(\vec{w})}{\partial w_{j \rightarrow k}} &= 2 (h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (s_k)\; z_j, \end{align}

otherwise

\begin{align} \frac{\partial E_i(\vec{w})}{\partial w_{j \rightarrow k}} &= 2 (h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (s_k)\; z_j \sum_{o \in \{ v \; | \; v \text{ conn. to } k \}} \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_o} w_{k \rightarrow o}. \end{align}

Then after the following is obtained
\begin{align} \frac{\partial E_i(\vec{w})}{\partial \vec{w}} = \left( \; \; \frac{\partial E_i(\vec{w})}{\partial w_{j \rightarrow k}} \; \; \right), \forall j, k \end{align}
the weight vector can either be updated in one go (batch update)
\begin{align} \vec{w}_{t+1} := \vec{w_t} - \eta \frac{\partial E(\vec{w})}{\partial \vec{w}} \bigg|_{\vec{w_t}} = \vec{w_t} - \eta \sum_i \frac{\partial E_i(\vec{w})}{\partial \vec{w}}\bigg|_{\vec{w_t}}, \end{align}
or it can be updated sequentially using one training example at a time:
\begin{align} \vec{w}_{t+1} := \vec{w_t} - \eta \frac{\partial E_i(\vec{w})}{\partial \vec{w}} \bigg|_{\vec{w_t}}.\end{align}

4.1. Example multilayer network

If you launch and play with the applet above, you will see that it is able to separate classes non-linearly (indicating that it's using more than one perceptron). It is built using this two-layer neural network:

Two-layer neural network example

The weights vector $\vec{w}$ contains all the weights in the network, i.e.
\begin{align} \vec{w} = ( w_{in_1 \rightarrow 1}, w_{in_x \rightarrow 1}, w_{in_y \rightarrow 1}, w_{in_1 \rightarrow 2}, ..., w_{in_y \rightarrow 5}, w_{in_1 \rightarrow 6}, w_{1 \rightarrow 6}, w_{2 \rightarrow 6}, ..., w_{5 \rightarrow 6}). \end{align}

Each perceptron is using sigmoid as its activation function and the output of the perceptron $6$ is the output for the whole network, i.e. $h_{\vec{w}}(\vec{x_i}) = z_6$ .

Then an individual point i (with x and y coordinates normalized) is considered as an $i$ ^th training example and fed through the network. While it's being propagated, each $s_i$ and $z_i$ for $i = 1, ..., 6$ are stored.

Then the gradient of an $i$ ^th error surface is calculated as follows:
\begin{align}
\frac{\partial E_i(\vec{w})}{\partial \vec{w}} &= \left( \frac{\partial E_i(\vec{w})}{\partial w_{in_1 \rightarrow 1}},\frac{\partial E_i(\vec{w})}{\partial w_{in_x \rightarrow 1}}, ..., \frac{\partial E_i(\vec{w})}{\partial w_{in_y \rightarrow 5}},\frac{\partial E_i(\vec{w})}{\partial w_{in_1 \rightarrow 6}},\frac{\partial E_i(\vec{w})}{\partial w_{1 \rightarrow 6}},\frac{\partial E_i(\vec{w})}{\partial w_{2 \rightarrow 6}}, ..., \frac{\partial E_i(\vec{w})}{\partial w_{5 \rightarrow 6}} \right) , \end{align}
where
\begin{align} \frac{\partial E_i(\vec{w})}{\partial w_{in_1 \rightarrow 1}} &= 2 (h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (s_1)\; \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_6} w_{1 \rightarrow 6} \\
&= 2 (z_6 - y_i) \; \sigma (s_1) \; (1 - \sigma (s_1)) \; \sigma (s_6) \; (1 - \sigma (s_6)) \; w_{1 \rightarrow 6}, \\
\frac{\partial E_i(\vec{w})}{\partial w_{in_x \rightarrow 1}} &= 2 (z_6 - y_i) \; \sigma (s_1) \; (1 - \sigma (s_1)) \; {in}_x \; \sigma (s_6) \; (1 - \sigma (s_6)) \; w_{1 \rightarrow 6}, \\
& \vdots \\
\frac{\partial E_i(\vec{w})}{\partial w_{in_y \rightarrow 5}} &= 2 (z_6 - y_i) \; \sigma (s_5) \; (1 - \sigma (s_5)) \; {in}_y \; \sigma (s_6) \; (1 - \sigma (s_6)) \; w_{5 \rightarrow 6}, \\
\frac{\partial E_i(\vec{w})}{\partial w_{in_1 \rightarrow 6}} &= 2 (h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (s_6) \\
&= 2 (z_6 - y_i) \; \sigma (s_6) \; (1 - \sigma (s_6)) , \\
\frac{\partial E_i(\vec{w})}{\partial w_{1 \rightarrow 6}} &= 2 (z_6 - y_i) \; \sigma (s_6) \; (1 - \sigma (s_6)) \; z_1, \\
\frac{\partial E_i(\vec{w})}{\partial w_{2 \rightarrow 6}} &= 2 (z_6 - y_i) \; \sigma (s_6) \; (1 - \sigma (s_6)) \; z_2, \\
& \vdots \\
\frac{\partial E_i(\vec{w})}{\partial w_{5 \rightarrow 6}} &= 2 (z_6 - y_i) \; \sigma (s_6) \; (1 - \sigma (s_6)) \; z_5.
\end{align}

Finally, the network is sequentially trained with the learning rate $\eta = 0.5$ (starting with a random initial weight vector $w_0$ )
\begin{align} \vec{w}_{t+1} := \vec{w_t} - 0.5 \frac{\partial E_i(\vec{w})}{\partial \vec{w}} \bigg|_{\vec{w_t}}.\end{align}

That's it, I hope it sheds some light on the backpropagation!

Halfway There

Manfredas Zabarauskas — Tue, 12 Apr 2011 12:57:39 +0000

Another term in Cambridge has gone by - four out of nine to go. In the meantime, here's a quick update of what I've been up to in the past few months.

1. Microsoft internship

Redmond, WA, 2011

In January I had the opportunity to visit Microsoft's headquarters in Redmond, WA, to interview for the Software Development Engineer in Test intern position in the Office team. In short - a great trip, in every aspect.

I left London Heathrow on January 11th, 2:20 PM and landed in Seattle Tacoma at 4:10 PM (I suspect that there might have been a few time zones in between those two points). I arrived in Mariott Redmond roughly an hour later, which meant that because of my anti-jetlag technique ("do not go to bed until 10-11 PM in the new timezone no matter what") I had a few hours to kill. Ample time to unpack, grab a dinner in Mariott's restaurant and go for a short stroll around Redmond before going to sleep.

On the next day I had four interviews arranged. The interviews themselves were absolutely stress-free, it felt more like a chance to meet and have a chat with some properly smart (and down-to-earth) folks.

Top of the Space Needle. Seattle, WA, 2011

The structure of the interviews seemed fairly typical: each interview consisted of some algorithm/data structure problems, a short discussion about the past experience and the opportunity to ask questions (obviously a great chance to learn more about the team/company/company culture, etc). Since this was my third round of summer internship applications (I have worked as a software engineer for Wolfson Microelectronics in '09 and Morgan Stanley in '10), everything made sense and was pretty much what I expected.

My trip ended with a quick visit to Seattle on the next day: a few pictures of the Space Needle, a cup of Seattle's Best Coffee and there I was on my flight back to London, having spent $0.00 (yeap, Microsoft paid for everything - flights, hotel, meals, taxis, etc). Even so, the best thing about Microsoft definitely seemed to be the people working there; since I have received and accepted the offer, we'll see if my opinion remains unchanged after this summer!

2. Lent term v2.0

Well, things are still picking up the speed. Seven courses with twenty-eight supervisions in under two months, plus managing a group project (crowd-sourcing mobile network signal strength), a few basketball practices each week on top of that and you'll see a reason why this blog has not been updated for a couple of months.

It's not all doom and gloom, of course. Courses themselves are great, lecturers make some decently convoluted material understandable in minutes and an occasional formal hall (e.g. below) also helps.

All in all, my opinion, that Cambridge provides a great opportunity to learn a huge amount of material in a very short timeframe, remains unchanged.

There will be more to come about some cool things that I've learnt in separate posts, but now speaking of learning - it's revision time... :-)

Me and Ada at the CompSci formal hall. Cambridge, England, 2011