<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Manfredas Zabarauskas&#039; Blog &#187; tutorial</title>
	<atom:link href="http://blog.zabarauskas.com/tag/tutorial/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.zabarauskas.com</link>
	<description>We are what we repeatedly do; excellence, then, is not an act but a habit. -- Aristotle</description>
	<lastBuildDate>Tue, 29 Nov 2011 03:23:54 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Backpropagation Tutorial</title>
		<link>http://blog.zabarauskas.com/backpropagation-tutorial/</link>
		<comments>http://blog.zabarauskas.com/backpropagation-tutorial/#comments</comments>
		<pubDate>Sun, 17 Apr 2011 23:16:25 +0000</pubDate>
		<dc:creator>Manfredas Zabarauskas</dc:creator>
				<category><![CDATA[Development]]></category>
		<category><![CDATA[Education]]></category>
		<category><![CDATA[applet]]></category>
		<category><![CDATA[backpropagation]]></category>
		<category><![CDATA[derivation]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[linear classifier]]></category>
		<category><![CDATA[multiple layer]]></category>
		<category><![CDATA[neural network]]></category>
		<category><![CDATA[perceptron]]></category>
		<category><![CDATA[single layer]]></category>
		<category><![CDATA[training]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://blog.zabarauskas.com/?p=848</guid>
		<description><![CDATA[// // The PhD thesis of Paul J. Werbos at Harvard in 1974 described backpropagation as a method of teaching feed-forward artificial neural networks (ANNs). In the words of Wikipedia, it lead to a "rennaisance" in the ANN research in 1980s. As we will see later, it is an extremely straightforward technique, yet most of [...]]]></description>
			<content:encoded><![CDATA[<p><script type="text/javascript">// <![CDATA[
 function show_multiplelayer_applet() { var html_element, body_element, p_element, text_node; html_element = document.documentElement; body_element = html_element.lastChild; applet_element = document.createElement("applet"); text_node = document.createTextNode("Cannot start the applet! Please install the Java Runtime Environment."); applet_element.appendChild(text_node); applet_element.setAttribute("code", "com.zabarauskas.ai1.MultipleLayerApplet"); applet_element.setAttribute("archive", "http://www.zabarauskas.com/downloads/ANNs/multilayer.jar"); applet_element.setAttribute("height", "0"); applet_element.setAttribute("width", "0"); body_element.appendChild(applet_element); }
// ]]&gt;</script><script type="text/javascript">// <![CDATA[
 function show_singlelayer_applet() { var html_element, body_element, p_element, text_node; html_element = document.documentElement; body_element = html_element.lastChild; applet_element = document.createElement("applet"); text_node = document.createTextNode("Cannot start the applet! Please install the Java Runtime Environment."); applet_element.appendChild(text_node); applet_element.setAttribute("code", "com.zabarauskas.ai1.SingleLayerApplet"); applet_element.setAttribute("archive", "http://www.zabarauskas.com/downloads/ANNs/singlelayer.jar"); applet_element.setAttribute("height", "0"); applet_element.setAttribute("width", "0"); body_element.appendChild(applet_element); }
// ]]&gt;</script>The PhD thesis of <a href="http://en.wikipedia.org/wiki/Paul_Werbos" target="_blank">Paul J. Werbos</a> at Harvard in 1974 described backpropagation as a method of teaching <a href="http://en.wikipedia.org/wiki/Feedforward_neural_network" target="_blank">feed-forward artificial neural networks</a> (ANNs). In the words of Wikipedia, it lead to a "rennaisance" in the ANN research in 1980s.</p>
<p>As we will see later, it is an extremely straightforward technique, yet most of the tutorials online seem to skip a fair amount of details. Here's a simple (yet still thorough and mathematical) tutorial of how backpropagation works from the ground-up; together with a couple of example applets. Feel free to play with them (and watch the videos) to get a better understanding of the methods described below!</p>
<input type="submit" name="sub_button" onclick="javascript:show_singlelayer_applet()" style="width: 305px; float: left;" value="Launch the single-layer neural network applet!" width="305">
<input type="submit" name="sub_button" style="width: 305px; float: right;" onclick="javascript:show_multiplelayer_applet()" value="Launch the multilayer neural network applet!" width="305">
<p><small><div class="wp-caption alignleft" style="width: 304px"><iframe title="YouTube video player" width="293" height="336" src="http://www.youtube.com/embed/D8iMDH5va9M" frameborder="0" allowfullscreen></iframe><p class="wp-caption-text">Training a single perceptron (linear classifier)</p></div> <div class="wp-caption alignright" style="width: 304px"><iframe title="YouTube video player" width="293" height="336" src="http://www.youtube.com/embed/fAKwocta2wM" frameborder="0" allowfullscreen></iframe><p class="wp-caption-text">Training a multilayer neural network</p></div></small><br />
&nbsp;<br />
&nbsp; </p>
<p><strong>1. Background</strong></p>
<p>To start with, imagine that you have gathered some empirical data relevant to the situation that you are trying to predict - be it fluctuations in the stock market, chances that a tumour is benign, likelihood that the picture that you are seeing is a face or (like in the applets above) the coordinates of red and blue points.</p>
<p>We will call this data <em>training examples</em> and we will describe <script type='math/tex'>i</script><sup>th</sup> training example as a tuple <script type='math/tex'>(\vec{x_i}, y_i)</script>, where <script type='math/tex'>\vec{x_i} \in \mathbb{R}^n</script> is a vector of inputs and <script type='math/tex'>y_i \in \mathbb{R}</script> is the observed output.</p>
<p>Ideally, our neural network should output <script type='math/tex'>y_i</script> when given <script type='math/tex'>\vec{x_i}</script> as an input. In case that does not always happen, let's define the <em>error </em>measure as a simple squared distance between the actual observed output and the prediction of the neural network: <script type='math/tex'>E := \sum_i (h(\vec{x_i}) - y_i)^2</script>, where <script type='math/tex'>h(\vec{x_i})</script> is the output of the network.</p>
<p><strong>2. Perceptrons (building-blocks)</strong></p>
<p>The simplest classifiers out of which we will build our neural network are <a href="http://en.wikipedia.org/wiki/Perceptron" target="_blank"><em>perceptrons</em></a> (fancy name thanks to <a href="http://en.wikipedia.org/wiki/Frank_Rosenblatt" target="_blank">Frank Rosenblatt</a>). In reality, a perceptron is a plain-vanilla linear classifier which takes a number of inputs <script type='math/tex'>a_1, ..., a_n</script>, scales them using some weights <script type='math/tex'>w_1, ..., w_n</script>, adds them all up (together with some bias <script type='math/tex'>b</script>) and feeds everything through an <em>activation function</em> <script type='math/tex'>\sigma \in \mathbb{R} \rightarrow \mathbb{R}</script>.</p>
<p>A picture is worth a thousand equations:</p>
<p><small><div class="wp-caption aligncenter" style="width: 244px"><img title="Perceptron (linear classifier)" src="http://blog.zabarauskas.com/img/perceptron.gif" alt="Perceptron (linear classifier)" width="234" height="140" /><p class="wp-caption-text">Perceptron (linear classifier)</p></div></small></p>
<p>To slightly simplify the equations, define <script type='math/tex'>w_0 := b</script> and <script type='math/tex'>a_0 := 1</script>. Then the behaviour of the perceptron can be described as <script type='math/tex'>\sigma(\vec{a} \cdot \vec{w})</script>, where <script type='math/tex'>\vec{a} := (a_0, a_1, ..., a_n)</script> and <script type='math/tex'>\vec{w} := (w_0, w_1, ..., w_n)</script>.</p>
<p>To complete our definition, here are a few examples of typical activation functions:</p>
<ul>
<li><em>sigmoid:</em> <script type='math/tex'>\sigma(x) = \frac{1}{1 + \exp(-x)}</script>,</li>
<li><em>hyperbolic tangent:</em> <script type='math/tex'>\sigma(x) = \tanh(x)</script>,</li>
<li>plain <em>linear</em> <script type='math/tex'>\sigma(x) = x</script> and so on.</li>
</ul>
<p>Now we can finally start building neural networks.<span id="more-848"></span> The simplest kind of network that we can build is... exactly, one perceptron! Here's how we can train it to classify things!</p>
<p><strong>3. Single-layer neural network</strong></p>
<p>We defined the <em>error</em> earlier as <script type='math/tex'>E := \sum_i (h(\vec{x_i}) - y_i)^2</script>. Obviously, since we are using a single perceptron both our error and the output of the network (<script type='math/tex'>h_{\vec{w}}(\vec{x_i}) = \sigma(\vec{w} \cdot \vec{x_i})</script>) depend on the weights vector <script type='math/tex'>\vec{w}</script>.</p>
<p>Incorporating those observations into the updated error measure we obtain <script type='math/tex'>E(\vec{w}) := \sum_i (h_{\vec{w}}(\vec{x_i}) - y_i)^2</script>.</p>
<p>Our goal is to find such a vector of weights <script type='math/tex'>\vec{w}</script> that <script type='math/tex'>E(\vec{w})</script> is minimised - that way our perceptron will correctly predict the output for all inputs of our training examples!</p>
<p>We will do that by applying the <em>gradient descent</em> algorithm: in essence we will treat the error as a surface in <em>n</em>-dimensional space, then we will find a greatest downwards slope at the current point <script type='math/tex'>\vec{w_t}</script> and will go in that direction to obtain <script type='math/tex'>\vec{w}_{t+1}</script>. This way hopefully we will find a minimum point on the error surface and we will use the coordinates of that point as the final weight vector.</p>
<p>By skipping a great deal of maths on whether the minimum point exists, is it unique and global, can we "overjump" it by accident, what are the conditions for the following partial derivatives to exist, etc, etc; we will dive straight in hoping for the best and will calculate the <em><a href="http://en.wikipedia.org/wiki/Gradient" target="_blank">gradient</a></em> of the error surface at <script type='math/tex'>\vec{w_t}</script>. Then we will take a step in the opposite direction of the gradient (i.e. in the direction of the fastest decreasing slope on the error surface) to obtain <script type='math/tex'>\vec{w}_{t + 1}</script>.</p>
<p>To express it in a slightly more mathematical way, we will start with some <em>randomized (!) </em>weight vector <script type='math/tex'>\vec{w_0}</script> and will train our perceptron by updating the weights</p>
<p>\begin{align} \vec{w}_{t+1} := \vec{w_t} - \eta \frac{\partial E(\vec{w})}{\partial \vec{w}} \bigg|_{\vec{w_t}}, \end{align}</p>
<p>where <script type='math/tex'>\eta</script> is known as a <em>learning rate</em> (a simple scaling factor that typically ranges between zero and one).</p>
<p>Observe that</p>
<p>\begin{align} \frac{\partial E(\vec{w})}{\partial \vec{w}} = \left( \frac{\partial E(\vec{w})}{\partial w_0},\frac{\partial E(\vec{w})}{\partial w_1}, ... ,\frac{\partial E(\vec{w})}{w_n} \right), \end{align}</p>
<p>and we can calculate</p>
<p>\begin{align} \frac{\partial E(\vec{w})}{\partial w_j} &#038;= \frac{\partial}{\partial w_j} \sum_i (h_{\vec{w}}(\vec{x_i}) - y_i)^2 \\ &#038;= \sum_i 2(h_{\vec{w}}(\vec{x_i}) - y_i) \frac{\partial}{\partial w_j} (h_{\vec{w}}(\vec{x_i}) - y_i) \\ &#038;= \sum_i 2(h_{\vec{w}}(\vec{x_i}) - y_i) \frac{\partial}{\partial w_j} \sigma(\vec{x_i} \cdot \vec{w}) \\ &#038;= \sum_i 2(h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (\vec{x_i} \cdot \vec{w}) \frac{d}{d w_j} \vec{x_i} \cdot \vec{w} \\ &#038;= \sum_i 2(h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (\vec{x_i} \cdot \vec{w}) \frac{d}{d w_j} \sum_{k=1}^n a_k w_k \\ &#038;= 2 a_j \sum_i (h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (\vec{x_i} \cdot \vec{w}) \end{align}</p>
<p>for each <script type='math/tex'>0 \leq j \leq n</script>.</p>
<p><strong>3.1. <em>Example single-layer neural network</em></strong></p>
<input type="submit" name="sub_button" onclick="javascript:show_singlelayer_applet()" style="width: 600px;" value="Launch the example single-layer neural network applet" width="600">
<p>In this applet, a perceptron takes two inputs (normalized <em>x</em> and <em>y</em> coordinates, i.e. <script type='math/tex'>a_1 = in_x</script>, <script type='math/tex'>a_2 = in_y</script>) and uses sigmoid as an activation function with the learning rate <script type='math/tex'>\eta = 0.1</script>.</p>
<p>Then, using a previous general result</p>
<p>\begin{align} \frac{\partial E(\vec{w})}{\partial w_j} &#038;= 2 a_j \sum_i (h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (\vec{x_i} \cdot \vec{w}) \\ &#038;= 2 a_j \sum_i (\sigma(\vec{w} \cdot \vec{x_i}) - y_i) \sigma(\vec{x_i} \cdot \vec{w}) (1 - \sigma(\vec{x_i} \cdot \vec{w})), \end{align}</p>
<p>(since for the sigmoid activation function <script type='math/tex'>\sigma ' (x) = \sigma(x) (1 - \sigma(x))</script>); and thus</p>
<p>\begin{align} \frac{\partial E(\vec{w})}{\partial \vec{w}} = 2 \vec{a} \sum_i (\sigma(\vec{w} \cdot \vec{x_i}) - y_i) \sigma(\vec{x_i} \cdot \vec{w}) (1 - \sigma(\vec{x_i} \cdot \vec{w})), \end{align}</p>
<p>where <script type='math/tex'>\vec{a} = (1, in_x, in_y)</script>.</p>
<p>The final algorithm to update the weight vector <script type='math/tex'>\vec{w} = (w_0, w_1, w_2)</script> (which is initially randomized) then is</p>
<p>\begin{align} \vec{w}_{t+1} := \vec{w_t} - 0.2 \vec{a} \sum_i (h_{\vec{w}_t}(\vec{x_i}) - y_i) h_{\vec{w}_t}(\vec{x_i}) (1 - h_{\vec{w}_t}(\vec{x_i})), \end{align}</p>
<p>where <script type='math/tex'>h_{\vec{w}_t}(\vec{x_i}) = \sigma(\vec{w}_t \cdot \vec{x_i})</script>.</p>
<p>However, a single perceptron is extremely limited in the sense that different classes of examples must be separable with a hyperplane (hence the name, <em>linear </em>classifier), which is usually not the case in real-life applications.</p>
<p>Time to bump things up a notch: let's connect a few of them together to obtain a multilayer feed-forward neural network!</p>
<p><strong>4. Multilayer neural network</strong></p>
<p>Let's consider a general case first: a completely unrestricted feed-forward structure (with the only condition being that there are no loops between the perceptrons to avoid general madness and chaos). </p>
<p>Since it is structurally more complex than just a single perceptron, take a look at the following figure that explains some more notation:</p>
<p><small><div class="wp-caption aligncenter" style="width: 625px"><img title="Multilayer neural network" src="http://blog.zabarauskas.com/img/multilayer.gif" alt="Multilayer neural network" width="615" height="291" /><p class="wp-caption-text">Multilayer neural network</p></div></small></p>
<p>Here the weight <script type='math/tex'>w_{i \rightarrow j}</script> connects perceptrons <script type='math/tex'>i</script> and <script type='math/tex'>j</script>, the sum of the weighed inputs of perceptron <script type='math/tex'>j</script> is denoted by <script type='math/tex'>s_j := \sum_k z_k w_{k \rightarrow j}</script> where <script type='math/tex'>k</script> iterates over all perceptrons connected to <script type='math/tex'>j</script>, and the output of <script type='math/tex'>j</script> is written as <script type='math/tex'>z_j := \sigma(s_j)</script>, where <script type='math/tex'>\sigma</script> is <script type='math/tex'>j</script>'s activation function.</p>
<p>We will use the same error measure <script type='math/tex'>E(\vec{w}) := \sum_i (h_{\vec{w}}(\vec{x_i}) - y_i)^2</script>, except now the weights vector <script type='math/tex'>\vec{w}</script> will contain all the weights in the network, i.e. <script type='math/tex'>\vec{w} = (\;\;w_{i \rightarrow j}\;\;)</script> for all <script type='math/tex'>i, j</script>.</p>
<p>To find <script type='math/tex'>\vec{w}</script> that minimizes <script type='math/tex'>E(\vec{w})</script> using gradient descent we have to calculate <script type='math/tex'>\frac{\partial E(\vec{w})}{\partial \vec{w}}</script> (again). However, this time it is (very slightly) more involved.</p>
<p>First of all let's separate the contributions of individual training examples to the overall error using the following observation:<br />
\begin{align} \frac{\partial E(\vec{w})}{\partial \vec{w}} = \sum_i \frac{\partial E_i(\vec{w})}{\partial \vec{w}}, \end{align}<br />
where <script type='math/tex'>E_i(\vec{w}) = (h_{\vec{w}}(\vec{x_i}) - y_i)^2</script>.</p>
<p>Then</p>
<p>\begin{align} \frac{\partial E_i(\vec{w})}{\partial w_{j \rightarrow k}} &#038;= \frac{\partial}{\partial w_{j \rightarrow k}} (h_{\vec{w}}(\vec{x_i}) - y_i)^2 \\ &#038;= 2 (h_{\vec{w}}(\vec{x_i}) - y_i) \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial w_{j \rightarrow k}} \\ &#038;=  2 (h_{\vec{w}}(\vec{x_i}) - y_i) \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_k} \frac{\partial s_k}{\partial w_{j \rightarrow k}} \\ &#038;= 2 (h_{\vec{w}}(\vec{x_i}) - y_i) \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_k} z_j. \end{align}</p>
<p>If <script type='math/tex'>k</script> is an output node, then<br />
\begin{align} \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_k} = \frac{d \;\; \sigma(s_k)}{d \; s_k}  = \sigma' (s_k)\end{align}<br />
and thus<br />
\begin{align} \frac{\partial E_i(\vec{w})}{\partial w_{j \rightarrow k}} &#038;= 2 (h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (s_k)\; z_j. \end{align}</p>
<p>However, if <script type='math/tex'>k</script> is not an output node, then a change in <script type='math/tex'>s_k</script> can affect all the nodes which are connected to <script type='math/tex'>k</script>'s output, i.e.<br />
\begin{align} \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_k} &#038;= \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial z_k} \frac{\partial z_k}{\partial s_k} \\ &#038;= \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial z_k} \sigma ' (s_k) \\ &#038;= \sum_{o \in \{ v \; | \; v \text{ is connected to } k \}} \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_o} \frac{\partial s_o}{\partial z_k} \sigma ' (s_k) \\ &#038;= \sum_{o \in \{ v \; | \; v \text{ is connected to } k \}} \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_o} w_{k \rightarrow o} \; \sigma ' (s_k), \end{align}<br />
... and we are almost done! All what is left to do is to place the <script type='math/tex'>i</script><sup>th</sup> example at the inputs of our neural network, calculate <script type='math/tex'>s_k</script> and <script type='math/tex'>z_k</script> for all the nodes (the <em>forward-propagation</em> step) and to work our way backwards from the output node calculating <script type='math/tex'>\frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_k}</script> (hence the name, <em>backpropagation</em>).</p>
<p>To summarize, if <script type='math/tex'>k</script> is an output node, then</p>
<p>\begin{align} \frac{\partial E_i(\vec{w})}{\partial w_{j \rightarrow k}} &#038;= 2 (h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (s_k)\; z_j, \end{align}</p>
<p>otherwise</p>
<p>\begin{align} \frac{\partial E_i(\vec{w})}{\partial w_{j \rightarrow k}} &#038;= 2 (h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (s_k)\; z_j \sum_{o \in \{ v \; | \; v \text{ conn. to } k \}} \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_o} w_{k \rightarrow o}. \end{align}</p>
<p>Then after the following is obtained<br />
\begin{align} \frac{\partial E_i(\vec{w})}{\partial \vec{w}} = \left( \; \; \frac{\partial E_i(\vec{w})}{\partial w_{j \rightarrow k}}   \; \; \right), \forall j, k \end{align}<br />
the weight vector can either be updated in one go (<em>batch</em> update)<br />
\begin{align} \vec{w}_{t+1} := \vec{w_t} - \eta \frac{\partial E(\vec{w})}{\partial \vec{w}} \bigg|_{\vec{w_t}} =  \vec{w_t} - \eta \sum_i \frac{\partial E_i(\vec{w})}{\partial \vec{w}}\bigg|_{\vec{w_t}}, \end{align}<br />
or it can be updated <em>sequentially</em> using one training example at a time:<br />
\begin{align} \vec{w}_{t+1} := \vec{w_t} - \eta \frac{\partial E_i(\vec{w})}{\partial \vec{w}} \bigg|_{\vec{w_t}}.\end{align}</p>
<p><strong>4.1. <em>Example multilayer network</em></strong></p>
<input type="submit" name="sub_button" onclick="javascript:show_multiplelayer_applet()" style="width: 600px;" value="Launch the example multilayer neural network applet" width="600">
<p>If you launch and play with the applet above, you will see that it is able to separate classes non-linearly (indicating that it's using more than one perceptron). It is built using this two-layer neural network:</p>
<p><small><div class="wp-caption aligncenter" style="width: 440px"><img title="Two-layer neural network example" src="http://blog.zabarauskas.com/img/multilayer_example.gif" alt="Two-layer neural network example" width="430" height="297" /><p class="wp-caption-text">Two-layer neural network example</p></div></small></p>
<p>The weights vector <script type='math/tex'>\vec{w}</script> contains all the weights in the network, i.e.<br />
\begin{align} \vec{w} = ( w_{in_1 \rightarrow 1}, w_{in_x \rightarrow 1}, w_{in_y \rightarrow 1}, w_{in_1 \rightarrow 2}, ..., w_{in_y \rightarrow 5}, w_{in_1 \rightarrow 6}, w_{1 \rightarrow 6}, w_{2 \rightarrow 6}, ..., w_{5 \rightarrow 6}). \end{align}</p>
<p>Each perceptron is using <i>sigmoid</i> as its activation function and the output of the perceptron <script type='math/tex'>6</script> is the output for the whole network, i.e. <script type='math/tex'>h_{\vec{w}}(\vec{x_i}) = z_6</script>.</p>
<p>Then an individual point <i>i</i> (with <i>x</i> and <i>y</i> coordinates normalized) is considered as an <script type='math/tex'>i</script><sup>th</sup> training example and fed through the network. While it's being propagated, each <script type='math/tex'>s_i</script> and <script type='math/tex'>z_i</script> for <script type='math/tex'>i = 1, ..., 6</script> are stored.</p>
<p>Then the gradient of an <script type='math/tex'>i</script><sup>th</sup> error surface is calculated as follows:<br />
\begin{align}<br />
\frac{\partial E_i(\vec{w})}{\partial \vec{w}} &#038;= \left( \frac{\partial E_i(\vec{w})}{\partial w_{in_1 \rightarrow 1}},\frac{\partial E_i(\vec{w})}{\partial w_{in_x \rightarrow 1}}, ..., \frac{\partial E_i(\vec{w})}{\partial w_{in_y \rightarrow 5}},\frac{\partial E_i(\vec{w})}{\partial w_{in_1 \rightarrow 6}},\frac{\partial E_i(\vec{w})}{\partial w_{1 \rightarrow 6}},\frac{\partial E_i(\vec{w})}{\partial w_{2 \rightarrow 6}}, ..., \frac{\partial E_i(\vec{w})}{\partial w_{5 \rightarrow 6}} \right) , \end{align}<br />
where<br />
\begin{align} \frac{\partial E_i(\vec{w})}{\partial w_{in_1 \rightarrow 1}} &#038;= 2 (h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (s_1)\; \frac{\partial h_{\vec{w}}(\vec{x_i})}{\partial s_6} w_{1 \rightarrow 6} \\<br />
&#038;= 2 (z_6 - y_i) \; \sigma (s_1) \; (1 -  \sigma (s_1)) \; \sigma (s_6) \; (1 - \sigma (s_6)) \; w_{1 \rightarrow 6}, \\<br />
\frac{\partial E_i(\vec{w})}{\partial w_{in_x \rightarrow 1}} &#038;= 2 (z_6 - y_i) \; \sigma (s_1) \; (1 -  \sigma (s_1)) \; {in}_x \; \sigma (s_6) \; (1 - \sigma (s_6)) \; w_{1 \rightarrow 6}, \\<br />
&#038; \vdots \\<br />
\frac{\partial E_i(\vec{w})}{\partial w_{in_y \rightarrow 5}} &#038;= 2 (z_6 - y_i) \; \sigma (s_5) \; (1 -  \sigma (s_5)) \; {in}_y \; \sigma (s_6) \; (1 - \sigma (s_6)) \; w_{5 \rightarrow 6}, \\<br />
\frac{\partial E_i(\vec{w})}{\partial w_{in_1 \rightarrow 6}} &#038;= 2 (h_{\vec{w}}(\vec{x_i}) - y_i) \; \sigma ' (s_6) \\<br />
&#038;= 2 (z_6 - y_i) \; \sigma (s_6) \; (1 -  \sigma (s_6)) , \\<br />
\frac{\partial E_i(\vec{w})}{\partial w_{1 \rightarrow 6}} &#038;= 2 (z_6 - y_i) \; \sigma (s_6) \; (1 -  \sigma (s_6)) \; z_1, \\<br />
\frac{\partial E_i(\vec{w})}{\partial w_{2 \rightarrow 6}} &#038;= 2 (z_6 - y_i) \; \sigma (s_6) \; (1 -  \sigma (s_6)) \; z_2, \\<br />
&#038; \vdots \\<br />
\frac{\partial E_i(\vec{w})}{\partial w_{5 \rightarrow 6}} &#038;= 2 (z_6 - y_i) \; \sigma (s_6) \; (1 -  \sigma (s_6)) \; z_5.<br />
\end{align}</p>
<p>Finally, the network is sequentially trained with the learning rate <script type='math/tex'>\eta = 0.5</script> (starting with a random initial weight vector <script type='math/tex'>w_0</script>)<br />
\begin{align} \vec{w}_{t+1} := \vec{w_t} - 0.5 \frac{\partial E_i(\vec{w})}{\partial \vec{w}} \bigg|_{\vec{w_t}}.\end{align}</p>
<p>That's it, I hope it sheds some light on the backpropagation!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.zabarauskas.com/backpropagation-tutorial/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Eigenfaces Tutorial</title>
		<link>http://blog.zabarauskas.com/eigenfaces-tutorial/</link>
		<comments>http://blog.zabarauskas.com/eigenfaces-tutorial/#comments</comments>
		<pubDate>Fri, 02 Oct 2009 16:43:22 +0000</pubDate>
		<dc:creator>Manfredas Zabarauskas</dc:creator>
				<category><![CDATA[Development]]></category>
		<category><![CDATA[eigenface]]></category>
		<category><![CDATA[eigenfaces]]></category>
		<category><![CDATA[face detection]]></category>
		<category><![CDATA[face recognition]]></category>
		<category><![CDATA[pca]]></category>
		<category><![CDATA[pentland]]></category>
		<category><![CDATA[turk]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://blog.zabarauskas.com/?p=286</guid>
		<description><![CDATA[The main purpose behind writing this tutorial was to provide a more detailed set of instructions for someone who is trying to implement an eigenface based face detection or recognition systems. It is assumed that the reader is familiar (at least to some extent) with the eigenface technique as described in the original M. Turk [...]]]></description>
			<content:encoded><![CDATA[<p><i>The main purpose behind writing this tutorial was to provide a more detailed set of instructions for someone who is trying to implement an eigenface based face detection or recognition systems. It is assumed that the reader is familiar (at least to some extent) with the eigenface technique as described in the original M. Turk and A. Pentland papers (see "References" for more details). </i></p>
<h3>Introduction</h3>
<p>The idea behind eigenfaces is similar (to a certain extent) to the one behind the periodic signal representation as a sum of simple oscillating functions in a <a href="http://en.wikipedia.org/wiki/Fourier_series" target="_blank">Fourier decomposition</a>. The technique described in this tutorial, as well as in the original papers, also aims to represent a face as a linear composition of the base images (called the eigenfaces).</p>
<p>The recognition/detection process consists of initialization, during which the eigenface basis is established and face classification, during which a new image is projected onto the "face space" and the resulting image is categorized by the weight patterns as a known-face, an unknown-face or a non-face image.</p>
<h3>Demonstration</h3>
<p>To <a href="http://www.zabarauskas.com/downloads/Eigenfaces.zip">download</a> the software shown in video for 32-bit x86 platform, click <a href="http://www.zabarauskas.com/downloads/Eigenfaces.zip">here</a>. It was compiled using Microsoft Visual C++ 2008 and uses <a href="http://www.gnu.org/software/gsl/" target="_blank">GSL</a> for Windows.</p>
<p><object width="480" height="385"><param name="movie" value="http://www.youtube.com/v/YWRiF7FAuKE&#038;hl=en&#038;fs=1&#038;"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/YWRiF7FAuKE&#038;hl=en&#038;fs=1&#038;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="385"></embed></object></p>
<h3>Establishing the Eigenface Basis</h3>
<p>First of all, we have to obtain a training set of <script type='math/tex'>M</script> grayscale face images  <script type='math/tex'>I_1, I_2, ..., I_M</script>. They should be:</p>
<ol>
<li> face-wise aligned, with eyes in the same level and faces of the same scale,</li>
<li> normalized so that every pixel has a value between 0 and 255 (i.e. one byte per pixel encoding), and</li>
<li>of the same <script type='math/tex'>N \times N</script> size.</li>
</ol>
<p>So just capturing everything formally, we want to obtain a set <script type='math/tex'>\{ I_1, I_2, ..., I_M \}</script>, where \begin{align} I_k = \begin{bmatrix} p_{1,1}^k &#038; p_{1,2}^k &#038; ... &#038; p_{1,N}^k \\ p_{2,1}^k &#038; p_{2,2}^k &#038; ... &#038; p_{2,N}^k \\ \vdots \\ p_{N,1}^k &#038; p_{N,2}^k &#038; ... &#038; p_{N,N}^k \end{bmatrix}_{N \times N} \end{align} and <script type='math/tex'>0 \leq p_{i,j}^k \leq 255.</script></center></p>
<p>Once we have that, we should change the representation of a face image <script type='math/tex'>I_k</script> from a <script type='math/tex'>N \times N</script> matrix, to a <script type='math/tex'>\Gamma_k</script> point in <script type='math/tex'>N^2</script>-dimensional space. Now here is how we do it: <span id="more-286"></span>we concatenate all the rows of the matrix <script type='math/tex'>I_k</script> into one big vector of dimension <script type='math/tex'>N^2</script>. Can it get any more simpler than that?</p>
<p>This is how it looks formally:</p>
<p><center><script type='math/tex'>\Gamma_k = \begin{bmatrix} p_{1,1}^k \\ p_{1,2}^k \\ \vdots \\ p_{1,N}^k \\ p_{2,1}^k \\ p_{2,2}^k \\ \vdots \\ p_{2,N}^k \\ \vdots \\ p_{N,1}^k \\ p_{N,2}^k \\ \vdots \\ p_{N,N}^k \end{bmatrix}_{N \times 1}</script>, where  <script type='math/tex'>k = 1, ..., M</script> and <script type='math/tex'>p_{i,j}^k \in I_k</script></center></p>
<p>Since we are much more interested in the characteristic features of those faces, let's subtract everything what is common between them, i.e. the <strong>average face</strong>.<br />
The average face of the previous mean-adjusted images can be defined as <script type='math/tex'>\Psi = {{1}\over{M}} \sum_{i=1}^{M} \Gamma_i</script>, then each face differs from the average by the vector <script type='math/tex'>\Phi_i = \Gamma_i - \Psi</script>.</p>
<p>Now we should attempt to find a set of orthonormal vectors which best describe the distribution of our data. The necessary steps in this at a first glance daunting task would seem to be:</p>
<ol>
<li>Obtain a <a href="http://en.wikipedia.org/wiki/Covariance_matrix" target="_blank">covariance matrix</a><br />
<script type='math/tex'>C = {{1}\over{M}} \sum_{i=1}^{M} \Phi_i \Phi_i^T = AA^T</script>, where <script type='math/tex'>A = \left[ \Phi_1 \Phi_2 ... \Phi_M \right]</script>.</li>
<li>Find the eigenvectors <script type='math/tex'>u_k</script> and eigenvalues <script type='math/tex'>\lambda_k</script> of <script type='math/tex'>C</script>.</li>
</ol>
<p>However, note two things here: <script type='math/tex'>A</script> is of the size <script type='math/tex'>N^2 \times M</script> and hence the matrix <script type='math/tex'>C</script> is of the size <script type='math/tex'>N^2 \times N^2</script>. To put things into perspective - if your image size is <script type='math/tex'>128 \times 128</script>, then the size of the matrix <script type='math/tex'>C</script> would be <script type='math/tex'>16384 \times 16384</script>. Determining eigenvectors and eigenvalues for a matrix this size would be an absolutely intractable task!</p>
<p>So how do we go about it? A simple mathematical trick: first let's calculate the inner product matrix <script type='math/tex'>L = A^T A</script>, of the size <script type='math/tex'>M \times M</script>. Then let's find it's eigenvectors <script type='math/tex'>v_i, i = 1, ..., M</script> of <script type='math/tex'>L</script> (of the <script type='math/tex'>M</script>-th dimension). Now observe, that if <script type='math/tex'>L v_i = \lambda_i v_i</script>, then</p>
<p><center>\begin{array} {rcl} A L v_i &#038;=&#038; \lambda_i A v_i \Rightarrow \\ A A^T A v_i &#038;=&#038; \lambda_i A v_i \Rightarrow \\ C A v_i &#038;=&#038; \lambda_i A v_i, \end{array}</center></p>
<p>and hence <script type='math/tex'>u_i = A v_i</script> and <script type='math/tex'>\lambda_i</script> are respectively the <script type='math/tex'>M</script> eigenvectors (of <script type='math/tex'>N^2</script>-th dimension) and eigenvalues of <script type='math/tex'>C</script>. Make sure to normalize <script type='math/tex'>u_i</script>, such that <script type='math/tex'>\left\| u_i \right\| = 1</script>.</p>
<p>We will call these eigenvectors <script type='math/tex'>u_i</script> the <strong>eigenfaces</strong>. Scale them to 255 and render on the screen, to see why.</p>
<p>It turns out that quite a few eigenfaces with the smallest eigenvalues can be discarded, so leave only the <script type='math/tex'>R \leq M</script> ones with the largest eigenvalues (i.e. only the ones making the greatest contribution to the variance of the original image set) and chuck them into the matrix <script type='math/tex'>U = \left[ u_1 u_2 ... u_R \right]_{N^2 \times R}</script></p>
<p>After you have done that - congratulations! We won't need anything else, but the matrix <script type='math/tex'>U</script> for the next steps - face detection and classification.</p>
<h3>Face Classification Using Eigenfaces</h3>
<p>Once the eigenfaces are created, a new face image <script type='math/tex'>\Gamma</script> can be transformed into it's eigenface components by a simple operation:</p>
<p><center><script type='math/tex'>\Omega = U^T (\Gamma - \Psi) =  \begin{bmatrix} \omega_1 \\ \omega_2 \\ \vdots \\ \omega_R \end{bmatrix}_{R \times 1}</script>.</center></p>
<p>The weights <script type='math/tex'>\omega_i \in \Omega</script> describe the contribution of each eigenface in representing the input face image. We can use this vector for <strong>face recognition</strong> by finding the smallest <a href="http://en.wikipedia.org/wiki/Euclidean_distance">Euclidean distance</a> <script type='math/tex'>\epsilon_{rec}</script> between the input face and training faces weight vectors, i.e. by calculating <script type='math/tex'>\epsilon_{rec} = min \left\| \Omega - \Omega_i \right\|</script>. If <script type='math/tex'>\epsilon_{rec} < \Theta_{rec}</script>, where <script type='math/tex'>\Theta_{rec}</script> is a treshold chosen heuristically, then we can say that the input image is recognized as the image with which it gives the lowest score.</p>
<p>The weights vector can also be used for an unknown <strong>face detection</strong>, exploiting the fact that the images of faces do not change radically when projected into the face space, while the projection of non-face images appear quite different. To do so, we can calculate the distance <script type='math/tex'>\epsilon_{det}</script> from the mean-adjusted input image <script type='math/tex'>\Phi = \Gamma - \Psi</script> and its projection onto face space <script type='math/tex'>\Phi_f = \sum_{i=1}^R \omega_i u_i </script>, i.e. <script type='math/tex'>\epsilon_{det} = \left\| \Phi - \Phi_f \right\|</script>. Again, if <script type='math/tex'>\epsilon_{det} < \Theta_{det}</script> for some treshold <script type='math/tex'>\Theta_{det}</script> (also obtained heuristically, for example, by observing <script type='math/tex'>\epsilon_{det}</script> for an input set consisting only of face images and a set of non-face images) we can conclude that the input image is a face.</p>
<h3>References</h3>
<p>1. Face Recognition Using Eigenfaces, Matthew A. Turk and Alex P. Pentland, MIT Vision and Modeling Lab, CVPR ‘91.<br />
2. Eigenfaces for Recognition, Matthew A. Turk and Alex P. Pentland, Journal of Cognitive Neuroscience ‘91.<br />
3. <a href="http://www.scholarpedia.org/article/Eigenfaces" target="_blank">Eigenfaces</a>. Sheng Zhang and Matthew Turk (2008), Scholarpedia, 3(9):4244. </p>
]]></content:encoded>
			<wfw:commentRss>http://blog.zabarauskas.com/eigenfaces-tutorial/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
	</channel>
</rss>

<!-- www.000webhost.com Analytics Code -->
<script type="text/javascript" src="http://analytics.hosting24.com/count.php"></script>
<noscript><a href="http://www.hosting24.com/"><img src="http://analytics.hosting24.com/count.php" alt="web hosting" /></a></noscript>
<!-- End Of Analytics Code -->

