## Perceptron Convergence Theorem

I am enrolled in an online machine learning course through Georgia Tech. It’s not as mathematically rigorous as I would like, but the material is nevertheless quite interesting. Today, I want to share a brief result with you about the convergence of a learning method for perceptrons.

In the context of artificial neural networks, a perceptron is an “artificial neuron”, not surprisingly a constituent of artificial neural networks, which you should read about. Basically, an artificial neuron takes a feature vector of inputs, calculates a weighted sum of the inputs, passes the sum to a function, called the activation or transfer function, and outputs the result. For a perceptron, this activation function is binary, returning a class in $\left\{\pm 1\right\}$.

Formally, we have a feature vector $\vec{x}=(x_{1},\ldots,x_{n})\in\mathbb{R}^{n}$

$\displaystyle f:\mathbb{R}^{n}\rightarrow\left\{\pm 1\right\},f(\vec{x})=\begin{cases}{+1} & {,\vec{x}\cdot\vec{w}\geq\theta}\\{-1} & {,\vec{x}\cdot\vec{w}\theta}\end{cases}$

where $\vec{w}=(w_{1},\ldots,w_{n})\in\mathbb{R}^{n}$ is fixed weight vector, and $\theta\in\mathbb{R}$ is the “threshold” constant. Some authors use the class set $\left\{0,1\right\}$, but I have been reading “Machine Learning” by T. Mitchell (I don’t recommend it) which does not. It would be nice if we could just take $\theta=0$, so we embed a feature vector into $\mathbb{R}^{n+1}$ by introducing $x_{0}:=1$ and $w_{0}:=-\theta$. For additional convenience, we introduce the “nonstandard” signum function taking $+1$ on nonnegatives and $-1$ on negatives.

So how does a perceptron learn how to “perceive”? Specifically, how do we find a “good” choice of weight vector $\vec{w}$? Well, if we have some data, it would be nice if the perceptron classified all of our training examples correctly. So let’s try to find a $\vec{w}$ which does this.

Consider a finite set $\left\{(\vec{x}_{i},y_{i})\right\}_{i=1}^{N}$ of training examples consisting of a feature vector $\vec{x}_{i}\in\mathbb{R}^{n+1}$ ($x_{0}=1$) and a class label $y_{i}\in\left\{\pm1\right\}$. We start an initial weight $\vec{w}\gets\vec{w}_{0}$, which is typically zero or randomly taken very small, and a small positive constant $\eta$ called the learning rate, which is used to moderate the magnitude of the weight updates. We then iterate through the training examples, testing if the perceptron correctly classifies each $\vec{x}_{i}$. If it does, we do nothing and move on to the next training example. Otherwise, we update $\vec{w}$ by adding the term $\eta y_{i}\vec{x}_{i}$. The intuition behind the update term is that it pushes the quantity $\vec{w}\cdot\vec{x}_{i}$ in the direction of the class $y_{i}$. Indeed,

$\displaystyle\vec{x}_{i}\cdot\left(\eta y_{i}\vec{x}_{i}\right)=\eta y_{i}\left\|x_{i}\right\|^{2}$

which has the opposite sign of the incorrect class. We keep performing rounds of iteration through the training examples until we complete a round with no misclassifications.

Does the algorithm converge in finite time? Well, it depends on the geometry of our data. If the training examples $\vec{x}_{1},\ldots,\vec{x}_{n}$ can be separated by a hyperplane in $\mathbb{R}^{n}$ such that each class lies entirely in separate half-spaces, then the algorithm ends after finitely many iterations. In fact, we can bound the number of weight updates the algorithm makes.

Theorem 1. Suppose that $\left\|\vec{x}_{i}\right\|\leq R$ for all $i$ and suppose that there exists $\gamma>0$ and $v\in\mathbb{R}^{n+1}$, with $\left\|v\right\|=1$, such that

$\displaystyle\gamma<\eta y_{i}\vec{v}\cdot\vec{x}_{i}\indent$ $\displaystyle\forall 1\leq i\leq N$

Then the perceptron perfectly classifies all the training examples in $O(\eta^{2}R^{2}/\gamma^{2})$.

Proof. Let $\vec{w}_{t}$ denote the sequence of weights obtained by the perceptron rule so that the weight after the $k^{th}$ update is $\vec{w}_{k}$. Let $(\vec{x}_{(k)},y_{(k)})$ denote the training example that generated the error which resulted in the $k^{th}$ update. Observe that we can write

$\displaystyle\vec{w}_{k}=\vec{w}_{k-1}+\eta y_{(k)}\vec{x}_{(k)}\Rightarrow\eta y_{(k)}\vec{x}_{(k)}=\vec{w}_{k}-\vec{w}_{k-1}$

Suppose we have made $k$ updates. Then using the linearity of the dot product, we obtain

$\displaystyle k\gamma\sum_{i=1}^{k}\eta y_{(i)}\vec{v}\cdot\vec{x}_{(i)}=\vec{v}\cdot\sum_{i=1}^{k}\vec{w}_{i}-\vec{w}_{i-1}=\vec{v}\cdot\left(\vec{w}_{k}-\vec{w}_{0}\right)$

We now seek an upper bound for the right-hand side of the above expression. By Cauchy-Schwarz, we have

$\displaystyle\vec{v}\cdot\left(\vec{w}_{k}-\vec{w}_{0}\right)\leq\left\|\vec{v}\right\|\left\|\vec{w}_{k}-\vec{w}_{0}\right\|=\left\|\vec{w}_{k}-\vec{w}_{0}\right\|$

Using the triangle inequality and the hypothesis that $\left\|\vec{w}_{0}\right\|$ is small, we can just consider $\left\|\vec{w}_{k}\right\|$. Observe that we can reduce the index $k$ by substituting our result above to obtain

$\displaystyle\left\|\vec{w}_{k-1}+\eta y_{(k)}\vec{x}_{(k)}\right\|^{2}=\left\|\vec{w}_{k-1}\right\|^{2}+\left\|\eta y_{(k)}\vec{x}_{(k)}\right\|^{2}+2\eta y_{(k)}\vec{w}_{k-1}\cdot x_{(k)}$

The last expression is negative as $sgn(y_{(k)})\neq sgn(\vec{w}_{k-1}\cdot\vec{x}_{(k)})$ by definition of a weight update. Hence, we obtain the upper bound

$\displaystyle\left\|\vec{w}_{k-1}\right\|^{2}+\left\|\eta y_{(k)}\vec{x}_{(k)}\right\|^{2}\leq\left\|\vec{w}_{k-1}\right\|^{2}+\eta^{2}R^{2}$

By induction on $k$, we conclude that

$\displaystyle k\gamma\sum_{i=1}^{k}\eta y_{(i)}\vec{v}\cdot\vec{x}_{(i)}\leq\left(k\eta^{2}R^{2}+\left\|\vec{w}_{0}\right\|^{2}\right)^{\frac{1}{2}}+\left\|\vec{w}_{0}\right\|\lesssim \left(k\eta^{2}R^{2}\right)^{\frac{1}{2}}$

Squaring both sides and dividing through by $k$, we obtain

$\displaystyle k\lesssim\left(\dfrac{\eta R}{\gamma}\right)^{2}$

$\Box$

The perceptron convergence theorem isn’t very useful because a priori we don’t know that the data is linearly separable. If it is linearly separable then the algorithm will work. If it isn’t, then the algorithm will keep running. How do we know that it hasn’t terminated because of a long finite computation time? In such a case, we would appeal to a learning algorithm which does not make a hypothesis about the geometry of the data, such as gradient descent.

It is my understanding that the perceptron convergence theorem was first proved by ABJ Novikoff in 1962 in the paper “On Convergence Proofs on Perceptrons”; but I was unable to find a copy online, and I currently don’t have access to research library. I’ve ordered a copy of “Perceptrons: An Introduction to Computational Geometry” by M. Minsky and S. Papert, which may shed some light on attribution.