I am enrolled in an online machine learning course through Georgia Tech. It’s not as mathematically rigorous as I would like, but the material is nevertheless quite interesting. Today, I want to share a brief result with you about the convergence of a learning method for perceptrons.
In the context of artificial neural networks, a perceptron is an “artificial neuron”, not surprisingly a constituent of artificial neural networks, which you should read about. Basically, an artificial neuron takes a feature vector of inputs, calculates a weighted sum of the inputs, passes the sum to a function, called the activation or transfer function, and outputs the result. For a perceptron, this activation function is binary, returning a class in .
Formally, we have a feature vector
where is fixed weight vector, and is the “threshold” constant. Some authors use the class set , but I have been reading “Machine Learning” by T. Mitchell (I don’t recommend it) which does not. It would be nice if we could just take , so we embed a feature vector into by introducing and . For additional convenience, we introduce the “nonstandard” signum function taking on nonnegatives and on negatives.
So how does a perceptron learn how to “perceive”? Specifically, how do we find a “good” choice of weight vector ? Well, if we have some data, it would be nice if the perceptron classified all of our training examples correctly. So let’s try to find a which does this.
Consider a finite set of training examples consisting of a feature vector () and a class label . We start an initial weight , which is typically zero or randomly taken very small, and a small positive constant called the learning rate, which is used to moderate the magnitude of the weight updates. We then iterate through the training examples, testing if the perceptron correctly classifies each . If it does, we do nothing and move on to the next training example. Otherwise, we update by adding the term . The intuition behind the update term is that it pushes the quantity in the direction of the class . Indeed,
which has the opposite sign of the incorrect class. We keep performing rounds of iteration through the training examples until we complete a round with no misclassifications.
Does the algorithm converge in finite time? Well, it depends on the geometry of our data. If the training examples can be separated by a hyperplane in such that each class lies entirely in separate half-spaces, then the algorithm ends after finitely many iterations. In fact, we can bound the number of weight updates the algorithm makes.
Theorem 1. Suppose that for all and suppose that there exists and , with , such that
Then the perceptron perfectly classifies all the training examples in .
Proof. Let denote the sequence of weights obtained by the perceptron rule so that the weight after the update is . Let denote the training example that generated the error which resulted in the update. Observe that we can write
Suppose we have made updates. Then using the linearity of the dot product, we obtain
We now seek an upper bound for the right-hand side of the above expression. By Cauchy-Schwarz, we have
Using the triangle inequality and the hypothesis that is small, we can just consider . Observe that we can reduce the index by substituting our result above to obtain
The last expression is negative as by definition of a weight update. Hence, we obtain the upper bound
By induction on , we conclude that
Squaring both sides and dividing through by , we obtain
The perceptron convergence theorem isn’t very useful because a priori we don’t know that the data is linearly separable. If it is linearly separable then the algorithm will work. If it isn’t, then the algorithm will keep running. How do we know that it hasn’t terminated because of a long finite computation time? In such a case, we would appeal to a learning algorithm which does not make a hypothesis about the geometry of the data, such as gradient descent.
It is my understanding that the perceptron convergence theorem was first proved by ABJ Novikoff in 1962 in the paper “On Convergence Proofs on Perceptrons”; but I was unable to find a copy online, and I currently don’t have access to research library. I’ve ordered a copy of “Perceptrons: An Introduction to Computational Geometry” by M. Minsky and S. Papert, which may shed some light on attribution.