So now we have a hypothesis function and we’ve got a way to measure how accurate it is. All we need now is a way to improve our hypothesis function, eventually finding the best possible \(\theta_0, \theta_1\) values (that minimize our cost).
That’s where gradient descent comes in.
Imagine plotting our \(\theta_0, \theta_1\) parameters on the \(x, y\) axes .. and a 3 dimensional \(y\) axis showing the corresponding cost function \(J(\theta_0, \theta_1)\) for those \(\theta_1, \theta_1\) values.
Here’s what that would look like:
What we’re looking for is the point on the graph would the lowest cost, which would represent the pit in the center. That’s the sweet spot.. and the values of \(\theta_0\) and \(\theta_1\) at that point are what we are looking for (to optimize our holy hypothesis function \(h\)).
Since plotting these pretty graphs is not always possible (imagine what it would look like if you had just 20 features instead of 1).
Instead then, what we need to do is find a mathematical way to get the answers we need.
At the core of the gradient descent technique is a very simple idea. If we can take a starting point - any starting point - and then take a tiny step “downhill” (towards a point with a lower cost), then we’ll eventually get to a pit .. somewhere where we can’t descend any further without going back up. This point is known as the local minima and that’s what we’re aiming for.
The only question is: “how do we know which way is downhill?”.
Well, that’s where your old from high school - derivatives - come in. If you remember one thing about derivatives in school, it’s that a derivative is the slope at a given point (if you’re completely lost, check out this excellent introduction to derivates).
So all we need to do then is follow the derivative (the slope at a particular point), and it will give us a direction to move towards.
Assuming we’ve only got one \(x\) and one \(y\) (carrying on the example from the last post), we can start substituting equations in to really drive this point home.
Let’s start by substituting for the value of \(J\) which we derived in the previous post as:
.. this results in:
Since we have two variables that we’re trying to optimize (namely \(\theta_0)\) and \(\theta_1\)), we have to use what’s known in the geeky world of mathematics as partial derivatives. While it may sound like I just turned all Math professor on you, it’s actually a simple concept.
It’s like a normal derivative, where you treat everything as a constant except the variable you are deriving by. If that’s not enough by way of introduction, check this great explanation on the Math StackExchange site.
Ok, so here are the final equations we’re looking for (in their simplified form, after rearranging and stuff):
She’s a real beauty, ain’t she?
And this is just one step. We’ll be running this calculation many, many times .. as we journey down the hill of cost, to the local minima (sometimes thousands of times even).
Nothing drives a point home like seeing some code, and I really struggled to get my head around this whole vector business until I actually saw it implemented in traditional arrays.
You’ll find the data.csv
file here.
Here’s what the output looks like:
It took me a while to wrap my head around this, and to get the code to a working state. Matt Nedrich’s post on the topic was very helpful.
Note: I’ve used Python in the example above, but going forward most examples will be written using Octave.