March 25, 2020 • 5 min read • IN tech
Here are my solutions to exercise 3.
Prove the assertion of the last paragraph. Hint: If you’re not already familiar with the Cauchy-Schwarz inequality, you may find it helpful to familiarize yourself with it.
Here is the paragraph he is referring to.
Indeed, there’s even a sense in which gradient descent is the optimal strategy for searching for a minimum. Let’s suppose that we’re trying to make a move in position so as to decrease as much as possible. This is equivalent to minimizing . We’ll constrain the size of the move so that for some small fixed . In other words, we want a move that is a small step of a fixed size, and we’re trying to find the movement direction which decreases as much as possible. It can be proved that the choice of which minimizes is , where is determined by the size constraint . So gradient descent can be viewed as a way of taking small steps in the direction which does the most to immediately decrease .
The goal here is to minimize which implies that we are trying to minimize (can be rewritten as ) as best as we can. We want , meaning we want since .
We know that for small fixed . To make we must must be in a direction opposite of . This is given by the definition of a dot product. The dot product is a value that represents how much two vectors are parallel. The maximum value of a dot product is produced when two vectors are in the exact same direction (parallel) and the most negative value is produced when the two vectors are in exactly opposite directions (antiparallel). So to produce the most negative value we much choose a such that it is in the opposite direction of . We can rewrite . is defined as such because we want the vector to be a fraction of the that way we do not overshoot the minimum.
I explained gradient descent when is a function of two variables, and when it’s a function of more than two variables. What happens when is a function of just one variable? Can you provide a geometric interpretation of what gradient descent is doing in the one-dimensional case?
Please note the particular values for the graph do not matter; it is the shape of the curve that matters.
To minimize with a gradient descent algorithm is quite simple in a one-dimensional example. This is because or in English the gradient is equal to the derivative of the function. So, looking at the image above, at point the tangent line seems to have a negative slope and thus , therefore when we move in the direction we increase and move down the slope toward the right.
An extreme version of gradient descent is to use a mini-batch size of just . That is, given a training input, , we update our weights and biases according to the rules and . Then we choose another training input, and update the weights and biases again. And so on, repeatedly. This procedure is known as online, on-line, or incremental learning. In online learning, a neural network learns from just one training input at a time (just as human beings do). Name one advantage and one disadvantage of online learning, compared to stochastic gradient descent with a mini-batch size of, say, .
The advantage of online learning has over stochastic gradient descent is that it will be much better at tuning the biases and weights to reduce the cost function for each given training input. It will not be using the estimated gradient but will be calculating the actual gradient and thus will produce better results. But one must do the cost-benefit analysis to see whether online learning, which is much slower but accurate, produces better results faster than stochastic gradient descent, which is fast but less accurate.
The major disadvantage of online learning is that it is much slower than stochastic gradient descent since you have to train the network over every single training input . So if you have a large dataset, like the MNIST data set with 60,000 images, then online learning will have to calculate the partial derivative for each weight and biases for each image ( and for a given ).
Stochastic gradient descent will only need to calculate the partial derivatives for a fraction of the data set (in this case, images) and then adjust the weights and biases simultaneously based on that mini-batch. This a heck of a lot faster than doing it one by one for 60,000 images!