Back to overview Neural Net Problems - Exercise 3
March 25, 2020

Here are my solutions to exercise 3.

### Part 1

#### Question

Prove the assertion of the last paragraph. Hint: If you're not already familiar with the Cauchy-Schwarz inequality, you may find it helpful to familiarize yourself with it.

Here is the paragraph he is referring to.

Indeed, there's even a sense in which gradient descent is the optimal strategy for searching for a minimum. Let's suppose that we're trying to make a move in position so as to decrease as much as possible. This is equivalent to minimizing . We'll constrain the size of the move so that for some small fixed . In other words, we want a move that is a small step of a fixed size, and we're trying to find the movement direction which decreases as much as possible. It can be proved that the choice of which minimizes is , where is determined by the size constraint . So gradient descent can be viewed as a way of taking small steps in the direction which does the most to immediately decrease .

#### Solution

The goal here is to minimize which implies that we are trying to minimize (can be rewritten as ) as best as we can. We want , meaning we want since .

We know that for small fixed . To make we must must be in a direction opposite of . This is given by the definition of a dot product. The dot product is a value that represents how much two vectors are parallel. The maximum value of a dot product is produced when two vectors are in the exact same direction (parallel) and the most negative value is produced when the two vectors are in exactly opposite directions (antiparallel). So to produce the most negative value we much choose a such that it is in the opposite direction of . We can rewrite . is defined as such because we want the vector to be a fraction of the that way we do not overshoot the minimum.

### Part 2

#### Question

I explained gradient descent when is a function of two variables, and when it's a function of more than two variables. What happens when is a function of just one variable? Can you provide a geometric interpretation of what gradient descent is doing in the one-dimensional case?

#### Solution

Please note the particular values for the graph do not matter; it is the shape of the curve that matters.

To minimize with a gradient descent algorithm is quite simple in a one-dimensional example. This is because or in English the gradient is equal to the derivative of the function. So, looking at the image above, at point the tangent line seems to have a negative slope and thus , therefore when we move in the direction we increase and move down the slope toward the right.

### Part 3

#### Question

An extreme version of gradient descent is to use a mini-batch size of just . That is, given a training input, , we update our weights and biases according to the rules and . Then we choose another training input, and update the weights and biases again. And so on, repeatedly. This procedure is known as online, on-line, or incremental learning. In online learning, a neural network learns from just one training input at a time (just as human beings do). Name one advantage and one disadvantage of online learning, compared to stochastic gradient descent with a mini-batch size of, say, .