code
share


â–º Chapter 13: Multi-layer Perceptrons

13.3 Normalizing gradient descent

In [4]:

13.3.1 The gradient descent direction¶

We saw in Chapter 6 that with standard gradient descent we take steps of the form

\begin{equation} \mathbf{w}^{\,k} = \mathbf{w}^{\,k-1} - \alpha \nabla g(\mathbf{w}^{\,k-1}) \end{equation}

where the negative gradient at each step determines the direction we travel in next. How much we end up traveling in the direction of negative gradient is determined by two factors: (i) the steplength parameter $\alpha$, and (ii) the magnitude of the gradient itself denoted by $\Vert \nabla g(\mathbf{w}^{k-1}) \Vert_2$. This means that we cannot fully control how much we travel at each step by tuning $\alpha$ alone. Luckily this issue has a quick fix: normalize the gradient by dividing off its magnitude to get the unit-length descent vector $-\frac{\nabla g(\mathbf{w}^{k-1})}{\Vert \nabla g(\mathbf{w}^{k-1}) \Vert_2 }$. Notice, dividing the negative gradient by its magnitude does not change its direction but it allows us to write the normalized gradient descent step as

\begin{equation} \mathbf{w}^{\,k} = \mathbf{w}^{\,k-1} - \alpha \frac{\nabla g(\mathbf{w}^{\,k-1})}{\left\Vert \nabla g(\mathbf{w}^{\,k-1}) \right\Vert_2 } \end{equation}

where now the distance we travel at each step

\begin{equation} \left\Vert \mathbf{w}^{\,k} - \mathbf{w}^{\,k-1} \right\Vert_2 = \left\Vert -\alpha \frac{\nabla g(\mathbf{w}^{\,k-1})}{\left\Vert \nabla g(\mathbf{w}^{\,k-1}) \right\Vert_2 }\right\Vert_2 = \alpha \end{equation}

is solely determined by $\alpha$, which itself can be fixed for all steps of the gradient descent run or can change at each step, e.g., a diminishing steplength at step $k$ like $\alpha = \frac{1}{k}$.

The pseudo-code for normalized gradient descent algorithm is given below.

Normalized gradient descent¶


1:   input: function $g$, steplength $\alpha$, maximum number of steps $K$, and initial point $\mathbf{w}^0$
2:   $\mathbf{w}_{\text{min}}=\mathbf{w}^0$
3:   $g_{\text{min}}=g\left(\mathbf{w}^0\right)$
4:   for $\,\,k = 1...K$
5:           compute $\nabla g\left(\mathbf{w}^{k-1}\right)$
6:           if $\left\Vert \nabla g\left(\mathbf{w}^{k-1}\right)\right\Vert _{2}=0$
7:                     set $\nabla g\left(\mathbf{w}^{k-1}\right)$ to a small nonzero random vector
8:           end if
9:           $\mathbf{d}^{k-1} =\frac{\nabla g\left(\mathbf{w}^{k-1}\right)}{\left\Vert \nabla g\left(\mathbf{w}^{k-1}\right)\right\Vert _{2}}$
10:         $\mathbf{w}^k = \mathbf{w}^{k-1} - \alpha \mathbf{d}^{k-1}$
11:         if $g\left(\mathbf{w}^k\right) < g_{\text{min}}$
12:                  $\mathbf{w}_{\text{min}}=\mathbf{w}^k$
13:                  $g_{\text{min}}=g\left(\mathbf{w}^k\right)$
14:         end if
15:  end for
16:  output: $\mathbf{w}_{\text{min}}$ and $g_{\text{min}}$


To implement normalized gradient descent all we need to do is adjust a few lines in the descent loop of (unnormalized) gradient descent. First we need to compute the magnitude of the gradient (as reflected in line 5 of the pseudo-code):

In [ ]:
# evaluate the gradient
grad_eval = grad(w)

# compute norm of gradient
grad_norm = np.linalg.norm(grad_eval)

Next, since we will be dividing by this quantity, we need to make sure it is not so small in magnitude (or exactly 0) that it creates a 'division by zero' error (lines 6-8 of the pseudo-code). If the value is too small we pick a direction at random and move in its direction according to the steplength parameter $\alpha$.

In [ ]:
# check that magnitude of gradient is not too small, if yes pick a random direction to move
if grad_norm == 0:
    # pick random direction and normalize to have unit legnth
    grad_eval = 10**-6*np.sign(2*np.random.rand(len(w)) - 1)
    grad_norm = np.linalg.norm(grad_eval)
    grad_eval /= grad_norm

With these two simple adjustments we have the Python implementation of normalized gradient descent in the code cell below.

In [24]:
# gradient descent function - inputs: g (input function), alpha (steplength parameter), max_its (maximum number of iterations), w (initialization)
def normalized_gradient_descent(g,alpha,max_its,w):
    # compute the gradient of our input function - note this is a function too!
    gradient = grad(g)

    # run the gradient descent loop
    best_w = w        # weight we return, should be the one providing lowest evaluation
    best_eval = g(w)       # lowest evaluation yet
    for k in range(max_its):
        # evaluate the gradient, compute its length
        grad_eval = gradient(w)
        grad_norm = np.linalg.norm(grad_eval)
        
        # check that magnitude of gradient is not too small, if yes pick a random direction to move
        if grad_norm == 0:
            # pick random direction and normalize to have unit legnth
            grad_eval = 10**-6*np.sign(2*np.random.rand(len(w)) - 1)
            grad_norm = np.linalg.norm(grad_eval)
        
        grad_eval /= grad_norm
    
        # take gradient descent step
        w = w - alpha*grad_eval
        
        # return only the weight providing the lowest evaluation
        test_eval = g(w)
        if test_eval < best_eval:
            best_eval = test_eval
            best_w = w
            
    return best_w

13.3.2 Comparing normalized and unnormalized gradient descent schemes¶

From a mathematical point of view the difference between unnormalized and normalized gradient descent is trivial: they both describe the same idea, and either one can be quickly seen to be mathematically equivalent to the other by correct choice of steplength. Note that setting $\alpha \longleftarrow \frac{\alpha}{\left\Vert \nabla g(\mathbf{w}^{\,k-1}) \right\Vert_2 }$ turns the unnormalized gradient descent step $\mathbf{w}^{\,k} = \mathbf{w}^{\,k-1} - \alpha \nabla g\left(\mathbf{w}^{\,k-1}\right)$ into a normalized one. So why go to the trouble of mentioning both normalized and unnormalized gradient descent methods if they are mathematically the same?

Because practically speaking depending on the function being minimized one version of gradient descent can be much easier to employ effectively than the other. In this Subsection we will explore the practical advantages and disadvantages of each version of gradient descent. In particular we will see the following via examples below.

1. Behavior on generic convex functions: Generally speaking the unnormalized method is more convenient for use with convex functions (e.g., linear and logistic regression). With many smooth convex cost functions the length of each step is tied directly to the magnitude of the gradient, which is advantageous.

2. Behavior on generic non-convex functions: Generally speaking the normalized is more convenient for use with non-convex functions (e.g., deep neural networks), because here the fact that the length of each step of the unnormalized version is tied directly to the magnitude of the gradient is disadvantageous. By completely controlling the length of each step via the steplength parameter, the normalized version can be made to more easily pass over saddle points and through flat regions of a function.

3. Steplength parameter selection: For a general function gradient descent reaches points infinitesimally close to local minima only when the length of each step diminishes as the algorithm progresses (at least eventually). In the case of normalized gradient descent this implies that the steplength parameter $\alpha$ should be diminishing in size if such a solution is needed. This must be done as well with unnormalized gradient descent when applied to certain convex functions (that are not completely smooth, e.g., the absolute value function) as well. However for smooth convex functions with unnormalized gradient descent since the length of each step is dependent on the length of the gradient such behavior is possible even with a steplength parameter $\alpha$ fixed for all iterations.

Example 1. Normalized vs. unnormalized gradient descent for convex functions¶

In this Example we compare normalized and unnormalized gradient descent (left and right panels respectively) on the simple quadratic function

\begin{equation} g(w)=w^2 \end{equation}

Both algorithms use the same initial point ($w^0 = -3$), steplength parameter ($\alpha = 0.1$), and maximum number of iterations (20 each). Steps are colored from green to red to indicate the starting and ending points of each run, with circles denoting the actual steps in the input space and 'x' marks denoting their respective function evaluations.

Notice how - due to the re-scaling of each step via the derivative length - the unnormalized version races to the global minimum of the function. Meanwhile the normalized version - taking constant length steps - gets only a fraction of the way there. This behavior is indicative of how each behaves in general when applied to minimizing convex functions.

In [3]:

As another example, consider the same quadratic with $N = 2$ inputs which we illustrate - along with a run of each version of the algorithm - in the next Python cell. As can be seen in the plot, when using a constant steplength parameter $\alpha = 0.7$ the length of each step with the unnormalized algorithm naturally diminishes as it approaches the stationary point (the global minimum) due to the shrinking gradient, and this helps it 'zero in' on the minimum and get infinitesimally close to it. Conversely with the normalized gradient descent algorithm - which takes steps of the length $\alpha$ ad infinitum - we can only get so close to the global minimum. We cannot 'zero in' on the solution at the origin as well as the unnormalized version can.

This is indicative of a more general behavior of the normalized gradient descent algorithm with functions such as this - if we want a point infinitesimally close to a global minimum we must use a diminishing steplength. This gives the normalized algorithm a chance to 'zero in' on the minimum. Thus for a general function we see that gradient descent 'zeros in' local minima only when the length of each step (eventually) diminishes as the algorithm progresses. In the case of normalized gradient descent this implies that the steplength parameter $\alpha$ should be diminishing in size if such a solution is needed.

In [4]:

Note that it is not always the case with a convex function that the magnitude of the gradient diminishes as we approach global minima. Take, for example, the absolute value function

\begin{equation} g(w) = |w| \end{equation}

which has a single global minimum at $w = 0$, but whose derivative is always defined (everywhere but at $w = 0$) as

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w}g(w) = \begin{cases} +1 \,\,\,\,\,\text{if} \,\, w > 0 \\ -1 \,\,\,\,\,\text{if} \,\, w < 0 \end{cases} \end{equation}

Since the length of each step with unnormalized gradient descent is dependent on the magnitude of this derivative, the fact that its length here does not change means that we are essentially taking normalized steps.

Hence - for instance - if we initialize at $w = 1.7$ with a steplength value of $\alpha = 0.5$ and run for $10$ steps we oscillate around the global minimum, as with normalized gradient descent. We illustrate this in the next cell.

In [5]:

Here - as with normalized descent - because the length of each step is not diminishing as we approach the global minima naturally, we must force it to by using a diminishing steplength value $\alpha$ if we wish to reach a point infinitesimally close to the minimum. We rerun the example above using a diminishing steplength in the Python cell below, and indeed we get much closer to the global minimum.

In [6]:

A close cousin to the absolute value is the sum of ReLU functions, an example of which is given below

\begin{equation} g(w) = \text{max}\,\left(0,\,w - 0.5\right) + \text{max}\,\left(0,\, -w - 0.5\right) \end{equation}

This is like an absolute value function with a long flat region of global minima at the bottom. Here we can use a diminishing steplength, or pick the steplength value small enough so that we reach the flat region of global minima (and here we will stay since it is perfectly flat). We run the same setup of gradient descent using the sum of ReLU functions above to illustrate using a fixed steplength value $\alpha = 0.5$. Since there is a large region of global minima we can indeed reach a global minimum using this fixed steplength.

In [7]:

Example 2. Normalized vs. unnormalized gradient descent for non-convex functions with saddle points and flat regions¶

In this example we illustrate a very useful amenity of the normalized gradient scheme: the avoidance of saddle points that comes by controlling the exact length of each descent step via the steplength parameter. The first function we use to illustrate this advantageous behavior is the following

\begin{equation} g(w) = \text{max}\,\left(0,\, \left(3w - 2.3\right)^3 + 1\right)^2 + \text{max}\,\left(0,\, \left(-3w + 0.7\right)^3 + 1\right)^2 \end{equation}

This function has a minimum at $w= \frac{1}{2}$ and saddle points at $w = \frac{7}{30}$ and $w = \frac{23}{30}$.

To illustrate how normalized gradient descent can pass over the saddle point of this function we use normalized and unnormalized gradient descent (left and right panels respectively) for its minimization. Both algorithms use the same initial point ($w^0 = 0$), steplength parameter ($\alpha = 0.01$), and maximum number of iterations (55 each). Steps are colored from green to red to indicate the starting and ending points of each run, with circles denoting the actual steps in the input space and 'x' marks denoting their respective function evaluations.

Notice how keeping the size of each step fixed has the practical advantage of allowing the normalized version of gradient descent to naturally pass over saddle points of non-convex functions as it did in this example.

In [8]:

Building on this example, let us now look at an example of a function with large flat areas

$$ g(w_1,w_2) = -e^{-\left(w_1^2 + w_2^2\right)} + 1 $$

to compare how normalized and unnormalized gradient descent deal with large regions of an input space where the gradient is almost zero. This function has a single global minimum at the origin and comparing the two - as we do in the next Python cell - we can see that even when beginning on an almost completely flat area of the input space the normalized method can quickly find this minimum, whereas the unnormalized version can barely start moving in the right direction. More specifically, we initialize both algorithms at the point $\mathbf{w}^0 = \begin{bmatrix} -3 \\ -2 \end{bmatrix}$ use the diminishing steplength parameter $\alpha = \frac{1}{k}$, and take 20 steps with each version. The top two panels display the function from the side (left panel) and from above (right panel) for the normalized version, and likewise the bottom two panels display the results of the unnormalized version.

In [9]:

Example 3. Normalized gradient descent for Least Squares logistic regression¶

Recall that in discussing logistic regression in Chapter 9 we derived the non-convex Least Squares cost function

\begin{equation} g\left(w_0,w_1\right) = \sum_{p=1}^P \left(\text{tanh}\left(w_0 + x_pw_1 \right) - y_p \right)^2 \end{equation}

which was immediately discarded in favor of the convex softmax cost, because our original unnormalized version of gradient descent would easily get stuck in the flat regions of the former cost.

In this example we show how normalized gradient descent can be used to minimize the logistic Least Squares cost, translated into Python in the next cell.

In [10]:
# tanh non-convex logistic least squares cost function
def tanh_least_squares(w):
    cost = 0
    for p in range(0,len(y)):
        x_p = x[p,:]
        y_p = y[p]
        cost +=(np.tanh(w[0] + w[1]*x_p) - y_p)**2
    return cost

First, we load in a simulated one-dimensional dataset (left panel) and plot its corresponding tanh Least Squares cost (right panel).

In [11]:

We now run normalized gradient descent for $900$ iterations, initialized at $w_0=-w_1=20$, with steplength parameter fixed at $\alpha=1$.

In [12]:

The cell below plots the Least Squares logistic regression fit to the data (left panel) along with the gradient descent path towards the minimum on the contour plot of the cost function (right panel). The normalized gradient descent steps are colored green to red as the run progresses.

In [13]:

Example 4. Normalized gradient descent for MLPs¶

In this Example we compare $100$ iterations of normalized and unnormalized gradient descent on a real two-class classification dataset collected for the task of face detection, consisting of $P=10,000$ face and non-face images.

In [5]:

Once the data is loaded, we move on to initiate a supervised learning instance using our deep learning library, create a four-layer neural network basis consisting of $10$ units in each layer, and choose an activation function (here, ReLU), data normalization scheme (here, standard normalization) as well as the cost function (here, softmax).

In [6]:
# initiate an instance
demo = superlearn_setup.Setup(x,y)

# choose a neural network architecture
demo.choose_features(name='multilayer_perceptron', layer_sizes=[784,10,10,10,10,1], activation='relu')

# choose a data normalization scheme
demo.choose_normalizer(name = 'standard')

# choose a cost function
demo.choose_cost(name = 'softmax')

We now run both normalized and unnormalized gradient descent for $100$ iterations, using the same initialization and steplength parameter.

In [23]:
# run unnormalized gradient descent
demo.fit(max_its = 100, alpha_choice = 10**(-1), version='unnormalized')

# run normalized gradient descent
demo.fit(max_its = 100, alpha_choice = 10**(-1), version='normalized')

# plot the results
demo.show_histories(start = 0, labels=['unnormalized','normalized'])