Chapter 8: Linear regression¶

8.4 Feature scaling via standard normalization¶

Press the button 'Toggle code' below to toggle code on and off for entire this presentation.

In [1]:

from IPython.display import display
from IPython.display import HTML
import IPython.core.display as di # Example: di.display_html('<h3>%s:</h3>' % str, raw=True)

# This line will hide code by default when the notebook is eåxported as HTML
di.display_html('<script>jQuery(function() {if (jQuery("body.notebook_app").length == 0) { jQuery(".input_area").toggle(); jQuery(".prompt").toggle();}});</script>', raw=True)

# This line will add a button to toggle visibility of code blocks, for use with the HTML export version
di.display_html('''<button onclick="jQuery('.input_area').toggle(); jQuery('.prompt').toggle();">Toggle code</button>''', raw=True)

8.4.1 Feature scaling for single input datasets¶

In [2]:

# load data
csvname = datapath + 'unnorm_linregress_data.csv'
data = np.loadtxt(csvname,delimiter = ',')
x = data[:,:-1].T
y = data[:,-1:] 

# plot dataset
demo = regress_plotter.Visualizer(data)
demo.plot_data()

only two parameters to learn (the bias and slope of a best fit line)
let us take a look at its associated Least Squares cost function

In [4]:

# show the contours of an input function over a desired viewing range
static_plotter.two_input_contour_plot(least_squares,[],xmin = -12,xmax = 15,ymin = -5,ymax = 25,num_contours = 7,show_original = False)

contours very elliptical

these create a long narrow valley along the long axis of the ellipses

as discussed in Section 6.4 gradient descent progresses quite slowly when applied to minimize a cost functions like this

make a run of $100$ steps of gradient descent

In [6]:

# show run on contour plot
static_plotter.two_input_contour_plot(g,weight_history,xmin = -3,xmax = 7,ymin = -1,ymax = 12,num_contours = 7,show_original = False)

little progress made in the long narrow valley of the function!

the resulting fit is therefore quite poor

In [7]:

# the original data and best fit line learned from our gradient descent run
ind = np.argmin(cost_history)
least_weights = weight_history[ind]
demo.plot_fit(plotting_weights = [least_weights],colors = ['r'])

an extremely simple adjustment of the data ameliorates this issue significantly, called standard normalization

do the following to each input dimension / feature: subtract mean and divide off standard deviation

for single input case, we replace each input $x_p$ point with its mean centered unit deviation analog as

\begin{equation} x_p \longleftarrow \frac{x_p - \mu}{\sigma} \end{equation}

where

\begin{equation} \mu = \frac{1}{P}\sum_{p=1}^{P}x_p \\ \end{equation}

and the sample standard deviation of the inputs $\sigma$ is defined as

\begin{array} \ \sigma = \sqrt{\frac{1}{P}\sum_{p=1}^{P}\left(x_p - \mu \right)^2}. \end{array}

super simple, invertible, and has a profound impact on the shape of our cost function

lets see what this does to the contours of the cost function we just saw

In [11]:

# show run in both three-dimensions and just the input space via the contour plot
static_plotter.two_input_contour_plot(least_squares_2,[],xmin = -12,xmax = 15,ymin = -5,ymax = 25,num_contours = 7,show_original = False)

how did the cost function change? lets look at an animation where we slide between the original and standard normalized data

we will flip through versions of the cost function inputing

\begin{equation} \left(1 - \lambda\right)x_p + \lambda \left( \frac{x_p - \mu}{\sigma} \right) \end{equation}

where $\lambda$ ranges from $0$ (i.e., we use the original input) to $\lambda = 1$

In [27]:

# animation showing cost function transformation from standard to normalized input
scaling_tool = feature_scaling_tools.Visualizer(x,x_normalized,y,'least_squares')
scaling_tool.animate_transition(num_frames=50,xmin = -12,xmax = 15,ymin = -10,ymax = 30,num_contours = 7)

Out[27]:

now lets make the same sort of gradient descent run on the normalized input cost

In [13]:

# show run on contour plot
static_plotter.two_input_contour_plot(g,weight_history,xmin = -3,xmax = 10,ymin = -2,ymax = 6,num_contours = 7,show_original = False)

much much better!

how about the resulting fit?

In [14]:

# the original data and best fit line learned from our gradient descent run
ind = np.argmin(cost_history)
least_weights = weight_history[ind]
demo.plot_fit(plotting_weights = [least_weights],colors = ['r'],transformer = normalizer)

not here - to produce the fit we plug in new test points

new test points need to be treated precisely as the training data - subtract the same mean / divide by the same std

\begin{equation} \text{normalized_predictor}\left(x\right) = w_0^{\star} + w_1^{\star}\left(\frac{x - \mu}{\sigma}\right). \end{equation}

Example 1. Normalizing the input of a student debt dataset¶

does standard normalization work for real data?

absolutely - e.g., United States student debt dataset

In [3]:

# load data
csvname = datapath + 'student_debt.csv'
data = np.loadtxt(csvname,delimiter = ',')
x = data[:,:-1].T
y = data[:,-1:] 

# plot dataset
demo = regress_plotter.Visualizer(data)
demo.plot_data()

Least Squares cost before performing standard normalization to input, with run of 25 gradient descent steps

In [4]:

# an implementation of the least squares cost function for linear regression, precisely
# what was shown in Section 8.1 but here pulled in from a backend file
least_squares = cost_lib.choose_cost(x,y,'least_squares')

# run gradient descent to minimize the Least Squares cost for linear regression
g = least_squares; w = np.array([0.0,0.0])[:,np.newaxis]; max_its = 25; alpha_choice = 10**(-7);
weight_history,cost_history = optimizers.gradient_descent(g,alpha_choice,max_its,w)

# show run on contour plot
static_plotter.two_input_contour_plot(g,weight_history,xmin = -0.25,xmax = 0.25,ymin = -0.25,ymax = 0.25,num_contours = 7,show_original = False)

hugely elliptical! basically impossible to minimize via gradient descent

because the original input is enormous! so even slight variations from the very best slope/vertical intercept produce huge errors

corresponding fit will be terrible

In [5]:

# the original data and best fit line learned from our gradient descent run
ind = np.argmin(cost_history)
least_weights = weight_history[ind]
demo.plot_fit(plotting_weights = [least_weights],colors = ['r'])

after standard normalization, with only $25$ gradient descent steps - with much larger steplength!

In [9]:

# return normalization and inverse normalization functions based on input x
normalizer = standard_normalizer(x)

# normalize input by subtracting off mean and dividing by standard deviation
x_normalized = normalizer(x)

# an implementation of the least squares cost function for linear regression, precisely
# what was shown in Section 8.1 but here pulled in from a backend file
least_squares_2 = cost_lib.choose_cost(x_normalized,y,'least_squares')

# run gradient descent to minimize the Least Squares cost for linear regression
g = least_squares_2; w = np.array([0.0,0.0])[:,np.newaxis]; max_its = 25; alpha_choice = 10**(-1);
weight_history,cost_history = optimizers.gradient_descent(g,alpha_choice,max_its,w)

# show run on contour plot
static_plotter.two_input_contour_plot(g,weight_history,xmin = -1,xmax = 1,ymin = -1,ymax = 1,num_contours = 7,show_original = False,arrows = False)

way better fit possible

In [10]:

# the original data and best fit line learned from our gradient descent run
ind = np.argmin(cost_history)
least_weights = weight_history[ind]
demo.plot_fit(plotting_weights = [least_weights],colors = ['r'],transformer = normalizer)

8.4.2 Feature scaling for multi-input datasets¶

the same standard normalization scheme aids in tuning models for multi-innput regression data as well

simply do the same thing - mean center and scale by standard deviation - for each input

normalize the $n^{th}$ input dimension of an $N$-input dimensional dataset $\left\{\mathbf{x}_p,y_p\right\}_{p=1}^N$ as

\begin{equation} x_{p,n} \longleftarrow \frac{x_{p,n} - \mu_n}{\sigma_n} \end{equation}

where

\begin{array} \ \mu_n = \frac{1}{P}\sum_{p=1}^{P}x_{p,n} \\ \sigma_n = \sqrt{\frac{1}{P}\sum_{p=1}^{P}\left(x_{p,n} - \mu_n \right)^2} \end{array}

see Section notes for examples!

8.4.3 Summary and discussion¶

making the contours of a cost function more 'circular' make it easier to minimize - why?

e.g., for gradient descent due to the nature of the gradient descent direction!

this direction points orthogonal to contours - in long narrow valleys this points away from cost minimum, on circular cost it points more towards minimum