OLS the Machine Learning Way

Sep 29, 2019 6 min read

Coming from an Economics/Econometrics background, I have always been a bit puzzled by the way several Machine Learning (ML) textbooks I have read solve the ordinary least squares model (OLS). When considering a linear model of the form:

$$ y = X \beta + e $$

with $e$ a zero-mean noise term, the closed-form solution associated with minimizing the mean square error is:

$$ \beta = (X’X)^{-1}X’y $$

Several ML textbooks explain that a recursive algorithm (see below) may be used to solve for $\beta$. Why not directly using the analytical formula to calculate an estimate of $\beta$ ? While feasible with “small” datasets (not too many explanatory variables and/or observations), direct inversion of $X’X$ is not recommended when working with thousands of explanatory variables and/or billions of observations. The alternative is to use gradient descent, or better, stochastic gradient descent.

In this short post, I solve OLS the “machine-learning way”. That is, using (stochastic) gradient descent. The idea for gradient descent (GD) is quite intuitive. The gradient of $f$ at a given point tells us the direction of greatest increase for $f$ at this point. Hence, moving in the opposite direction (minus the gradient) is probably a good idea to find a local minimum. And indeed it is. The GD algorithm repetitively applies this procedure until a minimum (hopefully global) is found. Starting from an initial guess for $\beta$, one updates the guess using the following formula:

$$ \beta_{n} = \beta_{n-1} - \alpha * grad_{n}(\beta_{n-1},X,y) $$

where $\alpha$ is a small value (the “learning rate”) and $grad_{n}(\beta_{n-1},X,y)$ the gradient of the mean square error (another loss function can be used) evaluated at $\beta_{n-1}$ using the observations $X$ and $y$. Using the mean square error as a loss function generates a closed-form solution for the gradient:

$$ grad_{n}(\beta_{n-1},X,y) = (X’X)\beta - X’y $$

A refinement of GD, especially handy when dealing with a large dataset, is to use only a subset of the full sample when calculating the gradient:

$$ \beta_{n} = \beta_{n-1} - \alpha * grad_{n}(\beta_{n-1},X_n,y_n) $$

where $X_n$ and $y_n$ are a randomly selected sub-sample of $X$ and $y$. Stochastic Gradient Descent (SGD) reduces the computational burden associated with computing the gradient, while still having good convergence properties, as illustrated in the application below.

Implementation in Julia

Let’s first load packages and define parameters

using LinearAlgebra
using Distributions
using Plots
using Distributions
using Random

n_points=10000
dim_input=100 #dim of input, without the intercept
dim_output=1
# Normal noise
d = Normal()
# True parameters
beta = rand(d, dim_input + 1);
# Noise
e = rand(d, n_points);
# Input data:
X = rand(d, (n_points,dim_input));
# Add the intercept:
X = hcat(ones(n_points),X);
#Linear Model
y = X*beta .+ e;

This function calculates an estimate of $\beta$ using the analytical formula for OLS

#OLS way
function OLS_direct(X::Array, y::Vector)
    inv(transpose(X)*X)*transpose(X)*y
end

OLS_direct (generic function with 1 method)

@time beta_hat = OLS_direct(X, y);

0.611144 seconds (1.85 M allocations: 96.576 MiB, 13.64% gc time)

Without any major surprise, using the analytical solution works perfectly well, as illustrated in the following plot

plot(beta, beta_hat, seriestype=:scatter, label="OLS (Direct)")
plot!(beta, beta, seriestype=:line, label="true")
xlabel!("true value")
ylabel!("estimated value (Direct)")

png

Gradient Descent

Now it’s time to solve OLS the machine learning way. I first define a function that calculates the gradient of the loss function, evaluated at the current guess using the full sample. Then, a second function applies the GD updating rule.

#Calculate the gradient
function grad_OLS!(G, beta_hat, X, y)
    G[:] = transpose(X)*X*beta_hat - transpose(X)*y
end

grad_OLS! (generic function with 1 method)

#Gradient descent way
function OLS_gd(X::Array, y::Vector; epochs::Int64=1000, r::Float64=1e-5, verbose::Bool=false)
    #initial guess
    beta_hat = zeros(size(X,2))
    grad_n = zeros(size(X,2))
    for epoch=1:epochs
        grad_OLS!(grad_n, beta_hat, X, y)
        beta_hat -= r*grad_n
        if verbose==true
            if mod(epoch, round(Int, epochs/10))==1
                println("MSE: $(mse(beta_hat, X, y))")
            end
        end
    end
    return beta_hat
end

OLS_gd (generic function with 1 method)

As illustrated below, after 20 iterations we are quite close to the true value. After 100 iterations, values obtained by GD are indistinguishable from the true values.

@time beta_hat_gd_20 = OLS_gd(X,y, epochs=20);
@time beta_hat_gd = OLS_gd(X,y, epochs=100);
plot(beta, beta_hat_gd_20, seriestype=:scatter, label="GD (20 iter.)")
plot!(beta, beta_hat_gd, seriestype=:scatter, label="GD (100 iter.)")
plot!(beta, beta, seriestype=:line, label="true")
xlabel!("true value")
ylabel!("estimated value (GD)")

0.137474 seconds (81.55 k allocations: 5.814 MiB) 0.466605 seconds (714 allocations: 8.225 MiB)

png

Stochastic Gradient Descent

One issue associated with plain vanilla GD is that computing the gradient might be slow. Let’s now randomly select only a fraction of the full sample every time we iterate. Here, I take only 10 percent of the full sample.

#Gradient descent way
function OLS_sgd(X::Array, y::Vector; epochs::Int64=1000, r::Float64=1e-5, verbose::Bool=false, batchsizePer::Int64=10)
    #initial guess
    beta_hat = zeros(size(X,2))
    grad_n = zeros(size(X,2))
    #how many draws from the dataset?
    batchsize = round(Int, size(X,1)*(batchsizePer/100))
    Xn = zeros(batchsize, size(X,2))
    yn = zeros(batchsize)
    for epoch=1:epochs
        indices = shuffle(Vector(1:size(X,1)))
        Xn = X[indices[1:batchsize],:]
        yn = y[indices[1:batchsize]]
        grad_OLS!(grad_n, beta_hat, Xn, yn)
        #gradient descent:
        beta_hat -= r*grad_n
        if verbose==true
            if mod(epoch, round(Int, epochs/10))==1
                println("MSE: $(mse(beta_hat, X, y))")
            end
        end
    end
    return beta_hat
end

OLS_sgd (generic function with 1 method)

The following block of code shows that SGD achieves the same degree of accuracy, while being much faster than GD.

@time beta_hat_gd = OLS_gd(X,y, epochs=200);
@time beta_hat_sgd = OLS_sgd(X,y, epochs=200, batchsizePer=20);
plot(beta, beta_hat_sgd, seriestype=:scatter, label="SGD")
plot!(beta, beta_hat_gd, seriestype=:scatter, label="GD")
plot!(beta, beta, seriestype=:line, label="true")
xlabel!("true value")
ylabel!("estimated value (SGD)")

0.894465 seconds (1.41 k allocations: 16.448 MiB, 0.42% gc time) 0.513217 seconds (338.19 k allocations: 382.550 MiB, 4.60% gc time)

png

Conclusion

The OLS analytical formula is the gold standard to derive theoretical properties and is perfectly fine when working with reasonably-sized data. In a big data context, (stochastic) gradient descent is the way to go. SGD can be applied to a wide-range of minimization problems. In a machine-learning context, SGD is used to estimate (“train”) much more complicated models than the simple linear model presented here. In the Appendix below, I show how one can use SGD when no analytical solution for the gradient is available.

Appendix

GD without analytical solution for the gradient

Let’s assume we don’t have a closed-form solution for the gradient. In this context, Julia’s automatic differentiation library ForwardDiff is a good choice to calculate the gradient. Below, I define the loss function (MSE), I obtain the gradient of the loss function using ForwardDiff and I apply the SGD algorithm.

using ForwardDiff

function mse(beta::Vector, X::Array, y::Vector)
   result = zero(eltype(y))
   for i in 1:length(y)
       #sum squared errors
       result += (y[i] - dot(X[i,:],beta))^2
   end
   return result
end

mse (generic function with 1 method)

function grad!(G, beta_hat, X, y)
    G[:] = ForwardDiff.gradient(x -> mse(x, X, y), beta_hat)
end

grad! (generic function with 1 method)

#Gradient descent way
function OLS_gd(X::Array, y::Vector; epochs::Int64=1000, r::Float64=1e-5, verbose::Bool=false)
    #initial guess
    beta_hat = zeros(size(X,2))
    grad_n = zeros(size(X,2))
    for epoch=1:epochs
        grad!(grad_n, beta_hat, X, y)
        beta_hat -= r*grad_n
        if verbose==true
            if mod(epoch, round(Int, epochs/10))==1
                println("MSE: $(mse(beta_hat, X, y))")
            end
        end
    end
    return beta_hat
end

OLS_gd (generic function with 1 method)

@time beta_hat_gd = OLS_gd(X,y, epochs=100);
plot(beta, beta_hat_gd, seriestype=:scatter, label="GD (100 iter.)")
plot!(beta, beta, seriestype=:line, label="true")
xlabel!("true value")
ylabel!("estimated value (GD)")

3.180767 seconds (9.71 M allocations: 7.547 GiB, 12.66% gc time)

png