The Curse of Dimensionality and Adaptive Sparse Grids

Introduction

The curse of dimensionality is at the heart of dynamic programming. And dynamic programming is the cornerstone of modern economic theory. But what is the curse of dimensionality exactly? The aim of this blog post is to answer this question and to show one method that alleviates this curse: adaptive sparse grids.

TL;DR: adaptive sparse grids help you to limit the bite of the curse of dimensionality

You can download the notebook for this post here.


I. Theory: the curse of dimensionality

I. A. Problem: the curse of dimensionality

The behavior of an optimizing agent can be summarized by a Bellman equation of the form:

$$ V(x,z) = \max_{c \in \Gamma(x,z)}[ F(x,c,z) + \beta E_{z’|z}[ V(T(x,c), z’) ]$$

where $x$ is a non-random state variable and $z$ is a random variable following a Markov process. Both $x$ and $z$ are to be understood as vectors. The function $\Gamma(.)$ represents the set of possible actions given a current state and the function $T(.)$ captures the transition from one state to another when an action is taken. The variable $\beta$ is a discount factor striclty smaller than one and the $E$ represent an expectation operator.

Let’s now assume that we know the policy function that dictates the optimal behavior of the agent $c^{\star} = g(x,z)$, so we can ignore the max operator and rewrite the above as:

$$ V(x,z) = F(x,c^{\star},z) + \beta E_{z’|z}[ V(T(x,c^{\star}), z’) ]$$


Now let’s leave the theoretical world. In practice one needs to discretize (x,z) on a grid using a finite number of points. Let’s assume that the deterministic part of the state variable has $n$ dimensions and that the stochastic element has $z$ dimensions. For the sake of simplicity, let’s assume that we use $k$ points along each dimension. To evaluate $V(x,z)$, $k^{n+z}$ operations are needed to evaluate $F(x,c^{\star},z)$ on the grid.

Things get even worse if we look at the second element of the right-hand-side of the above equation: the conditional expectation. In the present context, a conditional expectation is “simply” a multidimensional integral. From a numerical standpoint, a multidimensional integral can be approximated using Gaussian quadrature. Once again, for the sake of simplicity let’s assume we use $\omega$ nodes along each dimension for $z$. Then, Gaussian quadrature implies $w^z$ function evaluations for $V(.)$ for each point on the grid. So a grand total of $k^{n+z}w^{z}$ function evaluations, which is equal to $\exp[(n+z)log(k)+ zlog(w)]$.


From the formula $\exp[(n+z)log(k)+ zlog(w)]$, I see two main takeaways:

  1. Get rid of useless dimensions. Yes, I know it’s obvious, but the formula makes it clear. Dimensions show up linearly within the exponential function, while the number of discretization points enters logarithmically. We can save computation time by being smart for the choice of grid points, but the smartest move is to get rid of useless dimensions.

  2. Stochastic variables are more costly. Once again it’s obvious, but I like the formalism that the formula brings. Stochastic variables matter for today ($zlog(k)$), but also for tomorrow ($zlog(w)$). The choice of quadrature nodes is important ($log(w)$), but a reduction in $z$ would even be better.

Now let’s plot the number of function evaluations necessary for the conditional expectation as a function of the number of dimensions for x and z (holding the number of grid points and the quadrature nodes constant). As illustrated below, things get out of hand quite rapidly.

using Plots

function curse(n,z; k=10, w=5)
    exp((n+z)*log(k) + z*log(w))
end

p1 = plot(1:3, 1:3, (n,z) -> curse(n,z), st=:surface, xlabel="n", ylabel="z")
p2 = plot(1:3, n -> curse(n,1), label="curse(n,1)", xlabel="n")
p3 = plot(1:3, z -> curse(1,z), label="curse(1,z)", xlabel="z")

p = plot(p1, p2, p3)
display(p)
println("Millions of function evaluations for curse(3,3): $(round(curse(3,3)/(10^6), digits = 2))")

svg

Millions of function evaluations for curse(3,3): 125.0

I.B. One solution: Adaptive Sparse Grids

One solution brought to the field of Economics by Brumm and Scheidegger (2017) is to use adaptive sparse grids. Adaptive sparse grids combine two ideas:

  1. Sparse grids are based on the idea that when it comes to extrapolation power, not all grid points are born equal. Some grid points are to some extent redundant and can be avoided.

  2. Adaptive sparse grids use a sparse grid as a starting point. Additional points are then added to the (sparse) grid in regions of high curvature.

Now let’s look at how a sparse grid adapts to the curvature of the following non-smooth function:

$$ f(x,y) = \frac{1}{|0.5 - x^4 - y^4| + 0.1} $$

# Packages and helper functions
using AdaptiveSparseGrids
using Random
using Distributions
using IterTools
using DataFrames
using NBInclude
@nbinclude("utils.ipynb")

# Example taken from here: https://github.com/jacobadenbaum/AdaptiveSparseGrids.jl
# I made small modifications along the way
# Bounds
lb  = zeros(2)
ub  = ones(2)

# True function to approximate:
f(x) = 1.0/(abs(0.5 - x[1]^4 - x[2]^4) + 0.1)

# Construct approximation:
fun = AdaptiveSparseGrid(f, lb, ub,
                         max_depth = 10,    # The maximum depth of basis elements in each dimension
                         tol = 0.01)        # Add nodes when min(abs(alpha/f(x)), abs(alpha)) < tol

xy_grid = extract_grid(fun)
p1=scatter(xy_grid[:,1],xy_grid[:,2], m=(:black,2),title="Sparse grid", xlabel=L"x", ylabel=L"y")
p2 = plot(xy_grid[:,1],xy_grid[:,2], [fun(xy_grid[i,:]) for i in 1:size(xy_grid,1)], m=(:black,2),title="Sparse grid", xlabel=L"x", ylabel=L"y", zlabel=L"f(x,y)", st=:scatter3d)
plot(p1, p2, legend=false)

svg

II. Application

Alright, back to Economics. We will focus on the model of the labour market by Lise and Robin. You can take a look at the original paper, or at my previous post here. “Solving the model” involves finding the function $S(.)$ solving the following equation:

$$ S(x,y,z) = s(x,y,z) + \frac{1 - \delta}{1 + r} E_{z’|z} [ max(S(x,y,z’), 0) ] $$

II. A. Value function iteration with a dense grid

For more details on value function iteration with a dense grid, you can look at my former post here. The two blocks of code below solve for the unknow function $S(.)$ and plot the results for a dense grid.

# Type defined in "utils.ipynb"
p = Params(nx=21, ny=21, nz=10);
# Solve by dense value function iteration
# function defined in "utils.ipynb"
@time S_dense = solve_VFI(p, max_iter = 5000)
Iter 1312 Convergence reached
 64.575871 seconds (581.92 M allocations: 25.137 GiB, 7.04% gc time, 1.06% compilation time)
# Reshape the grid
vals_grid = zeros(size(p.x_grid,1)*size(p.y_grid,1)*size(p.z_grid,1), 3)
for (i, (x, y, z)) in enumerate(product(p.x_grid, p.y_grid, p.z_grid))
    vals_grid[i, 1] = x
    vals_grid[i, 2] = y
    vals_grid[i, 3] = z
end

# Plot dense VFI
p1=scatter(vals_grid[:,1], vals_grid[:,2],m=(2),title="Dense grid", xlabel=L"x", ylabel=L"y")
p2 = plot(p.x_grid, p.y_grid, (x, y) -> S_dense([x; y; 1.0])[1], label = "f(x)", st=:surface)
plot!(vals_grid[:,1], vals_grid[:,2], [S_dense([vals_grid[i,1]; vals_grid[i,2]; 1.0])[1] for i in 1:size(vals_grid,1)], label = "f(x)", st=:scatter3d)
title!(L"S(x,y,z=1) Dense")
xlabel!(L"x")
ylabel!(L"y")

plot(p1, p2, legend=false)

svg

II.C. Adaptive sparse grid

Now let’s solve the same problem using the methodology described in Brumm and Scheidegger (2017) (without the fancy parallelism that they use). The idea is to start with a sparse grid and to add points to the grid where the function is non-smooth. The following block of code does that and compare the results obtained previously with the dense grid. Both methods yield same results, despite the sparse grid having fewer points.

# A. Initialization of the RHS
function init_guess(x, p)
    p.s_xyz(x[1], x[2], x[3]) + ((1.0 - p.delta)/(1.0 + p.r))*max(0.0, p.s_xyz(x[1], x[2], x[3]))
end

# Construct our approximation (this will evaluate f at the needed points, using
# all available threads)
fun = AdaptiveSparseGrid(x -> init_guess(x, p), p.lower_bound, p.upper_bound,
                         max_depth = 5,    # The maximum depth of basis elements in
                                            # each dimension
                         tol = 0.015)        # Add nodes when
                                            # min(abs(alpha/f(x)), abs(alpha)) < tol
fun_init = deepcopy(fun)
xyz_grid = extract_grid(fun)
pred_old = [fun(xyz_grid[i,:]) for i in 1:size(xyz_grid,1)]

# B. Function that calculate the RHS of the Bellman equation, given the guess from the previous iteration
function RHS(x, p, fun_old)

    vals = zeros(p.nb_nodes, 3)
    for (k, innovation) in enumerate(p.nodes_E)
        vals[k,1] = x[1]
        vals[k,2] = x[2]
        #Have to use min(max()) to stay within the boundaries (no extrapolation allowed)
        vals[k,3] = min(max(fun.bounds[3,1], (x[3].^p.rho).*exp.(innovation)), fun.bounds[3,2])
    end

    p.s_xyz(x[1], x[2], x[3]) + ((1.0 - p.delta)/(1.0 + p.r)).*sum(p.weigths_E.*max.(0.0, [fun_old(vals[i,:]) for i in 1:size(vals, 1) ]))

end

# C. Solve the problem with adaptive sparse grid
# parameters for VFI
tol_VF = 1e-8 #tolerance for VFI
max_iter_VF = 5000 #max iterations for VFI
show_every = max_iter_VF #display difference
# initialize values
fun_old = deepcopy(fun_init)
pred_old = [fun_old(xyz_grid[i,:]) for i in 1:size(xyz_grid,1)]
max_depth = 5 #max depth for sparse grid
tol = 0.015 #tolerance on sparse grid

@time begin
    # Initialize
    for i = 1:max_iter_VF

        fun = AdaptiveSparseGrid(x -> RHS(x, p, fun_old), p.lower_bound, p.upper_bound,
                         max_depth = max_depth,    # The maximum depth of basis elements in
                                            # each dimension
                         tol = tol)        # Add nodes when
                                            # min(abs(alpha/f(x)), abs(alpha)) < tol

        # Prediction on fixed grid
        pred_new = [fun(xyz_grid[i,:]) for i in 1:size(xyz_grid,1)]

        # DISTANCE
        # Check on fixed grid
        diff= maximum(abs.(pred_new .- pred_old))
        if mod(i, show_every) == 0
            println("Iter $(i) Diff : $(diff)")
            println("Max nb points on sparse grid: $(length(fun.nodes))")
        end

        if diff < tol_VF
            println("Iter $(i) Convergence reached")
            break
        end

        #UPDATE
        pred_old = copy(pred_new)
        fun_old = deepcopy(fun)

    end
end
Iter 1297 Convergence reached
 61.046396 seconds (435.68 M allocations: 17.780 GiB, 3.44% gc time, 0.96% compilation time)
# Plot output
p1 = plot(p.x_grid, p.y_grid, (x, y) -> S_dense([x; y; 1.0])[1], label = "f(x)", st=:contour)
title!(L"S(x,y,z=1) Dense")
xlabel!(L"x")
ylabel!(L"y")

# Initialization function
xyz_grid = extract_grid(fun)
p2 = plot(p.x_grid, p.y_grid, (x, y) -> fun([x;y;1.0]), label = "f(x)", st=:contour)
title!(L"S(x,y,z=1) Sparse")
xlabel!(L"x")
ylabel!(L"y")

ratio = 9/16
width = 800
pp = plot(p1, p2, size = (width, width*ratio))

svg

II.D. Sparse grid versus dense grid

As illustrated below, the sparse grid puts more points where the function $S(.)$ changes sign (in the top left and the bottom right parts of the (x-y) space). It matters because the sign of $S(.)$ determines whether or not a job is feasible for a worker of type $x$ and a firm of productivity $y$.

p1=scatter(xyz_grid[:,1], xyz_grid[:,2], m=(:black,2),title="Sparse grid", xlabel=L"x", ylabel=L"y")
p2=scatter(xyz_grid[:,1], xyz_grid[:,3], m=(:black,2),title="Sparse grid", xlabel=L"x", ylabel=L"z")
p3=scatter(xyz_grid[:,2], xyz_grid[:,3], m=(:black,2),title="Sparse grid", xlabel=L"y", ylabel=L"z")
p4 = plot(xyz_grid[:,1], xyz_grid[:,2], [fun(xyz_grid[i,:]) for i in 1:size(xyz_grid,1)], m=(:black,2),title="Sparse grid", st=:scatter3d, xlabel=L"x", ylabel=L"y", zlabel=L"S(x,y,z)")
plot(p1, p2, p3, p4, legend=false, title="Number points: $(size(xyz_grid, 1))")

svg

II.E. Accuracy

Looking at the mean squared error, or the median squared error, the adaptive sparse grid method is more accurate. This holds despite the fact that the sparse grid has approximately four time less points.

#Random draws from space
nb_draws = 1000
x_grid_rand = rand(Uniform(p.lower_bound[1], p.upper_bound[1]), nb_draws)
y_grid_rand = rand(Uniform(p.lower_bound[2], p.upper_bound[2]), nb_draws)
z_grid_rand = rand(Uniform(p.lower_bound[3], p.upper_bound[3]), nb_draws);

#LHS
LHS_dense = zeros(nb_draws)
LHS_sparse = zeros(nb_draws)
for (index, (xValue, yValue, zValue)) in enumerate(zip(x_grid_rand, y_grid_rand, z_grid_rand))
    LHS_dense[index] = S_dense([xValue; yValue; zValue])
    LHS_sparse[index] = fun([xValue; yValue; zValue])
end

#RHS
RHS_dense = zeros(nb_draws)
RHS_sparse = zeros(nb_draws)
for (index, (xValue, yValue, zValue)) in enumerate(zip(x_grid_rand, y_grid_rand, z_grid_rand))
    RHS_dense[index] = p.s_xyz(xValue, yValue, zValue) + ((1.0 - p.delta)/(1.0 + p.r)).*sum(p.weigths_E.*[max.(0.0, S_dense([xValue; yValue; (zValue.^p.rho).*exp.(innovation)])) for innovation in p.nodes_E])
    RHS_sparse[index] = RHS([xValue, yValue, zValue], p, fun)
end

err_dense = LHS_dense - RHS_dense
err_sparse = LHS_sparse - RHS_sparse

p1 = histogram(err_dense, label="Error VFI dense")
p2 = histogram(err_sparse, label="Error VFI sparse")
plot(p1, p2)

svg

df = DataFrame(Method = ["VFI Dense", "VFI Sparse"],
                MEAN_SE = [mean(err_dense).^2, mean(err_sparse).^2],
                MEDIAN_SE = [median(err_dense).^2, median(err_sparse).^2],
                MAX_SE = [maximum(err_dense).^2, maximum(err_sparse).^2],
                MIN_SE = [minimum(err_dense).^2, minimum(err_sparse).^2],
                NB_Points = [size(vals_grid,1), size(xyz_grid,1)])

2 rows × 6 columns

MethodMEAN_SEMEDIAN_SEMAX_SEMIN_SENB_Points
StringFloat64Float64Float64Float64Int64
1VFI Dense3.5566e-71.8067e-80.001353986.68973e-84410
2VFI Sparse7.1713e-81.63528e-80.00102580.0005776961644

Conclusion

When no closed-form solution is available, the curse of dimensionality makes the use of “brute-force methods” (e.g. value function iteration on a dense grid) unpractical/unfeasible. Using Adaptive sparse grids is one promising way to work with high-dimensional models.



Appendix

References

Version

versioninfo()
Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 4

Related