Numerical Approximation of Expectations Squared

Introduction

Recently, I came across a simple question that led me through a quite interesting rabbit hole. The question is very simple (and very niche), but the analysis involves key results from statistical theory and numerical analysis (the Central Limit Theorem, the Delta method, and the second-order Delta method, Monte Carlo integration).

Here is the seemingly innocent (and slightly bizarre) question:

  • Context: One is interested in evaluating the square of an expectation $E\Big[g(X) \Big]^2 = \big(\int_{a}^{b} g(x) f_X(x) dx \Big)^2 = \mu^2$, where $\mu$ is an unknown quantity “close to zero”. Note that the integral squared can be rewritten as $\big(\int_{a}^{b} g(x) f_X(x) dx \Big)^2 = \int_{a}^{b} \int_{a}^{b} g(x) g(y) f_X(x) f_Y(y) dx dy $. Transforming the square of an integral into a double integral is a standard trick, that is for instance used when calculating the Gaussian integral. Let’s say one want to numerically approximate $\mu^2$ using Monte Carlo integration.
  • Question: Is it better to approximate $\mu^2$ using the square of the Monte Carlo estimator $\big(\frac{1}{N} \sum_{i=1}^{N} f(x_i) \Big)^2$, or the Monte Carlo estimator based on the double integral: $\frac{1}{N} \sum_{i=1}^{N} f(x_i) f(y_i) $? Here the sample ${x_i}$ and ${y_i}$ are i.i.d random variables with density f(.). The random variables ${x_i}$ and ${y_i}$ have the same density function, but they are drawn independently from each other.

I. Consistency

Notice that both estimators are consistent. That is, as we increase $N$, the estimators converges in probability to the true value of the parameter $\mu^2$.

In what follows, I use the notation $\mu$ to refer to $E\Big[f(X) \Big]$ and $\sigma^2$ to $Var\Big[f(X) \Big]$.

Estimator $\big(\frac{1}{N} \sum_{i=1}^{N} f(x_i) \Big)^2$

We know that Monte Carlo estimator are consistent: So $\frac{1}{N} \sum_{i=1}^{N} f(x_i) \overset{p}{\to} \mu$. Because the function $x \rightarrow x^2$ is continuous on $R$, the continuous mapping theorem implies that $\Big(\frac{1}{N} \sum_{i=1}^{N} f(x_i)\Big)^2 \overset{p}{\to} \mu^2$.

Estimator $\frac{1}{N} \sum_{i=1}^{N} f(x_i) f(y_i) $

Once again, we know that Monte Carlo estimator are consistent. So $\frac{1}{N} \sum_{i=1}^{N} f(x_i) f(y_i) \overset{p}{\to} \int_{a}^{b} \int_{a}^{b} g(x) g(y) f_X(x) f_Y(y) dx dy = \big(\int_{a}^{b} g(x) f_X(x) dx \Big)^2 = \mu^2$.

Hence, both estimators get closer to the true value as we increase the number of draws N.

II. Bias

Interestingly, $\big(\frac{1}{N} \sum_{i=1}^{N} f(x_i) \Big)^2$ has a small sample bias, while $\frac{1}{N} \sum_{i=1}^{N} f(x_i) f(y_i) $ does not.

Estimator $\big(\frac{1}{N} \sum_{i=1}^{N} f(x_i) \Big)^2$

$$ E[ \big(\frac{1}{N} \sum_{i=1}^{N} f(x_i) \Big)^2] = \big(E [\frac{1}{N} \sum_{i=1}^{N} f(x_i)] \Big)^2 + Var(\frac{1}{N} \sum_{i=1}^{N} f(x_i))$$

$$ E[ \big(\frac{1}{N} \sum_{i=1}^{N} f(x_i) \Big)^2] = \mu^2 + \frac{\sigma^2}{N} $$

$$ Bias(\big(\frac{1}{N} \sum_{i=1}^{N} f(x_i) \Big)^2) = \frac{\sigma^2}{N} $$

Question for the careful reader: How can you remove the bias? (See the answer below).

Estimator $\frac{1}{N} \sum_{i=1}^{N} f(x_i) f(y_i) $

$$ E [\frac{1}{N} \sum_{i=1}^{N} f(x_i) f(y_i)] = \frac{1}{N} \sum_{i=1}^{N} E[f(x_i) f(y_i)] $$

$$ = \frac{1}{N} \sum_{i=1}^{N} E[f(x_i)] E[ f(y_i)] $$

$$ = \mu^2 $$

$$ Bias(\frac{1}{N} \sum_{i=1}^{N} f(x_i) f(y_i) ) = 0 $$

Where, going from the first line to the second is based on the fact that $X$ and $Y$ are independent random variables. However, the bias is just one part of the whole story. To see which estimator is better, we also need to investigate their variance.

III. Variance. Case $\mu > 0$

Estimator $\big(\frac{1}{N} \sum_{i=1}^{N} f(x_i) \Big)^2$

Note that the value $\bar{X_n} = \frac{1}{N} \sum_{i=1}^{N} f(x_i) $ is the sample mean. The Central Limit Theorem implies that:

$$\sqrt{N} \frac{(\bar{X_n} - \mu)}{\sigma} \overset{d}{\to} N(0,1)$$

To get the (asymptotic) variance of $\bar{X_n}^2$, one may use the delta method. The delta method states that if there is a sequence of random variables $X_n$ satisfying:

$$\sqrt{N} (\bar{X_n} - \theta) \overset{d}{\to} N(0,\sigma^2)$$

with $\theta$ and $\sigma^2$ finite valued constants. Assume that $g$ is a function with continuous first derivative, satisfying the property that $g’(\theta)$ is non-zero valued. Then,

$$\sqrt{N} (g(\bar{X_n}) - g(\theta)) \overset{d}{\to} N(0,\sigma^2 [g’(\theta)]^2)$$

Here $g(x) = x^2$, $g’(x) = 2x$, so the delta method implies that:

$$\sqrt{N} (\bar{X_n}^2 - \mu^2) \overset{d}{\to} N(0,4\mu^2 \sigma^2)$$.

Estimator $\frac{1}{N} \sum_{i=1}^{N} f(x_i) f(y_i) $

Let’s use the notation $\bar{XY}_n = \frac{1}{N} \sum_{i=1}^{N} f(x_i) f(y_i) $. Once again, the Central Limit Theorem implies that:

$$\sqrt{N} (\bar{XY}_n - \mu^2) \overset{d}{\to} N(0, \sigma_{XY}^2)$$

with $\sigma_{XY}^2 = Var(\bar{XY}_n)$.

If $X$ and $Y$ are independent random variables, then $Var(XY) = Var(X)Var(Y) + Var(X)(E[Y]^2) + Var(Y)(E[X]^2)$.

Here $Var(X) = Var(Y) = \sigma$, and $E[X] = E[Y] = \mu$, so the above equation simplifies to:

$$ Var(XY) = \sigma^4 + 2 \mu^2 \sigma^2 $$.

Comparison $Var(\bar{XY}_n)$ vs $Var(\bar{X}_n^2)$

For large values of $\mu$, the estimator $\bar{XY}_n$ has smaller variance. However, when $\mu$ is small, $\bar{X_n}$ has a smaller variance. This property is illustrated below by plotting the function $\mu \rightarrow (\sigma^4 + 2\mu^2\sigma^2) - 4\mu^2\sigma^2$ for different choices of $\sigma$.

using Plots
using LaTeXStrings
using Distributions
gr()

N = 100
grid_sigma = collect(range(0, 1.0, length=N))
grid_mu = collect(range(0, 1.0, length=N))
grid_mu_smaller = collect(range(0, 0.25, length=N))
diff = zeros(N,N)


f(mu, sigma) = (sigma.^4 .+ (2.0.*mu.^2).*(sigma.^2)) .- 4.0 .*(mu.^2).*(sigma.^2)

#p = plot(grid_mu, grid_sigma, (x,y) -> f(x,y), st = :contourf, xlabel=L"\mu", ylabel=L"\sigma", fill=true)
p0 = plot(grid_mu, x -> 0.0, label=L"0", linestyle=:dash, ylabel=L"(\sigma^4 + 2\mu^2\sigma^2) - 4\mu^2\sigma^2")
plot!(p0, grid_mu, x -> f(x,1.0), xlabel=L"\mu", label=L"f(\mu, \sigma=1.0)")
plot!(p0, grid_mu, x -> f(x,0.75), xlabel=L"\mu", label=L"f(\mu, \sigma=0.75)")
plot!(p0, grid_mu, x -> f(x,0.5), xlabel=L"\mu", label=L"f(\mu, \sigma=0.5)")

p1 = plot(grid_mu_smaller, x -> 0.0, label=L"0", linestyle=:dash, ylabel=L"(\sigma^4 + 2\mu^2\sigma^2) - 4\mu^2\sigma^2")
plot!(p1, grid_mu_smaller, x -> f(x,0.25), xlabel=L"\mu", label=L"f(\mu, \sigma=0.25)")
plot!(p1, grid_mu_smaller, x -> f(x,0.20), xlabel=L"\mu", label=L"f(\mu, \sigma=0.20)")
plot!(p1, grid_mu_smaller, x -> f(x,0.15), xlabel=L"\mu", label=L"f(\mu, \sigma=0.15)")

plot(p0, p1)

svg

In the block of code below, I estimate $\bar{XY}_n$ and $\bar{X}_n^2$ in two cases. First, I assume that $X$ is normally distributed with mean $\mu$ and unit variance. In this case, we do not even need the central limit theorem, because the sum of iid Normal variables is also normally distributed. In the second case, I assume that $X$ is uniformly distributed $Uni(\mu-1, \mu+1)$, in which case the mean of $X$ is equal to $\mu$.

As illustrated below, in both instances $\bar{X}_n^2$ has a smaller variance than $\bar{XY}_n$. The figure also underlines that even when $X$ is not normally distributed, the CLT kicks in and the above analysis provides a good estimate of the small sample behavior of our estimators.

N = 5000 #number of draws
N_replicate = 5000 #number of replications for the distribution of the estimator
list_plots = [] #to store plots
sigma = 1.0 #variance when simulating from Normal

for (index_distribution, distribution) in enumerate(["Normal", "Uniform"])
    for (index_mu, mu) in enumerate([0.20, 0.10, 0.05])

        #println(mu)

        if distribution == "Normal"
            d_x = Normal(mu, sigma)
            d_x1 = Normal(mu, sigma)
            d_x2 = Normal(mu, sigma)
        else
            d_x = Uniform(-1,1) + mu
            d_x1 = Uniform(-1,1) + mu
            d_x2 = Uniform(-1,1) + mu
        end

        # f(x) = x
        distristribution_square = zeros(N_replicate) # (mean(x))^2
        distristribution_square_centered = zeros(N_replicate) # (mean(x))^2 - mu^2
        distristribution_product_centered = zeros(N_replicate) # (mean(x1 x2) - mu^2

        # Stats on f(X)
        distristribution_std_f = zeros(N_replicate) #std f(X)
        distristribution_mean_f = zeros(N_replicate) #mean f(X)
        distribution_var_f = zeros(N_replicate) #var f(X)

        distristribution_product_std = zeros(N_replicate)
        distristribution_product_mean = zeros(N_replicate)

        for i=1:N_replicate

            draws_x = rand(d_x, N)
            draws_x1 = rand(d_x1, N)
            draws_x2 = rand(d_x2, N)

            distristribution_square[i] = mean(draws_x .- mu)^2
            distristribution_square_centered[i] = sqrt(N)*(mean(draws_x)^2 - mu^2)
            distristribution_product_centered[i] = sqrt(N)*(mean(draws_x1.*draws_x2) - mu^2)

            # Stats on f(X)
            distristribution_std_f[i] = std(draws_x)
            distristribution_mean_f[i] = mean(draws_x)
            distribution_var_f[i] = var(draws_x)

            distristribution_product_std[i] = std(draws_x1.*draws_x2)
            distristribution_product_mean[i] = mean(draws_x1.*draws_x2)

        end

        # Mean of certain statistics
        mu_fX = mean(distristribution_mean_f)
        sigma_fX = mean(distristribution_std_f)
        variance_fX = mean(distribution_var_f)

        # Empirical sigmas
        # Var(XY)
        var_n0_empirical = variance_fX^2 + variance_fX*(2*mu_fX^2)

        # Var(X^2)
        var_n1_empirical = (4*mu_fX^2*sigma_fX^2)        

        # Theoretical sigmas when using Normal draws
        var_n0 = sigma^2 * sigma^2 + sigma^2*(2*mu^2)
        var_n1 = (4*mu^2*sigma^2)

        pdf_normal_0 = Normal(0, sqrt(var_n0_empirical))
        pdf_normal_1 = Normal(0, sqrt(var_n1_empirical))

        # Plotting
        show_legend = false
        if index_mu == 2 && index_distribution == 1
            show_legend = true
        end
        title_name = ""
        if index_distribution == 1
            title_name = "N($(mu), $(sigma))"
        else
            title_name = "U(-1, 1) + $(mu)"
        end

        p1 = histogram(distristribution_product_centered, label=L"$\sqrt{N}(\bar{X_1 X_2}_{n} - \mu^2 ) $", normalize=true, alpha=0.5)
        histogram!(p1, distristribution_square_centered, label=L"$\sqrt{N}(\bar{X}_{n}^2 - \mu^2 )$", normalize=true, alpha=0.5, title=title_name)

        plot!(p1, minimum(distristribution_product_centered):0.1:maximum(distristribution_product_centered), x-> pdf(pdf_normal_0, x), label=L"$N(0, \sigma^4 + 2\mu^2\sigma^2)$")
        plot!(p1, minimum(distristribution_square_centered):0.1:maximum(distristribution_square_centered), x-> pdf(pdf_normal_1, x), label=L"$N(0, 4 \mu^2 \sigma^2)$", legend = show_legend)
        push!(list_plots, p1)

    end

end

plot(list_plots[1], list_plots[2], list_plots[3], list_plots[4], list_plots[5], list_plots[6])
plot!(size=(1000,600))

svg

IV. Variance. Case $\mu = 0$

The limit case $\mu = 0$ deserves a special treatment. Indeed, for the delta method to applies, the condition that $g′(\theta)$ is non-zero valued must hold. However with $g(\theta) = \theta^2$, $g’(0) = 2\times0=0$.

In that context, the second-order delta method applies:

Second order delta method

Consider a sequence of random variables $X_n$ satisfying:

$$\sqrt{N} (\bar{X_n} - \theta) \overset{d}{\to} N(0,\sigma^2)$$

with $\theta$ and $\sigma^2$ finite valued constants. Assume that $g$ is a function with continuous first and second derivatives, satisfying the property that $g’(\theta) = 0$ and $g”(\theta) \neq 0$. Then:

$$N (g(\bar{X_n}) - g(\theta)) \overset{d}{\to} \sigma^2 \frac{g”(\theta)}{2} \chi^2(1)$$

So applying the second order delta method with $g(\theta) = \theta^2$ and $\mu=0$ gives us:

$$N (\bar{X_n}^2) \overset{d}{\to} \sigma^2 \chi^2(1)$$

Using the fact that $Var(\chi^2(1)) = 2$ and rearranging terms gives:

$$ Var(\bar{X_n}^2) \approx \frac{2 \sigma^4}{N^2}$$

This approximation should get increasingly better as $N$ increases.

Comparison $Var(\bar{XY}_n)$ vs $Var(\bar{X}_n^2)$

When $\mu=0$:

$$Var(\bar{XY}_n) = \frac{\sigma^4}{N}$$

and

$$Var(\bar{X_n}^2) = \frac{2\sigma^4}{N^2}$$

Because of the $N^2$ at the denominator, we see that $\bar{X_n}^2$ converges much faster than $\bar{XY}_n$. This property is illustrated on the plot below, in which I do the same exercice as before, but with $\mu=0$. Note that I show the histogram of $\sqrt{N}(\bar{X_1 X_2}_{n} - \mu^2 )$ versus $N(\bar{X}^2_{n} - \mu^2 )$ (scaling with $\sqrt{N}$ versus $N$).

N = 1000#number of draws
N_replicate = 5000 #number of replications for the distribution of the estimator
list_plots = []
sigma=1.0 #variance when simulating from Normal

for (index_distribution, distribution) in enumerate(["Normal", "Uniform"])
    for (index_mu, mu) in enumerate([0.0])

        if distribution == "Normal"
            d_x = Normal(mu, sigma)
            d_x1 = Normal(mu, sigma)
            d_x2 = Normal(mu, sigma)
        else
            d_x = Uniform(-1,1) + mu
            d_x1 = Uniform(-1,1) + mu
            d_x2 = Uniform(-1,1) + mu
        end

        # f(x) = x
        distristribution_square = zeros(N_replicate) # (mean(x))^2
        distristribution_square_centered = zeros(N_replicate) # (mean(x))^2 - mu^2
        distristribution_product_centered = zeros(N_replicate) # (mean(x1 x2) - mu^2

        # Stats on f(X)
        distristribution_std_f = zeros(N_replicate) #std f(X)
        distristribution_mean_f = zeros(N_replicate) #mean f(X)
        distribution_var_f = zeros(N_replicate) #var f(X)

        distristribution_product_std = zeros(N_replicate)
        distristribution_product_mean = zeros(N_replicate)

        for i=1:N_replicate

            draws_x = rand(d_x, N)
            draws_x1 = rand(d_x1, N)
            draws_x2 = rand(d_x2, N)

            distristribution_square[i] = mean(draws_x .- mu)^2
            distristribution_square_centered[i] = N*(mean(draws_x)^2 - mu^2)/sigma
            distristribution_product_centered[i] = sqrt(N)*(mean(draws_x1.*draws_x2) - mu^2)

            # Stats on f(X)
            distristribution_std_f[i] = std(draws_x)
            distristribution_mean_f[i] = mean(draws_x)
            distribution_var_f[i] = var(draws_x)

            distristribution_product_std[i] = std(draws_x1.*draws_x2)
            distristribution_product_mean[i] = mean(draws_x1.*draws_x2)

        end

        # Mean of certain statistics
        mu_fX = mean(distristribution_mean_f)
        sigma_fX = mean(distristribution_std_f)
        variance_fX = mean(distribution_var_f)

        # Empirical sigmas
        # Var(XY)
        var_n0_empirical = variance_fX^2 + variance_fX*(2*mu_fX^2)

        pdf_normal_0 = Normal(0, sqrt(var_n0_empirical))
        pdf_chisq_1 = Chisq(1)

        # Plotting
        show_legend = false
        if index_mu == 1 && index_distribution == 2
            show_legend = true
        end
        title_name = ""
        if index_distribution == 1
            title_name = "N($(mu), $(sigma))"
        else
            title_name = "U(-1, 1) + $(mu)"
        end

        p1 = histogram(distristribution_product_centered, label=L"$\sqrt{N}(\bar{X_1 X_2}_{n} - \mu^2 ) $", normalize=true, alpha=0.5)
        histogram!(p1,distristribution_square_centered, label=L"$N(\bar{X}_{n}^2 - \mu^2)/\sigma^2$", normalize=true, alpha=0.5, title=title_name)

        plot!(p1, minimum(distristribution_product_centered):0.1:maximum(distristribution_product_centered), x-> pdf(pdf_normal_0, x), label=L"$N(0, \sigma^4 + 2\mu^2\sigma^2)$")
        plot!(p1, 0.001:0.01:maximum(distristribution_square_centered), x-> pdf(pdf_chisq_1, x), label=L"$\chi^2(1)$", legend = show_legend)
        push!(list_plots, p1)

    end

end

plot(list_plots[1], list_plots[2])
plot!(size=(1000,300))

svg

V. Mean Squared Error

To measure the quality of an estimator, it is common to use the mean squared error (MSE). In plain English, the mean squared error measures the average squared difference between the estimated values and the actual value:

$$ MSE(\hat{\theta}) = E_{\theta} [ (\hat{\theta} - \theta)^2] $$

The MSE has the appealing property that it is equal to the variance of the estimator, plus the bias squared:

$$ MSE(\hat{\theta}) = V [ \hat{\theta} ] + Bias(\hat{\theta})^2$$

Fortunately, we have already done the heavy lifting by calculating the biases and the variances of the estimators. Collecting the previous results, we have.

Case $\mu >0 $

$$ MSE(\bar{X}^2_n) = \frac{\sigma^4 + 4N \mu^2 \sigma^2}{N^2} $$

$$ MSE(\bar{XY}_n) = \frac{\sigma^4 + 2 \mu^2 \sigma^2}{N} $$

Case $\mu = 0$

$$ MSE(\bar{X}^2_n) = \frac{3 \sigma^4}{N^2}$$

$$ MSE(\bar{XY}_n) = \frac{\sigma^4}{N} $$

It is easy to see that when $\mu$ is close or equal to 0, it is better to use $\bar{X}^2_n$ instead $\bar{XY}_n$. In the next section, I check that numerically with some examples.

VI. Numerical illustration

Case $\mu = 0$

The takeaway of the previous sections is that if we suspect $\mu$ in a neighborhood of 0, one is better off using $\bar{X_n}^2$ instead of $\bar{XY}_n$.

Now I compare the accuracy of both approaches for a numerical integration exercice. For instance, let’s approximate the integral:

$$ \Big(\int_{-\infty}^{+\infty} x^3 \frac{1}{\sqrt{2\pi}} \exp(\frac{-1}{2}x^2) dx \Big)^2 $$

The true value of the integral is known to be zero (odd central moment of a normally distributed random variable). The left panel in the next graph shows the mean squared error for $\bar{X}^2_n$ and $\bar{XY}_n$. As expected given the previous sections, $\bar{X}^2_n$ provides a much better approximation to $\mu^2$ than $\bar{XY}_n$.

grid_N = [100, 1000, 10000, 100000, 1000000]
N_replicate = 1000 #number of replications for the distribution of the estimator

# To store stats values
distristribution_mean = zeros(length(grid_N), N_replicate)
distristribution_var = zeros(length(grid_N), N_replicate)
distristribution_var_product = zeros(length(grid_N), N_replicate)
distristribution_square_mean = zeros(length(grid_N), N_replicate)
distristribution_product_mean = zeros(length(grid_N), N_replicate)

# Store stats on distribution of values for each replication

distristribution_mean_var = zeros(length(grid_N))
distristribution_mean_var_product = zeros(length(grid_N))
distristribution_square_average = zeros(length(grid_N))
distristribution_square_var = zeros(length(grid_N))

distristribution_product_average = zeros(length(grid_N))
distristribution_product_var = zeros(length(grid_N))

# To store stats on Mean Squared Error (MSE)
distristribution_square_MSE_average = zeros(length(grid_N))
distristribution_square_MSE_var = zeros(length(grid_N))

distristribution_product_MSE_average = zeros(length(grid_N))
distristribution_product_MSE_var = zeros(length(grid_N))


mu = 0
sigma = 1.0 #variance when simulating from Normal
f(x) = x.^3

for (index_N, N) in enumerate(grid_N)

    d_x = Normal(mu, sigma)
    d_x1 = Normal(mu, sigma)
    d_x2 = Normal(mu, sigma)

    for i=1:N_replicate

        draws_x = rand(d_x, N)
        draws_x1 = rand(d_x1, N)
        draws_x2 = rand(d_x2, N)

        distristribution_mean[index_N, i] = mean(f(draws_x)) #mean f(x)
        distristribution_var[index_N, i] = var(f(draws_x)) #variance of f(x)
        distristribution_var_product[index_N, i] = var(f(draws_x1).*f(draws_x2)) #variance of f(x1)*f(x2)
        distristribution_square_mean[index_N, i] = mean(f(draws_x))^2
        distristribution_product_mean[index_N, i] = mean(f(draws_x1).*f(draws_x2))

    end

    #Stats across different replications
    # Variance estimator
    # E(x^2)
    distristribution_mean_var[index_N] = mean(distristribution_var[index_N, :]) #average of variances
    distristribution_mean_var_product[index_N] = mean(distristribution_var_product[index_N, :]) #average of variances of products
    distristribution_square_var[index_N] = var(distristribution_square_mean[index_N, :])
    distristribution_square_average[index_N] = mean(distristribution_square_mean[index_N, :])

    # E(xy)
    distristribution_product_var[index_N] = var(distristribution_product_mean[index_N, :])
    distristribution_product_average[index_N] = mean(distristribution_product_mean[index_N, :])

    # Mean squared error
    # Note that here the true value is 0
    # E(x^2)
    distristribution_square_MSE_var[index_N] = var((distristribution_square_mean[index_N, :]).^2)
    distristribution_square_MSE_average[index_N] = mean((distristribution_square_mean[index_N, :]).^2)

    # E(xy)
    distristribution_product_MSE_var[index_N] = var(distristribution_product_mean[index_N, :].^2)
    distristribution_product_MSE_average[index_N] = mean(distristribution_product_mean[index_N, :].^2)


end

# Plot for the expected value
p0 = plot(log10.(grid_N), log10.(distristribution_square_MSE_average), label=L"Empirical $\bar{X}^2_n$", xlabel=L"$log_{10}(N)$")
plot!(log10.(grid_N), log10.(distristribution_product_MSE_average),  label =L"Empirical $\bar{XY}_n$", xlabel=L"$log_{10}(N)$", ylabel=L"$log_{10}(MSE)$")
plot!(log10.(grid_N), log10.((distristribution_mean_var.^2 )./(grid_N)), label=L"$\sigma^4/N$", linestyle=:dash, linewidth=2.0)
plot!(log10.(grid_N), log10.((3.0 .* distristribution_mean_var.^2 )./(grid_N.^2)), label=L"$3\sigma^4/N^2$", title="Mean Squared Error", linestyle=:dash, linewidth=2.0)

# Plot for the variance
p1 = plot(log10.(grid_N), log10.(distristribution_square_var), label=L"Empirical $\bar{X}^2_n$", xlabel=L"$log_{10}(N)$")
plot!(p1, log10.(grid_N), log10.(distristribution_product_var),  label =L"Empirical $\bar{XY}_n$", xlabel=L"$log_{10}(N)$")
# Variance absolute value of $\bar{XY}_n$
# Using Jensen's inequality, it is slightly below the variance of $\bar{XY}_n$
# plot!(p1, log10.(grid_N), log10.(var(abs.(distristribution_product_mean), dims=2)),  label =L"$|\bar{XY}_n|$", xlabel=L"$log_{10}(N)$")
plot!(p1, log10.(grid_N), log10.((2.0 .* distristribution_mean_var.^2 )./(grid_N.^2)), label=L"$2\sigma^4/N^2$", linestyle=:dash, linewidth=2.0)
plot!(p1, log10.(grid_N), log10.((distristribution_mean_var.^2 )./(grid_N)), label=L"$\sigma^4/N$", title="Variance Estimator", ylabel=L"$log_{10}(Variance)$", linestyle=:dash, linewidth=2.0)


plot(p0, p1)
plot!(size=(800,400))

svg

The log scale of the previous graph does not do justice to the difference that exists between both estimators. So in the next graph, I keep a linear scale. The blue line is the average value for a given $N$. This graph makes obvious $\bar{X}_n$ converges much quicker to $\mu^2$ than $\bar{XY}_n$.

grid_N = collect(range(100, 100000, step=1000))
N_replicate = 100 #number of replications for the distribution of the estimator

distristribution_square_mean = zeros(length(grid_N), N_replicate)
distristribution_product_mean = zeros(length(grid_N), N_replicate)

# Distribution of values for each replication
distristribution_square_average = zeros(length(grid_N))
distristribution_square_var = zeros(length(grid_N))

distristribution_product_average = zeros(length(grid_N))
distristribution_product_var = zeros(length(grid_N))


mu = 0
sigma=1.0 #variance when simulating from Normal
f(x) = x.^3

for (index_N, N) in enumerate(grid_N)

    d_x = Normal(mu, sigma)
    d_x1 = Normal(mu, sigma)
    d_x2 = Normal(mu, sigma)

    for i=1:N_replicate

        draws_x = rand(d_x, N)
        draws_x1 = rand(d_x1, N)
        draws_x2 = rand(d_x2, N)

        distristribution_square_mean[index_N, i] = mean(f(draws_x))^2
        distristribution_product_mean[index_N, i] = mean(f(draws_x1).*f(draws_x2))

    end

    #Stats across different replications
    # E(x^2)
    distristribution_square_average[index_N] = mean(distristribution_square_mean[index_N, :])

    # E(xy)
    distristribution_product_average[index_N] = mean(distristribution_product_mean[index_N, :])

end

min_val = minimum(distristribution_product_mean)
max_val = maximum(distristribution_product_mean)
p0 = scatter(grid_N, distristribution_square_mean, legend=false,title=L"$\bar{X}^2_n$", xlabel=L"$N$", alpha=0.5, ylims=(min_val, max_val ))
plot!(p0, grid_N, distristribution_square_average, label="mean", ylims=(min_val, max_val ), linewidth=2.0, color="blue", xrotation = 15)
p1 = scatter(grid_N, distristribution_product_mean, legend=false,title=L"$\bar{XY}_n$", xlabel=L"$N$", ylims=(min_val, max_val ))
plot!(p1, grid_N, distristribution_product_average, label="mean", ylims=(min_val, max_val ), linewidth=2.0, color="blue", xrotation = 15)
plot(p0, p1)
plot!(size=(800,600))

svg

Case $\mu$ small

In practice, we do not know for sure that $\mu$ is exactly equal to zero. However, we know that $\mu$ is “close to zero”. Are we still better off using $\bar{X}^2_n$ instead of $\bar{XY}_n$?

grid_N = [10, 100, 1000, 10000, 100000, 1000000]
N_replicate = 1000 #number of replications for the distribution of the estimator

# To store stats values
distristribution_mean = zeros(length(grid_N), N_replicate)
distristribution_var = zeros(length(grid_N), N_replicate)
distristribution_var_product = zeros(length(grid_N), N_replicate)
distristribution_square_mean = zeros(length(grid_N), N_replicate)
distristribution_product_mean = zeros(length(grid_N), N_replicate)

# Store stats on distribution of values for each replication
distristribution_mean_average = zeros(length(grid_N))
distristribution_mean_var = zeros(length(grid_N))
distristribution_mean_var_product = zeros(length(grid_N))
distristribution_square_average = zeros(length(grid_N))
distristribution_square_var = zeros(length(grid_N))

distristribution_product_average = zeros(length(grid_N))
distristribution_product_var = zeros(length(grid_N))

# To store stats on Mean Squared Error (MSE)
distristribution_square_MSE_average = zeros(length(grid_N))
distristribution_square_MSE_var = zeros(length(grid_N))

distristribution_product_MSE_average = zeros(length(grid_N))
distristribution_product_MSE_var = zeros(length(grid_N))


true_value = sqrt(0.01)
mu = true_value  #mean normal
sigma = 1.0 #variance when simulating from Normal
f(x) = x

distribution = "Uniform" # Choice between Normal(mu, sigma) or Uniform(-1,1) + mu

for (index_N, N) in enumerate(grid_N)

    if distribution == "Normal"
        d_x = Normal(mu, sigma)
        d_x1 = Normal(mu, sigma)
        d_x2 = Normal(mu, sigma)
    else
        d_x = Uniform(-1,1) + mu
        d_x1 = Uniform(-1,1) + mu
        d_x2 = Uniform(-1,1) + mu
    end

    for i=1:N_replicate

        draws_x = rand(d_x, N)
        draws_x1 = rand(d_x1, N)
        draws_x2 = rand(d_x2, N)

        distristribution_mean[index_N, i] = mean(f(draws_x)) #mean f(x)
        distristribution_var[index_N, i] = var(f(draws_x)) #variance of f(x)

        distristribution_var_product[index_N, i] = var(f(draws_x1).*f(draws_x2)) #variance of f(x1)*f(x2)
        distristribution_square_mean[index_N, i] = mean(f(draws_x))^2
        distristribution_product_mean[index_N, i] = mean(f(draws_x1).*f(draws_x2))

    end

    #Stats across different replications
    # Average of mean
    distristribution_mean_average[index_N] = mean(mean(distristribution_mean[index_N, :]))

    # Variance estimator
    # E(x^2)
    distristribution_mean_var[index_N] = mean(distristribution_var[index_N, :]) #average of variances
    distristribution_mean_var_product[index_N] = mean(distristribution_var_product[index_N, :]) #average of variances of products

    distristribution_square_var[index_N] = var(distristribution_square_mean[index_N, :])
    distristribution_square_average[index_N] = mean(distristribution_square_mean[index_N, :])

    # E(xy)
    distristribution_product_var[index_N] = var(distristribution_product_mean[index_N, :])
    distristribution_product_average[index_N] = mean(distristribution_product_mean[index_N, :])

    # Mean squared error
    # Note that here the true value is 0
    # E(x^2)
    distristribution_square_MSE_var[index_N] = var((distristribution_square_mean[index_N, :] .- true_value^2).^2)
    distristribution_square_MSE_average[index_N] = mean((distristribution_square_mean[index_N, :] .- true_value^2).^2)

    # E(xy)
    distristribution_product_MSE_var[index_N] = var((distristribution_product_mean[index_N, :] .- true_value^2).^2)
    distristribution_product_MSE_average[index_N] = mean((distristribution_product_mean[index_N, :] .- true_value^2).^2)


end


# Plot for the expected value
p0 = plot(log10.(grid_N), log10.(distristribution_square_MSE_average), label=L"$Empirical \bar{X}^2_n$", xlabel=L"$log_{10}(N)$")
plot!(log10.(grid_N), log10.(distristribution_product_MSE_average),  label =L"$Empirical \bar{XY}_n$", xlabel=L"$log_{10}(N)$", ylabel=L"$log_{10}(MSE)$")
plot!(log10.(grid_N), log10.((distristribution_mean_var.^2 + 2 .* distristribution_mean_average.^2 .* distristribution_mean_var)./(grid_N)), label=L"$(\sigma^4  + 2 \mu^2 \sigma^2)/N$", linestyle=:dash, linewidth=2.0)
plot!(log10.(grid_N), log10.((3.0 .* distristribution_mean_var.^2 )./(grid_N.^2)), label=L"$3\sigma^4/N^2$", title="Mean Squared Error", linestyle=:dash, linewidth=2.0)
plot!(log10.(grid_N), log10.((distristribution_mean_var.^2 + 4 .* grid_N .* distristribution_square_average .* distristribution_mean_var)./(grid_N.^2)), label=L"$\frac{\sigma^4 + 4N \mu^2 \sigma^2}{N^2}$", title="Mean Squared Error", linestyle=:dash, linewidth=2.0)


# Plot for the variance
p1 = plot(log10.(grid_N), log10.(distristribution_square_var), label=L"$Empirical \bar{X}^2_n$", xlabel=L"$log_{10}(N)$")
plot!(p1, log10.(grid_N), log10.(distristribution_product_var),  label =L"Empirical $\bar{XY}_n$", xlabel=L"$log_{10}(N)$")
plot!(p1, log10.(grid_N), log10.((2.0 .* distristribution_mean_var.^2 )./(grid_N.^2)), label=L"$2\sigma^4/N^2$", linestyle=:dash, linewidth=2.0)
plot!(p1, log10.(grid_N), log10.((distristribution_mean_var.^2 + 2 .* distristribution_mean_average.^2 .* distristribution_mean_var)./(grid_N)), label=L"$(\sigma^4  + 2 \mu^2 \sigma^2)/N$", title="Variance Estimator", ylabel=L"$log_{10}(Variance)$", linestyle=:dash, linewidth=2.0)
plot!(log10.(grid_N), log10.((4 .* distristribution_square_average .* distristribution_mean_var)./(grid_N)), label=L"$\frac{4 \mu^2 \sigma^2}{N}$", linestyle=:dash, linewidth=2.0)


plot(p0, p1)
plot!(size=(800,400))

svg

Below is another example with a linear scale:

grid_N = collect(range(10, 1000, step=10))
N_replicate = 100 #number of replications for the distribution of the estimator

distristribution_square_mean = zeros(length(grid_N), N_replicate)
distristribution_product_mean = zeros(length(grid_N), N_replicate)

# Distribution of values for each replication
distristribution_square_average = zeros(length(grid_N))
distristribution_square_var = zeros(length(grid_N))

distristribution_product_average = zeros(length(grid_N))
distristribution_product_var = zeros(length(grid_N))


mu = true_value
sigma=1.0 #variance when simulating from Normal
f(x) = x

distribution = "Uniform" # Choice between Normal(mu, sigma) or Uniform(-1,1) + mu

for (index_N, N) in enumerate(grid_N)

    if distribution == "Normal"
        d_x = Normal(mu, sigma)
        d_x1 = Normal(mu, sigma)
        d_x2 = Normal(mu, sigma)
    else
        d_x = Uniform(-1,1) + mu
        d_x1 = Uniform(-1,1) + mu
        d_x2 = Uniform(-1,1) + mu
    end

    for i=1:N_replicate

        draws_x = rand(d_x, N)
        draws_x1 = rand(d_x1, N)
        draws_x2 = rand(d_x2, N)

        distristribution_square_mean[index_N, i] = mean(f(draws_x))^2
        distristribution_product_mean[index_N, i] = mean(f(draws_x1).*f(draws_x2))

    end

    #Stats across different replications
    # E(x^2)
    distristribution_square_average[index_N] = mean(distristribution_square_mean[index_N, :])

    # E(xy)
    distristribution_product_average[index_N] = mean(distristribution_product_mean[index_N, :])

end

min_val = minimum(distristribution_product_mean)
max_val = maximum(distristribution_product_mean)
p0 = scatter(grid_N, distristribution_square_mean, legend=false,title=L"$\bar{X}^2_n$", xlabel=L"$N$", alpha=0.5, ylims=(min_val, max_val ))
plot!(p0, grid_N, distristribution_square_average, label="mean", ylims=(min_val, max_val ), linewidth=3.0, color="blue", xrotation = 15)
plot!(grid_N, x -> mu^2, label=L"\mu^2", linestyle = :dash, linewidth=3.0)
p1 = scatter(grid_N, distristribution_product_mean, legend=false,title=L"$\bar{XY}_n$", xlabel=L"$N$", ylims=(min_val, max_val ))
plot!(p1, grid_N, distristribution_product_average, label="mean", ylims=(min_val, max_val ), linewidth=3.0, color="blue", xrotation = 15)
plot!(p1, grid_N, x -> mu^2, label=L"\mu^2", linestyle = :dash, linewidth=3.0)
plot(p0, p1)
plot!(size=(800,600))

svg

Conclusion

Based on a comparison of the mean squared error, $\big(\frac{1}{N} \sum_{i=1}^{N} f(x_i) \Big)^2$ is a better estimator than $\frac{1}{N} \sum_{i=1}^{N} f(x_i) f(y_i) $ when $\mu^2$ is small.

Why should we care? In some applications, we aim at finding a function $f$ that minimizes the square of an expectation: $\big(E[f(X)]\big)^2$. Numerical methods can find $\hat{f}$ such that $\big(E[\hat{f}(X)]\big)^2 \approx 0$. In general, no closed-form solution is available for the expectation, so one must use numerical methods. For instance, Monte Carlo integration. Hence, the question of $\bar{X}_n^2$ vs $\bar{XY}_n$ appears quite naturally in that context.

Appendix

versioninfo()
Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 4


Extra

We can also use the Minimum Variance Unbiased Estimator (MVUE) of $\mu^2$, which is

$$ T_2 = \big(\frac{1}{N} \sum_{i=1}^{N} f(x_i) \Big)^2 - \frac{S^2}{N}$$

Where $S^2$ is the sample variance. It is unbiased because $S^2$ is an unbiased estimator of of $\sigma^2$. We saw above that the bias for $\bar{X}^2_n$ was $\frac{\sigma^2}{N}$.

$$ V(T_2) = Var(\bar{X_n}^2) + Var(\frac{S^2}{N}) - \frac{2}{N} Cov(\bar{X_n}^2, S^2)$$

Using the Cauchy-Schwarz inequality:

$$ V(T_2) \leq Var(\bar{X_n}^2) + Var(\frac{S^2}{N}) + \frac{2}{N} \sqrt{Var(\bar{X_n}^2)Var(S^2)}$$

We have seen above that for N sufficiently large, $Var(\bar{X_n}^2) \approx \frac{4\mu^2\sigma^2}{N}$. The variance of the sample variance is $Var(S^2) = \frac{\sigma^4}{N}(\kappa - 1 + \frac{2}{N-1})$, with $\kappa$ the kurtosis. Hence,

$$ V(T_2) \leq \frac{4\mu^2\sigma^2}{N} + \frac{\sigma^4}{N^3}(\kappa - 1 + \frac{2}{N-1}) + \frac{4\mu\sigma^3}{N^2}\sqrt{\kappa - 1 + \frac{2}{N-1}} $$


Things get simpler if one assumes Normality for $f(x)$. First, normality implies that $\bar{X_n}$ and $S^2$ are independent (see Basus’s theorem). Thus $\bar{X_n}^2$ and $S^2$ are independent, because $x->x^2$ is a measurable function on its domain. Then $Var(S^2) = \frac{2\sigma^4}{N-1}$. We also a formula for $Var(\bar{X_n}^2)$ in the normal case:

$$ Var(\bar{X_n}^2) = \frac{4\mu^2\sigma^2}{N} \big( 1 + \frac{\sigma^2}{2n \mu^2}\big)$$

source: https://journals.sagepub.com/doi/abs/10.1177/0008068319750115

Hence, with the normality assumption we have:

$$ Var(T_2) = \frac{4\mu^2\sigma^2}{N} + \frac{2\sigma^4}{N (N-1)}$$


Now let’s do a final comparison of MSE in the Normal case. Also, let’s only use $N/2$ draws for $\bar{XY}_n$, to keep constant the number of time we call the function $f(.)$ in the code ($\bar{XY}_n$ uses twice as much function calls, because we use two independent series of shocks).

$$ MSE(\bar{XY}_n) = 2 \frac{\sigma^4 + 2 \mu^2 \sigma^2}{N} = \frac{4\mu^2\sigma^2}{N} + \frac{2\sigma^4}{N}$$

$$ MSE(\bar{X}^2_n) = \frac{4\mu^2\sigma^2}{N} + \frac{3\sigma^4}{N^2} $$

$$ MSE(T_2) = \frac{4\mu^2\sigma^2}{N} + \frac{2\sigma^4}{N (N-1)} $$

  • When $N=2$, $MSE(\bar{XY}_n) = MSE(T_2)$. For $N>2$, $MSE(\bar{XY}_n) > MSE(T_2)$
  • When $N=2$, $MSE(\bar{X}^2_n) < MSE(T_2)$. When $N \geq 3$, $MSE(\bar{X}^2_n) \geq MSE(T_2)$

So if one only looks at the MSE, the suggestion is to use:

  • $\bar{X}^2_n$ when $N=2$
  • $T_2$ when $N\geq 3$

svg

grid_N = [2, 4, 10, 100, 1000, 10000, 100000]
N_replicate = 10000 #number of replications for the distribution of the estimator

# To store stats values
distristribution_mean = zeros(length(grid_N), N_replicate)
distristribution_var = zeros(length(grid_N), N_replicate)
distristribution_var_product = zeros(length(grid_N), N_replicate)
distristribution_square_mean = zeros(length(grid_N), N_replicate)
distribution_T2 = zeros(length(grid_N), N_replicate)
distristribution_product_mean = zeros(length(grid_N), N_replicate)

# Store stats on distribution of values for each replication
distristribution_mean_average = zeros(length(grid_N))
distristribution_mean_var = zeros(length(grid_N))
distristribution_mean_var_product = zeros(length(grid_N))
distristribution_square_average = zeros(length(grid_N))
distristribution_square_var = zeros(length(grid_N))

distristribution_product_average = zeros(length(grid_N))
distristribution_product_var = zeros(length(grid_N))

# To store stats on Mean Squared Error (MSE)
distristribution_square_MSE_average = zeros(length(grid_N))
distristribution_square_MSE_var = zeros(length(grid_N))

distristribution_product_MSE_average = zeros(length(grid_N))
distristribution_product_MSE_var = zeros(length(grid_N))

distristribution_T2_MSE_var = zeros(length(grid_N))
distristribution_T2_MSE_average = zeros(length(grid_N))

true_value = sqrt(0.1)
mu = true_value  #mean normal
sigma = 1.0 #variance when simulating from Normal
f(x) = x

distribution = "Normal" #"Uniform" # Choice between Normal(mu, sigma) or Uniform(-1,1) + mu

for (index_N, N) in enumerate(grid_N)

    N2 = Int(N/2) #to keep the number of function evaluations constant

    if distribution == "Normal"
        d_x = Normal(mu, sigma)
        d_x1 = Normal(mu, sigma)
        d_x2 = Normal(mu, sigma)
    else
        d_x = Uniform(-1,1) + mu
        d_x1 = Uniform(-1,1) + mu
        d_x2 = Uniform(-1,1) + mu
    end

    for i=1:N_replicate

        draws_x = rand(d_x, N)
        draws_x1 = rand(d_x1, N2)
        draws_x2 = rand(d_x2, N2)

        distristribution_mean[index_N, i] = mean(f(draws_x)) #mean f(x)
        distristribution_var[index_N, i] = var(f(draws_x)) #variance of f(x)

        distristribution_var_product[index_N, i] = var(f(draws_x1).*f(draws_x2)) #variance of f(x1)*f(x2)
        distristribution_square_mean[index_N, i] = mean(f(draws_x))^2
        distribution_T2[index_N, i] = mean(f(draws_x))^2 - (var(f(draws_x))/N)
        distristribution_product_mean[index_N, i] = mean(f(draws_x1).*f(draws_x2))

    end

    #Stats across different replications
    # Average of mean
    distristribution_mean_average[index_N] = mean(mean(distristribution_mean[index_N, :]))

    # Variance estimator
    # E(x^2)
    distristribution_mean_var[index_N] = mean(distristribution_var[index_N, :]) #average of variances
    distristribution_mean_var_product[index_N] = mean(distristribution_var_product[index_N, :]) #average of variances of products

    distristribution_square_var[index_N] = var(distristribution_square_mean[index_N, :])
    distristribution_square_average[index_N] = mean(distristribution_square_mean[index_N, :])

    # E(xy)
    distristribution_product_var[index_N] = var(distristribution_product_mean[index_N, :])
    distristribution_product_average[index_N] = mean(distristribution_product_mean[index_N, :])

    # Mean squared error
    # Note that here the true value is 0
    # E(x^2)
    distristribution_square_MSE_var[index_N] = var((distristribution_square_mean[index_N, :] .- true_value^2).^2)
    distristribution_square_MSE_average[index_N] = mean((distristribution_square_mean[index_N, :] .- true_value^2).^2)

    # E(xy)
    distristribution_product_MSE_var[index_N] = var((distristribution_product_mean[index_N, :] .- true_value^2).^2)
    distristribution_product_MSE_average[index_N] = mean((distristribution_product_mean[index_N, :] .- true_value^2).^2)

    # E(x^2) - S^2/N
    distristribution_T2_MSE_var[index_N] = var((distribution_T2[index_N, :] .- true_value^2).^2)
    distristribution_T2_MSE_average[index_N] = mean((distribution_T2[index_N, :] .- true_value^2).^2)


end

# Plot for the expected value
p0 = plot(log10.(grid_N), log10.(distristribution_square_MSE_average), label=L"$Empirical \bar{X}^2_n$", xlabel=L"$log_{10}(N)$")
plot!(log10.(grid_N), log10.(distristribution_product_MSE_average),  label =L"$Empirical \bar{XY}_n$", xlabel=L"$log_{10}(N)$", ylabel=L"$log_{10}(MSE)$")
plot!(log10.(grid_N), log10.(distristribution_T2_MSE_average),  label =L"$Empirical T_{2,n}$", xlabel=L"$log_{10}(N)$", ylabel=L"$log_{10}(MSE)$")
vline!([log10.(3)], label=L"$log_{10}(3)$")
plot!(log10.(grid_N), log10.(2*(distristribution_mean_var.^2 + 2 .* distristribution_mean_average.^2 .* distristribution_mean_var)./(grid_N)), label=L"$2(\sigma^4  + 2 \mu^2 \sigma^2)/N$", linestyle=:dash, linewidth=2.0)
#plot!(log10.(grid_N), log10.((3.0 .* distristribution_mean_var.^2 )./(grid_N.^2)), label=L"$3\sigma^4/N^2$", title="Mean Squared Error", linestyle=:dash, linewidth=2.0)
plot!(log10.(grid_N), log10.((4 .* distristribution_mean_average.^2 .* distristribution_mean_var)./(grid_N) + (2.0 .* distristribution_mean_var.^2 )./(grid_N.*(grid_N .- 1))), label=L"$\frac{4 \mu^2 \sigma^2}{N} + \frac{2 \sigma^4}{N(N-1)}$", title="Mean Squared Error", linestyle=:dash, linewidth=2.0)
plot!(log10.(grid_N), log10.((4 .* distristribution_mean_average.^2 .* distristribution_mean_var)./(grid_N) + (3.0 .* distristribution_mean_var.^2 )./(grid_N.^2)), label=L"$\frac{4 \mu^2 \sigma^2}{N} + \frac{3 \sigma^4}{N^2}$", title="Mean Squared Error", linestyle=:dash, linewidth=2.0)

plot!(size=(800,400))