Least Squares Curve Fitting¶

This notebook demonstrates polynomial curve fitting via least squares, highlighting the symbolic-numeric bridge: we build design matrices numerically with ndarray operations, solve the least-squares problem with np_lstsq (wrapping LAPACK's SVD), and then reconstruct the fitted polynomial as a symbolic Maxima expression for plotting and analysis.

Key functions: np_linspace, np_lstsq, np_hstack, np_reshape, np_matmul, np_pow.

In [127]:
load("numerics")$
load("ax-plots")$

Generating Noisy Data¶

We start from a known quadratic function $f(x) = 0.5 x^2 - 2x + 1$ and add Gaussian noise. This gives us a ground truth to compare against.

In [128]:
/* True function: f(x) = 0.5*x^2 - 2*x + 1 */
n : 20$
x : np_linspace(0, 5, n)$

/* Evaluate f at each x and add noise */
y : np_add(
  np_add(
    np_scale(0.5, np_pow(x, 2)),
    np_scale(-2.0, x)
  ),
  np_add(
    np_ones([n]),
    np_scale(0.3, np_randn([n]))
  )
)$

print("x (first 5):", np_to_list(np_slice(x, [0, 5])))$
print("y (first 5):", np_to_list(np_slice(y, [0, 5])))$
x (first 5):
            [0.0,0.2631578947368421,0.5263157894736842,0.7894736842105263,
             1.0526315789473684]
y (first 5):
            [1.144367961890701,0.3465674007798698,-0.1013859777412156,
             -0.2302246196332074,-0.6000505351780027]
In [129]:
/* Plot the noisy data */
ax_draw2d(
  color=red, marker_size=6,
  name="Noisy data",
  points(np_to_list(x), np_to_list(y)),
  color=black, dash="dash", name="True: 0.5x^2 - 2x + 1",
  explicit(0.5*t^2 - 2*t + 1, t, 0, 5),
  title="Noisy Observations from a Quadratic",
  xlabel="x", ylabel="y",
  grid=true
)$
No description has been provided for this image

The Vandermonde Matrix [S+N]¶

To fit a degree-$d$ polynomial $p(x) = a_0 + a_1 x + a_2 x^2 + \cdots + a_d x^d$ to data $(x_i, y_i)$, we form the Vandermonde matrix (or design matrix) $V$ with entries $V_{ij} = x_i^{\,j}$:

$$ V = \begin{pmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^d \\ 1 & x_2 & x_2^2 & \cdots & x_2^d \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_m & x_m^2 & \cdots & x_m^d \end{pmatrix} $$

The least-squares fit then solves $V \mathbf{a} \approx \mathbf{y}$ for the coefficient vector $\mathbf{a} = (a_0, a_1, \ldots, a_d)^T$.

We can inspect the symbolic structure first, then build it numerically.

In [130]:
/* Symbolic Vandermonde structure for degree 2 */
print("Symbolic Vandermonde (degree 2) for points x1..x4:")$
genmatrix(lambda([i, j], x[i]^(j-1)), 4, 3);
Symbolic Vandermonde (degree 2) for points x1..x4:
Out[130]:
In [131]:
/* Build the Vandermonde matrix numerically for degree 2 */
/* Each column is x^j for j = 0, 1, 2 */
vander(xv, deg) := block([cols],
  cols : np_reshape(np_ones([np_size(xv)]), [np_size(xv), 1]),
  for j : 1 thru deg do
    cols : np_hstack(cols, np_reshape(np_pow(xv, j), [np_size(xv), 1])),
  cols
)$

V : vander(x, 2)$
print("Vandermonde shape:", np_shape(V))$
print("First 3 rows:")$
for i : 0 thru 2 do
  print("  ", np_to_list(np_row(V, i)))$
Vandermonde shape: [20,3]
First 3 rows:
   [1.0,0.0,0.0]
   [1.0,0.2631578947368421,0.06925207756232686]
   [1.0,0.5263157894736842,0.27700831024930744]

Solving with np_lstsq¶

The function np_lstsq(A, b) minimises $\|A x - b\|_2$ using SVD. It returns [x, residuals, rank, singular_values].

In [132]:
/* Solve the least-squares problem V * a = y */
y_col : np_reshape(y, [n, 1])$
[coeffs_nd, residuals, rank_V, sv] : np_lstsq(V, y_col)$

coeffs_list : np_to_list(coeffs_nd)$
a0 : coeffs_list[1]$
a1 : coeffs_list[2]$
a2 : coeffs_list[3]$

print("Fitted coefficients:")$
print("  a0 =", a0, "  (true: 1.0)")$
print("  a1 =", a1, "  (true: -2.0)")$
print("  a2 =", a2, "  (true: 0.5)")$
print("  rank =", rank_V)$

/* Reconstruct the fitted polynomial symbolically */
p_fit : a0 + a1*t + a2*t^2;
print("Fitted polynomial: p(t) =", p_fit)$
Fitted coefficients:
  a0 = 1.0837053490879573   (true: 1.0)
  a1 = -2.1517272135665984   (true: -2.0)
  a2 = 0.5273594310358118   (true: 0.5)
  rank = 3
0.5273594310358118*t^2-2.1517272135665984*t+1.0837053490879573
Fitted polynomial: p(t) =
                         0.5273594310358118*t^2-2.1517272135665984*t
                                               +1.0837053490879573

Plotting the Fit¶

We overlay the fitted curve (a symbolic expression) onto the noisy data.

In [133]:
ax_draw2d(
  color=red, marker_size=6,
  name="Data",
  points(np_to_list(x), np_to_list(y)),
  color=blue, line_width=2, name="Least-squares fit (deg 2)",
  explicit(p_fit, t, 0, 5),
  color=black, dash="dash", name="True curve",
  explicit(0.5*t^2 - 2*t + 1, t, 0, 5),
  title="Quadratic Least-Squares Fit",
  xlabel="x", ylabel="y",
  grid=true
)$
No description has been provided for this image

Residual Analysis¶

The residual vector $r = y - V\hat{a}$ shows the unexplained variation. We compute the coefficient of determination $R^2$:

$$R^2 = 1 - \frac{\sum r_i^2}{\sum (y_i - \bar{y})^2}$$

An $R^2$ close to 1 indicates the model captures most of the variance.

In [134]:
/* Predicted values and residuals */
y_pred : np_matmul(V, coeffs_nd)$
resid : np_sub(y_col, y_pred)$

/* Sum of squared residuals */
ss_res : np_dot(np_flatten(resid), np_flatten(resid))$

/* Total sum of squares */
y_mean : np_mean(y)$
y_centered : np_sub(y, y_mean)$
ss_tot : np_dot(y_centered, y_centered)$

R2 : 1 - ss_res / ss_tot;
print("R^2 =", R2)$
print("SS_res =", ss_res)$
print("SS_tot =", ss_tot)$
0.9726930639043225
R^2 = 0.9726930639043225
SS_res = 0.9615585185138484
SS_tot = 35.2129772137291
In [135]:
/* Plot residuals vs x */
ax_draw2d(
  color=blue, marker_size=6,
  name="Residuals",
  points(np_to_list(x), np_to_list(np_flatten(resid))),
  color=gray, dash="dash",
  explicit(0, t, 0, 5),
  title="Residuals vs x",
  xlabel="x", ylabel="Residual",
  grid=true
)$
No description has been provided for this image

Underfitting and Overfitting¶

The choice of polynomial degree $d$ controls the bias-variance tradeoff:

  • Degree 1 (linear): too simple to capture the curvature -- underfit
  • Degree 5: captures the shape with some flexibility -- good fit
  • Degree 15: has enough freedom to interpolate noise -- overfit

We fit all three and compare.

In [136]:
/* Helper: fit a polynomial of given degree, return symbolic expression */
polyfit(xv, yv, deg) := block(
  [Vm, yc, result, c, poly],
  Vm : vander(xv, deg),
  yc : np_reshape(yv, [np_size(yv), 1]),
  result : np_lstsq(Vm, yc),
  c : np_to_list(first(result)),
  poly : sum(c[k+1] * t^k, k, 0, deg),
  poly
)$

p1  : polyfit(x, y, 1)$
p5  : polyfit(x, y, 5)$
p15 : polyfit(x, y, 15)$

print("Degree 1:", p1)$
print("Degree 5:", expand(p5))$
Degree 1: 0.4850699416124584*t-0.997976615527083
Degree 5:
         -(0.01778366101067462*t^5)+0.22458881666238428*t^4
                                   -0.9729345742000228*t^3
                                   +2.167521279704431*t^2-2.980417057972085*t
                                   +1.0793971228343548
In [137]:
/* Plot all fits */
ax_draw2d(
  color=red, marker_size=6,
  name="Data",
  points(np_to_list(x), np_to_list(y)),
  color="#e41a1c", dash="dash", name="Degree 1 (underfit)",
  explicit(p1, t, 0, 5),
  color="#377eb8", line_width=2, name="Degree 5",
  explicit(p5, t, 0, 5),
  color="#4daf4a", name="Degree 15 (overfit)",
  explicit(p15, t, 0, 5),
  color=black, dash="dash", name="True curve",
  explicit(0.5*t^2 - 2*t + 1, t, 0, 5),
  title="Underfitting vs Overfitting",
  xlabel="x", ylabel="y",
  yrange=[-3, 6],
  grid=true
)$
No description has been provided for this image

The linear fit misses the curvature entirely. The degree-5 fit tracks the true curve closely. The degree-15 fit oscillates wildly between data points -- a classic sign of overfitting, where the model fits the noise rather than the underlying relationship.

Comparing with the Symbolic Solution [S+N]¶

For the linear case (degree 1), the least-squares solution satisfies the normal equations:

$$A^T A \, \hat{x} = A^T b$$

We can derive these symbolically in Maxima, solve them exactly, and verify that the symbolic solution matches the numerical np_lstsq result.

In [138]:
/* Build the normal equations symbolically for a linear fit */
/* Design matrix for degree 1: columns are [1, x_i] */
V1 : vander(x, 1)$
y1 : np_reshape(y, [n, 1])$

/* Normal equations: (V^T V) c = V^T y */
VtV : np_matmul(np_transpose(V1), V1)$
Vty : np_matmul(np_transpose(V1), y1)$

print("V^T V (2x2):")$
np_to_matrix(VtV);
print("V^T y (2x1):")$
np_to_matrix(Vty);
V^T V (2x2):
matrix([20.0,50.0],[50.0,171.05263157894737])
V^T y (2x1):
Out[138]:
In [139]:
/* Solve the normal equations numerically */
coeffs_normal : np_solve(VtV, Vty)$

/* Compare with np_lstsq */
[coeffs_lstsq, res1, rank1, sv1] : np_lstsq(V1, y1)$

print("Normal equations solution:", np_to_list(coeffs_normal))$
print("np_lstsq solution:        ", np_to_list(coeffs_lstsq))$

diff_norm : np_norm(np_sub(coeffs_normal, coeffs_lstsq))$
print("Difference norm:", diff_norm)$
Normal equations solution: [-0.9979766155270834,0.4850699416124584]
np_lstsq solution:         [-0.997976615527083,0.4850699416124584]
Difference norm: 3.3306690738754696e-16
In [140]:
/* Derive the normal equations fully symbolically */
/* For a linear model y = c0 + c1*x with m data points: */
/*   V^T V = [[m, sum(x)], [sum(x), sum(x^2)]]          */
/*   V^T y = [sum(y), sum(x*y)]                          */

print("Symbolic normal equations for linear regression:")$
print("  [  m       sum(x_i)  ] [c0]   [ sum(y_i)     ]")$
print("  [ sum(x_i) sum(x_i^2)] [c1] = [ sum(x_i*y_i) ]")$

/* Solve symbolically */
sx  : np_sum(x)$
sx2 : np_dot(x, x)$
sy  : np_sum(y)$
sxy : np_dot(x, y)$

/* Cramer's rule */
det_sys : n * sx2 - sx^2$
c0_sym  : (sy * sx2 - sx * sxy) / det_sys$
c1_sym  : (n * sxy - sx * sy) / det_sys$

print("Symbolic solution:")$
print("  c0 =", c0_sym)$
print("  c1 =", c1_sym)$
print("Matches np_lstsq:",
  is(abs(c0_sym - np_to_list(coeffs_lstsq)[1]) < 1e-10 and
     abs(c1_sym - np_to_list(coeffs_lstsq)[2]) < 1e-10))$
Symbolic normal equations for linear regression:
  [  m       sum(x_i)  ] [c0]   [ sum(y_i)     ]
  [ sum(x_i) sum(x_i^2)] [c1] = [ sum(x_i*y_i) ]
Symbolic solution:
  c0 = -0.9979766155270838
  c1 = 0.48506994161245853
Matches np_lstsq: true