Machine Learning Fundamentals¶

This notebook implements three classical ML algorithms from scratch using Maxima's numerics package. The symbolic-numeric bridge is the unique angle: we derive gradients symbolically with diff(), then evaluate them numerically — something that Python/NumPy cannot do in the same environment.

  1. Linear Regression — symbolic MSE gradient → numeric gradient descent
  2. Logistic Regression — symbolic cross-entropy gradient → binary classification
  3. K-Means Clustering — iterative centroid assignment
In [137]:
load("numerics")$
load("numerics-optimize")$
load("ax-plots")$

1. Linear Regression¶

Symbolic Phase: Deriving the Gradient¶

The mean squared error loss for a linear model $\hat{y} = a + b x$ is:

$$L(a, b) = \frac{1}{2N} \sum_{i=1}^{N} (y_i - a - b x_i)^2$$

We use Maxima's symbolic diff() to derive the gradient analytically.

In [138]:
/* Define the loss for a single data point */
L_i(a, b, x_i, y_i) := (1/2) * (y_i - a - b * x_i)^2$

/* Symbolic gradients */
grad_a_symbolic : diff(L_i(a, b, x[i], y[i]), a);
grad_b_symbolic : diff(L_i(a, b, x[i], y[i]), b);
-y[i]+b*x[i]+a
Out[138]:

The gradient per data point is:

  • $\frac{\partial L_i}{\partial a} = -(y_i - a - b x_i)$
  • $\frac{\partial L_i}{\partial b} = -x_i (y_i - a - b x_i)$

Summing over all data and dividing by $N$ gives:

  • $\nabla_a = -\text{mean}(y - a - bx)$
  • $\nabla_b = -\text{mean}(x \cdot (y - a - bx))$

The residual $r = \hat{y} - y = (a + bx) - y$ appears naturally. In matrix form with $X = [\mathbf{1}, \mathbf{x}]$ and $w = [a, b]^T$: $\nabla = X^T (Xw - y) / N$.

In [139]:
/* Generate synthetic data: y = 2x + 1 + noise */
n : 80$
x_data : np_scale(6.0, np_rand([n]))$
noise : np_scale(1.5, np_randn([n]))$
y_data : np_add(np_add(np_scale(2.0, x_data), 1.0), noise)$

print("Generated", n, "data points")$
print("True parameters: a = 1, b = 2")$
Generated 80 data points
True parameters: a = 1, b = 2
In [140]:
/* Build design matrix X = [1, x] of shape [n, 2] */
ones_col : np_reshape(np_ones([n]), [n, 1])$
x_col : np_reshape(x_data, [n, 1])$
X : np_hstack(ones_col, x_col)$
y_col : np_reshape(y_data, [n, 1])$

print("Design matrix X shape:", np_shape(X))$
Design matrix X shape: [80,2]
In [141]:
/* Gradient descent using the symbolic gradient (now in matrix form) */
w : ndarray([0.0, 0.0], [2, 1])$   /* initial weights [a, b]^T */
lr : 0.01$
max_iter : 300$
losses : []$

for iter : 1 thru max_iter do block(
  pred : np_matmul(X, w),
  residual : np_sub(pred, y_col),
  loss : np_mean(np_pow(residual, 2)) / 2.0,
  losses : append(losses, [loss]),
  grad : np_scale(1.0 / n, np_matmul(np_transpose(X), residual)),
  w : np_sub(w, np_scale(lr, grad))
)$

a_gd : np_ref(w, 0, 0)$
b_gd : np_ref(w, 1, 0)$
print("Gradient descent: a =", a_gd, ", b =", b_gd)$
print("Final loss:", last(losses))$
Gradient descent: a = 0.6219872915418031 , b = 2.069644398527832
Final loss: 1.1755429181238
In [142]:
/* Compare with closed-form least-squares solution */
[w_ls, residuals, rank_X, sv] : np_lstsq(X, y_col)$
a_ls : np_ref(w_ls, 0, 0)$
b_ls : np_ref(w_ls, 1, 0)$
print("Least squares:     a =", a_ls, ", b =", b_ls)$
print("Difference: |a_gd - a_ls| =", abs(a_gd - a_ls),
      ", |b_gd - b_ls| =", abs(b_gd - b_ls))$
Least squares:     a = 0.7445072224814612 , b = 2.038975333576846
Difference: |a_gd - a_ls| = 0.12251993093965807 , |b_gd - b_ls| =
                           0.03066906495098598

One-liner with np_minimize¶

The manual gradient descent loop above can be replaced by a single np_minimize call, which uses the L-BFGS algorithm internally. L-BFGS converges in far fewer iterations and matches the closed-form solution:

In [143]:
/* Same objective and gradient, but now as named functions */
mse_cost(w) := block([pred, res],
  pred : np_matmul(X, w),
  res : np_sub(pred, y_col),
  np_sum(np_pow(res, 2)) / (2 * n))$

mse_grad(w) := np_scale(1.0 / n,
  np_matmul(np_transpose(X), np_sub(np_matmul(X, w), y_col)))$

[w_opt, loss_opt, converged] : np_minimize(mse_cost, mse_grad, np_zeros([2, 1]))$
a_opt : np_ref(w_opt, 0, 0)$
b_opt : np_ref(w_opt, 1, 0)$
print("np_minimize:       a =", a_opt, ", b =", b_opt)$
print("Loss:", loss_opt, "  Converged:", converged)$
print("vs lstsq:  |a_diff| =", abs(a_opt - a_ls), ", |b_diff| =", abs(b_opt - b_ls))$
*************************************************
  N=    2   NUMBER OF CORRECTIONS= 2
       INITIAL VALUES
 F=  3.342926062133934D+01   GNORM=  2.987887577706911D+01
*************************************************
   I  NFN     FUNC                    GNORM                   STEPLENGTH
   1    2     1.047082338422501D+01   1.603801150125860D+01   3.346846137924177D-02
   5    9     1.173978374720999D+00   5.541949001678261D-10   1.423482544907285D-02
 THE MINIMIZATION TERMINATED WITHOUT DETECTING ERRORS.
 IFLAG = 0
np_minimize:       a = 0.7445072197310822 , b = 2.038975334266307
Loss: 1.1739783747209986   Converged: true
vs lstsq:  |a_diff| = 2.75037892460972e-9 , |b_diff| = 6.89460932790098e-10
In [144]:
/* Plot data and fitted line */
x_list : np_to_list(x_data)$
y_list : np_to_list(y_data)$

ax_draw2d(
  color = "steelblue", marker_size = 5,
  points(x_list, y_list),
  color = "red", line_width = 2, name = "GD fit",
  explicit(a_gd + b_gd * t, t, 0, 6),
  color = "green", line_width = 2, dash = "dash", name = "lstsq",
  explicit(a_ls + b_ls * t, t, 0, 6),
  color = "orange", line_width = 2, dash = "dot", name = "L-BFGS",
  explicit(a_opt + b_opt * t, t, 0, 6),
  title = "Linear Regression: GD vs Least Squares vs L-BFGS",
  xlabel = "x", ylabel = "y", grid = true, showlegend = true
)$
No description has been provided for this image
In [145]:
/* Loss convergence */
iter_list : makelist(i, i, 1, max_iter)$

ax_draw2d(
  color = "darkorange", line_width = 2,
  lines(iter_list, losses),
  title = "Training Loss (MSE/2)",
  xlabel = "Iteration", ylabel = "Loss", grid = true
)$
No description has been provided for this image

2. Logistic Regression¶

Symbolic Phase: Sigmoid and Cross-Entropy Gradient¶

The sigmoid function $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the key building block. Its derivative has a remarkably clean form:

In [146]:
/* Symbolic sigmoid and its derivative */
sigma(z) := 1 / (1 + exp(-z))$

sigma_deriv : diff(sigma(z), z);
sigma_deriv_simplified : ratsimp(sigma_deriv);
%e^-z/(%e^-z+1)^2
Out[146]:
In [147]:
/* Verify: sigma'(z) = sigma(z) * (1 - sigma(z)) */
is(ratsimp(sigma_deriv - sigma(z) * (1 - sigma(z))) = 0);
Out[147]:

The binary cross-entropy loss for one sample is:

$$\ell_i = -\left[ y_i \log \sigma(z_i) + (1 - y_i) \log(1 - \sigma(z_i)) \right]$$

where $z_i = w_0 + w_1 x_{i1} + w_2 x_{i2}$. The gradient of the loss with respect to $w$ turns out to be:

$$\nabla_w L = \frac{1}{N} X^T (\sigma(Xw) - y)$$

which has the same form as linear regression but with $\sigma(Xw)$ replacing $Xw$.

In [148]:
/* Symbolic verification: gradient of cross-entropy */
ell(z, yi) := -(yi * log(sigma(z)) + (1 - yi) * log(1 - sigma(z)))$

/* d(ell)/dz simplifies to sigma(z) - y */
dell_dz : diff(ell(z, yi), z)$
ratsimp(dell_dz);
Out[148]:

The symbolic computation confirms: $\frac{\partial \ell}{\partial z} = \sigma(z) - y$, so the full gradient is $\nabla_w L = \frac{1}{N} X^T (\hat{y} - y)$.

Numeric Phase: Binary Classification¶

In [149]:
/* Generate 2D binary classification data */
n_class : 60$
n_total : 2 * n_class$

/* Class 0: centred at (-1, -1) */
x0_1 : np_add(np_randn([n_class]), -1.0)$
x0_2 : np_add(np_randn([n_class]), -1.0)$

/* Class 1: centred at (1.5, 1.5) */
x1_1 : np_add(np_randn([n_class]), 1.5)$
x1_2 : np_add(np_randn([n_class]), 1.5)$

/* Stack features into design matrix [n_total, 3] with bias column */
X0 : np_hstack(np_ones([n_class, 1]),
               np_reshape(x0_1, [n_class, 1]),
               np_reshape(x0_2, [n_class, 1]))$
X1 : np_hstack(np_ones([n_class, 1]),
               np_reshape(x1_1, [n_class, 1]),
               np_reshape(x1_2, [n_class, 1]))$
X_lr : np_vstack(X0, X1)$

/* Labels */
y_lr : np_reshape(
  ndarray(append(makelist(0.0, i, 1, n_class), makelist(1.0, i, 1, n_class))),
  [n_total, 1]
)$

print("Data shape:", np_shape(X_lr), "Labels:", np_shape(y_lr))$
Data shape: [120,3] Labels: [120,1]
In [150]:
/* Numeric sigmoid function on ndarrays */
np_sigmoid(z) := np_div(1.0, np_add(1.0, np_exp(np_neg(z))))$

/* Logistic regression via gradient descent */
w_lr : np_zeros([3, 1])$
lr_rate : 0.1$
lr_iters : 200$
lr_losses : []$

for iter : 1 thru lr_iters do block(
  z : np_matmul(X_lr, w_lr),
  p : np_sigmoid(z),
  /* Cross-entropy loss (with clipping for stability) */
  p_safe : np_clip(p, 1.0e-10, 1.0 - 1.0e-10),
  loss : -np_mean(np_add(
    np_mul(y_lr, np_log(p_safe)),
    np_mul(np_sub(1.0, y_lr), np_log(np_sub(1.0, p_safe)))
  )),
  lr_losses : append(lr_losses, [loss]),
  /* Gradient: X^T (p - y) / N */
  grad : np_scale(1.0 / n_total, np_matmul(np_transpose(X_lr), np_sub(p, y_lr))),
  w_lr : np_sub(w_lr, np_scale(lr_rate, grad))
)$

w0 : np_ref(w_lr, 0, 0)$
w1 : np_ref(w_lr, 1, 0)$
w2 : np_ref(w_lr, 2, 0)$
print("Learned weights: w0 =", w0, ", w1 =", w1, ", w2 =", w2)$
print("Final loss:", last(lr_losses))$
define: warning: redefining the built-in function np_sigmoid
Learned weights: w0 = -0.3651447723085847 , w1 = 1.8412830744943471 , w2 =
                     1.2950707405668165
Final loss: 0.10732718187195761

One-liner with np_minimize¶

Again, the manual loop can be replaced by np_minimize. The L-BFGS optimizer finds the same solution in fewer iterations:

In [151]:
/* Cross-entropy cost and gradient as named functions */
lr_cost(w) := block([z, p, p_safe],
  z : np_matmul(X_lr, w),
  p : np_sigmoid(z),
  p_safe : np_clip(p, 1.0e-10, 1.0 - 1.0e-10),
  -np_mean(np_add(
    np_mul(y_lr, np_log(p_safe)),
    np_mul(np_sub(1.0, y_lr), np_log(np_sub(1.0, p_safe)))
  )))$

lr_grad(w) := block([z, p],
  z : np_matmul(X_lr, w),
  p : np_sigmoid(z),
  np_scale(1.0 / n_total, np_matmul(np_transpose(X_lr), np_sub(p, y_lr))))$

[w_lr_opt, lr_loss_opt, lr_conv] : np_minimize(lr_cost, lr_grad, np_zeros([3, 1]))$
print("np_minimize weights:",
      np_ref(w_lr_opt, 0, 0), np_ref(w_lr_opt, 1, 0), np_ref(w_lr_opt, 2, 0))$
print("Loss:", lr_loss_opt, "  Converged:", lr_conv)$

/* Use L-BFGS solution for subsequent analysis */
w_lr : w_lr_opt$
w0 : np_ref(w_lr, 0, 0)$
w1 : np_ref(w_lr, 1, 0)$
w2 : np_ref(w_lr, 2, 0)$
*************************************************
  N=    3   NUMBER OF CORRECTIONS= 3
       INITIAL VALUES
 F=  6.931471805599461D-01   GNORM=  8.667001743836422D-01
*************************************************
   I  NFN     FUNC                    GNORM                   STEPLENGTH
   1    2     2.285525476211290D-01   1.976789758836933D-01   1.153801544705070D+00
  16   17     7.144739219198905D-02   2.617714819387524D-08   1.000000000000000D+00
 THE MINIMIZATION TERMINATED WITHOUT DETECTING ERRORS.
 IFLAG = 0
np_minimize weights: -1.259299409169616 4.7469088283286265 2.794095765219403
Loss: 0.07144739219198905   Converged: true
In [152]:
/* Compute accuracy */
pred_probs : np_sigmoid(np_matmul(X_lr, w_lr))$
pred_labels : np_where(np_greater(pred_probs, 0.5), 1.0, 0.0)$
correct : np_sum(np_equal(pred_labels, y_lr))$
accuracy : correct / n_total * 100$
print("Accuracy:", accuracy, "%")$
Accuracy: 97.5 %o235
In [153]:
/* Plot data with decision boundary */
x0_1_list : np_to_list(x0_1)$
x0_2_list : np_to_list(x0_2)$
x1_1_list : np_to_list(x1_1)$
x1_2_list : np_to_list(x1_2)$

ax_draw2d(
  color = "steelblue", marker_size = 6, name = "Class 0",
  points(x0_1_list, x0_2_list),
  color = "darkorange", marker_size = 6, name = "Class 1",
  points(x1_1_list, x1_2_list),
  color = "red", line_width = 2, name = "Decision boundary",
  implicit(w0 + w1 * u + w2 * v = 0, u, -4, 5, v, -4, 5),
  title = "Logistic Regression: Decision Boundary",
  xlabel = "x1", ylabel = "x2", grid = true, showlegend = true
)$
No description has been provided for this image
In [154]:
/* Loss convergence */
lr_iter_list : makelist(i, i, 1, lr_iters)$

ax_draw2d(
  color = "darkred", line_width = 2,
  lines(lr_iter_list, lr_losses),
  title = "Logistic Regression: Cross-Entropy Loss",
  xlabel = "Iteration", ylabel = "Loss", grid = true
)$
No description has been provided for this image

3. K-Means Clustering¶

K-Means is an iterative algorithm that alternates between:

  1. Assignment: assign each point to the nearest centroid
  2. Update: recompute each centroid as the mean of its assigned points

No symbolic gradient derivation is needed — this is a purely numeric, distance-based algorithm.

In [155]:
/* Generate 3 clusters */
pts_per_cluster : 50$
n_pts : 3 * pts_per_cluster$
K : 3$

/* True cluster centres */
c1 : [0.0, 0.0]$
c2 : [5.0, 0.0]$
c3 : [2.5, 4.5]$

/* Generate points around each centre */
clust1_x : np_add(np_randn([pts_per_cluster]), c1[1])$
clust1_y : np_add(np_randn([pts_per_cluster]), c1[2])$
clust2_x : np_add(np_randn([pts_per_cluster]), c2[1])$
clust2_y : np_add(np_randn([pts_per_cluster]), c2[2])$
clust3_x : np_add(np_randn([pts_per_cluster]), c3[1])$
clust3_y : np_add(np_randn([pts_per_cluster]), c3[2])$

/* Stack into [n_pts, 2] data matrix */
all_x : ndarray(append(np_to_list(clust1_x), np_to_list(clust2_x), np_to_list(clust3_x)))$
all_y : ndarray(append(np_to_list(clust1_y), np_to_list(clust2_y), np_to_list(clust3_y)))$
X_km : np_hstack(np_reshape(all_x, [n_pts, 1]), np_reshape(all_y, [n_pts, 1]))$

print("K-Means data shape:", np_shape(X_km))$
K-Means data shape: [150,2]
In [156]:
/* Initialise centroids by picking 3 random data points */
idx1 : 0$
idx2 : pts_per_cluster$
idx3 : 2 * pts_per_cluster$

centroids : ndarray(
  [np_ref(X_km, idx1, 0), np_ref(X_km, idx1, 1),
   np_ref(X_km, idx2, 0), np_ref(X_km, idx2, 1),
   np_ref(X_km, idx3, 0), np_ref(X_km, idx3, 1)],
  [3, 2]
)$

print("Initial centroids:")$
for k : 0 thru K - 1 do
  print("  Centroid", k, ":", np_ref(centroids, k, 0), np_ref(centroids, k, 1))$
Initial centroids:
  Centroid 0 : 1.206096036425563 0.10421902599727507
  Centroid 1 : 5.383949954704471 0.2553747612440088
  Centroid 2 : 2.680158244565475 5.520265973700813
In [157]:
/* K-Means iteration */
km_iters : 15$
assignments : np_zeros([n_pts])$

for iter : 1 thru km_iters do block(
  /* Assignment step: compute distance from each point to each centroid */
  dist_cols : [],
  for k : 0 thru K - 1 do block(
    /* centroid k as a row [1, 2] — broadcasts against [n_pts, 2] */
    ck : np_slice(centroids, k, all),
    diff_sq : np_pow(np_sub(X_km, ck), 2),
    dist_k : np_sum(diff_sq, 1),  /* squared distance per point */
    dist_cols : append(dist_cols, [np_reshape(dist_k, [n_pts, 1])])
  ),
  /* Stack into [n_pts, K] distance matrix */
  dist_matrix : apply(np_hstack, dist_cols),
  /* Assign each point to nearest centroid */
  assignments : np_argmin(dist_matrix, 1),

  /* Update step: recompute centroids */
  new_cents : [],
  for k : 0 thru K - 1 do block(
    mask : np_equal(assignments, float(k)),
    count : np_sum(mask),
    /* Broadcast mask [n,1] against X [n,2] to zero out non-members */
    masked : np_mul(X_km, np_reshape(mask, [n_pts, 1])),
    cent_k : np_scale(1.0 / count, np_sum(masked, 0)),
    new_cents : append(new_cents, np_to_list(cent_k))
  ),
  centroids : ndarray(new_cents, [K, 2])
)$

print("Final centroids after", km_iters, "iterations:")$
for k : 0 thru K - 1 do
  print("  Centroid", k, ":",
        np_ref(centroids, k, 0), np_ref(centroids, k, 1))$
print("True centres: (0,0), (5,0), (2.5,4.5)")$
Final centroids after 15 iterations:
  Centroid 0 : -0.13777241092582407 -0.04814781397867725
  Centroid 1 : 5.25132547441417 0.17762751647208794
  Centroid 2 : 2.4247996482346177 4.599856495025368
True centres: (0,0), (5,0), (2.5,4.5)
In [158]:
/* Plot clusters with assigned colours */
assign_list : np_to_list(assignments)$
all_x_list : np_to_list(all_x)$
all_y_list : np_to_list(all_y)$

/* Separate points by cluster assignment */
c0x : []$  c0y : []$
c1x : []$  c1y : []$
c2x : []$  c2y : []$
for i : 1 thru n_pts do block(
  if assign_list[i] = 0.0 then (c0x : append(c0x, [all_x_list[i]]),
                                 c0y : append(c0y, [all_y_list[i]]))
  elseif assign_list[i] = 1.0 then (c1x : append(c1x, [all_x_list[i]]),
                                     c1y : append(c1y, [all_y_list[i]]))
  else (c2x : append(c2x, [all_x_list[i]]),
        c2y : append(c2y, [all_y_list[i]]))
)$

/* Centroid positions */
cent_x : np_to_list(np_col(centroids, 0))$
cent_y : np_to_list(np_col(centroids, 1))$

ax_draw2d(
  color = "steelblue", marker_size = 5, name = "Cluster 0",
  points(c0x, c0y),
  color = "darkorange", marker_size = 5, name = "Cluster 1",
  points(c1x, c1y),
  color = "seagreen", marker_size = 5, name = "Cluster 2",
  points(c2x, c2y),
  color = "red", marker_size = 12, marker_symbol = "x", name = "Centroids",
  points(cent_x, cent_y),
  title = "K-Means Clustering (K=3)",
  xlabel = "x1", ylabel = "x2", grid = true, showlegend = true,
  aspect_ratio = true
)$
No description has been provided for this image

Summary¶

This notebook demonstrated three classical ML algorithms implemented entirely with np_* functions:

Algorithm Symbolic Key Functions
Linear Regression diff() derives MSE gradient np_matmul, np_lstsq, np_transpose
Logistic Regression diff() derives cross-entropy gradient, confirms $\sigma'(z) = \sigma(z)(1 - \sigma(z))$ np_exp, np_div, np_log, np_clip, np_where
K-Means — np_argmin, np_sum(A, axis), np_equal, broadcasting

Symbolic-numeric bridge highlights:

  • Maxima's diff() derives the gradient analytically — no manual calculus needed
  • The symbolic result ($\nabla = X^T(\hat{y} - y)/N$) directly translates to numeric code
  • This workflow is unique to CAS environments: derive once, evaluate many times