Maximum Likelihood Estimation¶

Fit probability distribution parameters by maximizing likelihood (equivalently, minimizing negative log-likelihood). The symbolic-numeric bridge shines: we write the log-likelihood symbolically, derive the score function (gradient) with diff(), then optimize numerically with np_minimize.

  1. Normal distribution [S+N] — closed-form MLE as benchmark
  2. Gamma distribution [S+N] — no closed form, requires optimization
  3. Fisher information [S+N] — confidence intervals from the Hessian
  4. Logistic regression [S+N] — GLM with symbolic link function
In [27]:
load("numerics")$
load("numerics-optimize")$
load("ax-plots")$

1. Normal Distribution MLE [S+N]¶

Symbolic Phase: Log-Likelihood and Score¶

For i.i.d. $X_i \sim \mathcal{N}(\mu, \sigma^2)$, the per-observation log-likelihood is $$\ell_i(\mu, \sigma) = -\ln\sigma - \frac{(x_i - \mu)^2}{2\sigma^2}$$

The score function (gradient of $\ell$) gives the MLE in closed form.

In [28]:
/* Per-observation log-likelihood */
ll_i(mu, sigma, x_i) := -log(sigma) - (x_i - mu)^2 / (2 * sigma^2)$

/* Score function: derivatives of ll_i */
score_mu : diff(ll_i(mu, sigma, x[i]), mu);
score_sigma : diff(ll_i(mu, sigma, x[i]), sigma);
(x[i]-mu)/sigma^2
Out[28]:

Setting the score to zero:

  • $\hat{\mu} = \bar{x}$ (sample mean)
  • $\hat{\sigma}^2 = \frac{1}{n}\sum(x_i - \bar{x})^2$ (MLE uses $n$, not $n-1$)

We verify that np_minimize finds the same solution.

Numeric Phase: Closed-Form vs Optimization¶

In [29]:
/* Generate data from N(3, 2^2) */
n_norm : 200$
data_norm : np_add(3.0, np_scale(2.0, np_randn([n_norm])))$

/* Closed-form MLE */
mu_cf : np_mean(data_norm)$
sigma_cf : sqrt(np_mean(np_pow(np_sub(data_norm, mu_cf), 2)))$
print("Closed-form: mu =", mu_cf, ", sigma =", sigma_cf)$

/* Numeric MLE via np_minimize (minimize negative log-likelihood) */
/* We optimize [mu, log(sigma)] to keep sigma > 0 */
nll_norm(p) := block([mu_v, sig_v],
  mu_v : np_ref(p, 0),
  sig_v : exp(np_ref(p, 1)),
  n_norm * log(sig_v)
    + np_sum(np_pow(np_sub(data_norm, mu_v), 2)) / (2 * sig_v^2))$

nll_norm_grad(p) := block([mu_v, log_sig, sig_v, res],
  mu_v : np_ref(p, 0),
  log_sig : np_ref(p, 1),
  sig_v : exp(log_sig),
  res : np_sub(data_norm, mu_v),
  ndarray([
    -np_sum(res) / sig_v^2,
    n_norm - np_sum(np_pow(res, 2)) / sig_v^2  /* chain rule: d/d(log_sig) */
  ]))$

[p_norm, _, _] : np_minimize(nll_norm, nll_norm_grad, ndarray([0.0, 0.0]))$
print("np_minimize: mu =", np_ref(p_norm, 0),
      ", sigma =", exp(np_ref(p_norm, 1)))$
Closed-form: mu = 2.780627013714442 , sigma = 2.0905371415544867
*************************************************
  N=    2   NUMBER OF CORRECTIONS= 2
       INITIAL VALUES
 F=  1.210223212961730D+03   GNORM=  2.289029880532844D+03
*************************************************
   I  NFN     FUNC                    GNORM                   STEPLENGTH
   1    2     3.493417214207001D+02   1.325372038628683D+02   4.368662936663886D-04
  14   18     2.474842076969837D+02   2.021309684523029D-08   1.000000000000000D+00
 THE MINIMIZATION TERMINATED WITHOUT DETECTING ERRORS.
 IFLAG = 0
np_minimize: mu = 2.780627013857002 , sigma = 2.090537141654473
In [30]:
mu_fit : np_ref(p_norm, 0)$
sig_fit : exp(np_ref(p_norm, 1))$
ax_draw2d(
  color="steelblue",
  ax_histogram(np_to_list(data_norm)),
  title="Normal MLE: Histogram of Data",
  xlabel="x", ylabel="Count", grid=true
)$
No description has been provided for this image

2. Gamma Distribution MLE [S+N]¶

For the Gamma distribution with shape $\alpha$ and rate $\beta$: $$f(x; \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}$$

The log-likelihood per observation is $$\ell_i(\alpha, \beta) = \alpha \ln\beta - \ln\Gamma(\alpha) + (\alpha-1)\ln x_i - \beta x_i$$

There is no closed-form MLE for $\alpha$. The score involves the digamma function $\psi(\alpha)$, which is not available as an array operation — so we use finite-difference gradient approximation.

In [31]:
/* Symbolic log-likelihood for Gamma */
ll_gamma_i(alpha, beta, x_i) := alpha*log(beta) - log(gamma(alpha))
  + (alpha - 1)*log(x_i) - beta*x_i$

/* Score function */
score_gamma_alpha : diff(ll_gamma_i(alpha, beta, x[i]), alpha);
score_gamma_beta : diff(ll_gamma_i(alpha, beta, x[i]), beta);
log(beta)+log(x[i])-psi[0](alpha)
Out[31]:

The $\alpha$-score involves $\psi(\alpha) = \frac{d}{d\alpha}\ln\Gamma(\alpha)$ (digamma function). Since we lack an array-level digamma, we use central finite differences for the gradient.

In [32]:
/* Finite-difference gradient helper */
fd_gradient(f, x, eps) := block([nd, g],
  nd : np_size(x), g : np_zeros([nd]),
  for j : 0 thru nd - 1 do block([xp, xm],
    xp : np_copy(x), np_set(xp, j, np_ref(x, j) + eps),
    xm : np_copy(x), np_set(xm, j, np_ref(x, j) - eps),
    np_set(g, j, (f(xp) - f(xm)) / (2 * eps))),
  g)$

/* Generate Gamma(3, 2) data via sum-of-exponentials */
n_gam : 300$
data_gam : np_zeros([n_gam])$
for k : 1 thru 3 do
  data_gam : np_add(data_gam,
    np_scale(0.5, np_neg(np_log(np_rand([n_gam])))))$

/* NLL with log-transforms for positivity */
nll_gam(p) := block([a, b, sum_log, sum_x],
  a : exp(np_ref(p, 0)),
  b : exp(np_ref(p, 1)),
  sum_log : np_sum(np_log(data_gam)),
  sum_x : np_sum(data_gam),
  -(n_gam * (a*log(b) - log(gamma(a))) + (a - 1)*sum_log - b*sum_x))$

nll_gam_grad(p) := fd_gradient(nll_gam, p, 1e-5)$

[p_gam, _, _] : np_minimize(nll_gam, nll_gam_grad,
  ndarray([0.5, 0.5]), 1e-8, 200)$
print("Fitted: alpha =", exp(np_ref(p_gam, 0)),
      ", beta =", exp(np_ref(p_gam, 1)))$
print("True:   alpha = 3, beta = 2")$
*************************************************
  N=    2   NUMBER OF CORRECTIONS= 2
       INITIAL VALUES
 F=  4.039720999263950D+02   GNORM=  3.694536978404478D+02
*************************************************
   I  NFN     FUNC                    GNORM                   STEPLENGTH
   1    3     3.426570499213558D+02   5.520214687481796D+01   8.051426060712722D-04
  12   17     3.218344324907745D+02   1.136868377216160D-08   1.000000000000000D+00
 THE MINIMIZATION TERMINATED WITHOUT DETECTING ERRORS.
 IFLAG = 0
Fitted: alpha = 3.5628218441647106 , beta = 2.4136056528521386
True:   alpha = 3, beta = 2
In [33]:
ax_draw2d(
  color="steelblue",
  ax_histogram(np_to_list(data_gam)),
  title="Gamma MLE: Histogram of Data",
  xlabel="x", ylabel="Count", grid=true
)$
No description has been provided for this image

3. Fisher Information and Confidence Intervals [S+N]¶

The Fisher information per observation is the negative expected Hessian of the log-likelihood: $$\mathcal{I}(\theta)_{jk} = -E\!\left[\frac{\partial^2 \ell_i}{\partial \theta_j \partial \theta_k}\right]$$

For the Normal case, we derive it symbolically from the log-likelihood in Section 1.

In [34]:
/* Hessian of per-observation log-likelihood */
H_11 : diff(ll_i(mu, sigma, x[i]), mu, 2);
H_12 : diff(diff(ll_i(mu, sigma, x[i]), mu), sigma);
H_22 : diff(ll_i(mu, sigma, x[i]), sigma, 2);
-(1/sigma^2)
-((2*(x[i]-mu))/sigma^3)
Out[34]:
In [35]:
/* Fisher information: -E[H]
   Using E[(x-mu)^2] = sigma^2, E[x-mu] = 0 */
I_11 : ratsimp(-subst([(x[i] - mu)^2 = sigma^2, x[i] - mu = 0], H_11));
I_22 : ratsimp(-subst([(x[i] - mu)^2 = sigma^2, x[i] - mu = 0], H_22));
print("Fisher information per observation:")$
print("  I_11 =", I_11, ", I_22 =", I_22)$

/* Standard errors: se = 1/sqrt(n * I_jj) */
se_mu : 1 / sqrt(n_norm * float(subst(sigma = sigma_cf, I_11)))$
se_sigma : 1 / sqrt(n_norm * float(subst(sigma = sigma_cf, ratsimp(I_22))))$
print("95% CI for mu:", mu_cf, "+/-", 1.96*se_mu)$
print("95% CI for sigma:", sigma_cf, "+/-", 1.96*se_sigma)$
1/sigma^2
2/sigma^2
Fisher information per observation:
  I_11 = 1/sigma^2 , I_22 = 2/sigma^2
95%o166 CI for mu: 2.780627013714442 +/- 0.28973366586664173
95%o166 CI for sigma: 2.0905371415544867 +/- 0.2048726398723397

4. Logistic Regression as MLE [S+N]¶

Logistic regression is a GLM where the likelihood for binary $y_i \in \{0,1\}$ is $$\ell(\mathbf{w}) = \sum_{i=1}^n \bigl[ y_i \ln\sigma(\mathbf{w}^T\mathbf{x}_i) + (1-y_i)\ln(1 - \sigma(\mathbf{w}^T\mathbf{x}_i)) \bigr]$$

where $\sigma(z) = 1/(1+e^{-z})$ is the sigmoid. The gradient simplifies beautifully:

In [36]:
/* Sigmoid and its derivative */
sigmoid_sym(z) := 1 / (1 + exp(-z))$
ratsimp(diff(sigmoid_sym(z), z));
Out[36]:
In [37]:
/* Generate 2D classification data */
n_lr : 100$
x1_lr : np_randn([n_lr])$
x2_lr : np_randn([n_lr])$

/* True boundary: 1 + 2*x1 - 1.5*x2 > 0 */
z_true : np_add(1.0, np_add(np_scale(2.0, x1_lr), np_scale(-1.5, x2_lr)))$
y_lr : np_map(lambda([v], if v > 0 then 1.0 else 0.0), z_true)$

/* Design matrix: [1, x1, x2] */
X_lr : np_hstack(np_ones([n_lr, 1]),
                 np_reshape(x1_lr, [n_lr, 1]),
                 np_reshape(x2_lr, [n_lr, 1]))$
y_lr_col : np_reshape(y_lr, [n_lr, 1])$

/* Numerically stable sigmoid: clip input to avoid exp overflow */
np_sigmoid(z) := np_div(1.0, np_add(1.0, np_exp(np_neg(np_clip(z, -500.0, 500.0)))))$

/* Negative log-likelihood */
lr_nll(w) := block([z, p, p_safe],
  z : np_matmul(X_lr, w),
  p : np_sigmoid(z),
  p_safe : np_clip(p, 1.0e-10, 1.0 - 1.0e-10),
  -np_mean(np_add(
    np_mul(y_lr_col, np_log(p_safe)),
    np_mul(np_sub(1.0, y_lr_col),
           np_log(np_sub(1.0, p_safe))))))$

lr_grad(w) := block([z, p],
  z : np_matmul(X_lr, w),
  p : np_sigmoid(z),
  np_scale(1.0 / n_lr,
    np_matmul(np_transpose(X_lr), np_sub(p, y_lr_col))))$

[w_lr, nll_lr, conv_lr] : np_minimize(lr_nll, lr_grad, np_zeros([3, 1]))$
print("Fitted w =", np_to_list(np_flatten(w_lr)))$
print("True:  w = [1, 2, -1.5]")$
define: warning: redefining the built-in function np_sigmoid
*************************************************
  N=    3   NUMBER OF CORRECTIONS= 3
       INITIAL VALUES
 F=  6.931471805599458D-01   GNORM=  3.761164029443863D-01
*************************************************
   I  NFN     FUNC                    GNORM                   STEPLENGTH
   1    2     4.210458639576212D-01   1.863708524500784D-01   2.658751365725103D+00
  28   32     6.102377849729093D-05   1.351944840873941D-05   1.000000000000000D+00
 THE MINIMIZATION TERMINATED WITHOUT DETECTING ERRORS.
 IFLAG = 0
Fitted w = [994.9408636632331,2000.2701547726429,-1504.0625752993801]
True:  w = [1, 2, -1.5]
In [38]:
/* Decision boundary: w0 + w1*x1 + w2*x2 = 0  =>  x2 = -(w0 + w1*x1)/w2 */
w0 : np_ref(w_lr, 0, 0)$  w1 : np_ref(w_lr, 1, 0)$  w2 : np_ref(w_lr, 2, 0)$

/* Separate classes using boolean mask + extract */
mask_pos : np_greater(y_lr, 0.5)$
mask_neg : np_logical_not(mask_pos)$

ax_draw2d(
  color="blue", marker_size=5, name="Class 1",
  points(np_to_list(np_extract(mask_pos, x1_lr)),
         np_to_list(np_extract(mask_pos, x2_lr))),
  color="red", marker_size=5, name="Class 0",
  points(np_to_list(np_extract(mask_neg, x1_lr)),
         np_to_list(np_extract(mask_neg, x2_lr))),
  color="black", line_width=2, name="Decision boundary",
  explicit(-(w0 + w1*u) / w2, u, -3, 3),
  title="Logistic Regression MLE",
  xlabel="x1", ylabel="x2", grid=true, showlegend=true
)$
No description has been provided for this image

Observed Information for Coefficient Standard Errors¶

The observed Fisher information (Hessian of NLL at the MLE) for logistic regression is $$\mathcal{I}(\hat{\mathbf{w}}) = X^T \text{diag}\bigl(\hat{p}_i(1-\hat{p}_i)\bigr) X$$

Its inverse gives the approximate covariance of the coefficient estimates.

In [39]:
/* Observed information: X^T diag(p*(1-p)) X */
p_hat : np_sigmoid(np_matmul(X_lr, w_lr))$
W_vec : np_mul(p_hat, np_sub(1.0, p_hat))$

/* X_weighted = diag(W) * X, implemented as element-wise scaling */
X_w : np_hstack(
  np_mul(W_vec, np_reshape(np_col(X_lr, 0), [n_lr, 1])),
  np_mul(W_vec, np_reshape(np_col(X_lr, 1), [n_lr, 1])),
  np_mul(W_vec, np_reshape(np_col(X_lr, 2), [n_lr, 1])))$
info_mat : np_matmul(np_transpose(X_lr), X_w)$
cov_w : np_inv(info_mat)$

print("Coefficient standard errors:")$
for j : 0 thru 2 do
  print("  se(w", j, ") =", sqrt(np_ref(cov_w, j, j)))$
Coefficient standard errors:
  se(w 0 ) = 2093.372112086626
  se(w 1 ) = 4210.886253601187
  se(w 2 ) = 3165.4461572091195

Summary¶

Distribution Parameters Gradient S+N Bridge
Normal $\mu$, $\sigma$ Analytic score via diff() Closed-form MLE as benchmark
Gamma $\alpha$, $\beta$ Finite differences Digamma in symbolic score
Logistic regression $\mathbf{w}$ Analytic: $X^T(\hat{p} - y)/n$ Sigmoid derivative via diff()

The symbolic-numeric bridge lets us derive score functions and Fisher information matrices symbolically, then evaluate them numerically on real data.