Maximum Likelihood Estimation¶
Fit probability distribution parameters by maximizing likelihood (equivalently, minimizing negative log-likelihood). The symbolic-numeric bridge shines: we write the log-likelihood symbolically, derive the score function (gradient) with diff(), then optimize numerically with np_minimize.
- Normal distribution [S+N] — closed-form MLE as benchmark
- Gamma distribution [S+N] — no closed form, requires optimization
- Fisher information [S+N] — confidence intervals from the Hessian
- Logistic regression [S+N] — GLM with symbolic link function
load("numerics")$
load("numerics-optimize")$
load("ax-plots")$
1. Normal Distribution MLE [S+N]¶
Symbolic Phase: Log-Likelihood and Score¶
For i.i.d. $X_i \sim \mathcal{N}(\mu, \sigma^2)$, the per-observation log-likelihood is $$\ell_i(\mu, \sigma) = -\ln\sigma - \frac{(x_i - \mu)^2}{2\sigma^2}$$
The score function (gradient of $\ell$) gives the MLE in closed form.
/* Per-observation log-likelihood */
ll_i(mu, sigma, x_i) := -log(sigma) - (x_i - mu)^2 / (2 * sigma^2)$
/* Score function: derivatives of ll_i */
score_mu : diff(ll_i(mu, sigma, x[i]), mu);
score_sigma : diff(ll_i(mu, sigma, x[i]), sigma);
(x[i]-mu)/sigma^2
Setting the score to zero:
- $\hat{\mu} = \bar{x}$ (sample mean)
- $\hat{\sigma}^2 = \frac{1}{n}\sum(x_i - \bar{x})^2$ (MLE uses $n$, not $n-1$)
We verify that np_minimize finds the same solution.
Numeric Phase: Closed-Form vs Optimization¶
/* Generate data from N(3, 2^2) */
n_norm : 200$
data_norm : np_add(3.0, np_scale(2.0, np_randn([n_norm])))$
/* Closed-form MLE */
mu_cf : np_mean(data_norm)$
sigma_cf : sqrt(np_mean(np_pow(np_sub(data_norm, mu_cf), 2)))$
print("Closed-form: mu =", mu_cf, ", sigma =", sigma_cf)$
/* Numeric MLE via np_minimize (minimize negative log-likelihood) */
/* We optimize [mu, log(sigma)] to keep sigma > 0 */
nll_norm(p) := block([mu_v, sig_v],
mu_v : np_ref(p, 0),
sig_v : exp(np_ref(p, 1)),
n_norm * log(sig_v)
+ np_sum(np_pow(np_sub(data_norm, mu_v), 2)) / (2 * sig_v^2))$
nll_norm_grad(p) := block([mu_v, log_sig, sig_v, res],
mu_v : np_ref(p, 0),
log_sig : np_ref(p, 1),
sig_v : exp(log_sig),
res : np_sub(data_norm, mu_v),
ndarray([
-np_sum(res) / sig_v^2,
n_norm - np_sum(np_pow(res, 2)) / sig_v^2 /* chain rule: d/d(log_sig) */
]))$
[p_norm, _, _] : np_minimize(nll_norm, nll_norm_grad, ndarray([0.0, 0.0]))$
print("np_minimize: mu =", np_ref(p_norm, 0),
", sigma =", exp(np_ref(p_norm, 1)))$
Closed-form: mu = 2.780627013714442 , sigma = 2.0905371415544867
*************************************************
N= 2 NUMBER OF CORRECTIONS= 2
INITIAL VALUES
F= 1.210223212961730D+03 GNORM= 2.289029880532844D+03
*************************************************
I NFN FUNC GNORM STEPLENGTH
1 2 3.493417214207001D+02 1.325372038628683D+02 4.368662936663886D-04
14 18 2.474842076969837D+02 2.021309684523029D-08 1.000000000000000D+00
THE MINIMIZATION TERMINATED WITHOUT DETECTING ERRORS.
IFLAG = 0
np_minimize: mu = 2.780627013857002 , sigma = 2.090537141654473
mu_fit : np_ref(p_norm, 0)$
sig_fit : exp(np_ref(p_norm, 1))$
ax_draw2d(
color="steelblue",
ax_histogram(np_to_list(data_norm)),
title="Normal MLE: Histogram of Data",
xlabel="x", ylabel="Count", grid=true
)$
2. Gamma Distribution MLE [S+N]¶
For the Gamma distribution with shape $\alpha$ and rate $\beta$: $$f(x; \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}$$
The log-likelihood per observation is $$\ell_i(\alpha, \beta) = \alpha \ln\beta - \ln\Gamma(\alpha) + (\alpha-1)\ln x_i - \beta x_i$$
There is no closed-form MLE for $\alpha$. The score involves the digamma function $\psi(\alpha)$, which is not available as an array operation — so we use finite-difference gradient approximation.
/* Symbolic log-likelihood for Gamma */
ll_gamma_i(alpha, beta, x_i) := alpha*log(beta) - log(gamma(alpha))
+ (alpha - 1)*log(x_i) - beta*x_i$
/* Score function */
score_gamma_alpha : diff(ll_gamma_i(alpha, beta, x[i]), alpha);
score_gamma_beta : diff(ll_gamma_i(alpha, beta, x[i]), beta);
log(beta)+log(x[i])-psi[0](alpha)
The $\alpha$-score involves $\psi(\alpha) = \frac{d}{d\alpha}\ln\Gamma(\alpha)$ (digamma function). Since we lack an array-level digamma, we use central finite differences for the gradient.
/* Finite-difference gradient helper */
fd_gradient(f, x, eps) := block([nd, g],
nd : np_size(x), g : np_zeros([nd]),
for j : 0 thru nd - 1 do block([xp, xm],
xp : np_copy(x), np_set(xp, j, np_ref(x, j) + eps),
xm : np_copy(x), np_set(xm, j, np_ref(x, j) - eps),
np_set(g, j, (f(xp) - f(xm)) / (2 * eps))),
g)$
/* Generate Gamma(3, 2) data via sum-of-exponentials */
n_gam : 300$
data_gam : np_zeros([n_gam])$
for k : 1 thru 3 do
data_gam : np_add(data_gam,
np_scale(0.5, np_neg(np_log(np_rand([n_gam])))))$
/* NLL with log-transforms for positivity */
nll_gam(p) := block([a, b, sum_log, sum_x],
a : exp(np_ref(p, 0)),
b : exp(np_ref(p, 1)),
sum_log : np_sum(np_log(data_gam)),
sum_x : np_sum(data_gam),
-(n_gam * (a*log(b) - log(gamma(a))) + (a - 1)*sum_log - b*sum_x))$
nll_gam_grad(p) := fd_gradient(nll_gam, p, 1e-5)$
[p_gam, _, _] : np_minimize(nll_gam, nll_gam_grad,
ndarray([0.5, 0.5]), 1e-8, 200)$
print("Fitted: alpha =", exp(np_ref(p_gam, 0)),
", beta =", exp(np_ref(p_gam, 1)))$
print("True: alpha = 3, beta = 2")$
*************************************************
N= 2 NUMBER OF CORRECTIONS= 2
INITIAL VALUES
F= 4.039720999263950D+02 GNORM= 3.694536978404478D+02
*************************************************
I NFN FUNC GNORM STEPLENGTH
1 3 3.426570499213558D+02 5.520214687481796D+01 8.051426060712722D-04
12 17 3.218344324907745D+02 1.136868377216160D-08 1.000000000000000D+00
THE MINIMIZATION TERMINATED WITHOUT DETECTING ERRORS.
IFLAG = 0
Fitted: alpha = 3.5628218441647106 , beta = 2.4136056528521386
True: alpha = 3, beta = 2
ax_draw2d(
color="steelblue",
ax_histogram(np_to_list(data_gam)),
title="Gamma MLE: Histogram of Data",
xlabel="x", ylabel="Count", grid=true
)$
3. Fisher Information and Confidence Intervals [S+N]¶
The Fisher information per observation is the negative expected Hessian of the log-likelihood: $$\mathcal{I}(\theta)_{jk} = -E\!\left[\frac{\partial^2 \ell_i}{\partial \theta_j \partial \theta_k}\right]$$
For the Normal case, we derive it symbolically from the log-likelihood in Section 1.
/* Hessian of per-observation log-likelihood */
H_11 : diff(ll_i(mu, sigma, x[i]), mu, 2);
H_12 : diff(diff(ll_i(mu, sigma, x[i]), mu), sigma);
H_22 : diff(ll_i(mu, sigma, x[i]), sigma, 2);
-(1/sigma^2) -((2*(x[i]-mu))/sigma^3)
/* Fisher information: -E[H]
Using E[(x-mu)^2] = sigma^2, E[x-mu] = 0 */
I_11 : ratsimp(-subst([(x[i] - mu)^2 = sigma^2, x[i] - mu = 0], H_11));
I_22 : ratsimp(-subst([(x[i] - mu)^2 = sigma^2, x[i] - mu = 0], H_22));
print("Fisher information per observation:")$
print(" I_11 =", I_11, ", I_22 =", I_22)$
/* Standard errors: se = 1/sqrt(n * I_jj) */
se_mu : 1 / sqrt(n_norm * float(subst(sigma = sigma_cf, I_11)))$
se_sigma : 1 / sqrt(n_norm * float(subst(sigma = sigma_cf, ratsimp(I_22))))$
print("95% CI for mu:", mu_cf, "+/-", 1.96*se_mu)$
print("95% CI for sigma:", sigma_cf, "+/-", 1.96*se_sigma)$
1/sigma^2 2/sigma^2 Fisher information per observation: I_11 = 1/sigma^2 , I_22 = 2/sigma^2 95%o166 CI for mu: 2.780627013714442 +/- 0.28973366586664173 95%o166 CI for sigma: 2.0905371415544867 +/- 0.2048726398723397
4. Logistic Regression as MLE [S+N]¶
Logistic regression is a GLM where the likelihood for binary $y_i \in \{0,1\}$ is $$\ell(\mathbf{w}) = \sum_{i=1}^n \bigl[ y_i \ln\sigma(\mathbf{w}^T\mathbf{x}_i) + (1-y_i)\ln(1 - \sigma(\mathbf{w}^T\mathbf{x}_i)) \bigr]$$
where $\sigma(z) = 1/(1+e^{-z})$ is the sigmoid. The gradient simplifies beautifully:
/* Sigmoid and its derivative */
sigmoid_sym(z) := 1 / (1 + exp(-z))$
ratsimp(diff(sigmoid_sym(z), z));
/* Generate 2D classification data */
n_lr : 100$
x1_lr : np_randn([n_lr])$
x2_lr : np_randn([n_lr])$
/* True boundary: 1 + 2*x1 - 1.5*x2 > 0 */
z_true : np_add(1.0, np_add(np_scale(2.0, x1_lr), np_scale(-1.5, x2_lr)))$
y_lr : np_map(lambda([v], if v > 0 then 1.0 else 0.0), z_true)$
/* Design matrix: [1, x1, x2] */
X_lr : np_hstack(np_ones([n_lr, 1]),
np_reshape(x1_lr, [n_lr, 1]),
np_reshape(x2_lr, [n_lr, 1]))$
y_lr_col : np_reshape(y_lr, [n_lr, 1])$
/* Numerically stable sigmoid: clip input to avoid exp overflow */
np_sigmoid(z) := np_div(1.0, np_add(1.0, np_exp(np_neg(np_clip(z, -500.0, 500.0)))))$
/* Negative log-likelihood */
lr_nll(w) := block([z, p, p_safe],
z : np_matmul(X_lr, w),
p : np_sigmoid(z),
p_safe : np_clip(p, 1.0e-10, 1.0 - 1.0e-10),
-np_mean(np_add(
np_mul(y_lr_col, np_log(p_safe)),
np_mul(np_sub(1.0, y_lr_col),
np_log(np_sub(1.0, p_safe))))))$
lr_grad(w) := block([z, p],
z : np_matmul(X_lr, w),
p : np_sigmoid(z),
np_scale(1.0 / n_lr,
np_matmul(np_transpose(X_lr), np_sub(p, y_lr_col))))$
[w_lr, nll_lr, conv_lr] : np_minimize(lr_nll, lr_grad, np_zeros([3, 1]))$
print("Fitted w =", np_to_list(np_flatten(w_lr)))$
print("True: w = [1, 2, -1.5]")$
define: warning: redefining the built-in function np_sigmoid
*************************************************
N= 3 NUMBER OF CORRECTIONS= 3
INITIAL VALUES
F= 6.931471805599458D-01 GNORM= 3.761164029443863D-01
*************************************************
I NFN FUNC GNORM STEPLENGTH
1 2 4.210458639576212D-01 1.863708524500784D-01 2.658751365725103D+00
28 32 6.102377849729093D-05 1.351944840873941D-05 1.000000000000000D+00
THE MINIMIZATION TERMINATED WITHOUT DETECTING ERRORS.
IFLAG = 0
Fitted w = [994.9408636632331,2000.2701547726429,-1504.0625752993801]
True: w = [1, 2, -1.5]
/* Decision boundary: w0 + w1*x1 + w2*x2 = 0 => x2 = -(w0 + w1*x1)/w2 */
w0 : np_ref(w_lr, 0, 0)$ w1 : np_ref(w_lr, 1, 0)$ w2 : np_ref(w_lr, 2, 0)$
/* Separate classes using boolean mask + extract */
mask_pos : np_greater(y_lr, 0.5)$
mask_neg : np_logical_not(mask_pos)$
ax_draw2d(
color="blue", marker_size=5, name="Class 1",
points(np_to_list(np_extract(mask_pos, x1_lr)),
np_to_list(np_extract(mask_pos, x2_lr))),
color="red", marker_size=5, name="Class 0",
points(np_to_list(np_extract(mask_neg, x1_lr)),
np_to_list(np_extract(mask_neg, x2_lr))),
color="black", line_width=2, name="Decision boundary",
explicit(-(w0 + w1*u) / w2, u, -3, 3),
title="Logistic Regression MLE",
xlabel="x1", ylabel="x2", grid=true, showlegend=true
)$
Observed Information for Coefficient Standard Errors¶
The observed Fisher information (Hessian of NLL at the MLE) for logistic regression is $$\mathcal{I}(\hat{\mathbf{w}}) = X^T \text{diag}\bigl(\hat{p}_i(1-\hat{p}_i)\bigr) X$$
Its inverse gives the approximate covariance of the coefficient estimates.
/* Observed information: X^T diag(p*(1-p)) X */
p_hat : np_sigmoid(np_matmul(X_lr, w_lr))$
W_vec : np_mul(p_hat, np_sub(1.0, p_hat))$
/* X_weighted = diag(W) * X, implemented as element-wise scaling */
X_w : np_hstack(
np_mul(W_vec, np_reshape(np_col(X_lr, 0), [n_lr, 1])),
np_mul(W_vec, np_reshape(np_col(X_lr, 1), [n_lr, 1])),
np_mul(W_vec, np_reshape(np_col(X_lr, 2), [n_lr, 1])))$
info_mat : np_matmul(np_transpose(X_lr), X_w)$
cov_w : np_inv(info_mat)$
print("Coefficient standard errors:")$
for j : 0 thru 2 do
print(" se(w", j, ") =", sqrt(np_ref(cov_w, j, j)))$
Coefficient standard errors: se(w 0 ) = 2093.372112086626 se(w 1 ) = 4210.886253601187 se(w 2 ) = 3165.4461572091195
Summary¶
| Distribution | Parameters | Gradient | S+N Bridge |
|---|---|---|---|
| Normal | $\mu$, $\sigma$ | Analytic score via diff() |
Closed-form MLE as benchmark |
| Gamma | $\alpha$, $\beta$ | Finite differences | Digamma in symbolic score |
| Logistic regression | $\mathbf{w}$ | Analytic: $X^T(\hat{p} - y)/n$ | Sigmoid derivative via diff() |
The symbolic-numeric bridge lets us derive score functions and Fisher information matrices symbolically, then evaluate them numerically on real data.