REINFORCE: Policy Gradient with Symbolic Differentiation¶
The REINFORCE algorithm (Williams, 1992) learns a stochastic policy by gradient ascent on expected return:
$$\nabla_w J(w) = \mathbb{E}\bigg[\sum_t G_t \,\nabla_w \log \pi(a_t \mid s_t; w)\bigg]$$
where $G_t$ is the discounted return from step $t$.
Maxima's unique advantage: we define the policy symbolically and use diff() to compute the exact score function $\nabla_w \log \pi$ — no backpropagation framework needed.
load("numerics")$
load("ax-plots")$
1. Inverted Pendulum Stabilization¶
Simple pendulum with torque input $u$, Euler-discretized: $$\omega_{k+1} = \omega_k + \Delta t\left(-\frac{g}{l}\sin\theta_k - b\omega_k + \frac{u}{ml^2}\right), \qquad \theta_{k+1} = \theta_k + \Delta t\,\omega_{k+1}$$
Goal: stabilize near upright ($\theta = 0$) from a random initial tilt.
Reward: $r_k = -(\theta_k^2 + 0.1\,\omega_k^2 + 0.01\,u_k^2)$ — dense quadratic cost encouraging the pendulum to stay upright with minimal effort.
/* Pendulum parameters */
gv : 9.81$ lp : 1.0$ mp : 1.0$ bp : 0.1$
dt_sim : 0.05$
horizon : 80$ /* Steps per episode (4 seconds) */
u_max : 5.0$ /* Torque clamp */
print("Pendulum: g =", gv, " l =", lp, " b =", bp)$
print("Episode: ", horizon, "steps x", dt_sim, "s =", horizon * dt_sim, "s")$
Pendulum: g = 9.81 l = 1.0 b = 0.1 Episode: 80 steps x 0.05 s = 4.0 s
2. Symbolic Policy and Score Function¶
Gaussian policy with linear mean: $$\pi(u \mid \theta, \omega;\, w, \sigma) = \mathcal{N}\!\left(\mu(\theta,\omega;w),\, \sigma^2\right), \qquad \mu = w_1 \sin\theta + w_2 \cos\theta + w_3\,\omega$$
The score function $\nabla_w \log\pi$ is computed symbolically using diff():
$$\frac{\partial \log \pi}{\partial w_i} = \frac{u - \mu}{\sigma^2} \cdot \frac{\partial \mu}{\partial w_i}$$
/* Symbolic policy mean */
mu_expr : w1 * sin(th) + w2 * cos(th) + w3 * om$
/* Symbolic log-probability (up to constant) */
log_pi : -(u_sym - mu_expr)^2 / (2 * sig^2) - log(sig)$
/* Exact score functions via Maxima's symbolic diff() */
score_exprs : [diff(log_pi, w1), diff(log_pi, w2), diff(log_pi, w3)]$
print("Policy mean: μ =", mu_expr)$
print("Score ∂log π/∂w₁ =", score_exprs[1])$
print("Score ∂log π/∂w₂ =", score_exprs[2])$
print("Score ∂log π/∂w₃ =", score_exprs[3])$
Policy mean: μ = om*w3+cos(th)*w2+sin(th)*w1 Score ∂log π/∂w₁ = (sin(th)*(-(om*w3)-cos(th)*w2-sin(th)*w1+u_sym))/sig^2 Score ∂log π/∂w₂ = (cos(th)*(-(om*w3)-cos(th)*w2-sin(th)*w1+u_sym))/sig^2 Score ∂log π/∂w₃ = (om*(-(om*w3)-cos(th)*w2-sin(th)*w1+u_sym))/sig^2
3. REINFORCE Training¶
For each batch of episodes:
- Roll out trajectories with the stochastic policy
- Compute discounted returns $G_t$ using
np_discount - Evaluate the symbolic score function at each $(s_t, a_t)$
- Update: $w \leftarrow w + \alpha \cdot \frac{1}{N}\sum_{\text{episodes}} \sum_t (G_t - b)\,\nabla_w \log\pi(a_t \mid s_t)$
A baseline $b$ (running average of returns) reduces gradient variance.
np_seed(42)$
/* Policy parameters */
ww : [0.0, 0.0, 0.0]$
sigma_pol : 1.5$
/* Hyperparameters */
alpha_lr : 0.01$
disc : 0.97$
n_epochs : 40$
n_batch : 5$
sigma_decay : 0.98$
sigma_min : 0.3$
/* Tracking */
return_history : []$
baseline : 0.0$
beta_bl : 0.9$ /* Baseline smoothing */
for epoch : 1 thru n_epochs do (
batch_grad : [0.0, 0.0, 0.0],
batch_ret : 0.0,
for b : 1 thru n_batch do (
/* Random initial state near upright */
theta_v : 0.4 * float(random(2.0) - 1.0),
omega_v : 0.2 * float(random(2.0) - 1.0),
/* Pre-generate noise for episode */
noise_arr : np_randn(horizon),
/* Collect trajectory: rewards and score vectors */
rew_arr : np_zeros(horizon),
score_list : [],
for k : 0 thru horizon - 1 do (
/* Evaluate policy mean numerically */
mu_val : float(subst([w1=ww[1], w2=ww[2], w3=ww[3],
th=theta_v, om=omega_v], mu_expr)),
/* Sample action from N(mu, sigma^2) */
u_raw : mu_val + sigma_pol * np_ref(noise_arr, k),
u_act : max(-u_max, min(u_max, float(u_raw))),
/* Record reward */
rwd : -(theta_v^2 + 0.1*omega_v^2 + 0.01*u_act^2),
np_set(rew_arr, k, float(rwd)),
/* Evaluate symbolic score function at current (s, a, w) */
sv : float(subst([w1=ww[1], w2=ww[2], w3=ww[3],
th=theta_v, om=omega_v, u_sym=u_act, sig=sigma_pol],
score_exprs)),
score_list : endcons(sv, score_list),
/* Euler step */
omega_new : omega_v + dt_sim * (-gv/lp * sin(theta_v) - bp*omega_v
+ u_act/(mp*lp^2)),
theta_v : theta_v + dt_sim * omega_new,
omega_v : omega_new
),
/* Discounted returns via np_discount */
returns_arr : np_discount(rew_arr, disc),
ep_return : np_ref(returns_arr, 0),
/* Accumulate policy gradient with baseline */
for k : 1 thru horizon do (
gt : np_ref(returns_arr, k - 1) - baseline,
batch_grad : batch_grad + gt * score_list[k]
),
batch_ret : batch_ret + ep_return
),
/* Update policy weights */
ww : float(ww + (alpha_lr / n_batch) * batch_grad),
/* Update baseline and exploration */
avg_ret : batch_ret / n_batch,
baseline : beta_bl * baseline + (1 - beta_bl) * avg_ret,
sigma_pol : max(sigma_min, sigma_pol * sigma_decay),
return_history : endcons(avg_ret, return_history),
if mod(epoch, 10) = 0 then
print("Epoch", epoch, ": avg return =", avg_ret, " w =", ww, " σ =", sigma_pol)
)$
print("\nFinal policy weights:", ww)$
print("Final σ:", sigma_pol)$
Epoch 10 : avg return = -1.1228499119029394 w =
[-0.06962357795332622,-0.03471576530016185,-0.24400252473946557] σ =
1.22560921033132
Epoch 20 : avg return = -1.9153370442159179 w =
[-0.0551354254757563,0.07488989076414751,-0.47091682914399796] σ =
1.0014119576326417
Epoch 30 : avg return = -0.9594710608092523 w =
[-0.06638375109170443,0.10603589317257232,-0.6315931486653269] σ =
0.8182264790736555
Epoch 40 : avg return = -0.5878155531978437 w =
[-0.07025932327851005,0.06975946554922369,-0.7628291674318786] σ =
0.6685506059264261
nFinal policy weights:
[-0.07025932327851005,0.06975946554922369,
-0.7628291674318786]
Final σ: 0.6685506059264261
4. Learning Curve¶
/* Learning curve */
epoch_list : makelist(i, i, 1, n_epochs)$
ax_draw2d(
line_width = 2,
color = royalblue,
name = "Avg return",
lines(epoch_list, return_history),
title = "REINFORCE Learning Curve",
xlabel = "Epoch",
ylabel = "Average Return",
grid = true
)$
5. Evaluate the Learned Policy¶
Simulate the trained deterministic policy ($\sigma = 0$, use the mean action) from a challenging initial angle.
/* Simulate with learned deterministic policy */
test_theta0 : 0.4$ /* ~23 degrees */
test_omega0 : 0.0$
theta_v : test_theta0$
omega_v : test_omega0$
theta_hist : [theta_v]$
omega_hist : [omega_v]$
u_hist : []$
time_hist : [0.0]$
for k : 1 thru horizon do (
mu_val : float(subst([w1=ww[1], w2=ww[2], w3=ww[3],
th=theta_v, om=omega_v], mu_expr)),
u_act : max(-u_max, min(u_max, float(mu_val))),
u_hist : endcons(u_act, u_hist),
omega_new : omega_v + dt_sim * (-gv/lp * sin(theta_v) - bp*omega_v
+ u_act/(mp*lp^2)),
theta_v : theta_v + dt_sim * omega_new,
omega_v : omega_new,
theta_hist : endcons(theta_v, theta_hist),
omega_hist : endcons(omega_v, omega_hist),
time_hist : endcons(k * dt_sim, time_hist)
)$
/* Convert to degrees for readability */
theta_deg : map(lambda([x], float(x * 180 / %pi)), theta_hist)$
ax_multi(2, 1,
ax_draw2d(
line_width = 2,
color = royalblue,
name = "θ (deg)",
lines(time_hist, theta_deg),
color = tomato,
dash = "dash",
name = "ω (rad/s)",
lines(time_hist, omega_hist),
title = sconcat("Learned Policy (θ₀ = ", test_theta0, " rad)"),
xlabel = "Time (s)",
ylabel = "State",
grid = true
),
ax_draw2d(
line_width = 2,
color = forestgreen,
lines(rest(time_hist), u_hist),
title = "Control Effort",
xlabel = "Time (s)",
ylabel = "Torque u",
grid = true
)
)$
6. Phase Portrait: Learned vs Uncontrolled¶
Compare the learned policy's trajectory with free (uncontrolled) swinging from the same initial condition.
/* Uncontrolled trajectory for comparison */
theta_v : test_theta0$
omega_v : test_omega0$
free_theta : [theta_v]$
free_omega : [omega_v]$
for k : 1 thru horizon do (
omega_new : omega_v + dt_sim * (-gv/lp * sin(theta_v) - bp*omega_v),
theta_v : theta_v + dt_sim * omega_new,
omega_v : omega_new,
free_theta : endcons(theta_v, free_theta),
free_omega : endcons(omega_v, free_omega)
)$
ax_draw2d(
line_width = 2,
color = royalblue,
name = "Learned policy",
lines(theta_hist, omega_hist),
color = tomato,
dash = "dash",
name = "Uncontrolled",
lines(free_theta, free_omega),
marker_symbol = "triangle-up-open",
marker_size = 10,
color = black,
name = "Start",
points([[test_theta0, test_omega0]]),
marker_symbol = "circle",
color = forestgreen,
name = "Goal",
points([[0, 0]]),
title = "Phase Portrait: θ vs ω",
xlabel = "θ (rad)",
ylabel = "ω (rad/s)",
grid = true
)$
Key Takeaways¶
- Symbolic score function: Maxima's
diff()computes $\nabla_w \log\pi$ exactly — no backpropagation, no finite differences, no autodiff framework - REINFORCE: the simplest policy gradient method — sample trajectories, weight gradients by returns, update
- Baseline reduces variance: subtracting a running-average baseline from returns stabilizes learning without introducing bias
np_discount: computes discounted returns $G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots$ in a single reverse pass- Maxima's sweet spot: small parametric policies where exact symbolic gradients outweigh the overhead of a full deep learning stack
For harder problems (swing-up from arbitrary angles), combine with the Cross-Entropy Method for initialization or use np_minimize with np_compile_gradient for deterministic policy optimization.