REINFORCE: Policy Gradient with Symbolic Differentiation¶

The REINFORCE algorithm (Williams, 1992) learns a stochastic policy by gradient ascent on expected return:

$$\nabla_w J(w) = \mathbb{E}\bigg[\sum_t G_t \,\nabla_w \log \pi(a_t \mid s_t; w)\bigg]$$

where $G_t$ is the discounted return from step $t$.

Maxima's unique advantage: we define the policy symbolically and use diff() to compute the exact score function $\nabla_w \log \pi$ — no backpropagation framework needed.

In [25]:
load("numerics")$
load("ax-plots")$

1. Inverted Pendulum Stabilization¶

Simple pendulum with torque input $u$, Euler-discretized: $$\omega_{k+1} = \omega_k + \Delta t\left(-\frac{g}{l}\sin\theta_k - b\omega_k + \frac{u}{ml^2}\right), \qquad \theta_{k+1} = \theta_k + \Delta t\,\omega_{k+1}$$

Goal: stabilize near upright ($\theta = 0$) from a random initial tilt.

Reward: $r_k = -(\theta_k^2 + 0.1\,\omega_k^2 + 0.01\,u_k^2)$ — dense quadratic cost encouraging the pendulum to stay upright with minimal effort.

In [26]:
/* Pendulum parameters */
gv : 9.81$  lp : 1.0$  mp : 1.0$  bp : 0.1$
dt_sim : 0.05$
horizon : 80$  /* Steps per episode (4 seconds) */
u_max : 5.0$  /* Torque clamp */

print("Pendulum: g =", gv, " l =", lp, " b =", bp)$
print("Episode: ", horizon, "steps x", dt_sim, "s =", horizon * dt_sim, "s")$
Pendulum: g = 9.81  l = 1.0  b = 0.1
Episode:  80 steps x 0.05 s = 4.0 s

2. Symbolic Policy and Score Function¶

Gaussian policy with linear mean: $$\pi(u \mid \theta, \omega;\, w, \sigma) = \mathcal{N}\!\left(\mu(\theta,\omega;w),\, \sigma^2\right), \qquad \mu = w_1 \sin\theta + w_2 \cos\theta + w_3\,\omega$$

The score function $\nabla_w \log\pi$ is computed symbolically using diff(): $$\frac{\partial \log \pi}{\partial w_i} = \frac{u - \mu}{\sigma^2} \cdot \frac{\partial \mu}{\partial w_i}$$

In [27]:
/* Symbolic policy mean */
mu_expr : w1 * sin(th) + w2 * cos(th) + w3 * om$

/* Symbolic log-probability (up to constant) */
log_pi : -(u_sym - mu_expr)^2 / (2 * sig^2) - log(sig)$

/* Exact score functions via Maxima's symbolic diff() */
score_exprs : [diff(log_pi, w1), diff(log_pi, w2), diff(log_pi, w3)]$

print("Policy mean: μ =", mu_expr)$
print("Score ∂log π/∂w₁ =", score_exprs[1])$
print("Score ∂log π/∂w₂ =", score_exprs[2])$
print("Score ∂log π/∂w₃ =", score_exprs[3])$
Policy mean: μ = om*w3+cos(th)*w2+sin(th)*w1
Score ∂log π/∂w₁ = (sin(th)*(-(om*w3)-cos(th)*w2-sin(th)*w1+u_sym))/sig^2
Score ∂log π/∂w₂ = (cos(th)*(-(om*w3)-cos(th)*w2-sin(th)*w1+u_sym))/sig^2
Score ∂log π/∂w₃ = (om*(-(om*w3)-cos(th)*w2-sin(th)*w1+u_sym))/sig^2

3. REINFORCE Training¶

For each batch of episodes:

  1. Roll out trajectories with the stochastic policy
  2. Compute discounted returns $G_t$ using np_discount
  3. Evaluate the symbolic score function at each $(s_t, a_t)$
  4. Update: $w \leftarrow w + \alpha \cdot \frac{1}{N}\sum_{\text{episodes}} \sum_t (G_t - b)\,\nabla_w \log\pi(a_t \mid s_t)$

A baseline $b$ (running average of returns) reduces gradient variance.

In [28]:
np_seed(42)$

/* Policy parameters */
ww : [0.0, 0.0, 0.0]$
sigma_pol : 1.5$

/* Hyperparameters */
alpha_lr : 0.01$
disc : 0.97$
n_epochs : 40$
n_batch : 5$
sigma_decay : 0.98$
sigma_min : 0.3$

/* Tracking */
return_history : []$
baseline : 0.0$
beta_bl : 0.9$  /* Baseline smoothing */

for epoch : 1 thru n_epochs do (
  batch_grad : [0.0, 0.0, 0.0],
  batch_ret : 0.0,

  for b : 1 thru n_batch do (
    /* Random initial state near upright */
    theta_v : 0.4 * float(random(2.0) - 1.0),
    omega_v : 0.2 * float(random(2.0) - 1.0),

    /* Pre-generate noise for episode */
    noise_arr : np_randn(horizon),

    /* Collect trajectory: rewards and score vectors */
    rew_arr : np_zeros(horizon),
    score_list : [],

    for k : 0 thru horizon - 1 do (
      /* Evaluate policy mean numerically */
      mu_val : float(subst([w1=ww[1], w2=ww[2], w3=ww[3],
                            th=theta_v, om=omega_v], mu_expr)),

      /* Sample action from N(mu, sigma^2) */
      u_raw : mu_val + sigma_pol * np_ref(noise_arr, k),
      u_act : max(-u_max, min(u_max, float(u_raw))),

      /* Record reward */
      rwd : -(theta_v^2 + 0.1*omega_v^2 + 0.01*u_act^2),
      np_set(rew_arr, k, float(rwd)),

      /* Evaluate symbolic score function at current (s, a, w) */
      sv : float(subst([w1=ww[1], w2=ww[2], w3=ww[3],
                        th=theta_v, om=omega_v, u_sym=u_act, sig=sigma_pol],
                       score_exprs)),
      score_list : endcons(sv, score_list),

      /* Euler step */
      omega_new : omega_v + dt_sim * (-gv/lp * sin(theta_v) - bp*omega_v
                                       + u_act/(mp*lp^2)),
      theta_v : theta_v + dt_sim * omega_new,
      omega_v : omega_new
    ),

    /* Discounted returns via np_discount */
    returns_arr : np_discount(rew_arr, disc),
    ep_return : np_ref(returns_arr, 0),

    /* Accumulate policy gradient with baseline */
    for k : 1 thru horizon do (
      gt : np_ref(returns_arr, k - 1) - baseline,
      batch_grad : batch_grad + gt * score_list[k]
    ),

    batch_ret : batch_ret + ep_return
  ),

  /* Update policy weights */
  ww : float(ww + (alpha_lr / n_batch) * batch_grad),

  /* Update baseline and exploration */
  avg_ret : batch_ret / n_batch,
  baseline : beta_bl * baseline + (1 - beta_bl) * avg_ret,
  sigma_pol : max(sigma_min, sigma_pol * sigma_decay),

  return_history : endcons(avg_ret, return_history),

  if mod(epoch, 10) = 0 then
    print("Epoch", epoch, ": avg return =", avg_ret, " w =", ww, " σ =", sigma_pol)
)$

print("\nFinal policy weights:", ww)$
print("Final σ:", sigma_pol)$
Epoch 10 : avg return = -1.1228499119029394  w =
     [-0.06962357795332622,-0.03471576530016185,-0.24400252473946557]  σ =
     1.22560921033132
Epoch 20 : avg return = -1.9153370442159179  w =
     [-0.0551354254757563,0.07488989076414751,-0.47091682914399796]  σ =
     1.0014119576326417
Epoch 30 : avg return = -0.9594710608092523  w =
     [-0.06638375109170443,0.10603589317257232,-0.6315931486653269]  σ =
     0.8182264790736555
Epoch 40 : avg return = -0.5878155531978437  w =
     [-0.07025932327851005,0.06975946554922369,-0.7628291674318786]  σ =
     0.6685506059264261
nFinal policy weights:
                      [-0.07025932327851005,0.06975946554922369,
                       -0.7628291674318786]
Final σ: 0.6685506059264261

4. Learning Curve¶

In [29]:
/* Learning curve */
epoch_list : makelist(i, i, 1, n_epochs)$
ax_draw2d(
  line_width = 2,
  color = royalblue,
  name = "Avg return",
  lines(epoch_list, return_history),
  title = "REINFORCE Learning Curve",
  xlabel = "Epoch",
  ylabel = "Average Return",
  grid = true
)$
No description has been provided for this image

5. Evaluate the Learned Policy¶

Simulate the trained deterministic policy ($\sigma = 0$, use the mean action) from a challenging initial angle.

In [30]:
/* Simulate with learned deterministic policy */
test_theta0 : 0.4$  /* ~23 degrees */
test_omega0 : 0.0$

theta_v : test_theta0$
omega_v : test_omega0$

theta_hist : [theta_v]$
omega_hist : [omega_v]$
u_hist : []$
time_hist : [0.0]$

for k : 1 thru horizon do (
  mu_val : float(subst([w1=ww[1], w2=ww[2], w3=ww[3],
                        th=theta_v, om=omega_v], mu_expr)),
  u_act : max(-u_max, min(u_max, float(mu_val))),
  u_hist : endcons(u_act, u_hist),

  omega_new : omega_v + dt_sim * (-gv/lp * sin(theta_v) - bp*omega_v
                                   + u_act/(mp*lp^2)),
  theta_v : theta_v + dt_sim * omega_new,
  omega_v : omega_new,

  theta_hist : endcons(theta_v, theta_hist),
  omega_hist : endcons(omega_v, omega_hist),
  time_hist : endcons(k * dt_sim, time_hist)
)$

/* Convert to degrees for readability */
theta_deg : map(lambda([x], float(x * 180 / %pi)), theta_hist)$

ax_multi(2, 1,
  ax_draw2d(
    line_width = 2,
    color = royalblue,
    name = "θ (deg)",
    lines(time_hist, theta_deg),
    color = tomato,
    dash = "dash",
    name = "ω (rad/s)",
    lines(time_hist, omega_hist),
    title = sconcat("Learned Policy (θ₀ = ", test_theta0, " rad)"),
    xlabel = "Time (s)",
    ylabel = "State",
    grid = true
  ),
  ax_draw2d(
    line_width = 2,
    color = forestgreen,
    lines(rest(time_hist), u_hist),
    title = "Control Effort",
    xlabel = "Time (s)",
    ylabel = "Torque u",
    grid = true
  )
)$
No description has been provided for this image

6. Phase Portrait: Learned vs Uncontrolled¶

Compare the learned policy's trajectory with free (uncontrolled) swinging from the same initial condition.

In [31]:
/* Uncontrolled trajectory for comparison */
theta_v : test_theta0$
omega_v : test_omega0$
free_theta : [theta_v]$
free_omega : [omega_v]$

for k : 1 thru horizon do (
  omega_new : omega_v + dt_sim * (-gv/lp * sin(theta_v) - bp*omega_v),
  theta_v : theta_v + dt_sim * omega_new,
  omega_v : omega_new,
  free_theta : endcons(theta_v, free_theta),
  free_omega : endcons(omega_v, free_omega)
)$

ax_draw2d(
  line_width = 2,
  color = royalblue,
  name = "Learned policy",
  lines(theta_hist, omega_hist),
  color = tomato,
  dash = "dash",
  name = "Uncontrolled",
  lines(free_theta, free_omega),
  marker_symbol = "triangle-up-open",
  marker_size = 10,
  color = black,
  name = "Start",
  points([[test_theta0, test_omega0]]),
  marker_symbol = "circle",
  color = forestgreen,
  name = "Goal",
  points([[0, 0]]),
  title = "Phase Portrait: θ vs ω",
  xlabel = "θ (rad)",
  ylabel = "ω (rad/s)",
  grid = true
)$
No description has been provided for this image

Key Takeaways¶

  • Symbolic score function: Maxima's diff() computes $\nabla_w \log\pi$ exactly — no backpropagation, no finite differences, no autodiff framework
  • REINFORCE: the simplest policy gradient method — sample trajectories, weight gradients by returns, update
  • Baseline reduces variance: subtracting a running-average baseline from returns stabilizes learning without introducing bias
  • np_discount: computes discounted returns $G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots$ in a single reverse pass
  • Maxima's sweet spot: small parametric policies where exact symbolic gradients outweigh the overhead of a full deep learning stack

For harder problems (swing-up from arbitrary angles), combine with the Cross-Entropy Method for initialization or use np_minimize with np_compile_gradient for deterministic policy optimization.