Tabular Q-Learning: Grid World¶

Q-learning is the simplest model-free RL algorithm. An agent learns the value of state-action pairs by exploring an environment and updating a table:

$$Q(s, a) \leftarrow Q(s, a) + \alpha \big[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \big]$$

No neural networks, no gradients — just a table, some random exploration, and the Bellman equation.

In [8]:
load("numerics")$
load("ax-plots")$

1. Grid World Environment¶

A 5×5 grid with:

  • Start: top-left (0,0)
  • Goal: bottom-right (4,4) — reward +10
  • Walls: 3 blocked cells the agent cannot enter
  • Step cost: -0.1 per move (encourages short paths)
  • Actions: up (0), right (1), down (2), left (3)

States are numbered 0–24 (row-major: state = 5*row + col).

In [9]:
/* Grid dimensions */
grid_rows : 5$  grid_cols : 5$
n_states : grid_rows * grid_cols$
n_actions : 4$

/* State <-> (row, col) conversion (0-indexed) */
state_to_rc(s) := [floor(s / grid_cols), mod(s, grid_cols)]$
rc_to_state(r, c) := r * grid_cols + c$

/* Walls: set of blocked states */
wall_states : setify([rc_to_state(1,1), rc_to_state(2,3), rc_to_state(3,1)])$

/* Goal state */
goal_state : rc_to_state(4, 4)$

/* Action deltas: up, right, down, left */
dr : [-1, 0, 1, 0]$
dc : [0, 1, 0, -1]$
action_names : ["up", "right", "down", "left"]$

/* Step function: (state, action) -> (next_state, reward, finished) */
env_step(s, a) := block([rc, r, c, nr, nc, ns, reward, finished],
  rc : state_to_rc(s),
  r : rc[1],  c : rc[2],
  nr : r + dr[a+1],  nc : c + dc[a+1],
  /* Clip to grid boundaries */
  nr : max(0, min(grid_rows - 1, nr)),
  nc : max(0, min(grid_cols - 1, nc)),
  ns : rc_to_state(nr, nc),
  /* Bounce off walls */
  if elementp(ns, wall_states) then ns : s,
  /* Reward */
  finished : is(ns = goal_state),
  reward : if finished then 10.0 else -0.1,
  [ns, reward, finished]
)$

print("Grid world: 5x5, goal at (4,4), 3 walls")$
print("Walls at:", listify(wall_states))$
Grid world: 5x5, goal at (4,4), 3 walls
Walls at: [6,13,16]

2. Q-Learning¶

Epsilon-greedy exploration: with probability $\epsilon$ take a random action, otherwise take the greedy action $\arg\max_a Q(s,a)$.

In [10]:
np_seed(42)$

/* Hyperparameters */
alpha : 0.1$        /* learning rate */
gamma_ql : 0.95$    /* discount factor */
epsilon0 : 1.0$     /* initial exploration rate */
epsilon_min : 0.05$ /* minimum exploration */
epsilon_decay : 0.995$  /* decay per episode */
n_episodes : 500$
max_steps : 100$

/* Q-table: n_states x n_actions, initialized to zeros */
Q : np_zeros([n_states, n_actions])$

/* Epsilon-greedy action selection */
select_action(s, eps) := block([r],
  r : np_ref(np_rand([1]), 0),
  if r < eps then
    /* Random action */
    floor(np_ref(np_rand([1]), 0) * n_actions)
  else
    /* Greedy: argmax over actions for state s */
    block([best_a, best_q, q_val],
      best_a : 0, best_q : np_ref(Q, s, 0),
      for a : 1 thru n_actions - 1 do (
        q_val : np_ref(Q, s, a),
        if q_val > best_q then (best_a : a, best_q : q_val)
      ),
      best_a)
)$

/* Training loop */
episode_rewards : []$
episode_lengths : []$
epsilon : epsilon0$

for ep : 1 thru n_episodes do block(
  [s, total_reward, steps, finished, a, result, s_next, reward,
   max_q_next, q_current],
  s : 0,  /* start state */
  total_reward : 0.0,
  steps : 0,
  finished : false,

  while not finished and steps < max_steps do (
    a : select_action(s, epsilon),
    result : env_step(s, a),
    s_next : result[1],
    reward : result[2],
    finished : result[3],

    /* Q-learning update */
    max_q_next : np_ref(Q, s_next, 0),
    for a2 : 1 thru n_actions - 1 do
      max_q_next : max(max_q_next, np_ref(Q, s_next, a2)),

    q_current : np_ref(Q, s, a),
    np_set(Q, s, a, q_current + alpha * (reward + gamma_ql * max_q_next - q_current)),

    s : s_next,
    total_reward : total_reward + reward,
    steps : steps + 1
  ),

  episode_rewards : endcons(total_reward, episode_rewards),
  episode_lengths : endcons(steps, episode_lengths),
  epsilon : max(epsilon_min, epsilon * epsilon_decay),

  if mod(ep, 100) = 0 then
    print("Episode", ep, ": reward =", total_reward,
          "steps =", steps, "eps =",
          float(round(epsilon * 1000) / 1000))
)$
Episode 100 : reward = 8.0 steps = 21 eps = 0.606
Episode 200 : reward = 8.7 steps = 14 eps = 0.367
Episode 300 : reward = 8.5 steps = 16 eps = 0.222
Episode 400 : reward = 9.2 steps = 9 eps = 0.135
Episode 500 : reward = 9.1 steps = 10 eps = 0.082
In [11]:
/* Smooth the reward curve with a running average */
window : 20$
smoothed : makelist(
  lsum(x, x, makelist(episode_rewards[j],
    j, max(1, i - window + 1), i)) / min(i, window),
  i, 1, n_episodes)$

ax_draw2d(
  color = "#ccc",
  name = "Raw",
  points(makelist(i, i, 1, n_episodes), episode_rewards),
  line_width = 2,
  color = royalblue,
  name = "Smoothed (20-ep)",
  lines(makelist(i, i, 1, n_episodes), smoothed),
  title = "Q-Learning: Episode Reward",
  xlabel = "Episode",
  ylabel = "Total Reward",
  grid = true
)$
No description has been provided for this image

3. Learned Policy Visualization¶

Extract the greedy policy and value function from the Q-table.

In [12]:
/* Extract greedy policy and state values */
arrow_symbols : ["^", ">", "v", "<"]$

print("=== Learned Policy ===")$
for r : 0 thru grid_rows - 1 do block([row_str],
  row_str : "",
  for c : 0 thru grid_cols - 1 do block([s, best_a, best_q, q_val],
    s : rc_to_state(r, c),
    if elementp(s, wall_states) then
      row_str : sconcat(row_str, "  # ")
    elseif s = goal_state then
      row_str : sconcat(row_str, "  G ")
    else (
      best_a : 0,  best_q : np_ref(Q, s, 0),
      for a : 1 thru n_actions - 1 do (
        q_val : np_ref(Q, s, a),
        if q_val > best_q then (best_a : a, best_q : q_val)
      ),
      row_str : sconcat(row_str, "  ", arrow_symbols[best_a + 1], " ")
    )
  ),
  print(row_str)
)$

print("")$
print("=== State Values V(s) = max_a Q(s,a) ===")$
for r : 0 thru grid_rows - 1 do block([row_str],
  row_str : "",
  for c : 0 thru grid_cols - 1 do block([s, v],
    s : rc_to_state(r, c),
    if elementp(s, wall_states) then
      row_str : sconcat(row_str, "  ###  ")
    else (
      v : np_ref(Q, s, 0),
      for a : 1 thru n_actions - 1 do v : max(v, np_ref(Q, s, a)),
      row_str : sconcat(row_str, printf(false, "~6,2f ", float(v)))
    )
  ),
  print(row_str)
)$
=== Learned Policy ===
  v   <   >   >   v
  v   #   v   >   v
  v   >   v   #   v
  v   #   v   >   v
  >   >   >   >   G
=== State Values V(s) = max_a Q(s,a) ===
  6.38   5.34   2.00   3.89   5.84
  6.82   ###    4.22   4.39   7.73
  7.29   7.47   8.21   ###    9.19
  7.77   ###    8.83   9.37  10.00
  8.29   8.83   9.40  10.00   0.00
In [13]:
/* Visualize value function as a bar chart per row */
v_list : []$
state_labels : []$
bar_colors : []$
for r : 0 thru grid_rows - 1 do
  for c : 0 thru grid_cols - 1 do block([s, v],
    s : rc_to_state(r, c),
    if elementp(s, wall_states) then (
      v_list : endcons(0, v_list),
      bar_colors : endcons("#333", bar_colors)
    ) else (
      v : np_ref(Q, s, 0),
      for a : 1 thru n_actions - 1 do v : max(v, np_ref(Q, s, a)),
      v_list : endcons(float(v), v_list),
      if s = goal_state then bar_colors : endcons("forestgreen", bar_colors)
      else bar_colors : endcons("royalblue", bar_colors)
    ),
    state_labels : endcons(sconcat("(", r, ",", c, ")"), state_labels)
  )$

ax_bar(state_labels, v_list,
  title = "State Value Function V(s)",
  ylabel = "V(s)",
  bar_color = bar_colors
)$
In [14]:
/* Run the greedy policy and show the path */
path : [0]$
s : 0$
for step_i : 1 thru 20 do block([best_a, best_q, q_val, result],
  if s = goal_state then return(false),
  best_a : 0,  best_q : np_ref(Q, s, 0),
  for a : 1 thru n_actions - 1 do (
    q_val : np_ref(Q, s, a),
    if q_val > best_q then (best_a : a, best_q : q_val)
  ),
  result : env_step(s, best_a),
  s : result[1],
  path : endcons(s, path)
)$

print("Greedy path (states):", path)$
print("Path (coords):", map(state_to_rc, path))$
print("Path length:", length(path) - 1, "steps")$
Greedy path (states): [0,5,10,15,20,21,22,23,24]
Path (coords): [[0,0],[1,0],[2,0],[3,0],[4,0],[4,1],[4,2],[4,3],[4,4]]
Path length: 8 steps

Key Points¶

  • No gradients: Q-learning uses only table lookups and the Bellman equation
  • Epsilon-greedy: balances exploration (random) vs exploitation (greedy)
  • Convergence: with sufficient exploration, Q converges to optimal Q* for tabular MDPs
  • Limitation: state space must be discrete and small — for continuous states, use function approximation (linear or neural)