Tabular Q-Learning: Grid World¶
Q-learning is the simplest model-free RL algorithm. An agent learns the value of state-action pairs by exploring an environment and updating a table:
$$Q(s, a) \leftarrow Q(s, a) + \alpha \big[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \big]$$
No neural networks, no gradients — just a table, some random exploration, and the Bellman equation.
In [8]:
load("numerics")$
load("ax-plots")$
1. Grid World Environment¶
A 5×5 grid with:
- Start: top-left (0,0)
- Goal: bottom-right (4,4) — reward +10
- Walls: 3 blocked cells the agent cannot enter
- Step cost: -0.1 per move (encourages short paths)
- Actions: up (0), right (1), down (2), left (3)
States are numbered 0–24 (row-major: state = 5*row + col).
In [9]:
/* Grid dimensions */
grid_rows : 5$ grid_cols : 5$
n_states : grid_rows * grid_cols$
n_actions : 4$
/* State <-> (row, col) conversion (0-indexed) */
state_to_rc(s) := [floor(s / grid_cols), mod(s, grid_cols)]$
rc_to_state(r, c) := r * grid_cols + c$
/* Walls: set of blocked states */
wall_states : setify([rc_to_state(1,1), rc_to_state(2,3), rc_to_state(3,1)])$
/* Goal state */
goal_state : rc_to_state(4, 4)$
/* Action deltas: up, right, down, left */
dr : [-1, 0, 1, 0]$
dc : [0, 1, 0, -1]$
action_names : ["up", "right", "down", "left"]$
/* Step function: (state, action) -> (next_state, reward, finished) */
env_step(s, a) := block([rc, r, c, nr, nc, ns, reward, finished],
rc : state_to_rc(s),
r : rc[1], c : rc[2],
nr : r + dr[a+1], nc : c + dc[a+1],
/* Clip to grid boundaries */
nr : max(0, min(grid_rows - 1, nr)),
nc : max(0, min(grid_cols - 1, nc)),
ns : rc_to_state(nr, nc),
/* Bounce off walls */
if elementp(ns, wall_states) then ns : s,
/* Reward */
finished : is(ns = goal_state),
reward : if finished then 10.0 else -0.1,
[ns, reward, finished]
)$
print("Grid world: 5x5, goal at (4,4), 3 walls")$
print("Walls at:", listify(wall_states))$
Grid world: 5x5, goal at (4,4), 3 walls Walls at: [6,13,16]
2. Q-Learning¶
Epsilon-greedy exploration: with probability $\epsilon$ take a random action, otherwise take the greedy action $\arg\max_a Q(s,a)$.
In [10]:
np_seed(42)$
/* Hyperparameters */
alpha : 0.1$ /* learning rate */
gamma_ql : 0.95$ /* discount factor */
epsilon0 : 1.0$ /* initial exploration rate */
epsilon_min : 0.05$ /* minimum exploration */
epsilon_decay : 0.995$ /* decay per episode */
n_episodes : 500$
max_steps : 100$
/* Q-table: n_states x n_actions, initialized to zeros */
Q : np_zeros([n_states, n_actions])$
/* Epsilon-greedy action selection */
select_action(s, eps) := block([r],
r : np_ref(np_rand([1]), 0),
if r < eps then
/* Random action */
floor(np_ref(np_rand([1]), 0) * n_actions)
else
/* Greedy: argmax over actions for state s */
block([best_a, best_q, q_val],
best_a : 0, best_q : np_ref(Q, s, 0),
for a : 1 thru n_actions - 1 do (
q_val : np_ref(Q, s, a),
if q_val > best_q then (best_a : a, best_q : q_val)
),
best_a)
)$
/* Training loop */
episode_rewards : []$
episode_lengths : []$
epsilon : epsilon0$
for ep : 1 thru n_episodes do block(
[s, total_reward, steps, finished, a, result, s_next, reward,
max_q_next, q_current],
s : 0, /* start state */
total_reward : 0.0,
steps : 0,
finished : false,
while not finished and steps < max_steps do (
a : select_action(s, epsilon),
result : env_step(s, a),
s_next : result[1],
reward : result[2],
finished : result[3],
/* Q-learning update */
max_q_next : np_ref(Q, s_next, 0),
for a2 : 1 thru n_actions - 1 do
max_q_next : max(max_q_next, np_ref(Q, s_next, a2)),
q_current : np_ref(Q, s, a),
np_set(Q, s, a, q_current + alpha * (reward + gamma_ql * max_q_next - q_current)),
s : s_next,
total_reward : total_reward + reward,
steps : steps + 1
),
episode_rewards : endcons(total_reward, episode_rewards),
episode_lengths : endcons(steps, episode_lengths),
epsilon : max(epsilon_min, epsilon * epsilon_decay),
if mod(ep, 100) = 0 then
print("Episode", ep, ": reward =", total_reward,
"steps =", steps, "eps =",
float(round(epsilon * 1000) / 1000))
)$
Episode 100 : reward = 8.0 steps = 21 eps = 0.606 Episode 200 : reward = 8.7 steps = 14 eps = 0.367 Episode 300 : reward = 8.5 steps = 16 eps = 0.222 Episode 400 : reward = 9.2 steps = 9 eps = 0.135 Episode 500 : reward = 9.1 steps = 10 eps = 0.082
In [11]:
/* Smooth the reward curve with a running average */
window : 20$
smoothed : makelist(
lsum(x, x, makelist(episode_rewards[j],
j, max(1, i - window + 1), i)) / min(i, window),
i, 1, n_episodes)$
ax_draw2d(
color = "#ccc",
name = "Raw",
points(makelist(i, i, 1, n_episodes), episode_rewards),
line_width = 2,
color = royalblue,
name = "Smoothed (20-ep)",
lines(makelist(i, i, 1, n_episodes), smoothed),
title = "Q-Learning: Episode Reward",
xlabel = "Episode",
ylabel = "Total Reward",
grid = true
)$
3. Learned Policy Visualization¶
Extract the greedy policy and value function from the Q-table.
In [12]:
/* Extract greedy policy and state values */
arrow_symbols : ["^", ">", "v", "<"]$
print("=== Learned Policy ===")$
for r : 0 thru grid_rows - 1 do block([row_str],
row_str : "",
for c : 0 thru grid_cols - 1 do block([s, best_a, best_q, q_val],
s : rc_to_state(r, c),
if elementp(s, wall_states) then
row_str : sconcat(row_str, " # ")
elseif s = goal_state then
row_str : sconcat(row_str, " G ")
else (
best_a : 0, best_q : np_ref(Q, s, 0),
for a : 1 thru n_actions - 1 do (
q_val : np_ref(Q, s, a),
if q_val > best_q then (best_a : a, best_q : q_val)
),
row_str : sconcat(row_str, " ", arrow_symbols[best_a + 1], " ")
)
),
print(row_str)
)$
print("")$
print("=== State Values V(s) = max_a Q(s,a) ===")$
for r : 0 thru grid_rows - 1 do block([row_str],
row_str : "",
for c : 0 thru grid_cols - 1 do block([s, v],
s : rc_to_state(r, c),
if elementp(s, wall_states) then
row_str : sconcat(row_str, " ### ")
else (
v : np_ref(Q, s, 0),
for a : 1 thru n_actions - 1 do v : max(v, np_ref(Q, s, a)),
row_str : sconcat(row_str, printf(false, "~6,2f ", float(v)))
)
),
print(row_str)
)$
=== Learned Policy === v < > > v v # v > v v > v # v v # v > v > > > > G === State Values V(s) = max_a Q(s,a) === 6.38 5.34 2.00 3.89 5.84 6.82 ### 4.22 4.39 7.73 7.29 7.47 8.21 ### 9.19 7.77 ### 8.83 9.37 10.00 8.29 8.83 9.40 10.00 0.00
In [13]:
/* Visualize value function as a bar chart per row */
v_list : []$
state_labels : []$
bar_colors : []$
for r : 0 thru grid_rows - 1 do
for c : 0 thru grid_cols - 1 do block([s, v],
s : rc_to_state(r, c),
if elementp(s, wall_states) then (
v_list : endcons(0, v_list),
bar_colors : endcons("#333", bar_colors)
) else (
v : np_ref(Q, s, 0),
for a : 1 thru n_actions - 1 do v : max(v, np_ref(Q, s, a)),
v_list : endcons(float(v), v_list),
if s = goal_state then bar_colors : endcons("forestgreen", bar_colors)
else bar_colors : endcons("royalblue", bar_colors)
),
state_labels : endcons(sconcat("(", r, ",", c, ")"), state_labels)
)$
ax_bar(state_labels, v_list,
title = "State Value Function V(s)",
ylabel = "V(s)",
bar_color = bar_colors
)$
In [14]:
/* Run the greedy policy and show the path */
path : [0]$
s : 0$
for step_i : 1 thru 20 do block([best_a, best_q, q_val, result],
if s = goal_state then return(false),
best_a : 0, best_q : np_ref(Q, s, 0),
for a : 1 thru n_actions - 1 do (
q_val : np_ref(Q, s, a),
if q_val > best_q then (best_a : a, best_q : q_val)
),
result : env_step(s, best_a),
s : result[1],
path : endcons(s, path)
)$
print("Greedy path (states):", path)$
print("Path (coords):", map(state_to_rc, path))$
print("Path length:", length(path) - 1, "steps")$
Greedy path (states): [0,5,10,15,20,21,22,23,24] Path (coords): [[0,0],[1,0],[2,0],[3,0],[4,0],[4,1],[4,2],[4,3],[4,4]] Path length: 8 steps
Key Points¶
- No gradients: Q-learning uses only table lookups and the Bellman equation
- Epsilon-greedy: balances exploration (random) vs exploitation (greedy)
- Convergence: with sufficient exploration, Q converges to optimal Q* for tabular MDPs
- Limitation: state space must be discrete and small — for continuous states, use function approximation (linear or neural)