Blocking Protocol
This document describes the blocking/wakeup protocol used throughout the
kernel, the lost-wakeup race it currently suffers from, and a proposed fix
based on an idle thread and a WaitCondition primitive.
Formal PlusCal/TLA+ models of the protocol live in specs/. See
specs/PLUSCAL.md for authoring instructions and
specs/README.md for per-spec details.
Current protocol (buggy)
Every blocking site in the kernel follows this pattern:
#![allow(unused)]
fn main() {
{
let mut guard = shared_state.lock(); // 1. acquire lock
guard.waiter = Some(thread_idx); // 2. register waiter
} // 3. release lock
scheduler::block_current_thread(); // 4. mark Blocked + spin
}
The waker (a producer, timer, or signal) does:
#![allow(unused)]
fn main() {
{
let mut guard = shared_state.lock();
if let Some(t) = guard.waiter.take() { // clear waiter slot
scheduler::unblock(t); // wake up the blocked thread
}
}
}
And unblock() is conditional:
#![allow(unused)]
fn main() {
pub fn unblock(thread_idx: usize) {
let mut sched = SCHEDULER.lock();
if let Some(t) = sched.threads.get_mut(thread_idx) {
if t.state == ThreadState::Blocked { // <-- only acts if Blocked
t.state = ThreadState::Ready;
sched.ready_queue.push_back(thread_idx);
}
}
}
}
The race
A waker can execute between steps 3 (unlock) and 4 (mark Blocked):
- Waiter: acquires lock, sets
waiter = Some(self), releases lock. Thread state is stillRunning. - Waker: acquires lock, calls
waiter.take(), callsunblock(waiter).unblockchecksstate == Blocked— it’sRunning— no-op. Waiter slot is nowNone. - Waiter: calls
block_current_thread(), sets state toBlocked, spins forever. No future waker will callunblockbecause the waiter slot was already consumed. Deadlock.
This race is confirmed by the PlusCal model in specs/completion_port/completion_port.tla
(TLC finds a deadlock trace) and by code inspection of scheduler.rs lines
640-676.
Affected sites
The race affects every blocking site, not just the completion port:
| Site | File | Lock type | Status |
|---|---|---|---|
sys_io_wait | osl/src/io_port.rs | IrqMutex | Fixed: WaitCondition |
sys_io_ring_enter phase 3 | osl/src/io_port.rs | IrqMutex | Fixed: WaitCondition |
PipeReader::read | libkernel/src/file.rs | SpinMutex | Fixed: WaitCondition |
read_input (console) | libkernel/src/console.rs | SpinMutex | Fixed: WaitCondition |
sys_ipc_send | osl/src/ipc.rs | IrqMutex | mark_blocked under lock (action enum) |
sys_ipc_recv | osl/src/ipc.rs | IrqMutex | mark_blocked under lock (action enum) |
sys_wait4 | osl/src/syscalls/process.rs | SpinMutex | Fixed: WaitCondition |
sys_clone (vfork parent) | osl/src/clone.rs | SpinMutex | Fixed: WaitCondition |
blocking() (async bridge) | osl/src/blocking.rs | SpinMutex | Fixed: WaitCondition |
The IPC channel sites (sys_ipc_send, sys_ipc_recv) use the split
mark_blocked() / yield_now() pair. The mark_blocked is called inside
ChannelInner::try_send/try_recv (under the IrqMutex), and the caller
does yield_now() after the lock drops via the SendAction/RecvAction
enum. These are already race-free; WaitCondition doesn’t fit the
action-enum pattern without a larger refactor.
The sys_io_ring_enter variant previously had an additional bug: check and
set_waiter were under separate lock acquisitions, so a completion could
arrive between the check and the registration. WaitCondition fixes this by
construction.
Why Blocked threads spin today
Both preempt_tick and yield_tick handle an empty ready queue by returning
current_rsp — i.e. they keep running the current thread even if it’s
Blocked. This forces block_current_thread() to include a HLT spin loop:
the Blocked thread keeps running on the CPU, calling enable_and_hlt() in a
loop, waiting for the next timer interrupt to check if unblock() has
changed its state. This wastes up to one full quantum (10ms) per blocking
event and prevents the CPU from doing useful work while the thread is
Blocked.
Proposed fix
The fix has three parts: an idle thread that eliminates the need for blocked
threads to spin, a split of block_current_thread that fixes the race, and
a WaitCondition wrapper that makes the correct pattern easy and the buggy
pattern impossible.
Step 1: Add an idle thread
Create a per-CPU idle thread that the scheduler falls back to when the ready queue is empty. The idle thread does nothing but HLT in a loop, yielding the CPU until the next interrupt:
#![allow(unused)]
fn main() {
fn idle_thread() -> ! {
loop {
x86_64::instructions::interrupts::enable_and_hlt();
}
}
}
The idle thread is created during scheduler init and stored in Scheduler:
#![allow(unused)]
fn main() {
struct Scheduler {
// ...
idle_thread_idx: usize, // always present, never on the ready queue
}
}
Then preempt_tick and yield_tick switch to the idle thread instead of
staying on a Blocked/Dead thread:
#![allow(unused)]
fn main() {
let next_idx = match sched.ready_queue.pop_front() {
Some(idx) => idx,
None => sched.idle_thread_idx, // was: return current_rsp
};
}
The idle thread is never pushed onto the ready queue. The scheduler only runs
it as a fallback when nothing else is Ready. The first unblock() call
pushes a real thread onto the ready queue, and the next timer tick preempts
idle and switches to it.
With this change, a Blocked thread no longer needs to spin — the scheduler
context-switches away from it immediately and never schedules it again until
unblock() makes it Ready.
Step 2: Split block_current_thread into mark + yield
#![allow(unused)]
fn main() {
/// Mark the current thread Blocked and yield to the scheduler.
///
/// Safe to call while holding any lock (acquires SCHEDULER briefly).
/// The scheduler will context-switch away and never schedule this thread
/// again until unblock() is called. Execution resumes at the instruction
/// after this call.
// [spec: completion_port/completion_port.tla CheckAndAct — "thread_state := blocked"
// + WaitUnblocked — "await thread_state = running"]
pub fn mark_blocked_and_yield() {
x86_64::instructions::interrupts::without_interrupts(|| {
let mut sched = SCHEDULER.lock();
let idx = sched.current_idx;
sched.threads[idx].state = ThreadState::Blocked;
});
yield_now(); // context-switch away; resume here after unblock + reschedule
}
}
There is no spin loop. yield_now() triggers int 0x50, which enters
yield_tick. The scheduler sees the thread is Blocked, does not re-queue it,
and switches to the next ready thread (or idle). When unblock() is called
later, the thread is pushed onto the ready queue with state Ready. The
scheduler eventually picks it and context-switches back, resuming execution
right after the yield_now() call.
A separate mark_blocked() (without yield) is still useful for callers that
need to mark Blocked under a lock and yield after dropping it:
#![allow(unused)]
fn main() {
/// Mark the current thread Blocked. Does NOT yield.
/// Caller must call yield_now() after releasing their lock.
pub fn mark_blocked() {
x86_64::instructions::interrupts::without_interrupts(|| {
let mut sched = SCHEDULER.lock();
let idx = sched.current_idx;
sched.threads[idx].state = ThreadState::Blocked;
});
}
}
Step 3: Migrate call sites
Each site becomes:
#![allow(unused)]
fn main() {
{
let mut guard = shared_state.lock(); // 1. acquire lock
guard.waiter = Some(thread_idx); // 2. register waiter
scheduler::mark_blocked(); // 3. mark Blocked UNDER LOCK
} // 4. release lock
scheduler::yield_now(); // 5. context-switch away
// execution resumes here after unblock + reschedule
}
No loop, no spin, no HLT. The thread is off the CPU until explicitly woken.
Step 4: Introduce WaitCondition to enforce the pattern (DONE)
Implemented in libkernel/src/wait_condition.rs. Seven sites now use
WaitCondition::wait_while(): sys_io_wait, sys_io_ring_enter,
PipeReader::read, read_input (console), sys_wait4, sys_clone
(vfork parent), and blocking(). The two IPC channel sites use the split
mark_blocked()/yield_now() pair via an action enum.
A condvar-like wrapper that makes the ordering impossible to get wrong:
#![allow(unused)]
fn main() {
/// Single-waiter condvar for kernel blocking.
///
/// Encapsulates the check → register → mark_blocked → unlock → yield cycle.
/// The type system ensures mark_blocked happens before the guard drops.
pub struct WaitCondition;
impl WaitCondition {
/// If `predicate(guard)` returns true (i.e. "should block"), register
/// the waiter, mark the thread Blocked, release the lock, and yield.
/// Returns when unblocked and rescheduled.
///
// [spec: completion_port/completion_port.tla
// CheckAndAct (check + set_waiter + mark_blocked) = one label
// WaitUnblocked (await running) = next label]
pub fn wait_while<T, L: Lock<T>>(
mut guard: L::Guard<'_>,
predicate: impl Fn(&T) -> bool,
register: impl FnOnce(&mut T, usize),
) {
if !predicate(&*guard) {
return; // condition already satisfied, no need to block
}
let thread_idx = scheduler::current_thread_idx();
register(&mut *guard, thread_idx);
scheduler::mark_blocked(); // mark Blocked while lock held
drop(guard); // release lock
scheduler::yield_now(); // context-switch away; resume after unblock
}
}
}
Each blocking site reduces to a single call:
#![allow(unused)]
fn main() {
// Completion port
WaitCondition::wait_while(
port.lock(),
|p| p.pending() < min,
|p, idx| p.set_waiter(idx),
);
// Pipe reader
WaitCondition::wait_while(
self.inner.lock(),
|inner| inner.buffer.is_empty() && !inner.write_closed,
|inner, idx| { inner.reader_thread = Some(idx); },
);
// Console input
WaitCondition::wait_while(
CONSOLE_INPUT.lock(),
|c| c.buf.is_empty(),
|c, idx| { c.blocked_reader = Some(idx); },
);
// sys_wait4
WaitCondition::wait_while(
PROCESS_TABLE.lock(),
|table| find_zombie_child(table, pid).is_none(),
|table, idx| {
table.get_mut(&pid).unwrap().wait_thread = Some(idx);
},
);
}
Step 5: Deprecate block_current_thread
All three remaining sites (sys_wait4, sys_clone, blocking()) have been
migrated. block_current_thread is now unused and can be deprecated. New
blocking code must use WaitCondition or the mark_blocked() /
yield_now() pair.
Lock ordering note
mark_blocked() acquires the scheduler’s SCHEDULER SpinMutex internally.
This means any lock held by the caller must come before SCHEDULER in
the lock ordering. The current codebase already satisfies this: all IrqMutex
and SpinMutex locks protecting shared state are acquired before (and never
after) the scheduler lock.
If a future lock needs to be acquired after the scheduler lock, that lock
cannot be held when calling mark_blocked() — use the manual split instead,
and mark blocked before acquiring the inner lock.
PlusCal correspondence
| Rust construct | PlusCal label | Atomicity |
|---|---|---|
guard = state.lock() | Start of label | Lock acquired |
register(guard, idx) | Same label | Under same lock |
mark_blocked() | Same label | Under same lock (acquires SCHEDULER briefly) |
drop(guard) | End of label | Lock released |
yield_now() | Next label | await thread_state = "running" |
unblock(idx) in waker | Waker’s label | if thread_state = "blocked" then running |
Each WaitCondition::wait_while call maps to exactly two PlusCal labels,
making formal verification straightforward.
Relation to the io_ring_enter double-lock bug
sys_io_ring_enter phase 3 currently does:
#![allow(unused)]
fn main() {
{ let p = port.lock(); /* check cq_available */ } // lock 1
{ let mut p = port.lock(); p.set_waiter(idx); } // lock 2
scheduler::block_current_thread(); // lock 3
}
The check and set_waiter are under separate locks, so a CQE posted between
them is missed. WaitCondition fixes this by construction: the predicate
check and waiter registration happen under a single lock acquisition.