Project Status
ostoo is a hobby x86-64 kernel written in Rust, following the
Writing an OS in Rust blog series by Philipp Oppermann.
All twelve tutorial chapters have been completed and the project has gone
significantly beyond the tutorial.
Workspace Layout
| Crate | Purpose |
|---|---|
kernel/ | Top-level kernel binary — entry point, ties everything together |
libkernel/ | Core kernel library — all subsystems including APIC live here |
osl/ | “OS Subsystem for Linux” — syscall dispatch + VFS bridge |
devices/ | Driver framework — DriverTask trait, actor macro, built-in drivers, VFS |
devices-macros/ | Proc-macro crate: #[actor], #[on_message], #[on_info], #[on_tick], #[on_stream], #[on_start] |
Target triple: x86_64-os (custom JSON target, bare-metal, no std).
Build tooling: cargo-xbuild + bootimage (BIOS bootloader).
Toolchain: current nightly (floating, rust-toolchain.toml).
Completed Tutorial Chapters
1–2. Freestanding Binary / Minimal Kernel
#![no_std],#![no_main], custom panic handler.bootloadercrate provides the BIOS boot stage and passes aBootInfostruct.- Entry point via
entry_point!macro (libkernel_maininkernel/src/main.rs).
3. VGA Text Mode
libkernel/src/vga_buffer/mod.rs— aWriterbehind anIrqMutex.print!/println!macros available globally.- Volatile writes to avoid compiler optimisation of MMIO.
- Hardware cursor (CRTC registers 0x3D4/0x3D5) kept in sync on every write.
redraw_line(start_col, buf, len, cursor)for in-place line editing.- Fixed status bar at row 0 (
status_bar!macro, white-on-blue); updated bystatus_taskevery 250 ms with thread index, context-switch count, task queue depths, and uptime. - Timeline strip at row 1: scrolling coloured blocks, one per context switch, colour-coded by thread index.
4. Testing
- Custom test framework (
custom_test_frameworksfeature). - Integration tests in
kernel/tests/:basic_boot,heap_allocation,should_panic,stack_overflow. - QEMU
isa-debug-exitdevice used to signal pass/fail to the host. - Serial port (
libkernel/src/serial.rs) used for test output.
5–6. CPU Exceptions / Double Faults
- IDT set up in
libkernel/src/interrupts.rsvialazy_static. - Handlers: breakpoint, page fault (panics), double fault (panics).
- Double fault uses a dedicated IST stack (GDT TSS entry).
- GDT + TSS initialised in
libkernel/src/gdt.rs.
7. Hardware Interrupts
- 8259 PIC (chained) initialised via
pic8259; remapped to IRQ vectors 32–47. - PIC is later disabled once the APIC is configured.
- Timer interrupt handler (IRQ 0): increments tick counter, wakes timer futures.
- Keyboard interrupt handler (IRQ 1): reads scancode from port 0x60, pushes it into the async scancode queue.
8–9. Paging / Paging Implementation
libkernel/src/memory/mod.rs—RecursivePageTable(PML4 slot 511 self-referential); MMIO bump allocator at0xFFFF_8002_0000_0000withBTreeMapcache for idempotency; physical memory identity map kept for DMA address translation only (phys_mem_offsetfrom bootloader).libkernel/src/memory/frame_allocator.rs—BootInfoFrameAllocatorwalks the bootloader memory map to hand out usable physical frames.libkernel/src/memory/vmem_allocator.rs—DumbVmemAllocatorhands out a sequential range of virtual addresses (no reclamation); currently unused in production — the MMIO bump allocator inMemoryServiceshandles all virtual address allocation at runtime.
10. Heap Allocation
- Kernel heap mapped at
0xFFFF_8000_0000_0000, size 512 KiB (libkernel/src/allocator/mod.rs). - Global allocator:
linked_list_allocator::LockedHeap. extern crate allocavailable;Box,Vec,Rc,BTreeMap, etc. all work.
11. Allocator Designs
- Bump allocator implemented in
libkernel/src/allocator/bump.rs(O(1) alloc, no free). linked_list_allocatoris the active global allocator (can be swapped by changing thestatic ALLOCATORline inlibkernel/src/lib.rs).
12. Async/Await
- Task abstraction in
libkernel/src/task/mod.rs— pinned boxed futures with atomicTaskId. - Simple round-robin executor in
task/simple_executor.rs. - Full waker-based executor in
task/executor.rs:- Ready tasks in a
VecDeque, waiting tasks in aBTreeMap. - Wake queue (
crossbeam_queue::ArrayQueue) for interrupt-safe wakeups. sleep_if_idleusessti; hltto avoid busy-waiting.
- Ready tasks in a
Beyond the Tutorial
Timer
libkernel/src/task/timer.rs— LAPIC tick counter;TICKS_PER_SECOND = 1000.Delayfuture: resolves after a given number of ticks.Mailbox::recv_timeout(ticks)races inbox against aDelay.
Preemptive Multi-threaded Scheduler
libkernel/src/task/scheduler.rs— round-robin preemptive scheduler driven by the LAPIC timer at 1000 Hz; 10 ms quantum (QUANTUM_TICKS = 10).- Assembly stub
lapic_timer_stubsaves all 15 GPRs + iret frame on the current stack, then callspreempt_tick(current_rsp) -> new_rspin Rust. preempt_tickadvances the tick counter, acknowledges the LAPIC interrupt, decrements the quantum, and when it expires saves the old RSP, selects the next ready thread, and returns itssaved_rsp.scheduler::migrate_to_heap_stack(run_kernel)allocates a 64 KiB heap stack and switches thread 0 off the bootloader’s lower-half stack onto PML4 entry 256 (high canonical half), so it survives CR3 switches into user page tables.scheduler::init()registers the boot context as thread 0.scheduler::spawn_thread(entry)allocates a 64 KiB stack, synthesises an iret frame, and enqueues the new thread.- The kernel boots two executor threads (threads 0 and 1) that share the same async task queue; tasks are transparently dispatched across both.
- Shell command
threadsshows the current thread index and total context switches since boot.
Actor System (devices/, devices-macros/)
DriverTasktrait:name(),run(inbox, handle).Mailbox<M>/Inbox<M>MPSC queue;ActorMsg<M,I>envelope wraps inner messages, info queries, and erased-type info queries.- Process registry (
libkernel/src/task/registry.rs): actors register by name;registry::get::<M,I>(name)returns a typed sender handle. ErasedInforegistry: actors register aBox<dyn Fn() -> ...>so the shell can query any actor’s info without knowing its concrete type.
Proc-macro attributes (used inside #[actor] blocks)
| Attribute | Effect |
|---|---|
#[on_start] | Called once before the run loop |
#[on_message(Variant)] | Handles one inner message enum variant |
#[on_info] | Returns the actor’s typed info struct |
#[on_tick] | Called periodically; actor provides tick_interval_ticks() |
#[on_stream(factory)] | Polls a Stream + Unpin in the unified event loop |
The macro generates a unified poll_fn loop when #[on_tick] or #[on_stream]
are present, racing all event sources in a single future.
User Space and Process Isolation
- Full ring-3 process support with per-process page tables, SYSCALL/SYSRET,
and preemptive scheduling. Process exit and
execveproperly free user-half page tables and data frames (with refcount-aware shared frame handling). - 35+ Linux-compatible syscalls in
osl/src/syscalls/. - Per-process FD table, CWD tracking, parent/child relationships, zombie
lifecycle with
wait4/reap. - ELF loader for static x86-64 binaries; initial stack with
argc/argv/auxv. - IPC channels with fd-passing (capability transfer) — syscalls 505–507.
See
docs/ipc-channels.md. - Shared memory via
shmem_create(syscall 508) +mmap(MAP_SHARED)— anonymous shared memory backed by reference-counted physical frames. Seedocs/mmap-design.mdPhase 5b. - Notification fds via
notify_create(509) +notify(510) — general- purpose inter-process signaling through completion ports (OP_RING_WAIT). Seedocs/completion-port-design.mdPhase 4. - Console input buffer with foreground PID routing and blocking
read(0). - Async-to-sync bridge (
osl/src/blocking.rs) for VFS calls from syscall context. - See
docs/userspace-plan.mdfor the full roadmap (Phases 0–6 complete; Phase 7 signals not yet started).
Userspace Libraries (user/include/ostoo.h, user-rs/rt/)
- C library (
libostoo.a): shared headeruser/include/ostoo.hwith struct definitions, syscall numbers, opcodes, and flags. Static libraryuser/lib/libostoo.aprovides typed syscall wrappers for all 12 custom syscalls (501–512), output helpers (puts_stdout,put_num,put_hex), conversion helpers (itoa_buf,simple_atoi), and ring buffer access helpers (sq_entry,cq_entry). All 21 demo programs have been migrated to use the shared library, eliminating per-file boilerplate. - Rust library (
ostoo-rtcrate): two modules added to the existinguser-rs/rt/runtime crate.sysmodule provides raw syscall wrappers andrepr(C)struct definitions matching the kernel ABI.ostoomodule provides safe RAII types (CompletionPort,IpcSend/IpcRecv,SharedMem,NotifyFd,IrqFd,IoRing) with automatic fd cleanup on drop, plus builder methods onIoSubmissionfor each opcode.
Userspace Shell (user/src/shell.c)
- Primary user interface: musl-linked C binary, auto-launched on boot from
/bin/shellviakernel/src/main.rs. - Line editing: read char-by-char, echo, backspace, Ctrl+C (cancel), Ctrl+D (exit on empty line).
- Built-in commands:
echo,pwd,cd,ls,cat,pid,export,env,unset,exit,help. - Environment variables: shell maintains an env table, passes it to children.
Kernel provides defaults:
PATH=/host/bin,HOME=/,TERM=dumb,SHELL=/bin/shell. - External programs:
posix_spawn(path)+waitpid. - Built with Docker-based musl cross-compiler (
scripts/user-build.sh). - Sources in
user/src/, binaries output touser/bin/. - See
docs/userspace-shell.mdfor full design.
Kernel Shell (kernel/src/shell.rs) — fallback
#[actor]-based shell actor, active when no userspace shell is running.- Prompt includes CWD:
ostoo:/path>. - Commands:
help,echo,driver <start|stop|info>,blk <info|read|ls|cat>,ls,cat,pwd,cd,mount,exec,test. - Info commands (cpuinfo, meminfo, etc.) migrated to
/proc; accessible viacat /proc/<file>.
Keyboard Actor (kernel/src/keyboard_actor.rs)
#[actor]+#[on_stream(key_stream)]; registered as"keyboard".- Foreground routing: when a user process is foreground, raw keypresses are
delivered to
console::push_input()for userspaceread(0). - When kernel is foreground: full readline-style line editing:
- Cursor movement: ← → / Ctrl+B/F, Home/End / Ctrl+A/E
- Editing: Backspace, Delete, Ctrl+K (kill to end), Ctrl+U (kill to start), Ctrl+W (delete word)
- History: ↑↓ / Ctrl+P/N, 50-entry
VecDeque, live-buffer save/restore - Ctrl+C clears the line; Ctrl+L clears the screen
- Dispatches complete lines to the kernel shell via
ShellMsg::KeyLine.
virtio-blk Block Device (devices/src/virtio/)
virtio-drivers0.13 crate provides the virtio protocol; the kernel suppliesKernelHalimplementingHalfor DMA allocation, MMIO mapping, and virtual→physical address translation.- QEMU Q35 machine; PCIe ECAM at physical
0xB000_0000mapped at boot viaMemoryServices::map_mmio_region.PciRootis generic overMmioCam<'static>. VirtioBlkActoractor: handlesReadandWritemessages using the non-blocking virtio-drivers API (read_blocks_nb/complete_read_blocks) with a busy-pollCompletionFuturefor MVP.KernelHal::shareperforms a full page-table walk (translate_virt) so that heap-allocatedBlkReq/BlkResp/data buffers produce correct physical addresses for the device.- Shell commands:
blk info,blk read <sector>. - See
docs/virtio-blk.mdfor full details.
VirtIO 9P Host Directory Sharing (devices/src/virtio/p9*.rs)
- VirtIO 9P (9P2000.L) driver for sharing a host directory into the guest, providing a Docker-volume-like workflow: edit files on the host, they appear instantly in the guest.
p9_proto.rs— minimal 9P2000.L wire protocol: 8 message pairs (version, attach, walk, lopen, read, readdir, getattr, clunk).p9.rs—P9Clienthigh-level client wrappingVirtIO9p<KernelHal, PciTransport>. Synchronous API behindSpinMutex; performs version handshake + attach on construction. Public methods:list_dir,read_file,stat.- QEMU shares
./userdirectory via-fsdev local,...,security_model=none-device virtio-9p-pci,...,mount_tag=hostfs.
- Mounted at
/host(always) and at/as fallback when no virtio-blk disk is present, so/bin/shellauto-launch works without a disk image. - PCI device IDs:
0x1AF4:0x1049(modern),0x1AF4:0x1009(legacy). - Read-only for MVP; no write/create/delete support.
- See
docs/virtio-9p.mdfor full details.
exFAT Filesystem (devices/src/virtio/exfat.rs)
- Read-only exFAT driver with no external dependencies.
- Auto-detects bare exFAT, MBR-partitioned, and GPT-partitioned disk images.
- Implements: boot sector parsing, FAT chain traversal, directory entry set parsing (File / Stream Extension / File Name entries), and recursive path walking with case-insensitive ASCII matching.
- File reads capped at 16 KiB; peak heap usage during
ls≈ 5 KiB. - See
docs/exfat.mdfor full details.
VFS Layer (devices/src/vfs/)
- Uniform path namespace over multiple filesystems; shell no longer calls filesystem drivers directly.
- Enum dispatch (
AnyVfs) avoidsPin<Box<dyn Future>>trait objects. - Mount table (
MOUNTS:SpinMutex<Vec<(String, Arc<AnyVfs>)>>) sorted longest-mountpoint-first; theArcis cloned out before any.awaitso the lock is never held across a suspension point. ExfatVfs— wraps aBlkInboxand delegates to the exFAT driver.Plan9Vfs— wraps anArc<P9Client>and delegates to the 9P client. MapsP9ErrortoVfsError(ENOENT→NotFound, ENOTDIR→NotADirectory, etc.).ProcVfs— synthetic filesystem; no block I/O. All system info commands have been migrated from the shell to/procvirtual files:/proc/tasks— ready / waiting task counts from the executor./proc/uptime— seconds since boot from the LAPIC tick counter./proc/drivers— name and state of every registered driver./proc/threads— current thread index and context-switch count./proc/meminfo— heap usage, frame allocator stats, known virtual regions./proc/memmap— physical memory regions from the bootloader memory map./proc/cpuinfo— CPU vendor, family/model/stepping, CR0/CR4/EFER/RFLAGS./proc/pmap— page table walk with coalesced contiguous regions./proc/idt— IDT vector assignments (exceptions, PIC, LAPIC, dynamic)./proc/pci— enumerated PCI devices./proc/lapic— Local APIC state and timer configuration./proc/ioapic— I/O APIC redirection table entries./proc/irq_stats— per-slot IRQ counters (total, delivered, buffered, spurious).
- Shell commands:
ls,cat,cduse the VFS API;mountmanages the mount table at runtime (mount,mount proc <mp>,mount blk <mp>). /procis always mounted at boot; exFAT/is mounted if virtio-blk is present; 9p/hostis mounted if virtio-9p is present (and 9p falls back to/when no disk image exists).- See
docs/vfs.mdfor full design notes.
Completion Port Async I/O (osl/src/io_port.rs)
- io_uring-style completion-based async I/O subsystem.
- Kernel object:
CompletionPort(libkernel/src/completion_port.rs) — bounded queue of completions with single-waiter blocking semantics. FdObjectenum inlibkernel/src/file.rsprovides type-safe polymorphism for the fd table (File|Port), replacing the previous trait-object downcast approach.IrqMutexprotects theCompletionPortfor ISR-safepost()from interrupt context.- Syscalls:
io_create(501),io_submit(502),io_wait(503),io_setup_rings(511),io_ring_enter(512). - Supported operations:
OP_NOP(immediate),OP_TIMEOUT(async timer via executor),OP_READ/OP_WRITE(async — user buffers are copied to/from kernel memory duringio_submit/io_wait; the actual I/O runs on executor tasks soio_submitreturns immediately),OP_IRQ_WAIT(hardware interrupt delivery — ISR masks GSI and posts completion; rearm via another submit unmasks). - Shared-memory SQ/CQ rings (Phase 5):
io_setup_ringsallocates ring pages as shmem fds; userspace writes SQEs to the SQ ring and reads CQEs from the CQ ring.io_ring_enterkicks the kernel and/or blocks for completions. FileHandletrait haspoll_read/poll_writemethods (default impls delegate to syncread/write).PipeReaderandConsoleHandleoverridepoll_readwith waker-based async semantics so completion port reads never block executor threads.- Userspace demo programs:
io_demo.c(smoke test),io_pingpong.c/io_pong.c(parent-child IPC via completion port). - See
docs/completion-port-design.mdfor the full phased roadmap (all phases complete).
IRQ File Descriptors (libkernel/src/irq_handle.rs, osl/src/irq.rs)
- Userspace interrupt delivery via
irq_create(gsi)syscall (504). IrqInnertracks GSI, vector, slot, and saved IO APIC redirection entry.- ISR handler (
irq_fd_dispatch) masks the GSI vialibkernel::apic::mask_gsiand posts a completion to the associatedCompletionPort. For keyboard (GSI 1) and mouse (GSI 12), the ISR reads port 0x60 and drains all available bytes per interrupt (up to 16 per ISR invocation). - 64-entry scancode ring buffer per slot prevents lost scancodes between
rearms.
arm_irqbulk-drains the entire buffer into completions. - Per-slot atomic IRQ counters (total, delivered, buffered, spurious,
wrong_source) visible via
/proc/irq_stats. - On close, the original IO APIC entry is restored.
- Demo:
user/irq_demo.c— keyboard scancode display via OP_IRQ_WAIT.
IPC Channels (libkernel/src/channel.rs, osl/src/ipc.rs)
- Capability-based IPC channels for structured message passing between processes.
- Unidirectional with configurable buffer capacity: capacity=0 for synchronous rendezvous (seL4-style), capacity>0 for async buffered.
- Fixed 48-byte messages:
tag(u64) +data[3](u64) +fds[4](i32). - fd-passing (capability transfer): sender’s fds are extracted at send time, kernel objects are stored in the channel, and new fds are allocated in the receiver’s fd table at recv time. Cleanup on drop for undelivered messages.
- Completion port integration:
OP_IPC_SEND(5) andOP_IPC_RECV(6) for multiplexing IPC with timers, IRQs, and file I/O. - Syscalls:
ipc_create(505),ipc_send(506),ipc_recv(507). - Demos:
ipc_sync.c,ipc_async.c,ipc_port.c,ipc_fdpass.c. - See
docs/ipc-channels.mdfor full design.
Deadlock Detection (libkernel/src/spin_mutex.rs)
- All
spin::Mutexlocks replaced withSpinMutex— a drop-in wrapper that counts spin iterations and panics after a threshold, turning silent hangs into actionable diagnostics with serial output. SpinMutex: 100M iteration limit (~100 ms) — allows for legitimate preemption contention on a single-core scheduler.IrqMutex: 10M iteration limit (~10 ms) — interrupts disabled means no preemption, so any contention indicates a true deadlock.deadlock_panic()writes directly to COM1 (0x3F8) bypassingSERIAL1’s lock, then panics.
POSIX Signals (libkernel/src/signal.rs, osl/src/signal.rs)
- Phases 1–2: signal infrastructure, delivery on SYSCALL return, Ctrl+C/SIGINT, signal-interrupted syscalls (EINTR).
rt_sigaction(13): install/query signal handlers (SA_SIGINFO, SA_RESTORER).rt_sigprocmask(14): SIG_BLOCK/UNBLOCK/SETMASK for the signal mask.kill(62): send a signal to a specific pid; wakes interruptible blocks.rt_sigreturn(15): restore context from rt_sigframe after handler returns.- Signal delivery via
check_pending_signalsin the SYSCALL return path: constructs a Linux-ABI-compatiblert_sigframeon the user stack, rewrites the saved register frame sosysretq“returns” into the handler. - Ctrl+C: keyboard actor queues SIGINT on
foreground_pid(), wakes blocked console reader. - EINTR: blocking syscalls (
sys_wait4,PipeReader::read) set a per-processsignal_threadfield;sys_killunblocks it so the syscall returns EINTR. The shell forwards SIGINT to child processes on EINTR fromwaitpid. - Default actions: SIG_DFL terminate (SIGKILL, SIGTERM, etc.) or ignore (SIGCHLD).
- Demos:
user/sig_demo.c(SIGUSR1 self-signal),user/sig_int.c(Ctrl+C interrupt test), userspace shell handles SIGINT viasigaction. - See
docs/signals.mdfor full design.
Dummy Driver (devices/src/dummy.rs)
- Example actor with
#[on_tick]heartbeat,#[on_message(SetInterval)], and#[on_info]. - Demonstrates the full actor feature set.
ACPI Parsing
kernel/src/kernel_acpi.rsimplements anAcpiHandlerthat accesses physical ACPI regions via the bootloader’s identity map (phys + physical_memory_offset); no dynamic page mapping is required since all ACPI tables live in physical RAM.- Calls
acpi::search_for_rsdp_biosto locate and parse ACPI tables. - On boot the interrupt model is printed; APIC vs legacy PIC is detected.
APIC Module (libkernel/src/apic/)
- APIC code lives in
libkernel::apic, mapped at0xFFFF_8001_0000_0000. libkernel/src/apic/local_apic/— Local APIC register access via MMIO and MSR.libkernel/src/apic/io_apic/— I/O APIC register access via MMIO.libkernel::apic::init()maps the Local APIC and all I/O APICs from the ACPI table, routes ISA IRQs 0 (timer) and 1 (keyboard) through the I/O APIC to IDT vectors 0x20 and 0x21, then disables the 8259 PIC.libkernel::apic::calibrate_and_start_lapic_timer()uses the PIT as a reference to measure the LAPIC bus frequency, starts the LAPIC timer in periodic mode at 1000 Hz, then masks the PIT’s I/O APIC entry so it no longer fires.
Logging
libkernel/src/logger.rswraps the VGAprintln!macro as alog::Logimplementation.log::{debug, info, warn, error}macros usable throughout the kernel.- Initialised early in
libkernel_main.
CPUID
libkernel/src/cpuid.rs— thin wrapper aroundraw-cpuid;init()called during kernel init.
Known Issues / Technical Debt
Heap Size
The heap is a fixed 1 MiB at 0xFFFF_8000_0000_0000. Kernel thread stacks are
allocated from a separate stack arena (libkernel/src/stack_arena.rs) at
0xFFFF_8000_0010_0000 (16 × 64 KiB = 1 MiB), keeping large stack allocations
off the general-purpose heap and eliminating fragmentation. The arena uses a
bitmap for O(1) alloc/free with RAII slot handles. The heap is now used only for
small driver/task allocations. The DumbVmemAllocator has no reclamation path,
so virtual address space for MMIO/ACPI mappings is consumed monotonically.
virtio-blk Single-sector I/O
Block I/O uses IRQ-driven completion via AtomicWaker, but is still limited
to one 512-byte sector per request.
exFAT Write Support
The exFAT driver is read-only. All filesystem state changes (create, write, delete) are unsupported.
ProcVfs File Sizes Reported as Zero
VfsDirEntry::size is 0 for all /proc entries because the content length
is not known until the data is serialised. This is cosmetically wrong in ls
output but functionally harmless.
Possible Next Steps
Completion Port — All Phases Complete
- Phases 1–4 (core, read/write, OP_IRQ_WAIT, OP_RING_WAIT) — see sections above.
- Phase 5: Shared-memory SQ/CQ rings — implemented.
io_setup_rings(511) allocates shared SQ/CQ ring pages exposed as shmem fds.io_ring_enter(512) processes SQ entries and waits for CQ completions. Dual-modepost()writes simple CQEs directly to the shared CQ ring; deferred completions (OP_READ, OP_IPC_RECV) are flushed in syscall context. Test:ring_sq_test.
Memory Management
-
Larger / growable heap — demand-paged heap that grows on fault, or a larger static allocation. 1 MiB is tight with concurrent processes.
-
Reclaiming virtual address space — replace
DumbVmemAllocatorwith a proper free-list allocator so MMIO mappings can be released. -
File-backed
MAP_SHARED— anonymous shared memory (viashmem_create) is complete; file-backedMAP_SHAREDwith inode page cache remains future work. Seedocs/mmap-design.mdPhase 5c.
Process Model
-
Signals Phase 3+ — Phases 1–2 (basic signal delivery + Ctrl+C/SIGINT + EINTR) are complete:
rt_sigaction,rt_sigprocmask,kill, signal delivery on SYSCALL return,rt_sigreturn, Ctrl+C → SIGINT to foreground process, signal-interrupted blocking syscalls (EINTR). Remaining: exception-generated signals (SIGSEGV, SIGILL), SIGCHLD on child exit. Seedocs/signals.md. -
fork+ CoW page faults — standard POSIXfork.clone(CLONE_VM|CLONE_VFORK)andexecveare now implemented, enabling unpatched muslposix_spawnand Ruststd::process::Command. Fullforkwith CoW still requires a page fault handler and frame reference counting.
Drivers & I/O
-
Multi-sector DMA — batch multiple sectors per virtio request to reduce queue round-trips for directory scans and file reads.
-
exFAT write support — directory entry creation, FAT chain allocation, and sector writes to enable
touch,mkdir,cp,rm.
Compositor & Window Management
The userspace compositor (/bin/compositor) is a Wayland-style display server
with full input routing and window management.
- Display: Takes exclusive ownership of the BGA framebuffer via
framebuffer_open(515). Double-buffered compositing with painter’s algorithm. Cursor-only rendering optimization patches small rectangles for mouse movement instead of full recomposite. - Input: Connects to
/bin/kbd(keyboard) service via the service registry. Mouse input is integrated directly — the compositor claims IRQ 12 and decodes PS/2 packets inline (no separate mouse driver process). Key events forwarded to focused window. Mouse events drive cursor, focus, drag, and resize. - CDE-style decorations: Server-side window decorations inspired by CDE/Motif — 3D beveled borders (BORDER_W=4, BEVEL=2), 24px title bar with centered title, CDE-style close button, sunken inner bevel around client area. Blue-grey color palette.
- Window management: Click-to-focus with Z-order raise. Title bar drag to move. Edge/corner drag to resize with context-sensitive cursor icons (diagonal, horizontal, vertical double-arrows). Close button removes window.
- Resize protocol: On resize completion, compositor allocates a new
shared buffer and sends
MSG_WINDOW_RESIZED(tag 7) with the new buffer fd. Terminal emulator remaps buffer, recalculates dimensions, and redraws. - Terminal emulator (
/bin/term): Compositor client that spawns/bin/shellwith pipe-connected stdin/stdout. VT100 parser with color support. Character-level screen buffer (Cellarray + per-rowwrappedflags) enables text reflow on resize: logical lines are extracted, re-wrapped to the new width, and pixels regenerated from the cell buffer. Cursor position is preserved across resize. - See
docs/compositor-design.mdanddocs/display-input-ownership.md.
Microkernel Path
-
Microkernel Phase B — kernel primitives for userspace drivers: device MMIO mapping, DMA syscalls. IRQ fd (syscall 504 + OP_IRQ_WAIT) and
MAP_SHARED(viashmem_create508) are complete. Remaining items unblock userspace NIC driver. Seedocs/microkernel-design.md. -
Networking — virtio-net driver + smoltcp TCP/IP stack. The completion port is ready to back it once the NIC driver lands. See
docs/networking-design.md.
Preemptive Scheduler & Multi-threaded Async Executor
Overview
The kernel uses a round-robin preemptive scheduler built on top of the
LAPIC timer (1000 Hz). Every 10 ms (configurable via QUANTUM_TICKS) the
timer ISR saves the current CPU state and switches to the next ready thread,
regardless of what that thread was doing. This prevents any single async task
— even one that busy-loops — from starving all others.
The async executor’s state lives in global statics, so multiple kernel threads can pull and poll tasks from the same shared queue concurrently.
Thread Lifecycle
spawn_thread()
│
▼
[ Ready ] ◄──────────────────────────────┐
│ (selected by scheduler) │
▼ │
[ Running ] ──── quantum expired ─────────┘
Threads cycle between Ready and Running in strict round-robin order.
There is no blocked/sleeping state for threads — a thread that has nothing to
do (idle executor loop) calls HLT until an interrupt wakes it.
Thread 0 is the initial kernel thread. Early in boot, libkernel_main calls
scheduler::migrate_to_heap_stack(run_kernel) which allocates a 64 KiB heap
stack and switches RSP to it before continuing. This moves thread 0 off the
bootloader’s lower-half stack onto PML4 entry 256 (high canonical half), so
its stack survives CR3 switches into user page tables.
Additional threads are created with scheduler::spawn_thread(entry: fn() -> !).
The entry function must never return; in practice it calls
executor::run_worker().
Context Switch Mechanism
LAPIC timer IDT entry
The IDT entry for LAPIC_TIMER_VECTOR (0x30) is set with set_handler_addr
pointing directly at lapic_timer_stub. This bypasses the
extern "x86-interrupt" wrapper so the stub can manipulate RSP freely.
Assembly stub (lapic_timer_stub)
lapic_timer_stub:
push rax; push rbx; push rcx; push rdx
push rsi; push rdi; push rbp
push r8; push r9; push r10; push r11
push r12; push r13; push r14; push r15
sub rsp, 512 // allocate FXSAVE area
fxsave [rsp] // save x87/MMX/SSE state
mov rdi, rsp // current_rsp → first argument
call preempt_tick // returns new rsp in rax
mov rsp, rax // switch to (possibly new) thread's stack
fxrstor [rsp] // restore x87/MMX/SSE state
add rsp, 512 // deallocate FXSAVE area
pop r15; pop r14; pop r13; pop r12
pop r11; pop r10; pop r9; pop r8
pop rbp; pop rdi; pop rsi
pop rdx; pop rcx; pop rbx; pop rax
iretq
The CPU pushes an interrupt frame (SS/RSP/RFLAGS/CS/RIP, 40 bytes) before
the stub runs. The stub pushes 15 GPRs (120 bytes) and then allocates a
512-byte FXSAVE area for x87/MMX/SSE register state. Together that is
672 bytes = 42 × 16, so RSP is 16-byte aligned for both fxsave [rsp]
(requires 16-byte alignment) and the call instruction (SysV ABI:
RSP + 8 aligned at function entry).
preempt_tick(current_rsp: u64) -> u64
Runs entirely on the current thread’s stack (inside the call/ret
pair), then returns the next thread’s saved_rsp in RAX.
- Increments the global tick counter and wakes sleeping async tasks.
- Sends LAPIC EOI.
- Locks
SCHEDULER(interrupts already off — no deadlock risk). - If not yet initialised, returns
current_rspunchanged. - Decrements the current thread’s
ticks_remaining; if still > 0, returns unchanged. - Saves
current_rspincurrent_thread.saved_rsp. - Pushes the current thread index onto
ready_queue(marks it Ready). - Pops the front of
ready_queueasnext_idx. Because we just pushed current, the queue is always non-empty;unwrap_or(current_idx)is only a safety fallback. If current was the only thread it gets re-scheduled. - Resets
ticks_remaining = QUANTUM_TICKS, marks thread as Running. - Returns
next_thread.saved_rsp.
The stub then sets RSP = returned value and executes the symmetric pops +
iretq, which resumes execution on the new thread.
Initial Stack Layout for New Threads
spawn_thread(entry) allocates a 64 KiB Vec<u8> and writes a fake
interrupt frame at the top. The frame is exactly what a preempted thread’s
stack looks like, so the same assembly stub can start a new thread as if it
were resuming a preempted one.
high address ┌──────────────────────────┐
│ SS = 0 │ ← null selector, valid for ring-0
│ RSP = stack_top−8 │ ← thread's initial stack pointer
│ RFLAGS = 0x202 │ ← bit 9 (IF) + bit 1 (reserved)
│ CS = 0x08 │ ← kernel code segment
│ RIP = entry │ ← thread entry point
├──────────────────────────┤
│ rax = 0 │ 15 GPRs (120 bytes)
│ rbx = 0 │
│ … │
│ r15 = 0 │
├──────────────────────────┤
│ FXSAVE area │ 512 bytes (16-byte aligned)
│ (x87/MMX/SSE state) │ MXCSR = 0x1F80 at offset +24
│ │ XMM0-15 at offset +160
│ │ ← saved_rsp points here
low address └──────────────────────────┘
saved_rsp = base of the 512-byte FXSAVE area. The SwitchFrame (GPRs +
iretq frame) sits at saved_rsp + 512. Total region is 672 bytes,
guaranteed 16-byte aligned by rounding stack_top down.
Timer Quantum
QUANTUM_TICKS in task/scheduler.rs controls how many LAPIC ticks
(1 tick = 1 ms at 1000 Hz) each thread runs before being preempted. The
default is 10 (10 ms per thread).
To increase to 50 ms:
#![allow(unused)]
fn main() {
pub const QUANTUM_TICKS: u32 = 50;
}
Thread-safe Async Executor
Global state
| Static | Type | Purpose |
|---|---|---|
TASK_QUEUE | Mutex<VecDeque<Task>> | Tasks ready to be polled |
WAIT_MAP | Mutex<BTreeMap<TaskId, Task>> | Tasks waiting for a waker |
WAKE_QUEUE | Arc<ArrayQueue<TaskId>> | Lock-free waker notifications (ISR-safe) |
WAKER_CACHE | Mutex<BTreeMap<TaskId, Waker>> | One Waker per live task; keeps Arc count ≥ 2 to prevent ISR deallocation |
TASK_QUEUE and WAIT_MAP use SpinMutex (a spin::Mutex wrapper with
deadlock detection). On a single CPU with preemption, a thread can be
preempted while holding a spinlock; the new thread spinning on the same lock
will waste its quantum and yield back, at which point the original thread
releases the lock. If this doesn’t resolve within ~100 ms (SPIN_LIMIT
iterations), SpinMutex panics with a serial diagnostic rather than hanging
silently.
ISR-safe waker deallocation (WAKER_CACHE)
Both the timer ISR (timer::tick) and the keyboard ISR call Waker::wake(),
which consumes the stored Waker. If that were the last Arc<TaskWaker>
reference, the Drop impl would call into linked_list_allocator, whose
spinlock may already be held by the preempted thread → deadlock.
WAKER_CACHE (Mutex<BTreeMap<TaskId, Waker>>) holds one cached Waker per
live task, keeping the Arc strong count ≥ 2 whenever an ISR-accessible copy
exists. The ISR’s drop reduces the count from 2 → 1; the cache’s copy is only
freed from executor context when a task completes (Poll::Ready).
Task: Send requirement
Task::new requires Future<Output = ()> + Send + 'static. All built-in
tasks (timer, keyboard, example) satisfy this because they only hold values
that are Send (atomics, Mutex-guarded globals, simple scalars).
spawn(task) and run_worker()
executor::spawn pushes a Task into TASK_QUEUE. executor::run_worker
loops:
- Move tasks whose wakers fired from
WAIT_MAP→TASK_QUEUE. - Poll every task in
TASK_QUEUE. IfPending, move toWAIT_MAP. sleep_if_idle: disable interrupts, checkWAKE_QUEUE, then atomically re-enable + HLT (prevents missed-wakeup race).
Locking Rules
| Lock | Where held | Rule |
|---|---|---|
SCHEDULER | ISR and non-ISR | Non-ISR callers must use `without_interrupts( |
TASK_QUEUE | Non-ISR only | Released before polling to allow spawn() inside poll |
WAIT_MAP | Non-ISR only | Released before locking TASK_QUEUE to avoid ordering inversion |
WAKER_CACHE | Non-ISR only | Released before polling |
timer::WAKERS | ISR (tick) + non-ISR (Delay::poll) | Non-ISR uses without_interrupts |
The ISR already runs with IF = 0, so it never needs to call
without_interrupts.
Deadlock Detection
All spin::Mutex locks have been replaced with SpinMutex
(libkernel/src/spin_mutex.rs), which counts spin iterations and panics
after a threshold:
| Lock type | Threshold | Rationale |
|---|---|---|
SpinMutex | 100,000,000 (~100 ms) | Well beyond the 10 ms quantum; allows for legitimate preemption contention |
IrqMutex | 10,000,000 (~10 ms) | Interrupts are disabled — no preemption, so any contention is a true deadlock |
On timeout, deadlock_panic() writes directly to serial port 0x3F8
(bypassing SERIAL1’s lock) and then panics. This turns silent hangs into
actionable diagnostics.
Demonstrating Preemption
Add a spinning task to confirm no starvation:
#![allow(unused)]
fn main() {
executor::spawn(Task::new(async {
loop { core::hint::spin_loop(); }
}));
}
Without preemption this would freeze the kernel. With the scheduler, the
LAPIC timer fires every 10 ms and rotates to the next thread, so
[timer] tick: Ns elapsed still appears every second.
Paging Design
Virtual Address Layout
x86-64 canonical addresses split into two halves:
0x0000_0000_0000_0000 ┐
... │ lower canonical half — user process address space
0x0000_7FFF_FFFF_FFFF ┘
(non-canonical gap — any access faults)
0xFFFF_8000_0000_0000 kernel heap (HEAP_START, 256 KiB)
0xFFFF_8001_0000_0000 Local APIC MMIO (APIC_BASE, 4 KiB)
0xFFFF_8001_0001_0000 IO APIC(s) (4 KiB × n, relative to APIC_BASE)
0xFFFF_8002_0000_0000 MMIO window (MMIO_VIRT_BASE, 512 GiB)
↑ PCIe ECAM, virtio BARs, future driver MMIO allocated here
0xFFFF_FF80_0000_0000 recursive PT window (for index 511, see below)
0xFFFF_FFFF_FFFF_F000 PML4 self-mapping (recursive index 511)
phys_mem_offset bootloader physical identity map (stays put)
+ all physical RAM
All three kernel allocation regions (heap, APIC, MMIO) share PML4 index 256
(0xFFFF_8000_* through 0xFFFF_80FF_*), keeping the kernel footprint in a
single top-level page-table entry — easy to share across per-process page tables
without marking it USER_ACCESSIBLE.
Page Table Implementation: RecursivePageTable
Why recursive instead of OffsetPageTable
OffsetPageTable walks page-table frames by computing
phys_mem_offset + frame_phys_address. This creates a permanent dependency on
the bootloader’s physical-identity map (which lives in the lower canonical half).
For user-space isolation we want the lower half to be entirely process-owned.
RecursivePageTable eliminates this dependency: the CPU’s own hardware page
walker is used to reach PT frames, so no identity map is needed for page-table
operations.
How recursive mapping works
One PML4 slot (index 511) is pointed at the PML4’s own physical frame. When the
CPU walks this entry it re-enters the same PML4 as if it were a PDPT. Repeating
four times (P4→511, P3→511, P2→511, P1→511) exposes the PML4’s own 4 KiB page
at virtual address 0xFFFF_FFFF_FFFF_F000.
The full recursive window for index R (R=511) maps every page-table frame at a computable virtual address:
| Depth | Virtual base (R=511) | What is mapped there |
|---|---|---|
| PML4 | 0xFFFF_FFFF_FFFF_F000 | the PML4 itself |
| PDs | 0xFFFF_FFFF_FFE0_0000+ | all 512 PDPTs |
| PTs | 0xFFFF_FFFF_C000_0000+ | all 512 × 512 PDs |
| Pages | 0xFFFF_FF80_0000_0000+ | all PT frames |
The x86_64 crate’s RecursivePageTable type uses these computable addresses to
implement Mapper and Translate without any identity-map knowledge.
Setup sequence (in libkernel::memory::init)
1. Read CR3 → PML4 physical frame
2. Access PML4 via bootloader identity map: virt = phys_mem_offset + pml4_phys
3. Write PML4[511] = (pml4_phys_frame, PRESENT | WRITABLE)
4. flush_all() ← new mapping is now active
5. Compute recursive PML4 address: 0xFFFF_FFFF_FFFF_F000
6. Obtain &'static mut PageTable at that address
7. RecursivePageTable::new(pml4_at_recursive_addr)
After step 7 the identity map is still live (bootloader mapping is never
removed), but RecursivePageTable does not use it for page-table walks.
MMIO Virtual Address Allocator
Problem with the old approach
The old map_mmio_region mapped MMIO at phys_mem_offset + phys_addr — the
same virtual address the identity map uses for regular RAM. This worked but:
- It placed MMIO in the lower canonical half (future user space).
- It gave MMIO a fixed virtual address tied to
phys_mem_offset, which varies per boot and can change if the bootloader is swapped.
New design: bump allocator + cache
A bump pointer starts at MMIO_VIRT_BASE = 0xFFFF_8002_0000_0000 and advances
one region at a time. A BTreeMap<phys_base, virt_base> cache ensures that
mapping the same physical address twice returns the same virtual address.
MMIO_VIRT_BASE 0xFFFF_8002_0000_0000
+ PCIe ECAM 1 MiB
+ virtio BAR0 varies
+ ...
(grows upward; 512 GiB window — exhaustion is practically impossible)
Flags: PRESENT | WRITABLE | NO_CACHE (same as before).
Cache key
The cache key is the page-aligned physical base address. If the same physical base is mapped twice with different sizes the second call returns the cached mapping (the first mapping covers at least as many pages as were originally requested; in practice PCI BAR sizes are fixed per device).
Heap dependency
BTreeMap::insert allocates from the kernel heap. map_mmio_region must not
be called before init_heap completes or from interrupt context. All current
call sites (boot path in main.rs, KernelHal::mmio_phys_to_virt) satisfy
this constraint.
Remaining Identity Map Dependency
After the switch to RecursivePageTable, the bootloader identity map is still
used in exactly two places:
| Use | Location | Notes |
|---|---|---|
| DMA address translation | KernelHal::dma_alloc in devices/src/virtio/mod.rs | DMA frames are physical RAM; phys_mem_offset + paddr gives the kernel virtual address for CPU access |
| ACPI table access | KernelAcpiHandler::phys_to_virt in kernel/src/kernel_acpi.rs | ACPI tables are in physical RAM; same formula |
Both are kernel-private and never exposed to user space. The bootloader identity
map entries do not have the USER_ACCESSIBLE flag, so they are invisible to
ring-3 processes regardless.
The restriction: every page table the kernel uses to walk page structures must
keep the bootloader’s PML4 entries for the identity-map region. For per-process
page tables this is easily satisfied by copying PML4 entries 0–255 (lower half)
from the kernel PML4 — without USER_ACCESSIBLE — at process creation time.
Per-Process Page Tables
Each process gets its own PML4, created by MemoryServices::create_user_page_table:
- Slot 511: self-referential entry pointing to the process’s own PML4 physical
frame (required for
RecursivePageTableto work per-process). - Slots 256–510: shared kernel mappings (heap, APIC, MMIO window, physical memory direct map), copied verbatim from the active PML4. These are high-half addresses, never accessible from ring-3. Because the PML4 entries point to the same PDPT/PD/PT frames, changes to kernel page tables at levels below PML4 are automatically visible in all address spaces.
- Slots 0–255: process-private user-space mappings. The process’s code, stack, heap, and memory-mapped files live here.
Switching between processes requires only a mov cr3, new_pml4_phys — the
kernel’s high-half mappings are identical in every page table so no TLB flush is
needed for kernel entries (on CPUs with PCID support).
PML4 lifecycle
User PML4s and their lower-half page table frames are freed when a process
exits (terminate_process) or replaces its address space (execve). The
kernel boot PML4 physical address is stored in KERNEL_PML4_PHYS (set during
init_services). Before freeing a user PML4, the dying/exec’ing code
switches CR3 and the scheduler’s thread record to the kernel PML4. This is
critical because the frame allocator uses an intrusive free-list that
overwrites freed frames immediately — leaving CR3 pointing at a freed PML4
would cause a triple fault on the next TLB refill.
Files Changed
| File | Change |
|---|---|
libkernel/src/allocator/mod.rs | HEAP_START = 0xFFFF_8000_0000_0000 |
kernel/src/main.rs | APIC_BASE = 0xFFFF_8001_0000_0000 |
libkernel/src/memory/mod.rs | RecursivePageTable; MMIO_VIRT_BASE bump allocator; mmio_cache: BTreeMap |
libkernel/src/memory/vmem_allocator.rs | Test BASE constant updated (cosmetic) |
mmap Phased Design
Overview
This document describes a phased plan for improving the virtual memory
management subsystem, starting from the current minimal mmap implementation
and building towards file-backed, shared mappings.
Each phase is self-contained and independently testable.
Current State
mmap (syscall 9)
- Anonymous (
MAP_ANONYMOUS) and file-backedMAP_PRIVATE(eager copy). MAP_FIXEDsupported — implicit munmap of overlapping VMAs (Linux semantics).- Non-fixed allocations use a top-down gap finder over the VMA tree
(
[MMAP_FLOOR, MMAP_CEILING)=[0x10_0000_0000, 0x4000_0000_0000)). Freed regions are automatically reused. - Pages are eagerly allocated, zeroed, and mapped.
protargument is honoured — page table flags are derived fromPROT_READ,PROT_WRITE,PROT_EXECviaVma::page_table_flags().- Regions are tracked as
BTreeMap<u64, Vma>(vma_mapinProcess). /proc/mapsdisplays actualrwxpflags from VMA metadata.
munmap (syscall 11)
Implemented — unmaps pages, frees frames to the free list, and splits/removes VMAs. Supports partial unmaps (front, tail, middle split).
mprotect (syscall 10)
Implemented — updates page table flags and splits/updates VMAs. Supports partial mprotect across VMA boundaries (front, tail, middle split).
Process cleanup on exit
sys_exit frees all user-space frames (ELF segments, brk heap, user stack,
mmap regions) and intermediate page table frames before marking zombie.
Process cleanup on execve
sys_execve creates a fresh PML4, switches CR3, then frees the old address
space (all user pages and page tables).
Phase 1: VMA Tracking + PROT Flags ✓ (implemented)
Goal: Replace the bare Vec<(u64, u64)> region list with a proper VMA
(Virtual Memory Area) structure, and honour the prot argument in mmap.
VMA struct
Add to libkernel/src/process.rs (or a new libkernel/src/vma.rs):
#![allow(unused)]
fn main() {
#[derive(Debug, Clone)]
pub struct Vma {
pub start: u64, // page-aligned
pub len: u64, // page-aligned
pub prot: u32, // PROT_READ | PROT_WRITE | PROT_EXEC
pub flags: u32, // MAP_PRIVATE | MAP_ANONYMOUS | MAP_SHARED | ...
pub fd: Option<usize>, // file descriptor (Phase 5)
pub offset: u64, // file offset (Phase 5)
}
}
Store VMAs in a BTreeMap<u64, Vma> keyed by start address, replacing
mmap_regions: Vec<(u64, u64)>.
PROT flag translation
Map Linux PROT_* to x86-64 page table flags:
| Linux | x86-64 PTF | Notes |
|---|---|---|
PROT_READ | PRESENT | USER_ACCESSIBLE | x86 has no read-only without NX |
PROT_WRITE | + WRITABLE | |
PROT_EXEC | clear NO_EXECUTE | |
PROT_NONE | clear PRESENT |
Apply these flags in alloc_and_map_user_pages instead of the current
hardcoded USER_DATA_FLAGS.
Changes
| File | Change |
|---|---|
libkernel/src/process.rs | Add Vma struct, replace mmap_regions with BTreeMap<u64, Vma> |
osl/src/syscalls/mem.rs (sys_mmap) | Parse prot, compute PTF, store VMA |
osl/src/clone.rs | Clone the VMA map instead of Vec<(u64, u64)> |
osl/src/exec.rs | Clear VMA map on execve |
Test
Allocate an mmap region with PROT_READ only, attempt a write from
userspace — should page-fault.
Phase 2: Frame Free List + munmap ✓ (implemented)
Goal: Actually free physical frames when munmap is called.
Frame allocator changes
The current frame allocator (BootInfoFrameAllocator wrapping an iterator of
usable frames) is allocate-only. Two options:
- Bitmap allocator — replace the iterator with a bitmap over all usable RAM. Deallocation sets a bit. Simple, O(1) free, but O(n) alloc in the worst case.
- Free-list overlay — keep the bitmap for the initial boot-time pool, but maintain a singly-linked free list of returned frames (write the next pointer into the first 8 bytes of the freed page via the physical memory map). O(1) alloc and free.
Decision: free-list overlay. The bitmap is needed anyway to know which frames are in use, but a free list on top gives O(1) alloc from returned frames.
Unmap primitive
Add unmap_user_page(pml4_phys, vaddr) -> Option<PhysAddr> to the memory
subsystem. This walks the page table, clears the PTE, invokes invlpg, and
returns the physical frame address so the caller can free it.
sys_munmap implementation
fn sys_munmap(addr: u64, length: u64) -> i64
- Page-align addr and length.
- Look up overlapping VMAs.
- For each page in the range: call
unmap_user_page, push the returned frame onto the free list. - Split/remove VMAs as needed (a munmap in the middle of a VMA creates two smaller VMAs).
- TLB flush (per-page
invlpgis fine for now; batch flush can come later).
Changes
| File | Change |
|---|---|
libkernel/src/memory/ | Add unmap_user_page, frame free list |
osl/src/syscalls/mem.rs | Implement sys_munmap |
libkernel/src/process.rs | VMA split/remove helpers |
Contiguous DMA allocations
alloc_dma_pages(pages) with pages > 1 bypasses the free list and uses
allocate_frame_sequential to guarantee physical contiguity. The sequential
allocator walks the boot-time memory map and can be exhausted — once next
exceeds the total usable frames, it returns None even if the free list has
recycled frames available.
In practice this is fine because multi-page contiguous allocations only happen during early boot (VirtIO descriptor rings). If this becomes a problem in the future, options include:
- Fall back to the free list for single-frame DMA when sequential is exhausted.
- Replace the sequential allocator with a buddy allocator that can satisfy contiguous requests from recycled frames.
Test
mmap a region, write a pattern, munmap it, mmap a new region — should
get the same (or nearby) frames back, zero-filled.
Phase 3: mprotect + Process Cleanup ✓ (implemented)
Goal: Change page permissions on existing mappings, and free all process memory on exit/execve.
sys_mprotect
fn sys_mprotect(addr: u64, length: u64, prot: u64) -> i64
- Validate addr is page-aligned.
- Walk VMAs in the range, update
vma.prot. - For each page: rewrite the PTE flags to match the new prot (reuse the PROT→PTF translation from Phase 1).
invlpgeach modified page.- May need to split VMAs if the prot change covers only part of a VMA.
Process cleanup on exit
When a process exits (sys_exit / sys_exit_group), before marking zombie:
- Iterate all VMAs.
- For each page in each VMA: unmap and free the frame (reuse Phase 2 primitives).
- Free the user page tables themselves (PML4, PDPT, PD, PT pages).
- Free the brk region (iterate from
brk_basetobrk_current). - Free the user stack pages.
Process cleanup on execve
sys_execve already creates a fresh PML4. After the new PML4 is set up,
free the old page tables and all frames from the old VMA map (same cleanup
logic as exit, but targeting the old PML4).
Changes
| File | Change |
|---|---|
osl/src/syscalls/mem.rs | Implement sys_mprotect; osl/src/syscalls/process.rs calls cleanup in sys_exit |
osl/src/exec.rs | Call cleanup for old address space before jump |
libkernel/src/memory/ | PTE flag update helper, page table walker for cleanup |
libkernel/src/process.rs | VMA split for partial mprotect |
Test
mmap RW, write data, mprotect to read-only, attempt write — should
fault. Run a long-lived process that repeatedly spawns children — memory
usage should stay bounded.
Phase 4: MAP_FIXED + Gap Finding ✓ (implemented)
Goal: Support MAP_FIXED placement and smarter allocation that avoids
fragmenting the address space.
MAP_FIXED
MAP_FIXED performs implicit munmap of overlapping VMAs before mapping at
the requested address (Linux semantics). Addr must be page-aligned and
non-zero.
Gap-finding allocator
Replaced the bump-down pointer (mmap_next) with a generic top-down gap
finder (libkernel/src/gap.rs). The OccupiedRanges trait abstracts
iteration over occupied intervals so the algorithm can be reused.
Search range: [MMAP_FLOOR, MMAP_CEILING) = [0x10_0000_0000, 0x4000_0000_0000).
The VMA BTreeMap is the sole source of truth — no bump pointer.
Changes
| File | Change |
|---|---|
libkernel/src/gap.rs | New — OccupiedRanges trait, find_gap_topdown |
libkernel/src/lib.rs | Add pub mod gap |
libkernel/src/process.rs | Remove mmap_next, add MMAP_FLOOR/MMAP_CEILING, find_mmap_gap |
osl/src/syscalls/mem.rs | Rewrite sys_mmap with gap finder + MAP_FIXED |
osl/src/clone.rs | Remove mmap_next from clone state |
osl/src/exec.rs | Remove mmap_next reset and local MMAP_BASE constant |
Phase 5a: File-Backed MAP_PRIVATE (eager copy) ✓ (implemented)
Goal: Support mmap(fd, offset, ...) for MAP_PRIVATE file-backed
mappings with eager data copy. No sharing, no refcounting, no writeback.
6th syscall argument
The Linux mmap signature is:
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
fd and offset are the 5th and 6th arguments. The assembly stub saves
user R9 to per_cpu.user_r9 (offset 32). sys_mmap reads the offset via
libkernel::syscall::get_user_r9() — no ABI change needed.
Design: read from the fd’s buffer
Two approaches were considered:
- Read from VFS by path — incorrect because a file’s path can change after open (rename, unlink). An open fd refers to an inode, not a path.
- Read from the fd’s existing in-memory buffer —
VfsHandleholds the full file content in aVec<u8>. Exposed viaFileHandle::content_bytes(). Semantically correct: the fd holds a reference to the file content.
Decision: option 2. When lazy/partial sys_open or inode-based VFS
arrives later, content_bytes() can trigger a full load or we switch to an
inode-keyed page cache. The mmap code doesn’t need to change.
Implementation
FileHandle::content_bytes()— default returnsNone.VfsHandle::content_bytes()— returnsSome(&self.content).sys_mmapfile-backed path: extracts fd/offset, callscontent_bytes(), allocates per-page (clear + copy file data, clamped to file length — bytes past EOF stay zero, matching Linux), maps with prot flags.- Both MAP_FIXED and non-fixed variants work for file-backed — the address selection logic from Phase 4 is reused.
Changes
| File | Change |
|---|---|
libkernel/src/file.rs | Added content_bytes() default method to FileHandle |
osl/src/file.rs | Implemented content_bytes() on VfsHandle |
osl/src/errno.rs | Added ENODEV for non-mmap-able handles |
osl/src/syscalls/mem.rs | Extended sys_mmap with file-backed MAP_PRIVATE, added mmap_alloc_pages helper |
user/mmap_file.c | New demo: open file, mmap, compare with read(), munmap |
Test
mmap_file: opens /shell, reads first 64 bytes via read(), mmaps same
file with MAP_PRIVATE/PROT_READ, compares mapped bytes with read() output,
munmaps, exits cleanly.
Phase 5b: MAP_SHARED + Refcounted Frames ✓ (anonymous shared memory)
Goal: Support shared anonymous mappings with reference-counted frames.
Shared memory objects (shmem_create)
A custom syscall shmem_create(size) (nr 508) creates a shared memory
object backed by eagerly-allocated, zeroed physical frames and returns
a file descriptor. The fd can be inherited by child processes or passed
via IPC. Both sides call mmap(MAP_SHARED, fd) to map the same physical
frames into their address spaces.
Frame refcount table
A BTreeMap<u64, u16> in MemoryServices tracks frames with refcount ≥ 2.
Frames not in the table have an implicit refcount of 1 (single owner).
Each shared frame has owners:
- The
SharedMemInnerobject itself (1 ref, released on Arc drop) - Each process mapping (1 ref per mmap, released on munmap or exit)
Methods:
ref_share(phys)— increment (insert with 2 if new to table)ref_release(phys) -> bool— decrement, return true if frame should be freed
Refcount-aware cleanup
unmap_and_release_user_page()— unmaps PTE, callsref_release, only frees when refcount reaches 0.cleanup_user_address_space()— usesref_releasefor all leaf frames. Backwards-compatible: non-shared frames return true immediately.SharedMemInner::drop()— callsrelease_shared_frame()for each backing frame. Safe because Drop only fires from fd close (outsidewith_memory).
MAP_SHARED in sys_mmap
- Validates MAP_SHARED and MAP_PRIVATE are mutually exclusive
- MAP_SHARED | MAP_ANONYMOUS returns
-EINVAL(no fork) - MAP_SHARED with fd: extracts
SharedMemInnerfromFdObject::SharedMem, maps its physical frames, increments refcounts viaref_share
Changes
| File | Change |
|---|---|
libkernel/src/memory/mod.rs | refcounts: BTreeMap, ref_share, ref_release, unmap_and_release_user_page, refcount-aware cleanup_user_address_space |
libkernel/src/shmem.rs | New — SharedMemInner struct with Drop |
libkernel/src/file.rs | FdObject::SharedMem variant, as_shmem() |
libkernel/src/process.rs | MAP_SHARED constant |
osl/src/syscalls/shmem.rs | New — sys_shmem_create |
osl/src/syscalls/mod.rs | Wire syscall 508 |
osl/src/syscalls/mem.rs | MAP_SHARED path in sys_mmap, refcount-aware sys_munmap |
osl/src/fd_helpers.rs | get_fd_shmem helper |
Test
user/src/shmem_test.c: Parent creates shmem, writes magic pattern, spawns
child. Child inherits fd, mmaps it, verifies pattern, writes response.
Parent waits, verifies response.
Phase 5c: File-Backed MAP_SHARED (future)
Goal: Multiple processes mapping the same file share physical frames via an inode-keyed page cache.
This requires:
- VFS inode identifiers — unique per file across mounts.
The 9P protocol carries
qid.pathwhich serves as an inode, but it is currently discarded when converting toVfsDirEntry. - Shared page cache — a global
BTreeMap<(InodeId, page_offset) → PhysAddr>so multiple processes mapping the same file page get the same frame. - Dirty tracking —
msyncor process exit writes dirty shared pages back to the file.
The frame refcount table from Phase 5b provides the foundation.
Dependency Graph
Phase 1 ─── VMA tracking + PROT flags
│
├──▶ Phase 2 ─── Frame free list + munmap
│ │
│ └──▶ Phase 3 ─── mprotect + process cleanup
│ │
│ └──▶ Phase 4 ─── MAP_FIXED + gap finding
│ │
│ ├──▶ Phase 5a ─── File-backed MAP_PRIVATE (eager copy)
│ │
│ └──▶ Phase 5b ─── MAP_SHARED (anonymous, shmem_create)
│ │
│ └──▶ Phase 5c ─── File-backed MAP_SHARED
│ (requires inode-based VFS + page cache)
Phase 5b (MAP_SHARED anonymous) uses frame refcounting and shmem_create
to share physical frames between processes. No VFS changes needed.
Phase 5c (file-backed MAP_SHARED) requires inode identifiers from the VFS and a global page cache, building on Phase 5b’s refcount infrastructure.
Key Decisions
Eager vs demand paging
All phases use eager paging — frames are allocated and mapped immediately
in sys_mmap. Demand paging (lazy fault-in) is a future optimisation that
does not affect the syscall interface.
6th syscall argument for mmap
The offset parameter (6th arg, user r9) will be read from PerCpuData
rather than changing the dispatch function signature. This avoids adding
overhead to every syscall for a parameter only mmap uses.
Frame allocator: free-list overlay
Freed frames go onto a singly-linked free list stored in the pages themselves (using the physical memory map for access). The existing boot-time allocator remains for initial allocation; the free list is consulted first.
VMA storage: BTreeMap
A BTreeMap<u64, Vma> keyed by start address provides O(log n) lookup,
ordered iteration for gap-finding, and natural support for range queries.
Adequate for the expected number of VMAs per process (tens to low hundreds).
Graphics Subsystem Design
Overview
The kernel migrates from VGA text mode (80x25) to a pixel framebuffer during boot. Two hardware paths are covered: the Bochs Graphics Adapter (BGA) for the initial implementation, and virtio-gpu as future work.
After the switch, all existing output (println!, status_bar!, timeline
strip, boot progress bar) renders via an 8x16 bitmap font onto the
framebuffer. The text grid expands from 80x25 to 128x48 characters at
1024x768 resolution.
Architecture
Early boot (text mode) After PCI scan (graphical mode)
======================== ================================
print!/status_bar! print!/status_bar!
| |
Writer Writer
| |
DisplayBackend::TextMode DisplayBackend::Graphical
| |
VgaBuffer (0xB8000 MMIO) Framebuffer (BGA LFB MMIO)
|
font::draw_char() -> pixels
The Writer struct contains a DisplayBackend enum that dispatches all
cell reads/writes to either the legacy VGA text buffer or the pixel
framebuffer. The switch happens once during boot after PCI enumeration
detects the BGA device.
BGA (Bochs Graphics Adapter) – Implemented
Hardware Interface
The BGA device is QEMU’s default VGA adapter on Q35 machines (-vga std).
It is controlled via two I/O ports:
| Port | Direction | Description |
|---|---|---|
| 0x01CE | Write | Register index |
| 0x01CF | R/W | Register data |
Register Map
| Index | Name | Description |
|---|---|---|
| 0 | ID | Version ID (0xB0C0..0xB0C5) |
| 1 | XRES | Horizontal resolution |
| 2 | YRES | Vertical resolution |
| 3 | BPP | Bits per pixel (8/15/16/24/32) |
| 4 | ENABLE | Display enable + LFB enable |
| 5 | BANK | VGA bank (legacy, not used) |
| 6 | VIRT_WIDTH | Virtual width (scrolling) |
| 7 | VIRT_HEIGHT | Virtual height (scrolling) |
| 8 | X_OFFSET | Display X offset |
| 9 | Y_OFFSET | Display Y offset |
Mode Switch Sequence
- Write
ENABLE = 0(disable display) - Write
XRES = 1024,YRES = 768,BPP = 32 - Write
ENABLE = 0x01 | 0x20(enabled + LFB enabled)
Linear Framebuffer (LFB)
The LFB is located at PCI BAR0 of the BGA device:
- PCI Vendor: 0x1234
- PCI Device: 0x1111
- BAR0: Physical base address of the LFB (typically 0xFD000000 on Q35)
- Size:
width * height * (bpp/8)= 1024 * 768 * 4 = 3,145,728 bytes - Pixel format: BGRX (blue in byte 0, green in byte 1, red in byte 2, byte 3 unused)
The kernel maps the LFB into the kernel MMIO virtual window (0xFFFF_8002_…)
using map_mmio_region(). This region is present in all user page tables
(via shared PML4 entries 256-510).
Software Text Rendering
Characters are rendered using an embedded 8x16 bitmap font (standard IBM VGA ROM font, CP437 character set, 256 glyphs, 4096 bytes).
- Text grid: 128 columns x 48 rows (1024/8 x 768/16)
- Font:
libkernel/src/font.rs–FONT_8X16static array +draw_char() - Shadow buffer:
[[ScreenChar; 128]; 48]insideDisplayBackend::Graphicalenables scrolling without reading back from MMIO
Color Mapping
The VGA 16-color palette is mapped to 32-bit BGRA values:
| VGA Color | BGRA Value |
|---|---|
| Black | 0x00000000 |
| Blue | 0x00AA0000 |
| Green | 0x0000AA00 |
| … | … |
| White | 0x00FFFFFF |
Row Layout (preserved from text mode)
| Row(s) | Purpose |
|---|---|
| 0 | Status bar (white on blue) |
| 1 | Timeline strip (colored blocks) |
| 2 | Boot progress bar (during init) |
| 3-47 | Scrolling text output |
Scrolling
The graphical backend uses a fast scroll path:
Framebuffer::scroll_up()usescore::ptr::copy()to shift pixel data up by one character row (16 scanlines) in a single memcpy operation- The shadow
cellsarray is shifted correspondingly - Only the new blank bottom row is cleared with
fill_rect()
This avoids the naive approach of redrawing every character cell on scroll.
Boot Sequence
- Kernel boots in VGA text mode – early
println!works before PCI scan - After PCI scan: detect BGA via I/O port ID register
- Find BGA PCI device, read BAR0 for LFB physical address
- Map LFB into kernel virtual space
- Call
bga_set_mode(1024, 768, 32)to switch hardware - Call
switch_to_framebuffer()– copies current text content into shadow buffer, repaints entire screen - All subsequent output renders as pixels
If BGA is not detected (e.g. -vga none), the kernel stays in text mode.
QEMU Configuration
Q35 machine includes stdvga (BGA-compatible) by default. No changes to
run.sh required. To be explicit: -vga std.
Limitations
- No hardware cursor in graphical mode (software cursor is a future enhancement)
- LFB mapped with NO_CACHE (not write-combining); acceptable for text console but suboptimal for heavy graphics. A future optimization would configure PAT for write-combining on the LFB region.
- Only works in QEMU/Bochs (BGA is not present on real hardware)
VirtIO-GPU – Future Work
Motivation
- Standard virtio device, works with the
virtio-driverscrate already in use - Supports hardware-accelerated 2D operations (TRANSFER_TO_HOST_2D)
- Better fit for the existing virtio infrastructure (virtio-blk, virtio-9p)
- Portable across any hypervisor supporting virtio-gpu (not just QEMU)
Hardware Interface
| Field | Value |
|---|---|
| PCI Vendor | 0x1AF4 |
| PCI Device | 0x1050 (modern) / 0x1010 (legacy) |
| Device class | Display controller |
| Virtqueues | controlq (commands), cursorq |
Command Protocol
Unlike BGA’s simple I/O port registers, virtio-gpu uses a request/response protocol over virtqueues:
- RESOURCE_CREATE_2D – allocate a 2D resource (the framebuffer)
- RESOURCE_ATTACH_BACKING – attach DMA pages as backing store
- SET_SCANOUT – assign the resource to a display output
- TRANSFER_TO_HOST_2D – copy dirty rectangles from guest to host
- RESOURCE_FLUSH – tell the host to display the updated region
Design Sketch
VirtioGpuActorfollowing the existingVirtioBlkActorpattern- Scanout = framebuffer resource backed by DMA pages from
alloc_dma_pages() - Periodic
TRANSFER_TO_HOST_2D+RESOURCE_FLUSHto update display - Dirty-rect tracking to minimize transfer size
- Could share the same
Framebufferabstraction used by BGA
Why BGA First
- Simpler: I/O port registers + direct MMIO framebuffer writes
- No virtqueue setup, no command protocol
- QEMU Q35 has it by default (stdvga)
- Sufficient for a text console
Key Files
| File | Description |
|---|---|
libkernel/src/framebuffer.rs | BGA register access, Framebuffer struct |
libkernel/src/font.rs | Embedded 8x16 bitmap font + draw_char() |
libkernel/src/vga_buffer/ | DisplayBackend abstraction, Writer refactoring (mod.rs, capture.rs, timeline.rs) |
kernel/src/main.rs | init_bga_framebuffer() boot integration |
Status
- BGA detection and mode switching
- Linear framebuffer mapping and pixel rendering
- 8x16 bitmap font (full CP437 character set)
- DisplayBackend abstraction with text-mode fallback
- Fast pixel scrolling
- Status bar, timeline, progress bar all work in graphical mode
- Software cursor (underline/block at cursor position)
- Write-combining for LFB pages (PAT configuration)
- Virtio-GPU backend
FPU / SSE State Management
x86-64 Floating-Point & SIMD Instruction Sets
| Family | Registers | Width | Notes |
|---|---|---|---|
| x87 FPU | ST(0)–ST(7) | 80-bit | Legacy; used by some libm implementations |
| MMX | MM0–MM7 | 64-bit | Aliases x87 registers |
| SSE/SSE2 | XMM0–XMM15 | 128-bit | Baseline for x86-64; musl uses SSE2 |
| AVX/AVX2 | YMM0–YMM15 | 256-bit | Extends XMM to 256-bit upper halves |
| AVX-512 | ZMM0–ZMM31 | 512-bit | Not relevant for this kernel |
SSE2 is part of the x86-64 baseline — every long-mode CPU supports it, and the System V AMD64 ABI uses XMM0–XMM7 for floating-point arguments/returns. musl libc is compiled with SSE2 and will use XMM registers in user-space code.
Kernel Target Configuration
The kernel’s custom target (x86_64-os.json) specifies:
"features": "-mmx,-sse,+soft-float"
This tells LLVM to never emit SSE/MMX instructions in kernel Rust code. All floating-point operations (if any) use soft-float emulation. This means the kernel never touches XMM registers, so:
- Syscall path: No SSE save/restore needed — the kernel executes entirely
with GPRs, and
syscall/sysretreturns to the same process. - Interrupt handlers: Safe as long as they don’t use SSE (guaranteed by the target config).
- Timer preemption: The only path that switches between different user processes’ register contexts — requires SSE save/restore.
CR0/CR4 Setup (enable_sse)
SSE instructions will fault unless the CPU’s control registers are configured:
#![allow(unused)]
fn main() {
pub fn enable_sse() {
unsafe {
// CR0: clear EM (bit 2, x87 emulation), set MP (bit 1, monitor coprocessor)
let mut cr0 = Cr0::read_raw();
cr0 &= !(1 << 2); // clear CR0.EM
cr0 |= 1 << 1; // set CR0.MP
Cr0::write_raw(cr0);
// CR4: set OSFXSR (bit 9) and OSXMMEXCPT (bit 10)
let mut cr4 = Cr4::read_raw();
cr4 |= (1 << 9) | (1 << 10);
Cr4::write_raw(cr4);
}
}
}
- CR0.EM = 0: Do not trap x87/SSE instructions.
- CR0.MP = 1: Enable WAIT/FWAIT monitoring.
- CR4.OSFXSR = 1: Enable FXSAVE/FXRSTOR and SSE instructions.
- CR4.OSXMMEXCPT = 1: Enable unmasked SSE exception handling via #XM.
Called once during boot, before any user processes are spawned.
Eager FXSAVE/FXRSTOR Context Switch
We use the eager strategy: save and restore FPU/SSE state on every timer-driven context switch, unconditionally.
Timer stub flow
interrupt fires → CPU pushes iretq frame (40 bytes)
→ stub pushes 15 GPRs (120 bytes)
→ sub rsp, 512; fxsave [rsp] ← save SSE state
→ call preempt_tick ← may switch RSP
→ fxrstor [rsp] ← restore SSE state
→ add rsp, 512
→ pop GPRs; iretq
Stack layout during preemption
high address ┌──────────────────────────┐
│ SS / RSP / RFLAGS │ iretq frame (40 bytes)
│ CS / RIP │
├──────────────────────────┤
│ rax, rbx, ... r15 │ 15 GPRs (120 bytes)
├──────────────────────────┤
│ FXSAVE area │ 512 bytes (16-byte aligned)
│ (x87/MMX/SSE state) │ MXCSR at offset +24
│ │ XMM0-15 at offset +160
low address └──────────────────────────┘ ← saved_rsp points here
Total: 672 bytes = 42 x 16, preserving 16-byte alignment for both fxsave
(requires 16-byte aligned operand) and the SysV ABI call convention.
New thread initialization
spawn_thread and spawn_user_thread allocate the FXSAVE area below the
SwitchFrame and initialize MXCSR at offset +24 to 0x1F80 (the Intel
default: all SSE exceptions masked, round-to-nearest mode). XMM registers
start zeroed.
FXSAVE Memory Layout (512 bytes)
| Offset | Size | Field |
|---|---|---|
| 0 | 2 | FCW (x87 control word) |
| 2 | 2 | FSW (x87 status word) |
| 4 | 1 | FTW (abridged x87 tag word) |
| 6 | 1 | Reserved |
| 8 | 2 | FOP (last x87 opcode) |
| 10 | 8 | FIP (x87 instruction pointer) |
| 18 | 8 | FDP (x87 data pointer) |
| 24 | 4 | MXCSR (SSE control/status) |
| 28 | 4 | MXCSR_MASK |
| 32 | 128 | ST(0)–ST(7) / MM0–MM7 (8 x 16 bytes) |
| 160 | 256 | XMM0–XMM15 (16 x 16 bytes) |
| 416 | 96 | Reserved (must be zero for FXRSTOR) |
The MXCSR default value 0x1F80 means:
- Bits 12:7 =
0b111111— all six SSE exception masks set (no traps) - Bits 14:13 =
0b00— round-to-nearest-even - All exception flags (bits 5:0) cleared
Why Syscalls Don’t Need SSE Saves
The SYSCALL instruction does not change the process — it transitions from
ring 3 to ring 0 within the same thread. Since the kernel target has
-sse,+soft-float, no kernel code will modify XMM registers. When the
syscall handler returns via SYSRETQ, XMM registers still hold the user
process’s values.
The timer preemption path is different: it can switch from process A’s context to process B’s context, so process A’s XMM state would be overwritten by process B if not saved.
Future Considerations
Lazy FPU switching (CR0.TS)
Instead of saving/restoring on every context switch, set CR0.TS = 1 after switching away from a thread. The next SSE instruction triggers a #NM (Device Not Available) fault, at which point the handler saves the old thread’s state and loads the new thread’s state, then clears CR0.TS.
Pros: Avoids the 512-byte save/restore overhead when threads don’t use SSE (e.g., kernel threads). Cons: More complex, #NM handler latency, modern CPUs make FXSAVE fast enough that eager switching is preferred (Linux switched to eager in 3.15).
XSAVE for AVX
If AVX support is needed in the future, FXSAVE/FXRSTOR only covers XMM0–XMM15. XSAVE/XRSTOR can save the full YMM/ZMM state, but the save area size varies by CPU (queried via CPUID leaf 0xD). This would require:
CPUID.0xD.0:EBXto determine XSAVE area size- CR4.OSXSAVE = 1 and XCR0 configuration
- Dynamic allocation of per-thread XSAVE areas
- Replace FXSAVE/FXRSTOR with XSAVE/XRSTOR in the timer stub
File Descriptors & Pipes
Design for per-process file descriptor tables, the FileHandle trait,
blocking syscalls, and the pipe implementation.
Motivation
The kernel currently has three syscalls: write (hardcoded to stdout/stderr
via crate::print!()), exit, and arch_prctl. There is no concept of a
file descriptor, no read/close, and no IPC mechanism between user
processes.
Adding a proper file descriptor layer enables:
pipefor parent→child / sibling IPC- Redirecting stdout/stderr to pipes (shell pipelines)
- Future
open/read/write/closefor VFS-backed files dup2for fd redirection
Overview
User process Kernel
───────────── ──────
write(fd, buf, n) ──syscall──► fd_table[fd].write(buf)
│
┌─────────────┼──────────────┐
▼ ▼ ▼
ConsoleHandle PipeWriter VfsHandle SharedMem
→ print!() → PipeInner → VFS → shmem frames
▲
┌─────────────┘
│
read(fd, buf, n) ──► fd_table[fd].read(buf)
│
PipeReader
→ PipeInner
Layer 1: FileHandle trait
#![allow(unused)]
fn main() {
/// A kernel object backing an open file descriptor.
///
/// Implementations must be safe to share across threads (the fd table
/// holds `Arc<dyn FileHandle>`).
pub trait FileHandle: Send + Sync {
/// Read up to `buf.len()` bytes. Returns the number of bytes read,
/// or 0 for EOF. May block the calling thread (see "Blocking" below).
fn read(&self, buf: &mut [u8]) -> Result<usize, FileError>;
/// Write up to `buf.len()` bytes. Returns the number of bytes written.
/// May block the calling thread.
fn write(&self, buf: &[u8]) -> Result<usize, FileError>;
/// Release resources associated with this handle.
/// Called when the last `Arc` is dropped (i.e. last fd closed).
fn close(&self) {}
/// Return a name for downcasting purposes.
fn kind(&self) -> &'static str;
/// For directory handles: serialize entries as linux_dirent64 into buf.
fn getdents64(&self, _buf: &mut [u8]) -> Result<usize, FileError> {
Err(FileError::NotATty)
}
}
}
FileError is a structured enum in libkernel::file (using snafu for Display):
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Copy, Snafu)]
pub enum FileError {
BadFd, // bad file descriptor
IsDirectory, // is a directory
NotATty, // inappropriate ioctl for device
TooManyOpenFiles, // too many open files
}
}
Linux errno numeric codes are defined separately in osl::errno and converted
from FileError via errno::file_errno(). This keeps libkernel free of
Linux-specific numeric constants.
FileHandle::read/write are synchronous — they return when the
operation completes or an error occurs. Blocking is handled at the
scheduler level (see below), not via async/await.
Layer 2: Per-process fd table
Add to Process:
#![allow(unused)]
fn main() {
pub struct Process {
// ... existing fields ...
pub fd_table: Vec<Option<Arc<dyn FileHandle>>>,
}
}
On process creation, pre-populate fds 0–2:
#![allow(unused)]
fn main() {
fd_table: vec![
Some(Arc::new(ConsoleHandle)), // 0: stdin (read returns EBADF for now)
Some(Arc::new(ConsoleHandle)), // 1: stdout
Some(Arc::new(ConsoleHandle)), // 2: stderr
],
}
Fd allocation: scan for the first None slot; if none, push a new entry.
This matches the POSIX “lowest available fd” rule.
#![allow(unused)]
fn main() {
impl Process {
pub fn alloc_fd(&mut self, handle: Arc<dyn FileHandle>) -> Result<usize, FileError> {
for (i, slot) in self.fd_table.iter().enumerate() {
if slot.is_none() {
self.fd_table[i] = Some(handle);
return Ok(i);
}
}
if self.fd_table.len() < MAX_FDS {
let fd = self.fd_table.len();
self.fd_table.push(Some(handle));
Ok(fd)
} else {
Err(FileError::TooManyOpenFiles)
}
}
pub fn close_fd(&mut self, fd: usize) -> Result<(), FileError> {
if fd >= self.fd_table.len() {
return Err(FileError::BadFd);
}
match self.fd_table[fd].take() {
Some(handle) => { handle.close(); Ok(()) }
None => Err(FileError::BadFd),
}
}
pub fn get_fd(&self, fd: usize) -> Result<Arc<dyn FileHandle>, FileError> {
self.fd_table.get(fd)
.and_then(|slot| slot.clone())
.ok_or(FileError::BadFd)
}
}
}
Layer 3: ConsoleHandle
The simplest FileHandle — wraps the existing crate::print!() behaviour:
#![allow(unused)]
fn main() {
pub struct ConsoleHandle {
pub readable: bool,
}
impl FileHandle for ConsoleHandle {
fn read(&self, buf: &mut [u8]) -> Result<usize, FileError> {
if !self.readable {
return Err(FileError::BadFd);
}
Ok(crate::console::read_input(buf))
}
fn write(&self, buf: &[u8]) -> Result<usize, FileError> {
if let Ok(s) = core::str::from_utf8(buf) {
crate::print!("{}", s);
}
Ok(buf.len())
}
fn kind(&self) -> &'static str { "console" }
}
}
Layer 4: Blocking syscalls (Option C)
Pipe read and write must block when the buffer is empty or full.
Rather than adding async/await to the syscall path, we add a Blocked
state to the scheduler.
New thread state
#![allow(unused)]
fn main() {
enum ThreadState {
Ready,
Running,
Blocked, // ← new
Dead,
}
}
Blocking API
#![allow(unused)]
fn main() {
/// Block the current thread until `waker` is called.
///
/// Saves the current thread's state as `Blocked` and yields to the
/// scheduler. Returns when another thread (or ISR) calls
/// `unblock(thread_idx)`.
///
/// Must be called with interrupts disabled.
pub fn block_current_thread() { ... }
/// Move a blocked thread back onto the ready queue.
///
/// Safe to call from ISR context (e.g. a pipe write that wakes a reader).
pub fn unblock(thread_idx: usize) { ... }
}
How blocking works
- Syscall handler (e.g.
sys_readon an empty pipe) callsblock_current_thread(). - The scheduler marks the thread
Blockedand context-switches away. preempt_ticknever re-queuesBlockedthreads.- When the condition is met (e.g. a writer pushes data into the pipe),
the pipe calls
unblock(thread_idx). unblocksets the thread toReadyand pushes it onto the ready queue.- On the next preemption the thread is scheduled, returns from
block_current_thread, and the syscall retries the operation.
Avoiding lost wakeups
The pipe must check the condition and call block_current_thread while
holding the pipe’s internal lock. The sequence is:
lock pipe
if buffer_empty:
register self as waiter (store thread_idx)
unlock pipe
block_current_thread() ← yields here
goto top ← retry after wakeup
else:
copy data
wake writer if blocked
unlock pipe
return count
The critical property: between checking the condition and blocking, no
writer can sneak in — the pipe lock is held. The writer will see the
registered waiter and call unblock after releasing the lock.
Layer 5: Pipe
Shared state
#![allow(unused)]
fn main() {
struct PipeInner {
buf: VecDeque<u8>,
capacity: usize, // default 4096
reader_closed: bool,
writer_closed: bool,
blocked_reader: Option<usize>, // thread_idx waiting for data
blocked_writer: Option<usize>, // thread_idx waiting for space
}
pub struct Pipe {
inner: Mutex<PipeInner>,
}
}
Read end
#![allow(unused)]
fn main() {
pub struct PipeReader(Arc<Pipe>);
impl FileHandle for PipeReader {
fn read(&self, buf: &mut [u8]) -> Result<usize, FileError> {
loop {
let mut inner = self.0.inner.lock();
if !inner.buf.is_empty() {
let n = inner.drain_to(buf);
// Wake blocked writer if there's now space.
if let Some(writer) = inner.blocked_writer.take() {
scheduler::unblock(writer);
}
return Ok(n);
}
if inner.writer_closed {
return Ok(0); // EOF
}
// Buffer empty, writer alive — block.
inner.blocked_reader = Some(scheduler::current_thread_idx());
drop(inner);
scheduler::block_current_thread();
// Woken up — retry.
}
}
fn write(&self, _buf: &[u8]) -> Result<usize, FileError> {
Err(FileError::EBADF)
}
fn close(&self) {
let mut inner = self.0.inner.lock();
inner.reader_closed = true;
// Wake blocked writer so it sees EPIPE.
if let Some(writer) = inner.blocked_writer.take() {
scheduler::unblock(writer);
}
}
}
}
Write end
#![allow(unused)]
fn main() {
pub struct PipeWriter(Arc<Pipe>);
impl FileHandle for PipeWriter {
fn read(&self, _buf: &mut [u8]) -> Result<usize, FileError> {
Err(FileError::EBADF)
}
fn write(&self, buf: &[u8]) -> Result<usize, FileError> {
let mut offset = 0;
while offset < buf.len() {
let mut inner = self.0.inner.lock();
if inner.reader_closed {
return Err(FileError::EPIPE);
}
let space = inner.capacity - inner.buf.len();
if space > 0 {
let n = core::cmp::min(space, buf.len() - offset);
inner.buf.extend(&buf[offset..offset + n]);
offset += n;
// Wake blocked reader.
if let Some(reader) = inner.blocked_reader.take() {
scheduler::unblock(reader);
}
} else {
// Buffer full — block.
inner.blocked_writer = Some(scheduler::current_thread_idx());
drop(inner);
scheduler::block_current_thread();
}
}
Ok(buf.len())
}
fn close(&self) {
let mut inner = self.0.inner.lock();
inner.writer_closed = true;
// Wake blocked reader so it sees EOF.
if let Some(reader) = inner.blocked_reader.take() {
scheduler::unblock(reader);
}
}
}
}
Creating a pipe
#![allow(unused)]
fn main() {
pub fn new_pipe(capacity: usize) -> (PipeReader, PipeWriter) {
let pipe = Arc::new(Pipe {
inner: Mutex::new(PipeInner {
buf: VecDeque::with_capacity(capacity),
capacity,
reader_closed: false,
writer_closed: false,
blocked_reader: None,
blocked_writer: None,
}),
});
(PipeReader(pipe.clone()), PipeWriter(pipe))
}
}
Layer 6: Syscall wiring
New syscalls
| Nr | Name | Signature |
|---|---|---|
| 0 | read | read(fd, buf, count) → ssize_t |
| 1 | write | write(fd, buf, count) → ssize_t |
| 3 | close | close(fd) → int |
| 22 | pipe | pipe(fds) → int |
sys_pipe implementation
#![allow(unused)]
fn main() {
fn sys_pipe(fds_ptr: u64) -> i64 {
// Validate user pointer (2 × i32 = 8 bytes).
const USER_LIMIT: u64 = 0x0000_8000_0000_0000;
if fds_ptr == 0 || fds_ptr + 8 > USER_LIMIT {
return FileError::EFAULT.0;
}
let (reader, writer) = new_pipe(4096);
let pid = process::current_pid();
let (read_fd, write_fd) = process::with_process(pid, |proc| {
let rfd = proc.alloc_fd(Arc::new(reader))?;
match proc.alloc_fd(Arc::new(writer)) {
Ok(wfd) => Ok((rfd, wfd)),
Err(e) => { proc.close_fd(rfd).ok(); Err(e) }
}
}).unwrap_or(Err(FileError::EBADF))?;
// Write fds to user space.
let fds = fds_ptr as *mut [i32; 2];
unsafe { (*fds) = [read_fd as i32, write_fd as i32]; }
0
}
}
Refactored sys_write
#![allow(unused)]
fn main() {
fn sys_write(fd: u64, buf: u64, count: u64) -> i64 {
// ... existing user pointer validation ...
let bytes = validated_user_slice(buf, count)?;
let pid = process::current_pid();
let handle = process::with_process_ref(pid, |p| {
p.fd_table.get(fd as usize).and_then(|s| s.clone())
}).flatten().ok_or(FileError::EBADF)?;
match handle.write(bytes) {
Ok(n) => n as i64,
Err(e) => e.0,
}
}
}
Implementation order
| Phase | What | Files |
|---|---|---|
| 1 | FileHandle trait + FileError | libkernel/src/file.rs (new) |
| 2 | fd_table on Process + alloc_fd/close_fd | libkernel/src/process.rs |
| 3 | ConsoleHandle | libkernel/src/file.rs |
| 4 | Refactor sys_write to use fd table | libkernel/src/syscall.rs |
| 5 | Add sys_read + sys_close | libkernel/src/syscall.rs |
| 6 | Blocked thread state + block_current_thread / unblock | libkernel/src/task/scheduler.rs |
| 7 | PipeInner / PipeReader / PipeWriter | libkernel/src/pipe.rs (new) |
| 8 | sys_pipe syscall | libkernel/src/syscall.rs |
| 9 | dup2 (optional, for shell redirection) | libkernel/src/syscall.rs |
Phases 1–5 are useful independently — they give user processes a real fd abstraction for stdout/stderr. Phase 6 is needed for any future blocking syscall (futex, sleep, waitpid). Phases 7–8 deliver pipes.
Open questions
- Pipe capacity: 4096 bytes matches Linux’s historical default. Should
this be page-sized for alignment, or is
VecDequefine? - Multiple readers/writers: This design supports only one blocked reader
and one blocked writer. For a single pipe between two processes this is
fine, but
dup-ed fds sharing a pipe end would need a wait queue. - Signal delivery: POSIX
SIGPIPEon write to a broken pipe is not modelled — we returnEPIPEinstead. Signals can be added later. O_NONBLOCK: Not yet supported. Would returnEAGAINinstead of blocking. Requires fd-level flags.- VFS integration: A future
VfsHandleimplementingFileHandlewould connect the VFS’s asyncread_fileto the synchronousFileHandle::readby using the same blocking mechanism.
Actor System
Overview
The kernel uses a lightweight actor model for device drivers and long-running
system services. Each actor is an async task that owns its state behind an
Arc, receives typed messages through a Mailbox, and responds to requests
via one-shot Reply channels.
The design avoids shared mutable state and lock contention between drivers: all cross-actor communication is by message passing.
Core Primitives
Mailbox<M> — libkernel::task::mailbox
An async, mutex-backed message queue.
sender receiver (actor run loop)
────── ────────────────────────
mailbox.send(msg) → while let Some(msg) = inbox.recv().await { ... }
(suspends when queue empty; woken on send)
mailbox.close() → recv() drains remaining msgs, then returns None
Key properties:
sendacquires the lock, checksclosed, and either enqueues the message or drops it immediately. Dropping a message also drops any embeddedReply, which closes the reply channel and unblocks the sender withNone.closesetsclosed = trueunder the lock and wakes the receiver. Messages already in the queue are not removed —recvdelivers them before returningNone. Anysendarriving aftercloseis silently dropped.reopenclears the closed flag, used when restarting a driver.- The mutex makes
sendandcloseatomic with respect to each other, eliminating the race between “is it closed?” and “enqueue”.
recv uses a double-check pattern to avoid missed wakeups:
poll():
lock → dequeue / check closed → unlock (fast path)
register waker
lock → dequeue / check closed → unlock (second check)
→ Pending
The lock is always released before registering the waker and before waking it,
so a send or close that arrives between the two checks will either be seen
by the second check or will wake the (now-registered) waker.
Reply<T> — one-shot response channel
Reply<T> is the sending half of a request/response pair.
#![allow(unused)]
fn main() {
// Actor receives:
ActorMsg::Info(reply) => reply.send(ActorStatus { name: "dummy", running: true, info: () }),
// Sender awaits:
let status: Option<ActorStatus<()>> = inbox.ask(|r| ActorMsg::Info(r)).await;
}
Reply::new() returns (Reply<T>, Arc<Mailbox<T>>). The actor calls
reply.send(value) to deliver a response; the Drop impl calls close() on
the inner mailbox regardless, so the receiver always unblocks:
reply.send(value)→ value pushed, thenReplydropped →close()called.close()does not drain the queue, so the value is still there forrecv.replydropped without send →close()called on an empty mailbox →recv()returnsNone.
ActorMsg<M, I> — the envelope type
Every actor mailbox is Mailbox<ActorMsg<M, I>> where M is the actor-specific
message type and I is the actor-specific info detail type (defaults to ()).
#![allow(unused)]
fn main() {
pub enum ActorMsg<M, I: Send = ()> {
/// Typed info request — reply carries ActorStatus<I> with the full detail.
Info(Reply<ActorStatus<I>>),
/// Type-erased info request from the process registry — reply carries
/// ActorStatus<ErasedInfo> so callers can display detail without knowing I.
ErasedInfo(Reply<ActorStatus<ErasedInfo>>),
/// An actor-specific message.
Inner(M),
}
}
ActorStatus<I> is the response to both info variants:
#![allow(unused)]
fn main() {
pub struct ActorStatus<I = ()> {
pub name: &'static str,
pub running: bool, // always true when the actor is responding
pub info: I, // actor-specific detail
}
}
ErasedInfo is a type alias for the boxed detail used in type-erased queries:
#![allow(unused)]
fn main() {
pub type ErasedInfo = Box<dyn core::fmt::Debug + Send>;
}
RecvTimeout<M> — timed receive
recv_timeout races the inbox against a Delay, returning whichever fires first:
#![allow(unused)]
fn main() {
pub enum RecvTimeout<M> {
Message(M), // a message arrived before the deadline
Closed, // mailbox was closed (actor should exit)
Elapsed, // timer fired before any message
}
// Usage:
match inbox.recv_timeout(ticks).await {
RecvTimeout::Message(msg) => { /* handle */ }
RecvTimeout::Closed => break,
RecvTimeout::Elapsed => { /* periodic work */ }
}
}
Used internally by the #[on_tick] generated run loop.
ask — the request/response pattern
#![allow(unused)]
fn main() {
// Returns Option<R>; None if the actor is stopped or dropped the reply.
let result = inbox.ask(|reply| ActorMsg::Inner(MyMsg::GetThing(reply))).await;
}
ask creates a Reply, wraps it in a message, sends it, and awaits the
response. Because a closed mailbox drops incoming messages (and their
Replys), ask on a stopped actor returns None immediately rather than
hanging.
Self-query deadlock: an actor must never use ask (or registry::ask_info)
to query its own mailbox from within a message handler — it cannot recv() the
response while blocked executing the current message. Detect self-queries by
comparing names and respond directly instead.
Driver Lifecycle — devices::task_driver
DriverTask trait
#![allow(unused)]
fn main() {
pub trait DriverTask: Send + Sync + 'static {
type Message: Send;
type Info: Send + 'static;
fn name(&self) -> &'static str;
fn run(
handle: Arc<Self>,
stop: StopToken,
inbox: Arc<Mailbox<ActorMsg<Self::Message, Self::Info>>>,
) -> impl Future<Output = ()> + Send;
}
}
type Info is the actor-specific detail returned by #[on_info]. Use ()
if the actor has no custom info.
The run future is 'static because all state is accessed through Arc<Self>.
StopToken can be polled between messages for cooperative stop, though most
actors simply let inbox.recv() return None (which happens when the mailbox
is closed by stop()).
TaskDriver<T> — the lifecycle wrapper
TaskDriver<T> implements Driver (the registry interface) and owns:
| Field | Type | Purpose |
|---|---|---|
task | Arc<T> | actor state, shared with the run future |
running | Arc<AtomicBool> | set true on start, false when run exits |
stop_flag | Arc<AtomicBool> | StopToken reads this |
inbox | Arc<Mailbox<ActorMsg<T::Message, T::Info>>> | message channel |
Lifecycle:
TaskDriver::new()
inbox starts CLOSED → sends before start() are dropped immediately
start()
inbox.reopen() opens the mailbox
running = true
spawn(async { T::run(handle, stop, inbox).await; running = false; })
stop()
stop_flag = true StopToken fires
inbox.close() recv() will return None after draining
(run loop exits)
running = false
TaskDriver::new returns (TaskDriver<T>, Arc<Mailbox<ActorMsg<T::Message, T::Info>>>).
The caller holds onto the Arc<Mailbox> to send actor-specific messages and
registers it in the process registry (see below).
The #[actor] Macro — devices_macros
The macro generates a complete DriverTask implementation from an annotated
impl block, eliminating the run-loop boilerplate. All attributes are
passthrough no-ops when used outside an #[actor] block.
Basic usage — pure message actor
#![allow(unused)]
fn main() {
pub enum DummyMsg { SetInterval(u64) }
#[derive(Debug)]
pub struct DummyInfo { pub interval_secs: u64 }
pub struct Dummy { interval_secs: AtomicU64 }
#[actor("dummy", DummyMsg)]
impl Dummy {
#[on_info]
async fn on_info(&self) -> DummyInfo {
DummyInfo { interval_secs: self.interval_secs.load(Ordering::Relaxed) }
}
#[on_message(SetInterval)]
async fn set_interval(&self, secs: u64) {
self.interval_secs.store(secs, Ordering::Relaxed);
}
}
}
What the macro generates:
#![allow(unused)]
fn main() {
// Inherent impl with handler methods (attributes stripped):
impl Dummy {
async fn on_info(&self) -> DummyInfo { ... }
async fn set_interval(&self, secs: u64) { ... }
}
// DriverTask impl with the generated run loop:
impl DriverTask for Dummy {
type Message = DummyMsg;
type Info = DummyInfo;
fn name(&self) -> &'static str { "dummy" }
async fn run(handle: Arc<Self>, _stop: StopToken,
inbox: Arc<Mailbox<ActorMsg<DummyMsg, DummyInfo>>>) {
log::info!("[dummy] started");
while let Some(msg) = inbox.recv().await {
match msg {
ActorMsg::Info(reply) =>
reply.send(ActorStatus { name: "dummy", running: true,
info: handle.on_info().await }),
ActorMsg::ErasedInfo(reply) =>
reply.send(ActorStatus { name: "dummy", running: true,
info: Box::new(handle.on_info().await) }),
ActorMsg::Inner(msg) => match msg {
DummyMsg::SetInterval(secs) => handle.set_interval(secs).await,
}
}
}
log::info!("[dummy] stopped");
}
}
// Convenience type alias (struct name + "Driver"):
pub type DummyDriver = TaskDriver<Dummy>;
}
Any methods in the #[actor] block that have no actor attribute are emitted
unchanged in the inherent impl and are callable from handler methods.
#[on_start] — actor startup hook
Called once, after the [actor] started log line and before the message loop:
#![allow(unused)]
fn main() {
#[on_start]
async fn on_start(&self) {
println!();
print!("myactor> ");
}
}
Only one #[on_start] method is allowed per actor.
#[on_info] — custom actor info
Without #[on_info], Info and ErasedInfo reply with info: (). Annotate
one method to provide actor-specific detail:
#![allow(unused)]
fn main() {
#[on_info]
async fn on_info(&self) -> MyInfo {
MyInfo { /* fields from self */ }
}
}
The return type must implement Debug + Send. The macro infers type Info = MyInfo and generates both Info and ErasedInfo arms automatically.
#[on_message(Variant)] — inner message handler
Maps one enum variant of the actor’s message type to an async handler:
#![allow(unused)]
fn main() {
#[on_message(DoThing)]
async fn do_thing(&self, n: u32) { ... }
}
The generated match arm is:
#![allow(unused)]
fn main() {
ActorMsg::Inner(MyMsg::DoThing(n)) => handle.do_thing(n).await,
}
Multiple #[on_message] methods are allowed, one per variant.
#[on_tick] — periodic callback
When present, the macro switches to a unified poll_fn loop (see below)
that races the inbox against a Delay. The actor must also provide a plain
tick_interval_ticks(&self) -> u64 method (no attribute needed):
#![allow(unused)]
fn main() {
fn tick_interval_ticks(&self) -> u64 {
self.interval_secs.load(Ordering::Relaxed) * TICKS_PER_SECOND
}
#[on_tick]
async fn heartbeat(&self) {
log::info!("[myactor] tick");
}
}
Only one #[on_tick] method is allowed per actor. The delay is reset after
each tick so tick_interval_ticks can change dynamically.
#[on_stream(factory)] — interrupt/hardware stream source
Actors that need to react to hardware events (interrupts, async streams) use
#[on_stream]. The factory argument names a plain method that returns a
Stream + Unpin; the handler is called for each item:
#![allow(unused)]
fn main() {
// Factory — called once when the actor starts:
fn key_stream(&self) -> KeyStream { KeyStream::new() }
// Handler — called for each item from the stream:
#[on_stream(key_stream)]
async fn on_key(&self, key: Key) {
// process key event
}
}
Multiple #[on_stream] methods are allowed, one per stream.
The unified poll_fn loop
When one or more #[on_stream] or #[on_tick] attributes are present the
macro generates a loop that races all event sources in a single poll_fn:
#![allow(unused)]
fn main() {
// Streams initialised once before the loop:
let mut _stream_0 = handle.key_stream();
// Timer initialised if #[on_tick] is present:
let mut _delay = Delay::new(handle.tick_interval_ticks());
loop {
enum _Event {
_Inbox(ActorMsg<KeyboardMsg, KeyboardInfo>),
_Stream0(Key), // one variant per #[on_stream]
_Tick, // present if #[on_tick]
_Stopped,
}
let mut _recv = inbox.recv();
let _ev = poll_fn(|cx| {
// Streams polled first — interrupt-driven, lowest latency:
match poll_stream_next(&mut _stream_0, cx) {
Poll::Ready(Some(item)) => return Poll::Ready(_Event::_Stream0(item)),
Poll::Ready(None) => return Poll::Ready(_Event::_Stopped),
Poll::Pending => {}
}
// Inbox — control messages and stop signal:
match Pin::new(&mut _recv).poll(cx) {
Poll::Ready(Some(msg)) => return Poll::Ready(_Event::_Inbox(msg)),
Poll::Ready(None) => return Poll::Ready(_Event::_Stopped),
Poll::Pending => {}
}
// Timer (lowest priority):
if let Poll::Ready(()) = Pin::new(&mut _delay).poll(cx) {
return Poll::Ready(_Event::_Tick);
}
Poll::Pending
}).await;
match _ev {
_Event::_Stopped => break,
_Event::_Inbox(msg) => match msg { /* Info, ErasedInfo, Inner arms */ }
_Event::_Stream0(key) => handle.on_key(key).await,
_Event::_Tick => {
handle.heartbeat().await;
_delay = Delay::new(handle.tick_interval_ticks());
}
}
}
}
All wakers (mailbox AtomicWaker, stream AtomicWaker, timer WAKERS slot)
register the same task waker, so whichever source fires first reschedules
the task. No extra task or thread is needed.
Using #[actor] outside the devices crate
The macro generates impl crate::task_driver::DriverTask for … and
pub type XDriver = crate::task_driver::TaskDriver<X>;. In the devices crate
this resolves naturally. For crates that use devices as a dependency (e.g.
kernel), expose task_driver at the crate root:
#![allow(unused)]
fn main() {
// kernel/src/task_driver.rs
pub use devices::task_driver::*;
// kernel/src/main.rs
pub mod task_driver; // makes crate::task_driver resolve for #[actor] expansions
}
The generated type alias uses the struct name suffixed with Driver:
KeyboardActor → KeyboardActorDriver, Shell → ShellDriver.
Process Registry — libkernel::task::registry
The registry maps actor names to their mailboxes, allowing any code to send messages to a named actor without holding a direct reference.
#![allow(unused)]
fn main() {
// Registration (at init time, in main.rs):
registry::register("dummy", dummy_inbox.clone());
// Typed lookup (when the caller knows both message and info types):
let inbox: Arc<Mailbox<ActorMsg<DummyMsg, DummyInfo>>> =
registry::get::<DummyMsg, DummyInfo>("dummy")?;
inbox.send(ActorMsg::Inner(DummyMsg::SetInterval(5)));
// Type-erased info query (no knowledge of M or I needed):
if let Some(status) = registry::ask_info("dummy").await {
println!("name: {} running: {} info: {:?}", status.name, status.running, status.info);
}
}
Each registry entry stores two representations of the same mailbox:
| Field | Type | Used for |
|---|---|---|
mailbox | Arc<dyn Any + Send + Sync> | typed downcast via get<M, I> |
informable | Arc<dyn Informable> | type-erased ErasedInfo query via ask_info |
Informable is a simple object-safe trait:
#![allow(unused)]
fn main() {
pub trait Informable: Send + Sync {
fn send_info(&self, reply: Reply<ActorStatus<ErasedInfo>>);
}
// Blanket impl for all actor mailboxes:
impl<M: Send, I: Send + 'static> Informable for Mailbox<ActorMsg<M, I>> {
fn send_info(&self, reply: Reply<ActorStatus<ErasedInfo>>) {
self.send(ActorMsg::ErasedInfo(reply));
}
}
}
ask_info clones the Arc<dyn Informable> while holding the registry lock,
drops the lock, then sends the request and awaits the reply — the lock is never
held across an await.
Actors in Practice
Shell — pure message actor with startup hook
#![allow(unused)]
fn main() {
pub enum ShellMsg { KeyLine(String) }
pub struct Shell;
#[actor("shell", ShellMsg)]
impl Shell {
#[on_start]
async fn on_start(&self) {
println!();
print!("ostoo> ");
}
#[on_message(KeyLine)]
async fn on_key_line(&self, line: String) {
self.execute_command(&line).await;
print!("ostoo> ");
}
// Plain helpers — land in the inherent impl:
async fn execute_command(&self, line: &str) { ... }
async fn cmd_driver(&self, rest: &str) { ... }
}
}
The shell prints its prompt in #[on_start] (once, when the actor starts) and
again after each command in #[on_message(KeyLine)].
Fire-and-forget dispatch: the keyboard actor sends ShellMsg::KeyLine with
mailbox.send() (no reply), so it never blocks waiting for the shell. The
shell processes one command at a time; new lines queue in the mailbox.
Self-query avoidance: driver info shell from within a shell command would
deadlock if it sent ErasedInfo to the shell’s own mailbox (the shell is busy
executing the command and cannot recv). The handler detects the name "shell"
and responds directly without going through the registry.
Keyboard — stream actor
#![allow(unused)]
fn main() {
pub struct KeyboardActor {
keys_processed: AtomicU64,
lines_dispatched: AtomicU64,
line: spin::Mutex<LineBuf>,
}
#[actor("keyboard", KeyboardMsg)]
impl KeyboardActor {
fn key_stream(&self) -> KeyStream { KeyStream::new() }
#[on_stream(key_stream)]
async fn on_key(&self, key: Key) {
// buffer characters; dispatch complete lines to shell via send()
}
#[on_info]
async fn on_info(&self) -> KeyboardInfo { ... }
}
}
KeyStream is interrupt-driven: every PS/2 scancode IRQ pushes into a lock-free
queue and wakes an AtomicWaker. Because both the stream waker and the inbox
waker register the same task waker, the actor sleeps in a single poll_fn and
wakes on whichever event arrives first.
The line buffer lives in the actor struct behind a spin::Mutex<LineBuf> so it
is accessible from the &self reference in on_key. The mutex is never held
across an .await.
Dummy — tick actor (example / test driver)
#![allow(unused)]
fn main() {
#[actor("dummy", DummyMsg)]
impl Dummy {
fn tick_interval_ticks(&self) -> u64 {
self.interval_secs.load(Ordering::Relaxed) * TICKS_PER_SECOND
}
#[on_tick]
async fn heartbeat(&self) {
log::info!("[dummy] heartbeat");
}
#[on_info]
async fn on_info(&self) -> DummyInfo { ... }
#[on_message(SetInterval)]
async fn set_interval(&self, secs: u64) { ... }
}
}
Starts stopped. driver start dummy from the shell opens its mailbox and
spawns the run loop. driver dummy set-interval 3 sends SetInterval(3) and
changes the heartbeat rate at runtime.
Startup Sequence
#![allow(unused)]
fn main() {
// main.rs (abridged)
// Dummy driver — starts stopped, user can start it from the shell
let (dummy_driver, dummy_inbox) = DummyDriver::new(Dummy::new());
devices::driver::register(Box::new(dummy_driver));
registry::register("dummy", dummy_inbox);
// Shell actor — started immediately
let (shell_driver, shell_inbox) = ShellDriver::new(Shell::new());
devices::driver::register(Box::new(shell_driver));
registry::register("shell", shell_inbox.clone());
devices::driver::start_driver("shell").ok(); // reopen + spawn run loop
// Keyboard actor — started immediately, stream-driven by PS/2 IRQs
let (kb_driver, kb_inbox) =
KeyboardActorDriver::new(KeyboardActor::new());
devices::driver::register(Box::new(kb_driver));
registry::register("keyboard", kb_inbox);
devices::driver::start_driver("keyboard").ok();
}
File Map
| Path | Role |
|---|---|
libkernel/src/task/mailbox.rs | Mailbox<M>, Reply<T>, ActorMsg<M,I>, ActorStatus<I>, ErasedInfo, RecvTimeout<M> |
libkernel/src/task/mod.rs | poll_stream_next helper used by macro-generated code |
libkernel/src/task/registry.rs | process registry, Informable, ask_info |
devices/src/task_driver.rs | DriverTask trait, TaskDriver<T>, StopToken |
devices/src/driver.rs | Driver trait, driver registry (start/stop/list) |
devices-macros/src/lib.rs | #[actor], #[on_message], #[on_info], #[on_start], #[on_tick], #[on_stream] |
devices/src/dummy.rs | tick + message actor (#[on_tick], #[on_message], #[on_info]) |
kernel/src/shell.rs | shell actor (#[on_start], #[on_message]) |
kernel/src/keyboard_actor.rs | keyboard actor (#[on_stream], #[on_info]) |
kernel/src/task_driver.rs | pub use devices::task_driver::* shim for crate::task_driver path |
virtio-blk Block Device Driver
Overview
The kernel includes a PCI virtio-blk driver that provides read/write access to
a QEMU virtual disk. The driver is implemented using the virtio-drivers crate
(v0.13) and integrates with the existing actor/driver framework.
The driver is started automatically at boot if a virtio-blk PCI device is
found. It is accessible from the shell via the blk commands.
Architecture
QEMU virtio-blk device (PCIe, Q35 ECAM)
│
│ PciTransport (virtio-drivers)
▼
VirtIOBlk<KernelHal, PciTransport> ← virtio protocol implementation
│
spin::Mutex (actor + ISR safe)
│
VirtioBlkActor ← actor framework wrapper
│
Mailbox<ActorMsg<VirtioBlkMsg, VirtioBlkInfo>>
│
Shell / other actors ← consumers
Components
devices/src/virtio/mod.rs — HAL and transport
KernelHal
Implements the virtio_drivers::Hal unsafe trait, bridging the virtio-drivers
crate into the kernel memory model:
| Method | Implementation |
|---|---|
dma_alloc(pages) | Allocates contiguous physical frames via MemoryServices::alloc_dma_pages; returns (paddr, virt) where virt is in the linear physical-memory window (phys_mem_offset + paddr). Pages are zeroed. |
dma_dealloc | No-op. The frame allocator has no free operation; allocations are leaked (acceptable for MVP). |
mmio_phys_to_virt(paddr, size) | Calls MemoryServices::map_mmio_region to ensure the physical range is mapped, then returns the linear-window virtual address. |
share(buffer) | Performs a page-table walk via MemoryServices::translate_virt to find the physical address of any buffer (heap or DMA window). A plain vaddr - phys_mem_offset would be wrong for heap buffers. |
unshare | No-op on x86 (cache-coherent). |
ECAM / PciRoot
The Q35 machine exposes a PCIe Extended Configuration Access Mechanism (ECAM)
region at physical address 0xB000_0000 (1 MiB, covering bus 0).
Physical 0xB000_0000 → Virtual phys_mem_offset + 0xB000_0000
The mapping is created once during libkernel_main by calling
MemoryServices::map_mmio_region. The resulting virtual base is stored in the
ECAM_VIRT_BASE atomic and used by create_pci_root() which constructs a
PciRoot<MmioCam<'static>> for the virtio-drivers transport layer.
(In virtio-drivers 0.13, PciRoot is generic over a ConfigurationAccess
implementation; MmioCam wraps the raw MMIO pointer with a Cam::Ecam mode.)
create_pci_transport (formerly create_blk_transport)
#![allow(unused)]
fn main() {
pub fn create_pci_transport(bus: u8, device: u8, function: u8) -> Option<PciTransport>
}
Wraps PciTransport::new::<KernelHal, _>, isolating virtio-drivers from the
kernel binary — the kernel crate does not depend on virtio-drivers directly.
Works for any virtio-pci device (blk, 9p, etc.), not just block devices.
create_blk_transport is kept as a legacy alias.
register_blk_irq
#![allow(unused)]
fn main() {
pub fn register_blk_irq(handler: fn()) -> Option<u8>
}
Registers a dynamic IDT handler for the virtio-blk interrupt (delegating to
libkernel::interrupts::register_handler). Returns the allocated IDT vector,
which must be programmed into the device’s MSI or IO APIC routing table.
IRQ-driven completion is not yet wired up (see Limitations).
devices/src/virtio/blk.rs — the actor
Messages
#![allow(unused)]
fn main() {
pub enum VirtioBlkMsg {
Read(u64, Reply<Result<Vec<u8>, ()>>), // sector, reply
Write(u64, Vec<u8>, Reply<Result<(), ()>>), // sector, data, reply
}
}
Info
#![allow(unused)]
fn main() {
#[derive(Debug)]
pub struct VirtioBlkInfo {
pub capacity_sectors: u64,
pub reads: u64,
pub writes: u64,
}
}
Returned by driver info virtio-blk and blk info.
VirtioBlkActor
Owns a spin::Mutex<VirtIOBlk<KernelHal, PciTransport>>. The mutex is needed
because both the actor task and (future) interrupt handler may access the device.
unsafe impl Send + Sync are required because VirtIOBlk contains raw DMA
buffer pointers, which are not auto-Send. Access is always serialised through
the spin::Mutex.
Read/write flow
on_read(sector, reply):
1. lock device → read_blocks_nb(sector, &mut req, buf, &mut resp) → token
2. unlock device
3. CompletionFuture.await (busy-polls peek_used until the device signals done)
4. lock device → complete_read_blocks(token, &req, buf, &resp)
5. unlock device
6. reply.send(Ok(buf))
Write is symmetric with write_blocks_nb / complete_write_blocks.
All of read_blocks_nb, write_blocks_nb, complete_read_blocks, and
complete_write_blocks are unsafe fn in virtio-drivers — the safety
contract is that the buffers remain valid and unpinned for the duration of the
I/O. Because buf, req, and resp all live in the async state machine
on the heap, they are not moved or dropped between submit and complete.
CompletionFuture
#![allow(unused)]
fn main() {
struct CompletionFuture<'a> {
device: &'a spin::Mutex<VirtIOBlk<KernelHal, PciTransport>>,
}
impl Future for CompletionFuture<'_> {
type Output = ();
fn poll(...) -> Poll<()> {
if device.lock().peek_used().is_some() {
Poll::Ready(())
} else {
cx.waker().wake_by_ref(); // reschedule immediately (busy-poll)
Poll::Pending
}
}
}
}
This is a busy-poll future for MVP. It re-schedules itself every executor turn until the virtqueue returns a used buffer. See Limitations for the planned IRQ-driven replacement.
libkernel/src/memory/mod.rs — supporting APIs
Three methods were added to MemoryServices for virtio support:
map_mmio_region(phys_start, size) -> VirtAddr
Maps a physical MMIO range into the linear physical-memory window
(phys_mem_offset + phys_start) using 4 KiB pages with PRESENT | WRITABLE | NO_CACHE flags.
Pages already mapped as 4 KiB pages are skipped silently (Ok(_)). Pages
inside a 2 MiB or 1 GiB huge-page entry are also skipped
(Err(TranslateError::ParentEntryHugePage)) — they are already accessible
because the bootloader maps all physical RAM using 2 MiB huge pages.
This huge-page check was the fix for the map_to failed: ParentEntryHugePage
panic that occurred when mapping the ECAM region.
alloc_dma_pages(pages) -> Option<PhysAddr>
Allocates pages physically-contiguous 4 KiB frames from the
BootInfoFrameAllocator. Panics if frames are not contiguous (very unlikely
with the sequential allocator).
translate_virt(virt) -> Option<PhysAddr>
Walks the active RecursivePageTable to find the physical address for any virtual
address, regardless of page size (4 KiB, 2 MiB, or 1 GiB).
This is used by KernelHal::share to convert heap buffer addresses to physical
addresses. A simple vaddr - phys_mem_offset subtraction would be wrong for
heap buffers (which live at HEAP_START, not in the linear physical window),
producing garbage physical addresses and causing QEMU to report
virtio: zero sized buffers are not allowed.
Boot Sequence
libkernel_main()
1. memory::init_services(mapper, frame_allocator, phys_mem_offset, map)
2. map_mmio_region(0xB000_0000, 1 MiB) ← ECAM
virtio::set_ecam_base(ecam_virt)
3. devices::pci::init() ← scan CF8/CFC config space
4. find_devices(0x1AF4, 0x1042) ← probe modern-transitional first
find_devices(0x1AF4, 0x1001) ← then legacy
5. virtio::create_pci_transport(bus, dev, func)
└─ PciRoot::new(MmioCam::new(ECAM_VIRT_BASE, Cam::Ecam))
PciTransport::new::<KernelHal, _>(&mut root, df)
6. VirtioBlkActor::new(transport)
7. VirtioBlkActorDriver::new(actor)
8. driver::register + registry::register("virtio-blk", inbox)
9. driver::start_driver("virtio-blk")
→ "[kernel] virtio-blk registered"
Shell Commands
| Command | Description |
|---|---|
blk info | Print capacity, read count, and write count |
blk read <sector> | Read 512 bytes from sector N; hex-dump first 64 bytes |
blk ls [path] | List exFAT directory (see exfat.md) |
blk cat <path> | Print exFAT file as text (see exfat.md) |
ls [path] | Alias for blk ls |
cat <path> | Alias for blk cat |
driver info virtio-blk | Same info via the generic driver info command |
driver stop virtio-blk | Stop the actor (mailbox closed; no further I/O) |
driver start virtio-blk | Restart the actor |
Running with a Disk
# Create a blank 64 MiB disk image (once):
make disk
# Build and run with the disk attached:
make run
The run target adds:
-drive file=disk.img,format=raw,if=none,id=hd0
-device virtio-blk-pci,drive=hd0
The kernel uses a Q35 machine (-machine q35) which provides native PCIe and
ECAM support.
To run without a disk (e.g. for quick boot tests):
make run-nodisk
PCI Device IDs
| Device ID | Variant |
|---|---|
0x1AF4:0x1042 | Modern-transitional virtio-blk (QEMU default) |
0x1AF4:0x1001 | Legacy virtio-blk |
Both are probed at boot; modern-transitional is tried first.
Key Files
| File | Role |
|---|---|
devices/src/virtio/mod.rs | KernelHal, ECAM state, create_pci_transport, register_blk_irq |
devices/src/virtio/blk.rs | VirtioBlkActor, VirtioBlkMsg, VirtioBlkInfo, CompletionFuture |
devices/src/virtio/p9_proto.rs | 9P2000.L wire protocol encode/decode |
devices/src/virtio/p9.rs | P9Client — high-level 9P client wrapping VirtIO9p |
kernel/src/main.rs | ECAM mapping, PCI probe (blk + 9p), actor registration |
devices/src/virtio/exfat.rs | exFAT partition detection, filesystem, path walk |
kernel/src/shell.rs | blk info, blk read, blk ls, blk cat, ls, cat, cd, pwd |
libkernel/src/memory/mod.rs | map_mmio_region, alloc_dma_pages, translate_virt |
Makefile | disk, run, run-nodisk targets |
Limitations
Busy-poll completion
CompletionFuture re-schedules itself every executor turn, consuming CPU until
the device completes I/O. The intended replacement is an AtomicWaker-based
future that sleeps until the IRQ handler calls wake():
#![allow(unused)]
fn main() {
static IRQ_WAKER: AtomicWaker = AtomicWaker::new();
fn virtio_blk_irq_handler() {
IRQ_PENDING.store(true, Ordering::Release);
IRQ_WAKER.wake();
}
}
This requires programming the device’s MSI capability or IO APIC routing with
the vector returned by register_blk_irq. The infrastructure exists; wiring
is the remaining work.
No DMA free
dma_dealloc is a no-op. Freed DMA pages are leaked. The BootInfoFrameAllocator
has no reclamation path. Acceptable for MVP; a proper frame allocator with free
would be needed for a production kernel.
Single device
The IRQ state (IRQ_PENDING) is a file-level static, supporting only one
virtio-blk device. Multi-device support would require per-device state.
Heap size
The kernel heap is 100 KiB. DMA allocations come from the frame allocator (not
the heap), but Vec<u8> read buffers and BlkReq/BlkResp structs live on the
heap. Sustained I/O workloads should remain well within the limit.
VirtIO 9P Host Directory Sharing
Overview
The kernel includes a VirtIO 9P (9P2000.L) driver that shares a host directory
directly into the guest via QEMU’s -fsdev mechanism. This provides a
Docker-volume-like workflow: edit files on the host, they appear instantly in
the guest — no disk image rebuild needed.
The driver uses the VirtIO9p device from the virtio-drivers crate (v0.13)
and implements a minimal read-only 9P2000.L client on top.
Architecture
Host directory (./user)
│
QEMU virtio-9p-pci device
│
│ PciTransport (virtio-drivers)
▼
VirtIO9p<KernelHal, PciTransport> ← virtio device, raw request/response
│
spin::Mutex (synchronous access)
│
P9Client ← 9P2000.L protocol client
│
Plan9Vfs ← VFS adapter
│
devices::vfs mount table ← /host and optionally /
Components
devices/src/virtio/p9_proto.rs — Wire Protocol
Minimal 9P2000.L message encoding/decoding. All messages use little-endian
wire format with a 7-byte header: size[4] type[1] tag[2].
Message pairs implemented
| T-message | R-message | Type codes | Purpose |
|---|---|---|---|
| Tversion | Rversion | 100 / 101 | Protocol handshake (negotiates msize) |
| Tattach | Rattach | 104 / 105 | Mount filesystem, get root fid |
| Twalk | Rwalk | 110 / 111 | Traverse path components |
| Tlopen | Rlopen | 12 / 13 | Open a fid for reading |
| Tread | Rread | 116 / 117 | Read file data |
| Treaddir | Rreaddir | 40 / 41 | Read directory entries |
| Tgetattr | Rgetattr | 24 / 25 | Get file attributes (mode, size) |
| Tclunk | Rclunk | 120 / 121 | Release a fid |
Error responses use Rlerror (type 7) with a Linux errno code.
Key types
#![allow(unused)]
fn main() {
pub struct Qid { pub qid_type: u8, pub version: u32, pub path: u64 }
pub struct DirEntry9p { pub qid: Qid, pub offset: u64, pub dtype: u8, pub name: String }
pub struct Stat9p { pub mode: u32, pub size: u64, pub qid: Qid }
}
devices/src/virtio/p9.rs — P9Client
High-level client wrapping VirtIO9p. The device is accessed synchronously
through a spin::Mutex — no actor pattern needed since 9P access happens from
syscall context via osl::blocking::blocking().
#![allow(unused)]
fn main() {
pub struct P9Client {
device: Mutex<VirtIO9p<KernelHal, PciTransport>>,
msize: u32, // negotiated max message size (typically 8192)
next_fid: Mutex<u32>,
}
}
Construction
P9Client::new(transport) performs the handshake:
- Tversion — negotiates protocol version (“9P2000.L”) and max message size
- Tattach — attaches root fid (fid 0) to the shared directory
Public methods
| Method | Flow |
|---|---|
list_dir(path) | walk → lopen → readdir (loop) → clunk |
read_file(path) | walk → getattr (size) → lopen → read (loop) → clunk |
stat(path) | walk → getattr → clunk |
Each method walks from the root fid, allocating a temporary fid that is clunked
after the operation completes. The readdir and read loops consume data in
chunks of msize - 64 bytes.
list_dir filters out . and .. entries automatically.
devices/src/vfs/plan9_vfs.rs — VFS Adapter
Follows the ExfatVfs pattern. Wraps an Arc<P9Client> and maps:
DirEntry9p→VfsDirEntry(dtype 4 or qid type 0x80 →is_dir)P9Error→VfsError
The P9Client methods are synchronous but the VFS interface is async. Since the virtio-9p device uses polling (no IRQ), blocking in an async context is acceptable for MVP.
QEMU Configuration
In scripts/run.sh:
-fsdev local,id=fsdev0,path=./user,security_model=none \
-device virtio-9p-pci,fsdev=fsdev0,mount_tag=hostfs
This shares the ./user directory (where userspace binaries are built) into
the guest. security_model=none disables host permission mapping, which is
appropriate since the guest is read-only.
Boot Sequence
run_kernel()
1. PCI probe: find_devices(0x1AF4, 0x1049) ← modern virtio-9p
find_devices(0x1AF4, 0x1009) ← legacy
2. create_pci_transport(bus, dev, func)
3. P9Client::new(transport)
└─ Tversion + Tattach handshake
4. Arc::new(client)
5. vfs::mount("/host", Plan9(Plan9Vfs::new(Arc::clone(&client))))
6. If no virtio-blk:
vfs::mount("/", Plan9(Plan9Vfs::new(client)))
PCI Device IDs
| Device ID | Variant |
|---|---|
0x1AF4:0x1049 | Modern virtio-9p (probed first) |
0x1AF4:0x1009 | Legacy virtio-9p |
Key Files
| File | Role |
|---|---|
devices/src/virtio/p9_proto.rs | 9P2000.L wire protocol encode/decode |
devices/src/virtio/p9.rs | P9Client — high-level 9P client |
devices/src/vfs/plan9_vfs.rs | Plan9Vfs — VFS adapter |
devices/src/virtio/mod.rs | KernelHal, create_pci_transport (shared with blk) |
kernel/src/main.rs | 9P probe, mount at /host and fallback / |
scripts/run.sh | QEMU -fsdev and -device virtio-9p-pci flags |
Limitations
Read-only
The 9P client only implements read operations (walk, lopen, read, readdir, getattr). Write, create, mkdir, remove, and rename are not supported.
No fid recycling
Fid numbers are allocated monotonically and never reused. With 32-bit fids this is unlikely to be a problem in practice, but a long-running system performing many file operations would eventually exhaust the fid space.
Directory entry sizes
list_dir reports size: 0 for all entries because readdir does not return
file sizes. A per-entry getattr could be added but would increase the number
of 9P round-trips.
Single device
Only one virtio-9p device is probed. Multiple shared directories would require iterating over all matching PCI devices and mounting each at a different path.
Synchronous I/O
All 9P operations block the calling scheduler thread. This is acceptable when
called from syscall context via osl::blocking::blocking(), but direct use
from async tasks would stall the executor.
exFAT Read-Only Filesystem
Overview
The kernel includes a read-only exFAT filesystem driver that sits on top of the virtio-blk block device. It auto-detects bare exFAT volumes, MBR-partitioned disks, and GPT-partitioned disks, then exposes simple directory-listing and file-read operations through the shell.
The driver is implemented entirely in devices/src/virtio/exfat.rs with no
external dependencies.
Architecture
Shell (ls / cat / cd / pwd)
│
│ open_exfat / list_dir / read_file
▼
ExfatVol ──── async sector reads ────▶ BlkInbox
│ │
Partition detection VirtioBlkActor
Boot sector parse (virtio-blk driver)
FAT traversal
Dir entry parse
Path walk
All filesystem I/O is done one 512-byte sector at a time via the ask pattern
on the virtio-blk actor’s mailbox (VirtioBlkMsg::Read).
Partition Auto-Detection
open_exfat reads sector 0 and applies the following decision tree:
sector0[3..11] == "EXFAT "
→ bare exFAT (no partition table); volume starts at LBA 0
sector0[510..512] == [0x55, 0xAA]
read sector 1
sector1[0..8] == "EFI PART"
→ GPT: scan partition entries starting at the LBA stored in the header
look for type GUID = EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
(on-disk mixed-endian: A2 A0 D0 EB E5 B9 33 44 87 C0 68 B6 B7 26 99 C7)
read StartingLBA of the matching entry, verify "EXFAT " there
else
→ MBR: scan partition table at sector0[446..510]
look for entry with type byte 0x07
read LBA start (bytes 8–11 LE u32), verify "EXFAT " there
else
→ ExfatError::UnknownPartitionLayout
Type 0x07 is shared by exFAT and NTFS. The driver always verifies the OEM name at the candidate partition’s first sector before accepting it as exFAT.
On-Disk Layout
Boot Sector
| Offset | Size | Field | Notes |
|---|---|---|---|
| 3 | 8 | FileSystemName | Must equal "EXFAT " (with trailing space) |
| 80 | 4 | FatOffset | Sectors from volume start to FAT |
| 88 | 4 | ClusterHeapOffset | Sectors from volume start to data region |
| 96 | 4 | FirstClusterOfRootDirectory | Cluster number of root dir |
| 109 | 1 | SectorsPerClusterShift | sectors_per_cluster = 1 << shift |
| 510 | 2 | BootSignature | Must equal [0x55, 0xAA] |
FAT (File Allocation Table)
An array of u32 little-endian values. Entry N holds the next cluster in
the chain for cluster N, or 0xFFFFFFFF for end-of-chain.
fat_lba = volume_lba + FatOffset
sector_of_entry = fat_lba + (N * 4) / 512
byte_in_sector = (N * 4) % 512
Cluster Heap
cluster_lba(N) = cluster_heap_lba + (N − 2) * sectors_per_cluster
Cluster numbers start at 2; clusters 0 and 1 are reserved.
Directory Entry Sets
Each entry is 32 bytes. A file or directory is represented by a consecutive set of three or more entries:
| Type byte | Name | Key fields |
|---|---|---|
0x85 | File | [1] SecondaryCount; [4..6] FileAttributes (bit 4 = directory) |
0xC0 | Stream Extension | [8..16] DataLength (u64 LE); [20..24] FirstCluster (u32 LE) |
0xC1+ | File Name | [2..32] up to 15 UTF-16LE code units per entry |
Type 0x00 marks the end of directory; scanning stops immediately.
Any type byte with bit 7 clear (< 0x80) is an unused or deleted entry and
is skipped.
ExfatVol State
#![allow(unused)]
fn main() {
pub struct ExfatVol {
lba_base: u64, // absolute LBA of the exFAT boot sector
sectors_per_cluster: u64,
fat_lba: u64, // absolute LBA of the FAT
cluster_heap_lba: u64, // absolute LBA of the cluster heap
root_cluster: u32,
}
}
This is returned by open_exfat and passed to every subsequent call. The
shell calls open_exfat fresh on each command (stateless).
Public API
#![allow(unused)]
fn main() {
/// Auto-detect layout and open the exFAT volume.
pub async fn open_exfat(inbox: &BlkInbox) -> Result<ExfatVol, ExfatError>;
/// List directory at `path` (e.g. "/" or "/docs").
pub async fn list_dir(vol: &ExfatVol, inbox: &BlkInbox, path: &str)
-> Result<Vec<DirEntry>, ExfatError>;
/// Read a file into memory. Capped at 16 KiB.
pub async fn read_file(vol: &ExfatVol, inbox: &BlkInbox, path: &str)
-> Result<Vec<u8>, ExfatError>;
}
#![allow(unused)]
fn main() {
pub struct DirEntry {
pub name: String,
pub is_dir: bool,
pub size: u64,
}
pub enum ExfatError {
NoDevice, IoError, NotExfat, UnknownPartitionLayout,
PathNotFound, NotAFile, NotADirectory, FileTooLarge,
}
}
BlkInbox is a type alias for the virtio-blk actor’s mailbox:
#![allow(unused)]
fn main() {
pub type BlkInbox = Arc<Mailbox<ActorMsg<VirtioBlkMsg, VirtioBlkInfo>>>;
}
Path Resolution
The shell maintains a current working directory (CWD) in Shell::cwd
(spin::Mutex<String>, default "/").
resolve_path(cwd, path) in kernel/src/shell.rs handles relative and
absolute paths, then normalize_path collapses . and .. components:
cwd = "/a/b"
resolve("../c") → normalize("/a/b/../c") → "/a/c"
resolve("/foo") → "/foo"
resolve("") → "/a/b" (defaults to CWD)
Path component matching in the driver is case-insensitive ASCII
(str::eq_ignore_ascii_case). Non-ASCII filename characters are replaced
with ? in the decoded string.
Shell Commands
| Command | Description |
|---|---|
ls [path] | List directory; defaults to CWD |
cat <path> | Print file as text; non-printable bytes shown as . |
pwd | Print current working directory |
cd [path] | Change CWD; verifies the target exists; defaults to / |
blk ls [path] | Alias for ls |
blk cat <path> | Alias for cat |
cd calls list_dir on the target path before updating the CWD, so invalid
paths are rejected with an error rather than silently accepted.
Memory Budget
Peak heap usage during ls:
| Item | Size |
|---|---|
| Boot sector | 512 B |
| FAT sector (per entry lookup) | 512 B |
| Cluster data (typical 4 KiB cluster) | 4 KiB |
Vec<DirEntry> | small |
| Total | ~5 KiB |
read_file caps output at 16 KiB. The kernel heap is 100 KiB; both
operations are well within budget.
Limitations
Read-only
Write support is not implemented. VirtioBlkMsg::Write exists in the block
driver but the exFAT layer has no write path.
Entry sets crossing cluster boundaries
scan_dir_cluster collects all sectors of a cluster into a flat buffer before
parsing entries. An entry set whose 0x85 primary entry is in one cluster and
whose secondary entries start in the next cluster will be silently skipped.
This situation does not arise on normally-formatted volumes where directories
start empty.
ASCII-only filenames
UTF-16LE code points above U+007F are replaced with ?. Files can still be
opened by name if the shell command uses the same replacement — but in
practice, test images should use ASCII filenames.
Fresh volume open per command
open_exfat reads the boot sector (and up to ~32 GPT entry sectors) on every
shell command. A cached ExfatVol stored in the shell actor would reduce
overhead, but is unnecessary given the current workload.
16 KiB file cap
read_file returns ExfatError::FileTooLarge for files exceeding 16 KiB.
The limit exists to protect the 100 KiB heap; it can be raised if the heap is
grown.
Key Files
| File | Role |
|---|---|
devices/src/virtio/exfat.rs | Partition detection, boot parse, FAT traversal, dir scan, path walk, public API |
devices/src/virtio/mod.rs | Re-exports BlkInbox, DirEntry, ExfatError, ExfatVol, public functions |
kernel/src/shell.rs | cmd_blk_ls, cmd_blk_cat, cmd_cd, cmd_pwd, resolve_path, normalize_path |
Creating Test Images
GPT (macOS default)
hdiutil create -size 32m -fs ExFAT -volname TEST test-gpt.dmg
hdiutil attach test-gpt.dmg
cp hello.txt /Volumes/TEST/
mkdir /Volumes/TEST/subdir
cp nested.txt /Volumes/TEST/subdir/
hdiutil detach /Volumes/TEST
hdiutil convert test-gpt.dmg -format UDRO -o test-gpt-ro.dmg
MBR-partitioned
diskutil eraseDisk ExFAT TEST MBRFormat /dev/diskN
Bare exFAT (no partition table)
diskutil eraseVolume ExFAT TEST /dev/diskN
Running in QEMU
qemu-system-x86_64 ... \
-drive file=test-gpt.img,format=raw,if=none,id=hd0 \
-device virtio-blk-pci,drive=hd0
Then in the shell:
ostoo:/> ls
[DIR] subdir
[FILE 13] hello.txt
ostoo:/> cat /hello.txt
Hello, kernel!
ostoo:/> cd subdir
ostoo:/subdir> ls
[FILE 11] nested.txt
ostoo:/subdir> cat nested.txt
Hello again!
ostoo:/subdir> cd /
ostoo:/> pwd
/
Virtual Filesystem (VFS) Layer
Overview
The VFS layer provides a uniform path namespace over multiple filesystems. Before its introduction, the shell called the exFAT driver directly; adding a second filesystem would have required invasive shell changes. The VFS decouples path resolution and filesystem dispatch so that new drivers slot in without touching the shell.
Key properties:
- Enum dispatch — no heap-allocating
Pin<Box<dyn Future>>trait objects. - Mount table — filesystems are attached at arbitrary absolute paths.
- Lock safety — the mount-table lock is never held across an
awaitpoint. - No new Cargo dependencies — everything already present in the workspace.
Source layout
devices/src/
vfs/
mod.rs — public API, mount table, path resolution
exfat_vfs.rs — ExfatVfs: wraps virtio-blk + exFAT driver
plan9_vfs.rs — Plan9Vfs: wraps virtio-9p P9Client
proc_vfs/ — ProcVfs: synthetic kernel-info filesystem (mod.rs + generator submodules)
Public API (devices::vfs)
#![allow(unused)]
fn main() {
// Types
pub struct VfsDirEntry { pub name: String, pub is_dir: bool, pub size: u64 }
pub enum VfsError {
IoError, NotFound, NotAFile, NotADirectory, FileTooLarge, NoFilesystem,
}
pub enum AnyVfs { Exfat(ExfatVfs), Plan9(Plan9Vfs), Proc(ProcVfs) }
// Functions
pub fn mount(mountpoint: &str, fs: AnyVfs);
pub async fn list_dir(path: &str) -> Result<Vec<VfsDirEntry>, VfsError>;
pub async fn read_file(path: &str) -> Result<Vec<u8>, VfsError>;
pub fn with_mounts<F: FnOnce(&[(String, Arc<AnyVfs>)])>(f: F);
}
All paths supplied to list_dir and read_file must be absolute (the shell’s
resolve_path runs first and normalises . / ..).
Enum dispatch
Async methods on trait objects require Pin<Box<dyn Future>> — allocating and
verbose in no_std. Instead, AnyVfs is a plain enum:
#![allow(unused)]
fn main() {
pub enum AnyVfs {
Exfat(ExfatVfs),
Plan9(Plan9Vfs),
Proc(ProcVfs),
}
impl AnyVfs {
pub async fn list_dir(&self, path: &str) -> Result<Vec<VfsDirEntry>, VfsError> {
match self {
AnyVfs::Exfat(fs) => fs.list_dir(path).await,
AnyVfs::Plan9(fs) => fs.list_dir(path).await,
AnyVfs::Proc(fs) => fs.list_dir(path).await,
}
}
// read_file, fs_type likewise
}
}
Adding a new filesystem = add one variant + three match arms (list_dir,
read_file, fs_type).
Mount table
#![allow(unused)]
fn main() {
lazy_static! {
static ref MOUNTS: spin::Mutex<Vec<(String, Arc<AnyVfs>)>> = ...;
}
}
Entries are kept sorted longest-mountpoint-first so resolution is a simple linear scan — the first match wins without any backtracking.
mount() replaces an existing entry at the same mountpoint, then re-sorts.
Arc<AnyVfs> is cloned out of the lock before any .await; the spinlock is
never held across a suspension point.
Path resolution rules
| Situation | Mountpoint | Request path | Rel path passed to driver |
|---|---|---|---|
| Exact match | /proc | /proc | / |
| Prefix match | /proc | /proc/tasks | /tasks |
| Root pass-through | / | /docs/foo | /docs/foo |
| No match | — | /missing | VfsError::NoFilesystem |
#![allow(unused)]
fn main() {
fn resolve(path: &str) -> Option<(Arc<AnyVfs>, String)> {
for (mp, fs) in MOUNTS.lock().iter() {
if mp == "/" { return Some((clone(fs), path.into())); }
if path == mp { return Some((clone(fs), "/".into())); }
if path.starts_with(mp) && path[mp.len()..].starts_with('/') {
return Some((clone(fs), path[mp.len()..].into()));
}
}
None
}
}
ExfatVfs
ExfatVfs wraps a BlkInbox (the virtio-blk actor’s mailbox) and delegates
to the existing devices::virtio::exfat functions. It calls open_exfat
fresh on every request — identical to the pre-VFS shell behaviour.
ExfatVfs::list_dir / read_file
└─ exfat::open_exfat (detects bare/MBR/GPT layout)
└─ exfat::list_dir / read_file
ExfatError → VfsError mapping:
| ExfatError | VfsError |
|---|---|
| NoDevice / IoError / NotExfat / UnknownPartitionLayout | IoError |
| PathNotFound | NotFound |
| NotAFile | NotAFile |
| NotADirectory | NotADirectory |
| FileTooLarge | FileTooLarge |
Plan9Vfs
Plan9Vfs wraps an Arc<P9Client> and delegates to the 9P2000.L client.
Unlike ExfatVfs (which goes through the actor/mailbox path), the P9 client
performs synchronous virtio-9p device I/O directly under a spin::Mutex.
Plan9Vfs::list_dir / read_file
└─ P9Client::list_dir / read_file
└─ VirtIO9p::request (virtio-drivers)
P9Error → VfsError mapping:
| P9Error | VfsError |
|---|---|
| ServerError(2) (ENOENT) | NotFound |
| ServerError(20) (ENOTDIR) | NotADirectory |
| ServerError(21) (EISDIR) | NotAFile |
| ServerError(_) / DeviceError | IoError |
| BufferTooSmall / InvalidResponse / Utf8Error | IoError |
The list_dir result sets is_dir from the dirent’s dtype field (4 = DT_DIR)
or the qid type bit (0x80 = directory). The size field is 0 since readdir
does not report file sizes — a follow-up stat per entry could be added later.
See docs/virtio-9p.md for the full 9P driver documentation.
ProcVfs
A synthetic filesystem with no block I/O. All content is computed on demand.
| VFS path | Relative path seen by driver | Content |
|---|---|---|
/proc | / | directory listing |
/proc/tasks | /tasks | ready: N waiting: M\n |
/proc/uptime | /uptime | Ns\n |
/proc/drivers | /drivers | one name State line per driver |
Data sources:
executor::ready_count()/executor::wait_count()— task queue depthstimer::ticks() / TICKS_PER_SECOND— seconds since bootdriver::with_drivers()— registered driver names and states
Kernel initialisation (kernel/src/main.rs)
#![allow(unused)]
fn main() {
// Probe virtio-9p and create a shared P9Client.
let p9_client = probe_9p(); // returns Option<Arc<P9Client>>
// If 9p is available, always mount at /host.
if let Some(ref client) = p9_client {
devices::vfs::mount("/host", AnyVfs::Plan9(Plan9Vfs::new(Arc::clone(client))));
}
// Always mount /proc — available without a block device.
devices::vfs::mount("/proc", AnyVfs::Proc(ProcVfs));
// Mount exFAT at / if virtio-blk was probed successfully.
let have_blk = if let Some(inbox) = registry::get::<..>("virtio-blk") {
devices::vfs::mount("/", AnyVfs::Exfat(ExfatVfs::new(inbox)));
true
} else { false };
// Fallback: mount 9p at / if no disk image is present.
if !have_blk {
if let Some(client) = p9_client {
devices::vfs::mount("/", AnyVfs::Plan9(Plan9Vfs::new(client)));
}
}
}
This runs after both the virtio-blk and virtio-9p probe blocks and before task
spawning. When both are present, exFAT owns / and 9p is at /host. When
only 9p is present, it is mounted at both /host and / so that /shell
auto-launch works without a disk image.
Shell integration (kernel/src/shell.rs)
The shell commands ls, cat, and cd now call the VFS API instead of the
exFAT driver directly:
ls [path] → devices::vfs::list_dir(&path).await
cat <path> → devices::vfs::read_file(&path).await
cd [path] → devices::vfs::list_dir(&target).await (directory check)
A new mount command manages the mount table at runtime:
mount — list all mounts
mount proc <mountpoint> — attach a ProcVfs instance
mount blk <mountpoint> — attach an ExfatVfs instance (requires virtio-blk)
Example session
# Boot with 9p only (no disk image)
ostoo:/> mount
/ 9p
/host 9p
/proc proc
ostoo:/> ls /
shell
ostoo:/> ls /host
shell
ostoo:/> cat /proc/uptime
42s
# Boot with both disk image and 9p
ostoo:/> mount
/ exfat
/host 9p
/proc proc
ostoo:/> ls /
[DIR] subdir
[FILE 13] hello.txt
ostoo:/> ls /host
shell
ostoo:/> cat /host/shell | head
(binary ELF data)
Extending the VFS
To add a new filesystem type:
- Create
devices/src/vfs/<name>_vfs.rsimplementinglist_dirandread_fileas plainasync fn. - Add a variant to
AnyVfsinmod.rsand two match arms inlist_dir/read_file. - Re-export the new type from
mod.rs. - Mount it from
main.rsor the shell’smountcommand.
No changes to the shell dispatch loop or path-resolution logic are required.
IPC Channels
Overview
Capability-based IPC channels for structured message passing between processes. A channel is a unidirectional message conduit with configurable buffer capacity. Channels come in pairs: a send end and a receive end, each exposed as a file descriptor (capability).
The buffer capacity, set at creation time, determines the communication model:
- capacity = 0 – Synchronous rendezvous. Sender blocks until a receiver calls recv. Direct message transfer with scheduler donate for minimal latency (matching seL4 endpoint characteristics).
- capacity > 0 – Asynchronous buffered. Sender enqueues and returns immediately. Blocks only when the buffer is full.
This gives applications full control: create a sync channel for tight RPC-style communication, or an async channel for decoupled producer-consumer patterns.
Message Format
struct ipc_message {
uint64_t tag; /* user-defined message type */
uint64_t data[3]; /* 24 bytes of inline payload */
int32_t fds[4]; /* file descriptors for capability passing (-1 = unused) */
};
/* Total: 48 bytes */
The tag field is opaque to the kernel – applications use it to identify
message types. The data array carries the payload (pointers, handles,
small structs). The fds array carries file descriptors for capability
passing (set unused slots to -1). For bulk data, use shared memory with
a channel for signaling.
Syscalls
ipc_create (505)
long ipc_create(int fds[2], unsigned capacity, unsigned flags);
Creates a channel pair. Writes the send-end fd to fds[0] and the
receive-end fd to fds[1].
| Parameter | Description |
|---|---|
fds | User pointer to a 2-element int array |
capacity | Buffer capacity: 0 = sync, >0 = async buffered |
flags | IPC_CLOEXEC (0x1): set close-on-exec on both fds |
Returns 0 on success, negative errno on failure.
ipc_send (506)
long ipc_send(int fd, const struct ipc_message *msg, unsigned flags);
Send a message through a send-end fd.
| Parameter | Description |
|---|---|
fd | Send-end file descriptor |
msg | Pointer to message in user memory |
flags | IPC_NONBLOCK (0x1): return -EAGAIN instead of blocking |
Blocking behavior:
- Sync (cap=0): blocks until a receiver calls recv, then transfers directly
- Async (cap>0): blocks only if the buffer is full
Returns 0 on success, -EPIPE if receive end is closed, -EAGAIN if
non-blocking and would block.
ipc_recv (507)
long ipc_recv(int fd, struct ipc_message *msg, unsigned flags);
Receive a message from a receive-end fd.
| Parameter | Description |
|---|---|
fd | Receive-end file descriptor |
msg | Pointer to buffer in user memory |
flags | IPC_NONBLOCK (0x1): return -EAGAIN instead of blocking |
Returns 0 on success, -EPIPE if send end is closed and no messages remain.
Examples
Sync channel (capacity=0)
int fds[2];
ipc_create(fds, 0, 0); /* sync channel */
int send_fd = fds[0], recv_fd = fds[1];
/* In child (after clone+execve, with recv_fd inherited): */
struct ipc_message msg;
ipc_recv(recv_fd, &msg, 0); /* blocks until parent sends */
/* In parent: */
struct ipc_message req = { .tag = 1, .data = {42, 0, 0}, .fds = {-1, -1, -1, -1} };
ipc_send(send_fd, &req, 0); /* blocks until child recvs, then donates */
Async channel (capacity=4)
int fds[2];
ipc_create(fds, 4, 0); /* buffered, 4 messages */
/* Producer can send 4 messages without blocking: */
for (int i = 0; i < 4; i++) {
struct ipc_message m = { .tag = i };
ipc_send(fds[0], &m, 0);
}
/* Consumer drains: */
struct ipc_message m;
while (ipc_recv(fds[1], &m, IPC_NONBLOCK) == 0) {
/* process m */
}
/* returns -EAGAIN when empty */
fd-passing (capability transfer)
/* Create a pipe and an IPC channel */
int pipe_fds[2], ch_fds[2];
pipe(pipe_fds);
ipc_create(ch_fds, 4, 0);
/* Send the pipe write-end through the channel */
struct ipc_message msg = {
.tag = 1,
.data = { 0, 0, 0 },
.fds = { pipe_fds[1], -1, -1, -1 }, /* transfer pipe write-end */
};
ipc_send(ch_fds[0], &msg, 0);
/* Receive — kernel allocates a new fd for the pipe write-end */
struct ipc_message recv_msg;
ipc_recv(ch_fds[1], &recv_msg, 0);
int new_write_fd = recv_msg.fds[0]; /* new fd number in receiver */
write(new_write_fd, "hello", 5); /* writes to the same pipe */
Semantics: When ipc_send is called with non-(-1) values in the fds
array, the kernel looks up each fd in the sender’s fd table, increments
reference counts, and stores the kernel objects inside the channel. When
ipc_recv delivers the message, the kernel allocates new fds in the
receiver’s fd table and rewrites msg.fds with the new fd numbers.
Error handling: If any fd in msg.fds is invalid, the entire send fails
with -EBADF. If the receiver’s fd table is full, recv fails with
-EMFILE.
Cleanup: If a message with transferred fds is never received (e.g., the channel is destroyed with messages in the queue), the kernel closes the transferred fd objects automatically.
Kernel Implementation
Files
| File | Purpose |
|---|---|
libkernel/src/channel.rs | ChannelInner kernel object, IpcMessage struct, send/recv/close logic |
libkernel/src/file.rs | FdObject::Channel(ChannelFd) variant, ChannelFd::Send/Recv |
osl/src/ipc.rs | Syscall implementations (sys_ipc_create/send/recv) |
osl/src/syscall_nr.rs | SYS_IPC_CREATE=505, SYS_IPC_SEND=506, SYS_IPC_RECV=507 |
Sync rendezvous internals
When capacity=0, sender and receiver rendezvous directly:
- If receiver is already blocked: sender copies message to
pending_send, unblocks receiver, donates quantum viaset_donate_target+yield_now - If no receiver: sender stores message in
pending_send, records thread index, blocks viablock_current_thread() - Receiver wakes, takes message from
pending_send, unblocks sender
This uses the same block_current_thread / unblock / donate primitives
as pipes and waitpid (see docs/scheduler-donate.md).
Async buffered internals
Messages are stored in a VecDeque<IpcMessage> bounded by capacity:
- Send: push to queue, wake blocked receiver if any
- Recv: pop from queue, wake blocked sender if queue was full
- Queue full: sender blocks until receiver drains
- Queue empty: receiver blocks until sender enqueues
Design Decisions
Unidirectional: Simpler and more composable than bidirectional. For RPC, use two channels (request + reply). For server fan-in, share the send-end fd via dup/fork.
Fixed-size messages: No heap allocation per message. 48 bytes fits common control-plane payloads plus 4 file descriptors for capability passing. Bulk data should use shared memory.
Capacity determines semantics: The application chooses sync vs async at creation time, not at each send/recv. This makes the channel’s behavior predictable and matches the Go channels model.
IPC_NONBLOCK flag: Adds flexibility for polling patterns and try-send/ try-recv without changing the channel’s fundamental semantics.
Channel as fd: Reuses the existing fd_table, close, dup2, CLOEXEC, and cleanup-on-exit infrastructure. No new kernel handle namespace.
Completion Port Integration
IPC channels can be multiplexed with other async I/O sources (IRQs, timers, file reads) via the completion port system.
OP_IPC_SEND (opcode 5)
Submit an IPC send as an async operation via io_submit. The message is
read from user memory at submission time. If the channel can accept it
immediately, a completion is posted right away. Otherwise the message is
stored and the completion fires when a receiver drains space.
Submission fields:
| Field | Value |
|---|---|
opcode | 5 (OP_IPC_SEND) |
fd | Channel send-end file descriptor |
buf_addr | Pointer to user struct ipc_message to send |
user_data | User-defined tag (returned in completion) |
Completion fields:
| Field | Value |
|---|---|
opcode | 5 (OP_IPC_SEND) |
result | 0 on success, -EPIPE if receive end closed |
user_data | Same as submission |
OP_IPC_RECV (opcode 6)
Submit an IPC receive as an async operation via io_submit. When a message
arrives on the channel, a completion is posted to the port with the message
copied to the user-provided buffer.
Submission fields:
| Field | Value |
|---|---|
opcode | 6 (OP_IPC_RECV) |
fd | Channel receive-end file descriptor |
buf_addr | Pointer to user struct ipc_message buffer |
user_data | User-defined tag (returned in completion) |
Completion fields:
| Field | Value |
|---|---|
opcode | 6 (OP_IPC_RECV) |
result | 0 on success, -EPIPE if send end closed |
user_data | Same as submission |
The message is copied to buf_addr by io_wait (same mechanism as OP_READ).
Semantics: Both operations are one-shot, like OP_IRQ_WAIT. Each submission handles exactly one message. Re-submit after each completion for continuous send/receive.
Example: event loop with IPC + timer
int port = io_create(0);
int fds[2];
ipc_create(fds, 4, 0);
struct ipc_message recv_buf;
struct io_submission subs[2] = {
{ .opcode = 6 /* OP_IPC_RECV */, .fd = fds[1],
.buf_addr = (uint64_t)&recv_buf, .user_data = 1 },
{ .opcode = 1 /* OP_TIMEOUT */, .timeout_ns = 1000000000, .user_data = 2 },
};
io_submit(port, subs, 2);
struct io_completion comp;
io_wait(port, &comp, 1, 1, 0);
if (comp.user_data == 1) {
/* IPC message received in recv_buf */
} else {
/* timer fired */
}
Kernel internals
When io_submit processes OP_IPC_RECV, it calls arm_recv():
- If a message is already in the queue, posts a completion immediately
- Otherwise, registers the port on the channel (
pending_portfield) - When a future
ipc_senddeposits a message,try_senddetects the armed port and returnsSendAction::PostToPort— the caller serializes the message and posts it to the port after releasing the channel lock
When io_submit processes OP_IPC_SEND, it calls arm_send():
- If the channel can accept the message (queue not full, or receiver waiting), delivers it and posts a success completion immediately
- Otherwise, stores the port + message in
pending_send_port - When a future
ipc_recvdrains space,try_recvdetects the armed send port and returnsRecvAction::MessageAndNotifySendPort— the caller posts a success completion to the send port
Lock ordering: channel lock is always acquired before port lock (never reversed), preventing deadlocks.
Future Extensions
- ipc_call(fd, send_msg, recv_msg) – atomic send+recv for RPC
- Bidirectional channels – two queues in one object
fd-passing in messages– implemented:fds[4]in IpcMessage, kernel transfers fd objects between processes- Notification objects – seL4-style bitmask signaling
Completion Port Design
Overview
This document describes a unified completion-based async I/O primitive (CompletionPort) for ostoo. The design supports both io_uring-style and Windows IOCP-style patterns through a single kernel object, accessed as an ordinary file descriptor.
The CompletionPort is motivated by the microkernel migration path (microkernel-design.md, Phases B-E) where userspace drivers need to wait on multiple event sources — IRQs, shared-memory ring wakeups, timers — through a single blocking wait point, without polling or managing multiple threads.
See also: mmap-design.md for the shared memory primitives that enable the zero-syscall ring optimisation (Phase 5 of this design).
Motivation
The Problem
A userspace NIC driver in the microkernel architecture must simultaneously wait for:
- IRQ events — the device raised an interrupt
- Ring wakeups — the TCP/IP server posted new transmit descriptors
- Timers — a retransmit or watchdog timer expired
With the current kernel, each of these is a separate blocking read() on a
separate fd. A driver would need one thread per event source, or a poll()/
select() readiness multiplexer — neither of which exists yet.
Why Completion-Based
A completion-based model inverts the usual readiness pattern:
- Readiness (epoll/poll/select): “tell me when fd X is ready, then I’ll do the I/O myself.” Two syscalls per operation (wait + read/write).
- Completion (io_uring/IOCP): “do this I/O for me and tell me when it’s done.” One syscall to submit, one to reap — or zero with shared-memory rings.
Completion-based I/O is a better fit for ostoo because:
- Simpler driver loops. Submit work, wait for completions. No edge- triggered vs level-triggered subtlety.
- Naturally batched. Multiple operations submitted and reaped per syscall.
- Unifies heterogeneous events. IRQs, timers, and file I/O all produce
the same
IoCompletionstruct. - Shared-memory fast path. The submission/completion queues can be mapped into userspace for zero-syscall operation under load (Phase 5).
- Matches the microkernel data plane. Drivers post work and reap completions — the same pattern as managing hardware descriptor rings.
How Other Systems Do It
Linux io_uring
Introduced in 5.1. Submission Queue (SQ) and Completion Queue (CQ) are
shared-memory ring buffers mapped into userspace. The kernel polls the SQ
for new entries; completions appear in the CQ. io_uring_enter() is the
single syscall (submit + wait).
- Supports 60+ operation types (read, write, accept, timeout, etc.)
- SQEs carry a
user_datafield returned verbatim in CQEs for demux IORING_SETUP_SQPOLLmode: kernel thread polls the SQ — truly zero- syscall submission under load- Fixed-file and fixed-buffer registration to avoid per-op fd/buffer lookup
Windows IOCP (I/O Completion Ports)
The original completion-based API (NT 3.5, 1994). A completion port is a kernel object that aggregates completions from multiple file handles.
CreateIoCompletionPort()creates the port and associates handles- Async operations (ReadFile, WriteFile with OVERLAPPED) post completions to the associated port
GetQueuedCompletionStatus()dequeues one completion (blocking)PostQueuedCompletionStatus()manually posts a completion (for app-level signaling)- The kernel limits concurrent threads to the port’s concurrency value
Fuchsia zx_port
Zircon ports are the unified event aggregation primitive:
zx_port_create()creates a portzx_object_wait_async()registers interest in an object’s signals (channels, interrupts, timers, processes) — when the signal fires, a packet is queued to the portzx_port_wait()dequeues a packet (blocking with optional timeout)zx_port_queue()manually enqueues a user packet- Packets carry a
keyfield for demux (equivalent touser_data)
seL4 Notifications
seL4 uses a minimal signaling primitive:
- A notification is a word-sized bitmask of binary semaphores
seL4_Signal()OR-sets bits;seL4_Wait()atomically reads and clears- Multiple event sources (IRQs, IPC completions) signal different bits in the same notification
- One
seL4_Wait()multiplexes all sources — the returned word tells which bits fired - Limitation: carries no payload beyond the bitmask. Data transfer requires a separate shared-memory protocol
Comparison
| Aspect | io_uring | IOCP | zx_port | seL4 notify | ostoo (proposed) |
|---|---|---|---|---|---|
| Model | Completion | Completion | Completion | Signal | Completion |
| Queue location | Shared memory | Kernel | Kernel | Kernel (1 word) | Kernel (Phase 5: shared mem) |
| Payload | Full SQE/CQE | Bytes + key | Packet union | Bitmask only | IoCompletion struct |
| Demux field | user_data | CompletionKey | key | Bit position | user_data |
| Event sources | Files, sockets, timers | File handles | Objects + signals | Capabilities | Fds, IRQs, timers, rings |
| Zero-syscall path | SQPOLL mode | No | No | No | Phase 5 (shared rings) |
| Submit + wait | Single syscall | Separate | Separate | Separate | Single syscall (io_wait) |
Core Abstraction
A CompletionPort is a kernel object consisting of a FIFO completion queue and a waiter slot. It is accessed through a file descriptor, like any other ostoo resource.
┌─────────────────────────────┐
│ CompletionPort │
│ │
io_submit ──────▶│ ┌─────────────────────────┐ │
│ │ Completion Queue │ │
IRQ ISR ────────▶│ │ ┌────┬────┬────┬───┐ │ │
│ │ │ C0 │ C1 │ C2 │...│ │ │──────▶ io_wait
Timer expire ───▶│ │ └────┴────┴────┴───┘ │ │ (blocks until
│ └─────────────────────────┘ │ non-empty)
Ring wakeup ────▶│ │
│ waiter: Option<thread_idx> │
└─────────────────────────────┘
Key properties:
- Single consumer. Only one thread may call
io_waiton a port at a time. This avoids thundering-herd complexity and matches the single- threaded driver loop model. - Multiple producers. Any context — syscall path, ISR, timer callback — can post a completion to the queue.
- User_data demux. Every submission carries a
u64 user_datafield that is returned verbatim in the completion, allowing the caller to identify which operation completed without inspecting the payload. - Port as fd. The port lives in the process’s
fd_tableand can be closed, passed acrossexecve(unlessFD_CLOEXEC), or used withdup2.
Syscall Interface
Three syscalls using custom numbers in the 500+ range:
| Nr | Name | Signature |
|---|---|---|
| 501 | io_create | io_create(flags: u32) → fd |
| 502 | io_submit | io_submit(port_fd: i32, entries: *const IoSubmission, count: u32) → i64 |
| 503 | io_wait | io_wait(port_fd: i32, completions: *mut IoCompletion, max: u32, min: u32, timeout_ns: u64) → i64 |
io_create (501)
Creates a new CompletionPort and returns its file descriptor.
flags: reserved, must be 0. Future:IO_CLOEXEC.- Returns: fd on success, negative errno on failure.
io_submit (502)
Submits one or more I/O operations to the port.
port_fd: fd returned byio_create.entries: pointer to an array ofIoSubmissionstructs in user memory.count: number of entries to submit (0 < count ≤ 64).- Returns: number of entries successfully submitted, or negative errno.
Submissions that reference invalid fds or unsupported operations fail individually — the return value indicates how many of the leading entries were accepted.
io_wait (503)
Waits for completions and copies them to user memory.
port_fd: fd returned byio_create.completions: pointer to an array ofIoCompletionstructs in user memory.max: maximum number of completions to return.min: minimum number to wait for before returning (0 = non-blocking poll).timeout_ns: maximum wait time in nanoseconds. 0 = no timeout (wait indefinitely formincompletions). Withmin=0, returns immediately.- Returns: number of completions written, or negative errno.
IoSubmission struct
#![allow(unused)]
fn main() {
#[repr(C)]
pub struct IoSubmission {
pub user_data: u64, // returned in completion, opaque to kernel
pub opcode: u32, // OP_NOP, OP_READ, etc.
pub flags: u32, // per-op flags, reserved
pub fd: i32, // target fd (for OP_READ, OP_WRITE)
pub _pad: i32,
pub buf_addr: u64, // user buffer pointer
pub buf_len: u32, // buffer length
pub offset: u32, // file offset (low 32 bits, sufficient initially)
pub timeout_ns: u64, // for OP_TIMEOUT
}
}
Total size: 48 bytes.
IoCompletion struct
#![allow(unused)]
fn main() {
#[repr(C)]
pub struct IoCompletion {
pub user_data: u64, // copied from submission
pub result: i64, // bytes transferred, or negative errno
pub flags: u32, // completion flags (reserved)
pub opcode: u32, // echoed from submission
}
}
Total size: 24 bytes.
Operations
| Opcode | Name | Description | Status |
|---|---|---|---|
| 0 | OP_NOP | No operation. Completes immediately. Useful for testing. | Implemented |
| 1 | OP_TIMEOUT | Completes after timeout_ns nanoseconds. | Implemented |
| 2 | OP_READ | Read from fd into buf_addr. | Implemented |
| 3 | OP_WRITE | Write to fd from buf_addr. | Implemented |
| 4 | OP_IRQ_WAIT | Wait for interrupt on IRQ fd. | Implemented |
| 5 | OP_IPC_SEND | Send a message through an IPC channel. | Implemented |
| 6 | OP_IPC_RECV | Receive a message from an IPC channel. | Implemented |
| 7 | OP_RING_WAIT | Wait for notification fd signal. | Implemented |
OP_NOP (0)
Immediately posts a completion with result = 0. No side effects. Used for
round-trip latency testing and as a wake-up mechanism (submit a NOP from
another thread to unblock io_wait).
OP_TIMEOUT (1)
Registers a one-shot timer. Completes after timeout_ns nanoseconds with
result = 0, or result = -ETIME if cancelled (future).
Implementation: uses the existing libkernel::task::timer (Delay/Sleep)
infrastructure. The submission spawns an async delay task that posts the
completion when the timer fires.
OP_READ (2)
Reads up to buf_len bytes from fd at offset into buf_addr.
result= number of bytes read, or negative errno.- For console/pipe fds (no meaningful offset),
offsetis ignored.
Implementation: see “Sync fallback worker” below.
OP_WRITE (3)
Writes up to buf_len bytes from buf_addr to fd at offset.
result= number of bytes written, or negative errno.
Implementation: same sync fallback pattern as OP_READ.
OP_IRQ_WAIT (4)
Waits for a hardware interrupt on an IRQ fd (from microkernel-design.md, Phase B).
fdmust be an IRQ fd (IrqHandle).result= interrupt count since last wait, or negative errno.
Implementation: the IRQ fd’s ISR-safe notification calls port.post() when
the interrupt fires. No worker thread needed — the ISR posts directly.
OP_IPC_SEND (5)
Sends a message through an IPC channel send-end fd as an async operation.
fdmust be a channel send-end fd.buf_addrpoints to a user-spacestruct ipc_message(48 bytes).result= 0 on success,-EPIPEif receive end closed.
The message (including any fd-passing entries in fds[4]) is read from user
memory at submission time. If the channel can accept the message immediately,
a completion is posted right away. Otherwise the message is stored and the
completion fires when a receiver drains space.
See ipc-channels.md for full details.
OP_IPC_RECV (6)
Receives a message from an IPC channel receive-end fd as an async operation.
fdmust be a channel receive-end fd.buf_addrpoints to a user-spacestruct ipc_messagebuffer (48 bytes).result= 0 on success,-EPIPEif send end closed and no messages remain.
When a message arrives on the channel, a completion is posted to the port.
The message (including any transferred fds, allocated in the receiver’s fd
table) is copied to buf_addr during io_wait.
See ipc-channels.md for full details.
OP_RING_WAIT (7) — Implemented
Waits for a notification fd to be signaled.
fdmust be a notification fd (fromnotify_create, syscall 509).result= 0 on wakeup, or negative errno.
Implementation: the consumer submits OP_RING_WAIT via io_submit. The
kernel stores the port + user_data on the NotifyInner object. When the
producer calls notify(fd) (syscall 510), the kernel posts a completion
to the port. No worker thread needed — the syscall posts directly.
Edge-triggered, one-shot: one notify() → one completion. Consumer must
re-submit OP_RING_WAIT to rearm. If notify() is called before
OP_RING_WAIT is armed, the notification is buffered (coalesced).
The notification fd is a general-purpose signaling primitive, not tied to any specific ring buffer format. The kernel does not inspect ring buffer contents — it simply provides the signal/wait mechanism.
Kernel Implementation Sketch
CompletionPort struct
#![allow(unused)]
fn main() {
use alloc::collections::VecDeque;
pub struct CompletionPort {
queue: VecDeque<IoCompletion>,
waiter: Option<usize>, // thread index blocked in io_wait
max_queued: usize, // backpressure limit (default 256)
}
}
The CompletionPort is wrapped in a Mutex and stored inside a
CompletionPortHandle that implements FileHandle.
CompletionPortHandle
#![allow(unused)]
fn main() {
pub struct CompletionPortHandle {
port: Arc<Mutex<CompletionPort>>,
}
}
Implements FileHandle:
read()→ returnsErr(FileError::BadFd)(useio_waitinstead)write()→ returnsErr(FileError::BadFd)(useio_submitinstead)close()→ drop the Arc. Pending operations are cancelled (completions with-ECANCELEDare discarded).kind()→ a newFileKind::CompletionPortvariant
ISR-safe post()
The post() method must be callable from interrupt context (e.g., an IRQ
handler posting OP_IRQ_WAIT completions).
#![allow(unused)]
fn main() {
impl CompletionPort {
/// Post a completion. Safe to call from ISR context.
pub fn post(&mut self, completion: IoCompletion) {
if self.queue.len() < self.max_queued {
self.queue.push_back(completion);
}
// Wake the blocked waiter, if any
if let Some(thread_idx) = self.waiter.take() {
scheduler::unblock(thread_idx);
}
}
}
}
The CompletionPort is wrapped in an IrqMutex which disables interrupts
while held, making post() safe to call from ISR context.
io_wait blocking pattern
sys_io_wait(port_fd, completions_ptr, max, min, timeout_ns):
port = lookup_fd(port_fd) as CompletionPortHandle
loop:
lock port
n = drain up to max completions from queue
if n >= min:
copy n completions to user memory
return n
register current thread as waiter
unlock port
block_current_thread() // scheduler marks Blocked, yields
// ... woken by post() or timeout ...
if timeout expired:
return completions drained so far (may be 0)
This reuses the existing scheduler::block_current_thread() /
scheduler::unblock() pattern from the pipe and waitpid implementations.
Sync fallback worker for OP_READ / OP_WRITE
Existing FileHandle implementations (VfsHandle, ConsoleHandle, PipeReader)
are synchronous and blocking. To integrate them with the CompletionPort
without rewriting every handle:
io_submitfor OP_READ/OP_WRITE spawns an async task (via the existing executor).- The task calls
osl::blocking::blocking()which blocks a scheduler thread on the synchronousFileHandle::read()orFileHandle::write(). - When the blocking call returns, the task posts a completion to the port.
This means each in-flight OP_READ/OP_WRITE consumes one scheduler thread while blocked. Acceptable for the initial implementation; a true async FileHandle path can be added later.
io_submit(OP_READ, fd, buf, len):
spawn async {
let result = blocking(|| {
file_handle.read(buf, len)
});
port.lock().post(IoCompletion {
user_data,
result: result as i64,
opcode: OP_READ,
flags: 0,
});
}
Integration with Existing Infrastructure
FileHandle trait
No changes to the FileHandle trait in Phase 1. The sync fallback worker
bridges existing handles.
In a future phase, an optional submit_async() method could be added to
FileHandle for handles that can natively post completions (e.g., a future
async virtio-blk driver):
#![allow(unused)]
fn main() {
pub trait FileHandle: Send + Sync {
// ... existing methods ...
/// Submit an async operation. Default: not supported (use sync fallback).
fn submit_async(&self, _op: &IoSubmission, _port: &Arc<Mutex<CompletionPort>>)
-> Result<(), FileError>
{
Err(FileError::NotSupported)
}
}
}
Executor and timer reuse
- Executor: the existing async task executor spawns fallback worker tasks and timeout tasks. No changes needed.
- Timer:
OP_TIMEOUTuses the existinglibkernel::task::timer::Delay(which builds on the LAPIC timer tick). No new timer infrastructure.
fd_table
The CompletionPort is stored in the process fd_table as a
CompletionPortHandle. This means:
close(port_fd)cleans up the port.dup2works (two fds alias the same port via Arc).FD_CLOEXEC/close_cloexec_fds()works for execve.- No new kernel data structures outside the existing fd model.
Syscall dispatch wiring
Add to osl/src/syscalls/mod.rs:
#![allow(unused)]
fn main() {
501 => sys_io_create(a1 as u32),
502 => sys_io_submit(a1 as i32, a2 as *const IoSubmission, a3 as u32),
503 => sys_io_wait(a1 as i32, a2 as *mut IoCompletion, a3 as u32, a4 as u32, a5 as u64),
}
The IoSubmission and IoCompletion structs live in osl/src/io_port.rs
(new file). The CompletionPort and CompletionPortHandle implementations live
in libkernel/src/file.rs alongside the existing handle types.
Integration with Microkernel Primitives
The CompletionPort replaces the need for two separate primitives described in microkernel-design.md:
-
IRQ fd (Phase B) — instead of a standalone
IrqHandlewhereread()blocks, the IRQ fd posts completions to a port viaOP_IRQ_WAIT. The driver submits anOP_IRQ_WAITand reaps it alongside other completions. -
Notification objects (seL4-style) — unnecessary. A port with manual
post()(exposed as a futureio_postsyscall or via OP_NOP with user_data tagging) serves the same role.
NIC driver example loop
int port = io_create(0);
int irq_fd = open("/dev/irq/11", O_RDONLY);
int ring_fd = open("/dev/shm/txring", O_RDWR);
// Submit initial waits
IoSubmission subs[2] = {
{ .user_data = TAG_IRQ, .opcode = OP_IRQ_WAIT, .fd = irq_fd },
{ .user_data = TAG_RING, .opcode = OP_RING_WAIT, .fd = ring_fd },
};
io_submit(port, subs, 2);
for (;;) {
IoCompletion comp[8];
int n = io_wait(port, comp, 8, /*min=*/1, /*timeout=*/0);
for (int i = 0; i < n; i++) {
switch (comp[i].user_data) {
case TAG_IRQ:
handle_interrupt();
// Resubmit IRQ wait
io_submit(port, &(IoSubmission){
.user_data = TAG_IRQ, .opcode = OP_IRQ_WAIT, .fd = irq_fd
}, 1);
break;
case TAG_RING:
drain_tx_ring();
// Resubmit ring wait
io_submit(port, &(IoSubmission){
.user_data = TAG_RING, .opcode = OP_RING_WAIT, .fd = ring_fd
}, 1);
break;
}
}
}
This single-threaded loop handles both IRQs and ring wakeups through one blocking wait point — exactly the pattern needed for microkernel drivers.
Shared-Memory Ring Optimisation (Phase 5 — Implemented)
Under high throughput, even one syscall per batch can become a bottleneck. The shared-memory ring optimisation maps the submission and completion queues into userspace as shared-memory ring buffers, eliminating syscalls on the hot path for reading completions.
Syscalls
| Nr | Name | Purpose |
|---|---|---|
| 511 | io_setup_rings | Allocate SQ/CQ shared memory, put port in ring mode |
| 512 | io_ring_enter | Process SQ entries + optionally block for CQ completions |
io_setup_rings (511)
io_setup_rings(port_fd, params: *mut IoRingParams) → 0 or -errno
Allocates SQ and CQ ring pages and returns shmem fds that the process
mmaps with MAP_SHARED:
struct io_ring_params params = { .sq_entries = 64, .cq_entries = 128 };
io_setup_rings(port, ¶ms);
void *sq = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, params.sq_fd, 0);
void *cq = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, params.cq_fd, 0);
io_ring_enter (512)
io_ring_enter(port_fd, to_submit, min_complete, flags) → i64
Processes up to to_submit SQEs from the shared SQ ring, flushes deferred
completions, and optionally blocks until min_complete CQEs are available
in the CQ ring.
Ring layout
Single 4 KiB page per ring:
Offset 0: RingHeader (16 bytes)
AtomicU32 head — consumer advances (SQ: kernel, CQ: user)
AtomicU32 tail — producer advances (SQ: user, CQ: kernel)
u32 mask — capacity - 1
u32 flags — reserved (0)
Offset 64: entries[] (cache-line aligned)
SQ: IoSubmission[capacity] — 48 bytes each, max 64
CQ: IoCompletion[capacity] — 24 bytes each, max 128
Head and tail use atomic load/store with acquire/release ordering.
Dual-mode post()
When rings are active, CompletionPort::post() routes completions:
- Simple (no
read_buf, notransfer_fds): CQE written directly to the shared CQ ring viaIoRing::post_cqe(). Fast path for OP_NOP, OP_TIMEOUT, OP_WRITE, OP_IRQ_WAIT, OP_IPC_SEND, OP_RING_WAIT. - Deferred (
read_bufortransfer_fdspresent): pushed to the kernel VecDeque.io_ring_enterflushes these in syscall context where page tables are correct for data copy and fd installation.
Backward compatibility
io_submitworks in ring mode (completions go to CQ ring)io_waitreturns-EINVALin ring mode (useio_ring_enter)- Ports without rings work exactly as before
Phased Implementation
Phase 1: Core + OP_NOP + OP_TIMEOUT
Goal: Establish the CompletionPort kernel object, syscall interface, and basic operations.
| Item | Detail |
|---|---|
| Files | libkernel/src/file.rs (CompletionPortHandle), osl/src/io_port.rs (new: structs + sys_io_*), osl/src/syscalls/mod.rs (wire 501-503) |
| Dependencies | None — uses existing scheduler, timer, fd_table |
| Delivers | io_create, io_submit, io_wait; OP_NOP and OP_TIMEOUT |
| Test | Userspace program: create port, submit OP_NOP, io_wait returns immediately. Submit OP_TIMEOUT(100ms), io_wait blocks ~100ms then returns. |
Phase 2: OP_READ + OP_WRITE with Sync Fallback
Goal: Bridge existing FileHandle implementations into the completion model.
| Item | Detail |
|---|---|
| Files | osl/src/io_port.rs (add fallback worker logic) |
| Dependencies | Phase 1; existing osl::blocking::blocking() |
| Delivers | OP_READ and OP_WRITE on console, pipe, and VFS file fds |
| Test | Submit OP_WRITE to stdout + OP_READ from a file fd, reap both completions. Verify data matches. |
Phase 3: OP_IRQ_WAIT — Implemented
Goal: Hardware interrupt delivery through the completion port.
| Item | Detail |
|---|---|
| Files | libkernel/src/irq_handle.rs (IrqInner, IRQ slot table, ISR dispatch), libkernel/src/file.rs (FdObject::Irq variant), libkernel/src/completion_port.rs (OP_IRQ_WAIT constant), osl/src/irq.rs (sys_irq_create), osl/src/io_port.rs (OP_IRQ_WAIT handler in io_submit) |
| Dependencies | Phase 1; IO APIC route/mask/unmask (libkernel::apic) |
| Delivers | irq_create(gsi) syscall (504), submit OP_IRQ_WAIT on an IRQ fd, ISR masks line and posts completion to port, rearm via another OP_IRQ_WAIT unmasks |
| Test | user/irq_demo.c: create IRQ fd for keyboard GSI 1, submit OP_IRQ_WAIT, press key, verify completion with scancode in result. |
Phase 3b: OP_IPC_SEND + OP_IPC_RECV — Implemented
Goal: Multiplex IPC channel operations with other async I/O sources.
| Item | Detail |
|---|---|
| Files | libkernel/src/channel.rs (arm_send, arm_recv, PendingPortSend/Recv), libkernel/src/completion_port.rs (OP_IPC_SEND/RECV constants, transfer_fds on Completion), osl/src/io_port.rs (OP_IPC_SEND/RECV handlers) |
| Dependencies | Phase 1; IPC channels (syscalls 505–507) |
| Delivers | Submit OP_IPC_SEND/RECV on channel fds, completions posted when message delivered/received. Supports fd-passing: transferred fds installed in receiver during io_wait. |
| Test | user/ipc_port.c: IPC send/recv multiplexed with timers via completion port. user/ipc_fdpass.c: fd-passing through IPC channels. |
Phase 4: OP_RING_WAIT — Implemented
Goal: Inter-process signaling through the completion port via notification fds.
| Item | Detail |
|---|---|
| Files | libkernel/src/notify.rs (NotifyInner, arm/signal), libkernel/src/file.rs (FdObject::Notify), osl/src/notify.rs (sys_notify_create 509, sys_notify 510), osl/src/io_port.rs (OP_RING_WAIT handler) |
| Dependencies | Phase 1 |
| Delivers | notify_create(flags) syscall (509), notify(fd) syscall (510), submit OP_RING_WAIT on notification fd, producer-side notify() posts completion |
| Test | user/ring_test.c: parent creates shmem + notify fd, spawns child, child writes to shmem and signals, parent reaps OP_RING_WAIT completion and verifies data. |
Phase 5: Shared-Memory SQ/CQ Rings — Implemented
Goal: Zero-syscall submission and completion for high-throughput paths.
| Item | Detail |
|---|---|
| Files | libkernel/src/completion_port.rs (IoRing, IoSubmission, IoCompletion, RingHeader, dual-mode post()), libkernel/src/shmem.rs (from_existing), osl/src/io_port.rs (io_setup_rings 511, io_ring_enter 512, process_submission refactor) |
| Dependencies | Phase 1; MAP_SHARED from mmap-design.md Phase 5 |
| Delivers | Userspace-mapped SQ/CQ rings via shmem fds, io_ring_enter processes SQ + waits for CQ |
| Test | user/ring_sq_test.c: submit OP_NOP + OP_TIMEOUT via shared-memory SQ, reap from CQ ring, verify completions. |
Dependency Graph
┌────────────────────────┐
│ Phase 1 │
│ Core + NOP + TIMEOUT │
│ (no external deps) │
└──┬──────────┬───────┬──┘
│ │ │
┌────────▼──┐ ┌──▼────┐ │
│ Phase 2 │ │ │ │
│ READ/WRITE│ │ │ │
│ sync fbk │ │ │ │
└────────────┘ │ │ │
│ │ │
┌────────────────────┘ │ │
│ │ │
┌───────▼────────┐ ┌───────────┐ │ │
│ Phase 3 ✓ │ │ Phase 3b ✓│ │ │
│ OP_IRQ_WAIT │ │ OP_IPC_* │ │ │
│ │ │ │ │ │
│ requires: │ │ requires: │ │ │
│ IO APIC │ │ IPC chans │ │ │
└────────────────┘ └───────────┘ │ │
│ │
┌────────────────────┐ │ │
│ Phase 4 ✓ │ │ │
│ OP_RING_WAIT │ │ │
│ (notify fds) │ │ │
└────────────────────┘ │ │
│ │
┌─────────────────▼──▼───────────────┐
│ Phase 5 ✓ │
│ Shared-memory SQ/CQ rings │
│ │
│ requires: │
│ mmap Phase 5 (MAP_SHARED) ✓ │
└────────────────────────────────────┘
All phases are complete.
Key Design Decisions
Syscall-first, not ring-first
Phase 1 uses traditional syscalls (io_submit/io_wait). Shared-memory
rings are deferred to Phase 5. This avoids coupling the initial
implementation to MAP_SHARED (which is not yet implemented) and keeps the
kernel-side logic simple.
Port as fd
The CompletionPort is an fd in the process’s fd_table, not a special kernel
handle type. This reuses existing infrastructure (close, dup2, CLOEXEC,
fd_table cleanup on exit) and avoids inventing a parallel handle namespace.
Eager posting
Completions are pushed to the port’s queue immediately when the operation finishes (or the ISR fires). There is no lazy/deferred completion model. This is simpler and matches the existing block/unblock scheduling model.
Single-threaded wait
Only one thread may block in io_wait per port. This is a deliberate
constraint matching the single-threaded driver loop model. Multi-threaded
consumers can use multiple ports. This avoids thundering-herd wake-up logic
and lock contention on the completion queue.
user_data for demux
Every submission carries a u64 user_data returned verbatim in the
completion. The kernel never inspects this field. The caller uses it to
identify which logical operation completed (e.g., TAG_IRQ, TAG_RING,
a pointer to a request struct). This is the same pattern used by io_uring,
IOCP, and Fuchsia ports.
Custom syscall numbers (501-503)
The 500+ range is reserved for ostoo-specific syscalls. These are not Linux
syscall numbers. If Linux compatibility is needed later, a shim layer can map
Linux’s io_uring_setup/io_uring_enter numbers to the ostoo equivalents.
Kernel-buffered queue
The completion queue lives in kernel memory (VecDeque<IoCompletion>), not
shared memory. io_wait copies completions to user buffers. Simple,
correct, and sufficient until Phase 5 adds the zero-copy shared-memory path.
Sync fallback for existing FileHandles
Rather than rewriting ConsoleHandle, PipeReader, VfsHandle, etc. to be
async-aware, OP_READ/OP_WRITE spawn a blocking worker that calls the
existing synchronous FileHandle::read()/FileHandle::write() and posts the
completion when done. This trades a scheduler thread per in-flight op for
zero changes to existing handle implementations.
Timer via Delay
OP_TIMEOUT reuses the existing libkernel::task::timer::Delay rather than
introducing a new timer subsystem. The LAPIC timer already provides 10ms
ticks; Delay builds on this.
ISR-safe posting
CompletionPort::post() must work from interrupt context. The IrqMutex
around the port disables interrupts while held. The scheduler::unblock()
call is already ISR-safe (it just pushes to the ready queue).
No cancellation in Phase 1
Submitted operations cannot be cancelled. This avoids the complexity of
cancellation tokens, in-progress state tracking, and partial-completion
semantics. Cancellation support can be added later as an io_cancel syscall
(504) once the basic model is proven.
Completion-oriented, not readiness-oriented
The port reports “operation X is done” (completion), not “fd Y is readable” (readiness). This is a deliberate choice:
- Completion avoids the double-syscall problem (wait for ready, then do I/O).
- Completion naturally supports heterogeneous event sources (timers, IRQs) that don’t have a “ready” state.
- Readiness can be emulated on top of completion (submit a zero-length OP_READ as a readiness probe) but not vice versa.
Blocking Protocol
This document describes the blocking/wakeup protocol used throughout the
kernel, the lost-wakeup race it currently suffers from, and a proposed fix
based on an idle thread and a WaitCondition primitive.
Formal PlusCal/TLA+ models of the protocol live in specs/. See
specs/PLUSCAL.md for authoring instructions and
specs/README.md for per-spec details.
Current protocol (buggy)
Every blocking site in the kernel follows this pattern:
#![allow(unused)]
fn main() {
{
let mut guard = shared_state.lock(); // 1. acquire lock
guard.waiter = Some(thread_idx); // 2. register waiter
} // 3. release lock
scheduler::block_current_thread(); // 4. mark Blocked + spin
}
The waker (a producer, timer, or signal) does:
#![allow(unused)]
fn main() {
{
let mut guard = shared_state.lock();
if let Some(t) = guard.waiter.take() { // clear waiter slot
scheduler::unblock(t); // wake up the blocked thread
}
}
}
And unblock() is conditional:
#![allow(unused)]
fn main() {
pub fn unblock(thread_idx: usize) {
let mut sched = SCHEDULER.lock();
if let Some(t) = sched.threads.get_mut(thread_idx) {
if t.state == ThreadState::Blocked { // <-- only acts if Blocked
t.state = ThreadState::Ready;
sched.ready_queue.push_back(thread_idx);
}
}
}
}
The race
A waker can execute between steps 3 (unlock) and 4 (mark Blocked):
- Waiter: acquires lock, sets
waiter = Some(self), releases lock. Thread state is stillRunning. - Waker: acquires lock, calls
waiter.take(), callsunblock(waiter).unblockchecksstate == Blocked— it’sRunning— no-op. Waiter slot is nowNone. - Waiter: calls
block_current_thread(), sets state toBlocked, spins forever. No future waker will callunblockbecause the waiter slot was already consumed. Deadlock.
This race is confirmed by the PlusCal model in specs/completion_port/completion_port.tla
(TLC finds a deadlock trace) and by code inspection of scheduler.rs lines
640-676.
Affected sites
The race affects every blocking site, not just the completion port:
| Site | File | Lock type | Status |
|---|---|---|---|
sys_io_wait | osl/src/io_port.rs | IrqMutex | Fixed: WaitCondition |
sys_io_ring_enter phase 3 | osl/src/io_port.rs | IrqMutex | Fixed: WaitCondition |
PipeReader::read | libkernel/src/file.rs | SpinMutex | Fixed: WaitCondition |
read_input (console) | libkernel/src/console.rs | SpinMutex | Fixed: WaitCondition |
sys_ipc_send | osl/src/ipc.rs | IrqMutex | mark_blocked under lock (action enum) |
sys_ipc_recv | osl/src/ipc.rs | IrqMutex | mark_blocked under lock (action enum) |
sys_wait4 | osl/src/syscalls/process.rs | SpinMutex | Fixed: WaitCondition |
sys_clone (vfork parent) | osl/src/clone.rs | SpinMutex | Fixed: WaitCondition |
blocking() (async bridge) | osl/src/blocking.rs | SpinMutex | Fixed: WaitCondition |
The IPC channel sites (sys_ipc_send, sys_ipc_recv) use the split
mark_blocked() / yield_now() pair. The mark_blocked is called inside
ChannelInner::try_send/try_recv (under the IrqMutex), and the caller
does yield_now() after the lock drops via the SendAction/RecvAction
enum. These are already race-free; WaitCondition doesn’t fit the
action-enum pattern without a larger refactor.
The sys_io_ring_enter variant previously had an additional bug: check and
set_waiter were under separate lock acquisitions, so a completion could
arrive between the check and the registration. WaitCondition fixes this by
construction.
Why Blocked threads spin today
Both preempt_tick and yield_tick handle an empty ready queue by returning
current_rsp — i.e. they keep running the current thread even if it’s
Blocked. This forces block_current_thread() to include a HLT spin loop:
the Blocked thread keeps running on the CPU, calling enable_and_hlt() in a
loop, waiting for the next timer interrupt to check if unblock() has
changed its state. This wastes up to one full quantum (10ms) per blocking
event and prevents the CPU from doing useful work while the thread is
Blocked.
Proposed fix
The fix has three parts: an idle thread that eliminates the need for blocked
threads to spin, a split of block_current_thread that fixes the race, and
a WaitCondition wrapper that makes the correct pattern easy and the buggy
pattern impossible.
Step 1: Add an idle thread
Create a per-CPU idle thread that the scheduler falls back to when the ready queue is empty. The idle thread does nothing but HLT in a loop, yielding the CPU until the next interrupt:
#![allow(unused)]
fn main() {
fn idle_thread() -> ! {
loop {
x86_64::instructions::interrupts::enable_and_hlt();
}
}
}
The idle thread is created during scheduler init and stored in Scheduler:
#![allow(unused)]
fn main() {
struct Scheduler {
// ...
idle_thread_idx: usize, // always present, never on the ready queue
}
}
Then preempt_tick and yield_tick switch to the idle thread instead of
staying on a Blocked/Dead thread:
#![allow(unused)]
fn main() {
let next_idx = match sched.ready_queue.pop_front() {
Some(idx) => idx,
None => sched.idle_thread_idx, // was: return current_rsp
};
}
The idle thread is never pushed onto the ready queue. The scheduler only runs
it as a fallback when nothing else is Ready. The first unblock() call
pushes a real thread onto the ready queue, and the next timer tick preempts
idle and switches to it.
With this change, a Blocked thread no longer needs to spin — the scheduler
context-switches away from it immediately and never schedules it again until
unblock() makes it Ready.
Step 2: Split block_current_thread into mark + yield
#![allow(unused)]
fn main() {
/// Mark the current thread Blocked and yield to the scheduler.
///
/// Safe to call while holding any lock (acquires SCHEDULER briefly).
/// The scheduler will context-switch away and never schedule this thread
/// again until unblock() is called. Execution resumes at the instruction
/// after this call.
// [spec: completion_port/completion_port.tla CheckAndAct — "thread_state := blocked"
// + WaitUnblocked — "await thread_state = running"]
pub fn mark_blocked_and_yield() {
x86_64::instructions::interrupts::without_interrupts(|| {
let mut sched = SCHEDULER.lock();
let idx = sched.current_idx;
sched.threads[idx].state = ThreadState::Blocked;
});
yield_now(); // context-switch away; resume here after unblock + reschedule
}
}
There is no spin loop. yield_now() triggers int 0x50, which enters
yield_tick. The scheduler sees the thread is Blocked, does not re-queue it,
and switches to the next ready thread (or idle). When unblock() is called
later, the thread is pushed onto the ready queue with state Ready. The
scheduler eventually picks it and context-switches back, resuming execution
right after the yield_now() call.
A separate mark_blocked() (without yield) is still useful for callers that
need to mark Blocked under a lock and yield after dropping it:
#![allow(unused)]
fn main() {
/// Mark the current thread Blocked. Does NOT yield.
/// Caller must call yield_now() after releasing their lock.
pub fn mark_blocked() {
x86_64::instructions::interrupts::without_interrupts(|| {
let mut sched = SCHEDULER.lock();
let idx = sched.current_idx;
sched.threads[idx].state = ThreadState::Blocked;
});
}
}
Step 3: Migrate call sites
Each site becomes:
#![allow(unused)]
fn main() {
{
let mut guard = shared_state.lock(); // 1. acquire lock
guard.waiter = Some(thread_idx); // 2. register waiter
scheduler::mark_blocked(); // 3. mark Blocked UNDER LOCK
} // 4. release lock
scheduler::yield_now(); // 5. context-switch away
// execution resumes here after unblock + reschedule
}
No loop, no spin, no HLT. The thread is off the CPU until explicitly woken.
Step 4: Introduce WaitCondition to enforce the pattern (DONE)
Implemented in libkernel/src/wait_condition.rs. Seven sites now use
WaitCondition::wait_while(): sys_io_wait, sys_io_ring_enter,
PipeReader::read, read_input (console), sys_wait4, sys_clone
(vfork parent), and blocking(). The two IPC channel sites use the split
mark_blocked()/yield_now() pair via an action enum.
A condvar-like wrapper that makes the ordering impossible to get wrong:
#![allow(unused)]
fn main() {
/// Single-waiter condvar for kernel blocking.
///
/// Encapsulates the check → register → mark_blocked → unlock → yield cycle.
/// The type system ensures mark_blocked happens before the guard drops.
pub struct WaitCondition;
impl WaitCondition {
/// If `predicate(guard)` returns true (i.e. "should block"), register
/// the waiter, mark the thread Blocked, release the lock, and yield.
/// Returns when unblocked and rescheduled.
///
// [spec: completion_port/completion_port.tla
// CheckAndAct (check + set_waiter + mark_blocked) = one label
// WaitUnblocked (await running) = next label]
pub fn wait_while<T, L: Lock<T>>(
mut guard: L::Guard<'_>,
predicate: impl Fn(&T) -> bool,
register: impl FnOnce(&mut T, usize),
) {
if !predicate(&*guard) {
return; // condition already satisfied, no need to block
}
let thread_idx = scheduler::current_thread_idx();
register(&mut *guard, thread_idx);
scheduler::mark_blocked(); // mark Blocked while lock held
drop(guard); // release lock
scheduler::yield_now(); // context-switch away; resume after unblock
}
}
}
Each blocking site reduces to a single call:
#![allow(unused)]
fn main() {
// Completion port
WaitCondition::wait_while(
port.lock(),
|p| p.pending() < min,
|p, idx| p.set_waiter(idx),
);
// Pipe reader
WaitCondition::wait_while(
self.inner.lock(),
|inner| inner.buffer.is_empty() && !inner.write_closed,
|inner, idx| { inner.reader_thread = Some(idx); },
);
// Console input
WaitCondition::wait_while(
CONSOLE_INPUT.lock(),
|c| c.buf.is_empty(),
|c, idx| { c.blocked_reader = Some(idx); },
);
// sys_wait4
WaitCondition::wait_while(
PROCESS_TABLE.lock(),
|table| find_zombie_child(table, pid).is_none(),
|table, idx| {
table.get_mut(&pid).unwrap().wait_thread = Some(idx);
},
);
}
Step 5: Deprecate block_current_thread
All three remaining sites (sys_wait4, sys_clone, blocking()) have been
migrated. block_current_thread is now unused and can be deprecated. New
blocking code must use WaitCondition or the mark_blocked() /
yield_now() pair.
Lock ordering note
mark_blocked() acquires the scheduler’s SCHEDULER SpinMutex internally.
This means any lock held by the caller must come before SCHEDULER in
the lock ordering. The current codebase already satisfies this: all IrqMutex
and SpinMutex locks protecting shared state are acquired before (and never
after) the scheduler lock.
If a future lock needs to be acquired after the scheduler lock, that lock
cannot be held when calling mark_blocked() — use the manual split instead,
and mark blocked before acquiring the inner lock.
PlusCal correspondence
| Rust construct | PlusCal label | Atomicity |
|---|---|---|
guard = state.lock() | Start of label | Lock acquired |
register(guard, idx) | Same label | Under same lock |
mark_blocked() | Same label | Under same lock (acquires SCHEDULER briefly) |
drop(guard) | End of label | Lock released |
yield_now() | Next label | await thread_state = "running" |
unblock(idx) in waker | Waker’s label | if thread_state = "blocked" then running |
Each WaitCondition::wait_while call maps to exactly two PlusCal labels,
making formal verification straightforward.
Relation to the io_ring_enter double-lock bug
sys_io_ring_enter phase 3 currently does:
#![allow(unused)]
fn main() {
{ let p = port.lock(); /* check cq_available */ } // lock 1
{ let mut p = port.lock(); p.set_waiter(idx); } // lock 2
scheduler::block_current_thread(); // lock 3
}
The check and set_waiter are under separate locks, so a CQE posted between
them is missed. WaitCondition fixes this by construction: the predicate
check and waiter registration happen under a single lock acquisition.
Scheduler Donate (Direct-Switch) Infrastructure
Overview
All blocking IPC in ostoo (pipes, completion ports, wait4, vfork) uses
unblock(thread_idx) which pushes the woken thread to the back of
the ready queue. The thread then waits for the scheduler’s round-robin
to reach it — up to 10 ms (full quantum).
The scheduler donate mechanism adds a voluntary yield via a dedicated
ISR vector (int 0x50) so the waker can switch to the woken thread
immediately, eliminating the up-to-10 ms latency.
Mechanism
Yield interrupt (vector 0x50)
ipc_yield_stub is an assembly handler identical to the LAPIC timer
stub (lapic_timer_stub) but calls yield_tick instead of
preempt_tick. It provides a software-triggered context switch from
syscall context.
yield_tick differs from preempt_tick:
- No
tick(), nolapic_eoi()— not a hardware interrupt - No quantum decrement — always performs the switch
- Checks
DONATE_TARGET: AtomicUsizefor a direct-switch target
Public API
| Function | Description |
|---|---|
yield_now() | Trigger int 0x50 — voluntary preemption |
set_donate_target(idx) | Set direct-switch target for next yield |
unblock_yield(idx) | Unblock + set donate + yield (convenience) |
unblock_yield is the high-level primitive for the pattern: unblock a
thread and immediately switch to it.
Direct-switch flow
- Waker calls
unblock(target)— target moves to Ready, pushed to ready queue - Waker calls
set_donate_target(target)— stores target in atomic - Waker calls
yield_now()— triggersint 0x50 yield_ticksaves waker’s state, sees donate target, switches to target- Target resumes from its blocked state immediately
- Waker is re-queued as Ready and runs later via normal scheduling
If the donate target is no longer Ready (e.g., the timer already
dispatched it), yield_tick falls back to regular round-robin.
Applied to existing primitives
Pipes (libkernel/src/file.rs)
pipe_wake_reader() returns the woken thread index. PipeWriter::write()
drops the pipe lock, then calls set_donate_target + yield_now() if a
reader was woken.
PipeWriter::close() returns the woken thread index (if writer_count
reaches 0 and a reader was woken). It cannot yield itself because it
runs inside with_process() which holds the process table lock.
Instead, sys_close yields after the lock is released.
PipeInner tracks writer_count (incremented by on_dup(),
decremented by close()). write_closed is set only when
writer_count reaches 0, matching Unix pipe semantics where EOF is
delivered only after all writer fds are closed.
FdObject::clone() does NOT call on_dup() — it is a plain Arc clone.
on_dup() is only called via FdObject::notify_dup() at actual
fd-duplication sites (clone/fork fd_table inheritance, dup2).
The pipe lock must be dropped before yielding — otherwise the reader thread would deadlock trying to acquire it.
Completion ports (libkernel/src/completion_port.rs)
CompletionPort::post() returns Option<usize> — the woken waiter
thread index. ISR-context callers ignore the return value.
Syscall-context callers (e.g., OP_NOP in io_port.rs) use it to yield
to the waiter.
Process exit (libkernel/src/process.rs)
terminate_process() calls yield_now() before kill_current_thread().
If the parent has a wait_thread, the donate target is set to the
parent’s thread so it returns from wait4 immediately. The dying
thread’s remaining quantum is donated to the parent.
Safety constraints
yield_now()must NOT be called from ISR context. The scheduler lock could deadlock (ISR preempts code holding the lock, ISR tries to acquire lock → deadlock).- ISR paths (e.g.,
irq_fd_dispatch→CompletionPort::post()) continue using plainunblock(). This is fine because ISRs are short. - All locks (pipe, completion port) must be dropped before calling
yield_now().
Why int 0x50 works from syscall context
During a SYSCALL handler the CPU runs on the kernel stack with GS =
kernel GS (from swapgs in the syscall entry stub). int 0x50 pushes
a ring-0 interrupt frame. The yield stub sees RPL = 0 in the saved CS,
skips swapgs. Saves all GPRs + FXSAVE. yield_tick saves RSP,
switches to target’s stack. Target’s frame (from its own yield or
timer preemption) is restored via fxrstor + GPR pops + iretq.
Key files
| File | Change |
|---|---|
libkernel/src/task/scheduler.rs | ipc_yield_stub asm, yield_tick, DONATE_TARGET, public API |
libkernel/src/interrupts.rs | Register vector 0x50 in IDT |
libkernel/src/file.rs | pipe_wake_reader returns thread idx, yield in PipeWriter |
libkernel/src/completion_port.rs | post() returns Option<usize> |
osl/src/io_port.rs | Yield after OP_NOP post |
libkernel/src/process.rs | Yield before kill_current_thread in terminate_process |
Signal Support
Current state
Phase 1 of POSIX signal support: basic signal data structures, rt_sigaction,
rt_sigprocmask, signal delivery on SYSCALL return, rt_sigreturn, and kill.
What works
rt_sigaction(syscall 13): install/query signal handlers with SA_SIGINFO and SA_RESTORERrt_sigprocmask(syscall 14): SIG_BLOCK, SIG_UNBLOCK, SIG_SETMASKkill(syscall 62): send signals to specific pids- Signal delivery on SYSCALL return path via
check_pending_signals rt_sigreturn(syscall 15): restore context after signal handler returns- Default actions: SIG_DFL (terminate or ignore depending on signal), SIG_IGN
sigaltstack(syscall 131): stub returning 0
Signal delivery mechanism
The SYSCALL assembly stub saves 8 registers onto the kernel stack and stores the
stack pointer into PerCpuData.saved_frame_ptr (GS offset 40). After syscall_dispatch
returns, check_pending_signals() is called:
- Peek at process’s
pending & !blocked— early return if empty - Dequeue lowest pending signal
- If SIG_DFL: terminate (SIGKILL, SIGTERM, etc.) or ignore (SIGCHLD, SIGCONT)
- If SIG_IGN: return
- If handler installed: construct
rt_sigframeon user stack, rewrite saved frame
The rt_sigframe on the user stack contains:
pretcode(8B):sa_restoreraddress (musl’s__restore_rt)siginfo_t(128B): signal number, errno, codeucontext_t(224B): saved registers (sigcontext), fpstate ptr, signal mask
The saved SYSCALL frame is rewritten so sysretq “returns” into the handler:
- RCX (→ RIP) = handler address
- RDI = signal number
- RSI = &siginfo (if SA_SIGINFO)
- RDX = &ucontext (if SA_SIGINFO)
- User RSP = rt_sigframe base
When the handler returns, __restore_rt calls rt_sigreturn (syscall 15),
which reads the saved context from the rt_sigframe and restores the original
registers and signal mask.
Architecture
Key files
| File | Purpose |
|---|---|
libkernel/src/signal.rs | Signal constants, SigAction, SignalState |
libkernel/src/syscall.rs | PerCpuData.saved_frame_ptr, SyscallSavedFrame, check_pending_signals, deliver_signal |
libkernel/src/process.rs | Process.signal field |
osl/src/signal.rs | sys_rt_sigreturn, sys_kill |
osl/src/signal.rs | sys_rt_sigaction, sys_rt_sigprocmask |
PerCpuData layout
| Offset | Field | Purpose |
|---|---|---|
| 0 | kernel_rsp | Loaded on SYSCALL entry |
| 8 | user_rsp | Saved by entry stub |
| 16 | user_rip | RCX saved by entry stub |
| 24 | user_rflags | R11 saved by entry stub |
| 32 | user_r9 | R9 saved (for clone) |
| 40 | saved_frame_ptr | RSP after register pushes (for signal delivery) |
saved_frame_ptr is not saved/restored per-thread
saved_frame_ptr lives in a single per-CPU slot and is not saved/restored
during context switches. This is safe today because it is set and consumed
entirely within the SYSCALL entry/exit path with interrupts disabled:
- The assembly stub pushes registers, writes
mov gs:40, rsp, then callssyscall_dispatchfollowed bycheck_pending_signals— all before the register pops andsysretq. rt_sigreturnis itself a syscall, so the stub setssaved_frame_ptrat the start of the same SYSCALL path beforesys_rt_sigreturnreads it.
No preemption can occur between setting and consuming the pointer.
If signal delivery is ever needed from interrupt context (e.g. delivering
SIGSEGV from a page-fault handler or SIGINT from a keyboard ISR), this design
must be revisited — either by saving/restoring saved_frame_ptr per-thread in
the scheduler, or by using a different mechanism to locate the interrupted
frame (e.g. the interrupt stack frame pushed by the CPU).
Signal-interrupted syscalls (EINTR)
Blocking syscalls (sys_wait4, PipeReader::read) can be interrupted by
signals. The mechanism uses a per-process signal_thread field:
- Before blocking, the syscall stores its scheduler thread index in
process.signal_thread. sys_kill, after queuing a signal, readssignal_threadand callsscheduler::unblock()on it if set.- When the blocked thread wakes, it checks for pending signals. If any
are deliverable (
pending & !blocked != 0), it returns EINTR instead of re-blocking. - The field is cleared on any exit path (data available, EOF, or signal).
Only interruptible blocking sites set signal_thread. Non-interruptible
blocks (vfork parent in sys_clone, blocking() async bridge) never set
it, so they remain unaffected.
The shell’s cmd_run handles EINTR from waitpid by forwarding SIGINT
to the child process and re-waiting, enabling Ctrl+C to reach child
processes running in the terminal.
Future work
- Exception-generated signals (SIGSEGV, SIGILL, SIGFPE from ring-3 faults)
- FPU state save/restore in signal frames
- Signal queuing (currently only one instance per signal — standard signals)
Process Spawning
How user-space processes are created.
Current Implementation
Process creation uses the standard Linux clone(CLONE_VM|CLONE_VFORK) +
execve path. musl’s posix_spawn and Rust’s std::process::Command
work unmodified.
clone (CLONE_VM | CLONE_VFORK | SIGCHLD)
clone creates a child process that shares the parent’s address space.
The parent blocks until the child calls execve or _exit.
See syscalls/clone.md for full details.
execve
execve replaces the current process’s address space with a new ELF binary.
Reads the ELF from the VFS, creates a fresh PML4, maps segments, builds the
initial stack with argc/argv/envp/auxv, closes FD_CLOEXEC fds, unblocks
the vfork parent, and jumps to userspace.
See syscalls/execve.md for full details.
Internal spawning (kernel-side)
For boot-time process creation (e.g. auto-launching the shell), the kernel
uses osl::spawn::spawn_process_full(elf_data, argv, envp, parent_pid)
which combines ELF loading and process creation in a single call.
kernel/src/ring3.rs provides spawn_process and spawn_process_with_env
wrappers that delegate to spawn_process_full.
Process lifecycle
parent: clone(CLONE_VM|CLONE_VFORK)
│
│ ┌─── child created (shares parent PML4) ───┐
│ │ │
│ │ execve("/bin/prog", argv, envp) │
│ │ → fresh PML4, ELF mapped │
│ │ → close CLOEXEC fds │
│ │ → unblock parent │
│ │ → jump to ring 3 │
│ │ │
├──┘ parent unblocked │
│ │
│ waitpid(child, &status, 0) │
│ → blocks until child exits │
│ │
│ child: _exit(code) │
│ → mark zombie, wake parent │
│ │
▼ parent: waitpid returns, reap zombie
Key files
| File | Purpose |
|---|---|
osl/src/clone.rs | sys_clone — vfork child creation |
osl/src/exec.rs | sys_execve — replace process image |
osl/src/spawn.rs | spawn_process_full — kernel-side ELF spawning |
osl/src/elf_loader.rs | ELF parsing and address space setup |
libkernel/src/task/scheduler.rs | spawn_clone_thread, clone_trampoline |
kernel/src/ring3.rs | spawn_process wrapper for boot-time use |
Future work
fork+ CoW page faults — full POSIXforkwith copy-on-write. Requires page fault handler and per-frame reference counting.- fd inheritance across clone — currently the child gets a copy of the parent’s fd table; selective inheritance could be added.
Plan: User Space and Process Isolation
Context
The kernel currently runs everything — drivers, shell, filesystem — in a single ring-0 address space as async Rust tasks. This document outlines the path from that baseline to a system where untrusted programs run in isolated ring-3 processes with their own virtual address spaces, communicating with the kernel through system calls, and eventually linked against a ported musl libc.
Progress Summary
Phases 0–6 are complete. The kernel runs a musl-linked C shell
(user/shell.c) as its primary user interface. The shell auto-launches on
boot, supports line editing, built-in commands (echo, pwd, cd, ls,
cat, exit, help), and spawning external programs. Process creation uses standard Linux clone(CLONE_VM|CLONE_VFORK) + execve,
enabling unpatched musl posix_spawn and Rust std::process::Command.
35+ syscalls are implemented including pipe2, dup2, fcntl, getpid,
getrandom, clone/execve, and custom completion port / IPC syscalls.
| Phase | Status | Milestone |
|---|---|---|
| 0 — Toolchain | Done | Hand-crafted assembly blobs and static ELF binaries load and run |
| 1 — Ring-3 + SYSCALL | Done | GDT has ring-3 segments; SYSCALL/SYSRET works; sys_write, sys_exit, sys_arch_prctl implemented |
| 2 — Per-process page tables | Done | create_user_page_table, map_user_page, CR3 switching on context switch; ring-3 page faults kill the process |
| 3 — Process abstraction | Done | Process struct, process table, ELF loader, exec shell command, zombie reaping |
| 4 — System call layer | Done | 14 syscalls implemented; initial stack with auxv; brk/mmap for heap; writev for musl printf |
| 5 — Cross-compiler + musl | Done | Docker-based musl cross-compiler (scripts/user-build.sh); static musl binaries run on ostoo |
| 6 — Spawn / wait / user shell | Done | clone(CLONE_VM|CLONE_VFORK) + execve for process creation; wait4; pipe2, dup2, fcntl, getpid, getrandom; userspace C shell with line editing, auto-launched on boot |
| 7 — Signals | Not started | Requires signal frame push/pop, rt_sigaction, rt_sigreturn |
What works today
- Userspace shell (
user/shell.c): musl-linked C shell compiled via Docker cross-compiler, deployed to disk image at/shell. Auto-launched fromkernel/src/main.rson boot; falls back to kernel shell if not found. - Line editing in the shell: read char-by-char, echo, backspace, Ctrl+C (cancel line), Ctrl+D (exit on empty line).
- Built-in commands:
echo,pwd,cd,ls,cat,exit,help. - External programs:
posix_spawn(path)+waitpidfrom the shell. - Raw keypress delivery to userspace via
libkernel/src/console.rs: foreground PID routing, blockingread(0), keyboard ISR wakeup. - Per-process FD table (fds 0–2 =
ConsoleHandle);FileHandletrait withConsoleHandle,VfsHandle, andDirHandleimplementations. - 35+ syscalls implemented (see
docs/syscalls/for per-syscall docs):read,write,open,close,fstat,lseek,mmap,mprotect,munmap,brk,ioctl,writev,exit/exit_group,wait4,getcwd,chdir,arch_prctl,futex,getdents64,set_tid_address,set_robust_list,clone,execve,pipe2,dup2,fcntl,getpid,getrandom,kill,rt_sigaction,rt_sigprocmask,rt_sigreturn,sigaltstack,madvise,sched_getaffinity,clock_gettime, plus custom syscalls for completion ports (501–503), IRQ (504), and IPC channels (505–507). openresolves paths relative to process CWD; supports both files (VfsHandle) and directories (DirHandlewithO_DIRECTORY).getdents64returnslinux_dirent64structs fromDirHandle.clone(CLONE_VM|CLONE_VFORK)creates a child sharing the parent’s address space;execvereplaces it with a new ELF binary.wait4blocks parent until child exits/zombies.writev(used by musl’sprintf) writes scatter/gather buffers to VGA.brkgrows the process heap by allocating and mapping zero-filled pages.mmapsupports anonymousMAP_PRIVATEallocations via a bump-down allocator starting at0x4000_0000_0000.Processtracksbrk_base/brk_current(computed from ELF segment extents),mmap_next/mmap_regions,fd_table,cwd,parent_pid,wait_thread.- ELF parser extracts
phdr_vaddr,phnum, andphentsizefor the auxiliary vector (musl readsAT_PHDR/AT_PHNUM/AT_PHENTduring startup). spawn_process_full(inosl/src/spawn.rs) builds the initial stack withargc,argvstrings,envp(NULL), and auxiliary vector.- Async-to-sync bridge (
osl/src/blocking.rs): spawns async VFS operations as kernel tasks, blocks the user thread, unblocks on completion. - Unhandled syscalls log a warning with the syscall number and first 3 args,
then return
-ENOSYS. - Ring-3 page faults, GPFs, and invalid opcodes log the fault, mark the process zombie, wake the parent’s wait thread, restore kernel GS polarity, and kill the thread — no kernel panic.
test isolationverifies two independently-created PML4s have genuinely independent user-space mappings at the same virtual address.- System info commands (cpuinfo, meminfo, memmap, pmap, threads, tasks, idt,
pci, lapic, ioapic, drivers, uptime) are exposed as
/procvirtual files accessible viacat /proc/<file>.
Key implementation files
| File | Role |
|---|---|
libkernel/src/gdt.rs | GDT with kernel + user code/data segments, TSS, set_kernel_stack for rsp0 |
libkernel/src/syscall.rs | SYSCALL MSR init, assembly entry stub, per-CPU data |
libkernel/src/file.rs | FileHandle trait, FileError enum, ConsoleHandle |
libkernel/src/console.rs | Console input buffer, foreground PID routing, blocking read |
libkernel/src/process.rs | Process struct (fd_table, cwd, brk/mmap, parent/wait), ProcessManager, zombie lifecycle |
libkernel/src/elf.rs | ELF64 parser (static ET_EXEC, x86-64) with phdr metadata for auxv |
libkernel/src/memory/mod.rs | create_user_page_table, map_user_page, switch_address_space |
libkernel/src/task/scheduler.rs | spawn_user_thread, process_trampoline, CR3 switching in preempt_tick, block/unblock |
libkernel/src/interrupts.rs | Ring-3-aware page fault, GPF, and invalid opcode handlers |
osl/src/syscalls/ | syscall_dispatch + syscall implementations (io.rs, fs.rs, mem.rs, process.rs, misc.rs) |
osl/src/errno.rs | Linux errno constants, file_errno() / vfs_errno() converters |
osl/src/file.rs | VfsHandle, DirHandle (VFS-backed file handles) |
osl/src/blocking.rs | Async-to-sync bridge for VFS calls |
osl/src/spawn.rs | spawn_process_full (ELF spawning with argv and parent PID) |
kernel/src/ring3.rs | Legacy spawn_process wrapper, spawn_blob (raw code), test helpers |
kernel/src/keyboard_actor.rs | Foreground routing: raw bytes to console or kernel line editor |
kernel/src/main.rs | Auto-launch /shell on boot |
devices/src/vfs/proc_vfs/mod.rs | ProcVfs with 12+ virtual files (generator submodules) |
user/shell.c | Userspace shell (musl, static) |
docs/syscalls/*.md | Per-syscall documentation |
Virtual Address Space Layout
The kernel’s heap, APIC, and MMIO window live in the high canonical half
(≥ 0xFFFF_8000_0000_0000), so the entire lower canonical half is available
for user process address spaces. The kernel/user boundary is enforced at the
PML4 level: entries 0–255 (lower half) are user-private; entries 256–510
(high half) are kernel-shared; entry 511 is the per-PML4 recursive
self-mapping.
0x0000_0000_0000_0000 ← canonical zero (null pointer trap page, unmapped)
0x0000_0000_0040_0000 ← ELF load address (4 MiB, standard x86-64)
↓ text, data, BSS
↓ brk heap (grows up from page-aligned end of highest PT_LOAD segment)
...
0x0000_4000_0000_0000 ← mmap region (bump-down allocator, grows downward)
...
0x0000_7FFF_F000_0000 ← ELF user stack base (8 pages = 32 KiB)
0x0000_7FFF_F000_8000 ← ELF user stack top (RSP starts here minus auxv layout)
0x0000_7FFF_FFFF_FFFF ← top of lower canonical half (entire range = user)
(non-canonical gap)
0xFFFF_8000_0000_0000 ← kernel heap (HEAP_START, 512 KiB)
0xFFFF_8001_0000_0000 ← Local APIC MMIO (APIC_BASE)
0xFFFF_8001_0001_0000 ← IO APIC(s)
0xFFFF_8002_0000_0000 ← MMIO window (MMIO_VIRT_BASE, 512 GiB)
phys_mem_offset ← bootloader physical memory identity map (high half)
0xFFFF_FF80_0000_0000 ← recursive PT window (PML4[511])
0xFFFF_FFFF_FFFF_F000 ← PML4 self-mapping
Kernel entries (PML4 indices 256–510) are copied into every process page table
without USER_ACCESSIBLE; they are invisible to ring-3 code.
Phase 0 — Toolchain and Build Infrastructure ✅ COMPLETE
Goal: produce user-space ELF binaries that the kernel can load, without needing musl yet.
0a. Custom linker script
Write user/link.ld:
ENTRY(_start)
SECTIONS {
. = 0x400000;
.text : { *(.text*) }
.rodata : { *(.rodata*) }
.data : { *(.data*) }
.bss : { *(.bss*) COMMON }
}
0b. Rust no_std user target
Add a custom target JSON x86_64-ostoo-user.json with:
"os": "none","env": "","vendor": "unknown""pre-link-args": pass the linker script"panic-strategy": "abort"(no unwinding in user space initially)"disable-redzone": true(same requirement as kernel)
A minimal user/ crate can implement _start in assembly, call a main, then
invoke the exit syscall.
0c. Assembly user programs
Before the ELF loader exists, a hand-crafted binary blob (or raw ELF built from a few lines of NASM) is enough to verify the ring-3 transition and basic syscalls work.
Phase 1 — Ring-3 GDT Segments and SYSCALL Infrastructure ✅ COMPLETE
Goal: the kernel can jump to ring 3 and come back via SYSCALL/SYSRET. No process isolation yet — user code runs in the kernel’s own address space.
What was implemented:
- GDT extended with kernel data, user data, and user code segments in the order
required by
IA32_STAR(libkernel/src/gdt.rs). TSS.rsp0updated viaset_kernel_stack()on every context switch to a user process.- SYSCALL MSRs (
STAR,LSTAR,FMASK,EFER.SCE) configured inlibkernel/src/syscall.rs::init(). - Assembly entry stub with
swapgs, per-CPU kernel/user RSP swap, and SysV64 argument shuffle before callingsyscall_dispatch. - Three syscalls:
write(fd 1/2 to VGA),exit/exit_group(mark zombie + kill thread),arch_prctl(ARCH_SET_FS)(writeIA32_FS_BASEMSR). - Ring-3 test (
test ring3): drops to user mode, writes “Hello from ring 3!” via syscall, exits cleanly.
1a. GDT additions (libkernel/src/gdt.rs)
Add four new descriptors in the order required by IA32_STAR:
Index Selector Descriptor
0 0x00 Null
1 0x08 Kernel code (ring 0, already exists)
2 0x10 Kernel data (ring 0) ← new; SYSRET expects it at STAR[47:32]+8
3 0x18 (padding / null for SYSRET alignment)
4 0x20 User code (ring 3) ← new; STAR[63:48]
5 0x28 User data (ring 3) ← new; at STAR[63:48]+8
6 0x30+ TSS (2 slots for the 16-byte system descriptor)
IA32_STAR layout: bits 47:32 = kernel CS (SYSCALL), bits 63:48 = user CS − 16
(SYSRET uses this+16 for CS and +8 for SS).
Update the Selectors struct and init() in gdt.rs.
1b. TSS kernel-stack field
When the CPU delivers a ring-3 interrupt it loads RSP from TSS.rsp0. This
must point to the current process’s kernel stack top. For now a single global
TSS is fine; when processes exist, rsp0 is updated on every context switch.
1c. SYSCALL MSR setup (libkernel/src/interrupts.rs or new libkernel/src/syscall.rs)
#![allow(unused)]
fn main() {
pub fn init_syscall() {
// IA32_STAR: kernel CS at bits 47:32, user CS-16 at bits 63:48
let star: u64 = ((KERNEL_CS as u64) << 32) | ((USER_CS as u64 - 16) << 48);
unsafe { Msr::new(0xC000_0081).write(star); } // STAR
// IA32_LSTAR: entry point for 64-bit SYSCALL
unsafe { Msr::new(0xC000_0082).write(syscall_entry as u64); }
// IA32_FMASK: clear IF, DF on SYSCALL (but keep other flags)
unsafe { Msr::new(0xC000_0084).write(0x0000_0300); } // IF | DF
// Enable SCE bit in EFER
let efer = unsafe { Msr::new(0xC000_0080).read() };
unsafe { Msr::new(0xC000_0080).write(efer | 1); }
}
}
1d. Assembly syscall entry stub
libkernel/src/syscall_entry.asm (or global_asm! in syscall.rs):
syscall_entry:
swapgs ; switch to kernel GS (store user GS)
mov [gs:USER_RSP], rsp ; save user RSP into per-cpu area
mov rsp, [gs:KERN_RSP] ; load kernel RSP
push rcx ; user RIP (SYSCALL saves it here)
push r11 ; user RFLAGS
; push all scratch registers
push rax
push rdi
push rsi
push rdx
push r10
push r8
push r9
; rax = syscall number, rdi/rsi/rdx/r10/r8/r9 = arguments
mov rdi, rax
call syscall_dispatch ; -> rax = return value
pop r9
pop r8
pop r10
pop rdx
pop rsi
pop rdi
; leave rax as return value
pop r11 ; restore RFLAGS
pop rcx ; restore user RIP
mov rsp, [gs:USER_RSP] ; restore user RSP
swapgs
sysretq
swapgs requires a per-CPU data block holding the kernel stack pointer.
Implement as a small struct at a known virtual address (or via GS_BASE MSR).
1e. Minimal syscall dispatch table
Start with just three numbers (matching Linux x86-64 for musl compatibility):
| Number | Name | Action |
|---|---|---|
| 0 | read | stub → return −ENOSYS |
| 1 | write | write to VGA console if fd==1/2 |
| 60 | exit | terminate current process |
1f. First ring-3 test
Write a tiny inline assembly test in kernel/src/main.rs that:
- Pushes a fake user-mode iret frame (SS, RSP, RFLAGS with IF, CS ring-3, RIP).
iretqinto ring 3.- User code executes
syscallwithrax=1(write), prints one character. - Kernel writes it to VGA and returns to ring 3.
- User code executes
syscallwithrax=60(exit).
This verifies the GDT, SYSCALL, and basic ABI without an ELF loader or address space isolation.
Phase 2 — Per-Process Page Tables and Address Space Isolation ✅ COMPLETE
Goal: each process has its own PML4; kernel mappings are shared; user mappings are private.
What was implemented:
MemoryServices::create_user_page_table()allocates a fresh PML4, copies kernel entries (indices 256–510) withoutUSER_ACCESSIBLE, and sets the recursive self-mapping at index 511.MemoryServices::map_user_page()maps individual 4 KiB pages in a non-active page table given its PML4 physical address.unsafe switch_address_space(pml4_phys)writes CR3.- Page fault handler (
libkernel/src/interrupts.rs) checksstack_frame.code_segment.rpl()— ring-3 faults mark the process zombie (exit code -11 / SIGSEGV), restore kernel GS viaswapgs, and callkill_current_thread(). Kernel faults still panic. test isolationshell command verifies two PML4s map the same user virtual address to different physical frames.- Scheduler
preempt_ticksaves/restores CR3 when switching between threads with different page tables.
2a. Page table creation (libkernel/src/memory/)
Add to MemoryServices:
#![allow(unused)]
fn main() {
/// Allocate a fresh PML4, copy all kernel PML4 entries (indices where
/// virtual_address >= KERNEL_SPLIT) into it, and return the physical
/// address of the new PML4 frame.
pub fn create_user_page_table(&mut self) -> PhysAddr;
/// Map a single 4 KiB page in a specific (possibly non-active) page table.
pub fn map_user_page(
&mut self,
pml4_phys: PhysAddr,
virt: VirtAddr,
phys: PhysAddr,
flags: PageTableFlags, // USER_ACCESSIBLE | PRESENT | WRITABLE | NO_EXECUTE as needed
) -> Result<(), MapToError<Size4KiB>>;
/// Switch the active address space. Must be called with interrupts disabled.
pub unsafe fn switch_address_space(&self, pml4_phys: PhysAddr);
}
2b. Kernel/user PML4 split
The layout gives a clean hardware-level split:
- PML4 indices 0–255 (lower canonical half,
0x0000_*) — user-private. Left empty at process creation; populated by the ELF loader andmmap. - PML4 indices 256–510 (high canonical half,
0xFFFF_8000_*through0xFFFF_FF7F_*) — kernel-shared. Copied from the kernel PML4 at process creation; marked present but neverUSER_ACCESSIBLE. - PML4 index 511 — the recursive self-mapping. Each process PML4 must
have its own entry here pointing to its own physical PML4 frame (not
the kernel’s).
create_user_page_tablemust set this explicitly.
2c. Page fault handler upgrade
Replace the panic in page_fault_handler with:
#![allow(unused)]
fn main() {
extern "x86-interrupt" fn page_fault_handler(frame: InterruptStackFrame, ec: PageFaultErrorCode) {
let faulting_addr = Cr2::read();
if frame.code_segment.rpl() == PrivilegeLevel::Ring3 {
// Fault in user space — kill the process (deliver SIGSEGV later).
kill_current_process(Signal::Segv);
schedule_next(); // does not return to faulting instruction
} else {
panic!("kernel page fault at {:?}\n{:#?}\n{:?}", faulting_addr, frame, ec);
}
}
}
This is the minimum needed to prevent a kernel panic when user code accesses invalid memory; proper CoW / demand paging comes later.
2d. Address space switch on context switch
The scheduler’s preempt_tick function currently saves/restores only kernel
RSP. Extend it to also write CR3 when switching between processes with
different page tables.
Phase 3 — Process Abstraction ✅ COMPLETE
Goal: Process struct, a process table, and a working exec.
What was implemented:
Processstruct (libkernel/src/process.rs) with PID, state (Running/Zombie), PML4 physical address, heap-allocated 64 KiB kernel stack, entry point, user stack top, thread index, and exit code.- Global
PROCESS_TABLE: Mutex<BTreeMap<ProcessId, Process>>andCURRENT_PID: AtomicU64. insert(),current_pid(),set_current_pid(),with_process(),mark_zombie(),reap(),reap_zombies().- Scheduler integration:
SchedulableKind::Kernel | UserProcess(ProcessId).spawn_user_threadcreates a thread targetingprocess_trampolinewhich sets up TSS.rsp0, per-CPU kernel RSP, PID tracking, GS polarity, CR3 switch, and then doesiretqinto ring-3 user code. kill_current_thread()marks the thread Dead and spins; timer preemption skips dead threads.- ELF loader (
libkernel/src/elf.rs): minimal parser for staticET_EXECx86-64 binaries. ReturnsElfInfo { entry, segments, phdr_vaddr, phnum, phentsize }. kernel/src/ring3.rs::spawn_process(elf_data)— parses ELF, creates user PML4, maps all PT_LOAD segments (with correct R/W/X flags) plus a user stack page, creates a Process, and spawns a user thread. ReturnsOk(ProcessId).- Shell command
exec <path>reads an ELF from the VFS and callsspawn_process. spawn_blob(code)helper for test commands: maps a raw code blob + stack, creates a Process, spawns a user thread.- Zombie reaping:
reap_zombies()is called at the start ofspawn_blobandspawn_processto free kernel stacks of fully-exited processes.
3a. Process struct (libkernel/src/process/mod.rs)
#![allow(unused)]
fn main() {
pub struct Process {
pub pid: ProcessId,
pub state: ProcessState, // Running, Ready, Blocked, Zombie
pub pml4_phys: PhysAddr, // physical address of PML4
pub kernel_stack: Vec<u8>, // 64 KiB kernel stack
pub saved_rsp: u64, // kernel RSP when not running
pub user_rsp: u64, // user RSP (restored on ring-3 return)
pub files: FileTable, // open file descriptors
pub parent: Option<ProcessId>,
pub exit_code: Option<i32>,
}
}
3b. Process table
#![allow(unused)]
fn main() {
lazy_static! {
static ref PROCESSES: Mutex<BTreeMap<ProcessId, Process>> = ...;
}
}
CURRENT_PID: AtomicU32 — the PID running on each CPU (single-CPU for now).
3c. Scheduler integration
Replace the bare Thread list in scheduler.rs with process-aware scheduling:
- On
preempt_tick: save user context (if coming from ring 3), switchCR3, load next process’s user context and kernel RSP. TSS.rsp0updated to point to the new process’s kernel stack top.
3d. ELF loader (libkernel/src/elf.rs)
#![allow(unused)]
fn main() {
pub fn load_elf(
bytes: &[u8],
process: &mut Process,
mem: &mut MemoryServices,
) -> Result<VirtAddr, ElfError> // returns entry point
}
Steps:
- Validate ELF magic,
e_machine == EM_X86_64,e_type == ET_EXEC(static) orET_DYN(PIE). - For each
PT_LOADsegment: allocate physical frames, map atp_vaddrwithUSER_ACCESSIBLEand flags derived fromp_flags(R/W/X). - Copy
p_fileszbytes from the ELF image; zero-fill top_memsz. - Allocate and map a user stack (8–16 pages) just below the stack top.
- Set up the initial stack frame:
argc=0,argv=NULL,envp=NULL,auxventries forAT_ENTRY,AT_PHDR,AT_PAGESZ(required by musl’s_start). - Return
e_entry.
3e. sys_execve syscall
#![allow(unused)]
fn main() {
fn sys_execve(path: *const u8, argv: *const *const u8, envp: *const *const u8) -> ! {
let bytes = vfs::read_file(path_str).expect("exec: read failed");
let process = current_process_mut();
process.reset_address_space(); // drop old page table
let entry = load_elf(&bytes, process, &mut memory());
switch_to_user(entry, process.user_stack_top); // does not return
}
}
Phase 4 — System Call Layer ✅ COMPLETE
Goal: a syscall table wide enough to run a static musl binary that prints “Hello, world!” and exits.
What was implemented:
4a. ELF parser extensions (libkernel/src/elf.rs)
ElfInfo now includes phdr_vaddr, phnum, and phentsize. The parser
looks for a PT_PHDR program header (type 6) to get the phdr virtual address
directly; fallback computes it from the PT_LOAD segment containing e_phoff.
These values populate the auxiliary vector that musl reads during startup.
4b. Process memory tracking (libkernel/src/process.rs)
Process gained four new fields:
| Field | Type | Purpose |
|---|---|---|
brk_base | u64 | Page-aligned end of highest PT_LOAD segment (immutable) |
brk_current | u64 | Current program break (starts == brk_base) |
mmap_next | u64 | Bump-down pointer for anonymous mmap (starts at 0x4000_0000_0000) |
mmap_regions | Vec<(u64, u64)> | Tracked (vaddr, len) pairs |
Process::new() now takes a brk_base parameter. spawn_process computes it
from max(seg.vaddr + seg.memsz) page-aligned up.
4c. Initial stack layout (kernel/src/ring3.rs)
ELF processes get an 8-page (32 KiB) contiguous stack at 0x7FFF_F000_0000,
allocated via alloc_dma_pages(8) so the auxv layout can be written through the
kernel’s phys_mem_offset window. build_initial_stack() writes:
[stack_top]
16 bytes pseudo-random data (AT_RANDOM target)
alignment padding (8 bytes)
AT_NULL (0, 0)
AT_RANDOM (25, addr)
AT_ENTRY (9, entry_point)
AT_PHNUM (5, phnum)
AT_PHENT (4, phentsize)
AT_PHDR (3, phdr_vaddr)
AT_PAGESZ (6, 4096)
AT_UID (11, 0)
NULL ← envp terminator
NULL ← argv terminator
0 ← argc = 0
[RSP points here, 16-byte aligned]
4d. Syscall table (osl/src/syscalls/mod.rs)
All syscalls use Linux x86-64 numbers for musl compatibility. Unhandled
numbers log a warning and return -ENOSYS. Errno constants are defined
in osl/src/errno.rs; libkernel uses FileError for structured errors.
| Nr | Name | Implementation |
|---|---|---|
| 0 | read | Via fd_table → FileHandle::read; ConsoleHandle blocks on empty input |
| 1 | write | Via fd_table → FileHandle::write |
| 2 | open | VFS read_file or list_dir → VfsHandle/DirHandle; path resolution relative to CWD |
| 3 | close | Via fd_table |
| 5 | fstat | S_IFCHR for console fds |
| 8 | lseek | Returns -ESPIPE (not seekable) |
| 9 | mmap | Anonymous MAP_PRIVATE only; bump-down allocator |
| 10 | mprotect | Updates page table flags for VMA regions |
| 11 | munmap | Unmaps pages, frees frames, splits/removes VMAs |
| 12 | brk | Query or grow heap; allocates+maps zero-filled pages |
| 16 | ioctl | Returns -ENOTTY |
| 20 | writev | Via fd_table; scatter/gather write |
| 60 | exit | Mark zombie, wake parent wait_thread, kill thread |
| 61 | wait4 | Find zombie child, block if none, reap and return |
| 72 | futex | No-op stub (single-threaded, lock never contended) |
| 79 | getcwd | Copy process.cwd to user buffer |
| 80 | chdir | Validate path via VFS list_dir, update process.cwd |
| 158 | arch_prctl | ARCH_SET_FS writes IA32_FS_BASE MSR |
| 217 | getdents64 | Via DirHandle::getdents64 |
| 218 | set_tid_address | Returns current PID as TID |
| 231 | exit_group | Same as exit (single-threaded) |
| 273 | set_robust_list | No-op, returns 0 |
Lock ordering for brk and mmap: process table lock acquired/released to
read state, then memory lock for frame allocation and page mapping, then process
table lock re-acquired to write updates. This avoids nested lock deadlocks.
See docs/syscalls/ for detailed per-syscall documentation.
4e. What’s still missing (deferred to later phases)
- SMAP enforcement: User pointers in
writev,fstat,brkare accessed withoutstac/clac. Page deallocation: ✅ Fixed —munmapfrees frames and splits VMAs;brkshrink unmaps and frees pages; process exit cleans up the entire user address space.: ✅ Fixed — updates page table flags for the target VMA range.mprotectFS_BASE save/restore: ✅ Fixed — FS_BASE is saved/restored per-thread inpreempt_tickviasave_current_context/restore_thread_state.
Phase 5 — Cross-Compiler and musl Port ✅ COMPLETE
Goal: compile C programs that run as ostoo user processes.
What was implemented:
- Docker-based build environment (
scripts/user-build.sh) usingx86_64-linux-musl-crosstoolchain. user/Makefilecompiles*.cfiles to static musl-linked ELF binaries.user/shell.cis the primary musl binary (see Phase 6).- Binaries are deployed to the exFAT disk image or shared via virtio-9p.
5a. Toolchain strategy
The simplest path: use an existing x86_64-linux-musl sysroot unmodified,
because we implement Linux-compatible syscall numbers (Phase 4). musl does not
inspect the OS name at runtime — it just issues syscalls.
Option A (quickest): install x86_64-linux-musl-gcc from
musl.cc or via brew install x86_64-linux-musl-cross.
Compile with:
x86_64-linux-musl-gcc -static -o hello hello.c
The resulting fully-static ELF should work on ostoo with the Phase 4 syscalls.
Option B (custom triple): build musl from source with a custom --target
configured for ostoo. This is useful once ostoo diverges from Linux’s ABI
(e.g. custom syscall numbers or a different startup convention).
5b. musl build recipe (Option B outline)
# Prerequisites: a bare x86_64-elf-gcc cross-compiler (via crosstool-ng or
# manual binutils + gcc build targeting x86_64-unknown-elf).
git clone https://git.musl-libc.org/cgit/musl
cd musl
./configure \
--target=x86_64 \
--prefix=/opt/ostoo-sysroot \
--syslibdir=/opt/ostoo-sysroot/lib \
CROSS_COMPILE=x86_64-elf-
make -j$(nproc)
make install
Key musl files:
arch/x86_64/syscall_arch.h—__syscall0…__syscall6use thesyscallinstruction; no changes needed if syscall numbers match Linux.crt/x86_64/crt1.o—_startsets upargc/argv/envpfrom the initial stack (ABI defined in the ELF auxiliary vector; match what the ELF loader sets up in Phase 3d).src/env/__init_tls.c— callsarch_prctl(ARCH_SET_FS, ...); requires thesys_arch_prctlsyscall (Phase 4b).
5c. Rust user programs
For Rust programs targeting ostoo, add a custom target
x86_64-ostoo-user.json (from Phase 0b) and a minimal ostoo-rt crate that:
- Provides
_start(sets up a stack frame; callsmain; callssys_exit). - Provides
#[panic_handler]that callssys_exit(1). - Wraps the small syscall ABI.
Users can then write:
#![no_std]
#![no_main]
extern crate ostoo_rt;
#[no_mangle]
pub extern "C" fn main() {
ostoo_rt::write(1, b"Hello from Rust!\n");
}
Phase 6 — Spawn, Wait, and a Minimal Shell ✅ COMPLETE
Goal: a user-mode shell that can launch and wait for child programs.
What was implemented:
Process creation uses the standard Linux clone(CLONE_VM|CLONE_VFORK) +
execve path. musl’s posix_spawn and Rust’s std::process::Command
work unmodified.
6a. clone (syscall 56)
clone(CLONE_VM|CLONE_VFORK|SIGCHLD) creates a child sharing the parent’s
address space. The parent blocks until the child calls execve or _exit.
See clone.
6b. execve (syscall 59)
Replaces the current process image with a new ELF binary. Reads from VFS, creates a fresh PML4, maps segments, builds the initial stack, closes CLOEXEC fds, unblocks the vfork parent, and jumps to userspace.
See execve.
6c. wait4 (syscall 61)
sys_wait4(pid, status_ptr, options)— find zombie child, write exit status, reap, return child PID- If no zombie found: register
wait_threadon parent, block, retry on wake sys_exitwakes parent’swait_threadviascheduler::unblock()
6d. Userspace shell (user/shell.c)
- Compiled with musl (static), deployed at
/shell - Line editing: read char-by-char, echo, backspace, Ctrl+C, Ctrl+D
- Built-in commands:
echo,pwd,cd,ls,cat,exit,help - External programs:
posix_spawn(path)+waitpid(child, &status, 0) - Auto-launched from
kernel/src/main.rs; falls back to kernel shell if/shellis not found on the filesystem
6e. What’s deferred
fork+ CoW page faults — standard POSIXforkis not implemented. Adding it would require: marking all user pages read-only in both parent and child, a CoW page fault handler that copies on write, and reference counting on physical frames.
Phase 7 — Signals ⬜ NOT STARTED
Signals are the last major piece of POSIX plumbing needed for a realistic user-space environment.
Minimal signal implementation
#![allow(unused)]
fn main() {
pub struct SigAction { handler: usize, flags: u32, mask: SigSet }
pub struct SigTable { actions: [SigAction; 32], pending: SigSet, masked: SigSet }
}
sys_rt_sigactioninstalls handlers.- Before returning to user space after a syscall or interrupt, check
pending & ~masked. - If set: push a signal frame on the user stack (siginfo + ucontext), set RIP to the handler, clear the pending bit.
sys_rt_sigreturn: the signal handler calls this when done; the kernel pops the ucontext and resumes normal user execution.
Dependency Graph
Phase 0 ✅ ← Phase 1 ✅ ← Phase 2 ✅ ← Phase 3 ✅ ← Phase 4 ✅ ← Phase 5 ✅ ← Phase 6 ✅
(toolchain) (ring-3, (address (Process, (syscall (musl) (spawn/wait/
syscall) spaces) ELF loader) layer) shell)
↓
Phase 7 (signals)
Key Risks and Design Decisions
SYSCALL vs INT 0x80
Use SYSCALL/SYSRET (64-bit, fast path). INT 0x80 is the 32-bit ABI; musl uses SYSCALL on x86-64 exclusively.
Kernel/user split
The kernel lives entirely in the high canonical half (0xFFFF_8000_* and
above): heap at 0xFFFF_8000_*, APIC at 0xFFFF_8001_*, MMIO window at
0xFFFF_8002_*. The entire lower canonical half is free for user processes.
The split is enforced at the PML4 level — user processes simply have no
mappings at indices 256–510, and the kernel entries they inherit are never
USER_ACCESSIBLE. SMEP (CR4.20) and SMAP (CR4.21) provide the hardware
enforcement layer once ring-3 processes exist.
SMEP and SMAP
Once ring-3 processes exist, enable SMEP (CR4.20) to prevent the kernel from
accidentally executing user-mapped code, and SMAP (CR4.21) to prevent the
kernel from silently accessing user memory without an explicit stac/clac
pair. Any kernel code that copies from user buffers must use a checked copy
function that uses stac to temporarily permit access.
Static-only ELF initially
Dynamic linking requires an in-kernel or user-space ELD interpreter. Start
with -static binaries and the ELF loader described in Phase 3d. PIE static
binaries (ET_DYN with no INTERP segment) should work with minor adjustments to
the loader.
Single CPU for now
The process table and scheduler assume a single CPU. SMP support would require
per-CPU CURRENT_PID, per-CPU kernel stacks in the TSS, and IPI-based TLB
shootdown when modifying another process’s page table.
Heap size
The kernel heap is 1 MiB. Process control blocks each consume 64 KiB
(kernel stack) plus page table frames, plus Vec storage for mmap_regions.
Zombie processes are reaped via wait4 + reap(), but loading multiple
concurrent processes will still pressure the heap.
Memory management
munmap frees frames and splits/removes VMAs. brk shrink frees pages.
Process exit calls cleanup_user_address_space to walk and free all
user-half page tables and frames. The kernel heap (1 MiB) is the main
remaining pressure point for concurrent processes.
Milestones and Test Checkpoints
| Milestone | Observable result | Status |
|---|---|---|
| Phase 1 complete | iretq drops to ring 3; syscall returns to ring 0; “Hello from ring 3!” appears on VGA | ✅ Done |
| Phase 2 complete | Two user processes have separate address spaces; test isolation passes | ✅ Done |
| Phase 3 complete | exec /path/to/elf reads an ELF from the VFS, loads it into a fresh address space, and runs it | ✅ Done |
| Phase 4 complete | 14 syscalls, initial stack with auxv, brk/mmap heap, writev for printf | ✅ Done |
| Phase 5 complete | hello compiled with x86_64-linux-musl-gcc -static prints and exits cleanly | ✅ Done |
| Phase 6 complete | Userspace shell spawns children and waits for them; auto-launches on boot | ✅ Done |
| Phase 7 complete | SIGINT (Ctrl+C) terminates the foreground process | ⬜ |
read (nr 0)
Linux Signature
ssize_t read(int fd, void *buf, size_t count);
Description
Reads up to count bytes from file descriptor fd into buf.
Current Implementation
Looks up fd in the current process’s per-process file descriptor table and calls FileHandle::read() on the handle.
- fd 0 (stdin) —
ConsoleHandle: Reads raw bytes from the console input buffer (libkernel/src/console.rs). If the buffer is empty, blocks the current scheduler thread viablock_current_thread()until the keyboard ISR delivers input viapush_input(). Returns at least 1 byte per call. - VFS file fds —
VfsHandle: Reads from an in-memory buffer loaded atopen()time. Maintains a per-handle read position. Returns 0 at EOF. - Directory fds —
DirHandle: Returns-EISDIR(-21). - Invalid fds: Returns
-EBADF(-9).
Validates that buf falls within user address space (< 0x0000_8000_0000_0000). Returns -EFAULT (-14) on invalid pointers. Returns 0 immediately if count is 0.
Source: osl/src/syscalls/io.rs — sys_read
Future Work
- Support partial reads and proper error handling for VFS files.
- SMAP enforcement for user buffer validation.
write (nr 1)
Linux Signature
ssize_t write(int fd, const void *buf, size_t count);
Description
Writes up to count bytes from buf to file descriptor fd.
Current Implementation
Looks up fd in the current process’s per-process file descriptor table and calls FileHandle::write() on the handle.
- fd 1 (stdout) and fd 2 (stderr) —
ConsoleHandle: Interpretsbufas UTF-8 and prints to the VGA text buffer viaprint!(). If not valid UTF-8, falls back to printing printable ASCII (0x20..0x7F) plus\n,\r,\t. Returnscounton success. - VFS file fds —
VfsHandle: Returns-EBADF(-9) — files are read-only. - Invalid fds: Returns
-EBADF(-9).
Validates that buf falls within user address space (< 0x0000_8000_0000_0000). Returns -EFAULT (-14) on invalid pointers.
Source: osl/src/syscalls/io.rs — sys_write
Future Work
- Support writable VFS files.
- Handle partial writes.
open (nr 2)
Linux Signature
int open(const char *pathname, int flags, mode_t mode);
Description
Opens a file or directory at pathname and returns a file descriptor.
Current Implementation
- Reads a null-terminated path string from user space (max 4096 bytes). Returns
-EFAULTif the pointer is invalid. - Resolves the path relative to the process’s current working directory (
cwd). Normalises.and..components. - Unless
O_DIRECTORY(0o200000) is set, first attempts to open as a file viadevices::vfs::read_file()(throughosl::blocking::blocking()). On success, the entire file content is loaded into aVfsHandle(buffered in kernel memory) and a new fd is allocated. - If the file open fails with
VfsError::NotFoundorVfsError::NotAFile, orO_DIRECTORYwas requested, falls back to opening as a directory viadevices::vfs::list_dir(). On success, creates aDirHandlewith the directory listing and allocates a new fd. - Returns the new fd number on success, or a negative errno.
The VFS operations use osl::blocking::blocking() which spawns the async VFS call as a kernel task and blocks the calling user thread until it completes.
Flags supported: O_DIRECTORY (to explicitly request directory). O_RDONLY is implied for all opens. Other flags are accepted but ignored.
Source: osl/src/syscalls/fs.rs — sys_open
Errors
| Errno | Condition |
|---|---|
-EFAULT (-14) | Invalid pathname pointer |
-ENOENT (-2) | File or directory not found |
-ENOTDIR (-20) | Path is not a directory (when O_DIRECTORY used) |
-EMFILE (-24) | Per-process fd limit reached (64) |
-EIO (-5) | VFS I/O error |
Future Work
- Support
O_WRONLY,O_CREAT,O_TRUNCfor writable files. - Streaming reads instead of loading entire file into memory at open time.
- Proper
modehandling.
close (nr 3)
Linux Signature
int close(int fd);
Description
Closes a file descriptor so that it no longer refers to any file and may be reused.
Current Implementation
Looks up fd in the current process’s file descriptor table. If found, calls FileHandle::close() on the handle and sets the table slot to None, making the fd number available for reuse.
- Returns 0 on success.
- Returns
-EBADF(-9) if the fd is not open or out of range.
Source: osl/src/syscalls/fs.rs — sys_close
Future Work
- Flush pending writes for writable file handles before closing.
- Free resources held by the handle (e.g. release VFS locks).
fstat (nr 5)
Linux Signature
int fstat(int fd, struct stat *statbuf);
Description
Returns information about a file referred to by the file descriptor fd, writing it into the stat structure at statbuf.
Current Implementation
- Zero-fills the 144-byte
struct statbuffer. - Sets
st_modeat offset 24 toS_IFCHR | 0666(character device, read/write for all), regardless of which fd is queried. - Always returns 0 (success).
This is sufficient for musl’s stdio initialisation, which calls fstat on stdout to determine whether it is a terminal.
Source: osl/src/syscalls/fs.rs — sys_fstat
Future Work
- Return different
st_modevalues depending on the fd (e.g., regular file vs. character device). - Populate other stat fields (
st_size,st_ino,st_dev, timestamps, etc.). - Return
-EBADFfor invalid file descriptors. - Validate that
statbufis a writable user-space address.
lseek (nr 8)
Linux Signature
off_t lseek(int fd, off_t offset, int whence);
Description
Repositions the file offset of the open file descriptor fd to the given offset
according to whence (SEEK_SET, SEEK_CUR, SEEK_END).
Current Implementation
Always returns -ESPIPE (illegal seek). The only file descriptors currently in use
are stdin/stdout/stderr, which behave as non-seekable character devices (serial console).
Source: osl/src/syscalls/mod.rs — inline stub in syscall_dispatch
Future Work
- Implement proper seek for regular file descriptors once the VFS exposes them to
user-space via
open.
mmap (nr 9)
Linux Signature
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
Description
Maps pages of memory into the calling process’s address space.
Current Implementation
Supports anonymous mappings, file-backed private mappings, and shared
memory mappings via shmem_create fds.
Source: osl/src/syscalls/mem.rs — sys_mmap
Supported modes
| Flags | fd | Behaviour |
|---|---|---|
MAP_PRIVATE | MAP_ANONYMOUS | ignored | Allocate fresh zeroed pages (most common) |
MAP_PRIVATE | file fd | Copy file content into private pages |
MAP_SHARED | shmem fd | Map the shared memory object’s physical frames |
MAP_SHARED | MAP_ANONYMOUS | — | Returns -EINVAL (not supported without fork) |
MAP_SHARED and MAP_PRIVATE are mutually exclusive; if both or neither
are set, -EINVAL is returned.
Protection flags (prot)
| Flag | Value | Page table flags |
|---|---|---|
PROT_READ | 0x1 | PRESENT | USER_ACCESSIBLE |
PROT_WRITE | 0x2 | + WRITABLE |
PROT_EXEC | 0x4 | removes NO_EXECUTE |
If prot is 0 (PROT_NONE), pages are mapped as present but not
accessible from userspace (guard pages).
Address selection
- Without
MAP_FIXED: a top-down gap finder scans the VMA map for a free gap in the user address range (0x0000_0010_0000–0x0000_4000_0000_0000), starting from the top. The returned address is the start of the gap. MAP_FIXED:addrmust be page-aligned and non-zero. Any existing mappings in the range are implicitly unmapped before the new mapping is created.
MAP_SHARED with shmem fd
When MAP_SHARED is specified with a file descriptor from
shmem_create(508), the kernel maps the shared memory object’s existing
physical frames into the caller’s page table. Each frame’s reference
count is incremented so the frame is not freed until all processes have
unmapped it and the last fd is closed.
The offset argument selects the starting frame within the shmem object
(must be page-aligned).
File-backed MAP_PRIVATE
When MAP_PRIVATE is specified with a file fd (from open), the file’s
content is copied into freshly allocated pages. The pages are private to
the calling process — writes do not affect the underlying file or other
mappings.
VMA tracking
Each mapping is recorded as a Vma (virtual memory area) in the
process’s vma_map (BTreeMap<u64, Vma>), tracking start address,
length, protection, flags, fd, and offset. The VMA map is used by
munmap, mprotect, the gap finder, and process cleanup.
Lock ordering
PROCESS_TABLE is acquired first (to read VMA state and pml4_phys),
then released before acquiring MEMORY (to allocate/map pages), then
PROCESS_TABLE is re-acquired to update state. This avoids nested lock
deadlocks.
Errors
| Error | Condition |
|---|---|
-EINVAL | Length is 0, MAP_SHARED and MAP_PRIVATE both/neither set, MAP_SHARED | MAP_ANONYMOUS, unaligned MAP_FIXED addr, unaligned offset |
-ENOMEM | Physical memory exhausted or no virtual address gap found |
-ENODEV | MAP_SHARED fd is not a shmem object |
-EBADF | File-backed MAP_PRIVATE with an invalid fd |
See also
- munmap (11) — unmap pages
- mprotect (10) — change page protection
- shmem_create (508) — create shared memory fd
- mmap Design — design document with phase roadmap
mprotect (nr 10)
Linux Signature
int mprotect(void *addr, size_t len, int prot);
Description
Changes the access protections for the calling process’s memory pages in the range [addr, addr+len).
Implementation
- Validates
addris page-aligned andlen > 0(returns-EINVALotherwise). - Aligns
lenup to the next page boundary. - Splits/updates VMAs in the range via
Process::mprotect_vmas():- Entire VMA overlap: updates prot in place.
- Partial overlap (front, tail, middle): splits VMA and sets new prot on the affected portion.
- Converts
protto x86-64 page table flags (prot_to_page_flags()):PROT_NONE→USER_ACCESSIBLEonly (noPRESENT— any access faults).PROT_READ→PRESENT | USER_ACCESSIBLE | NO_EXECUTE.PROT_WRITE→ addsWRITABLE.PROT_EXEC→ removesNO_EXECUTE.
- Updates page table entries via
MemoryServices::update_user_page_flags()with TLB flush. - Returns 0 on success.
Lock ordering: PROCESS_TABLE first (VMA split), then MEMORY (page table update).
Returns 0 (no-op) if no VMAs overlap the requested range (Linux semantics).
Source: osl/src/syscalls/mem.rs (sys_mprotect), libkernel/src/process.rs (mprotect_vmas), libkernel/src/memory/mod.rs (update_user_page_flags)
munmap (nr 11)
Linux Signature
int munmap(void *addr, size_t length);
Description
Removes mappings for the specified address range, causing further references to addresses within the range to generate page faults.
Current Implementation
Fully implemented. Validates arguments, splits/removes VMAs, unmaps page table entries, frees physical frames to the free list, and flushes the TLB.
Source: osl/src/syscalls/mem.rs — sys_munmap
Behaviour
addrmust be page-aligned;lengthmust be > 0. Returns-EINVALotherwise.lengthis rounded up to the next page boundary.- Overlapping VMAs are split or removed:
- Entire VMA consumed — removed from
vma_map. - Front consumed — VMA start/len adjusted forward.
- Tail consumed — VMA len shortened.
- Middle consumed — VMA split into two fragments.
- Entire VMA consumed — removed from
- Each page in the unmapped range is removed from the page table.
Physical frames are released via refcount-aware logic: shared frames
(from
MAP_SHAREDmappings) are only freed when their reference count reaches 0 (i.e. all processes have unmapped the frame and the backingshmem_createfd has been closed). Non-shared frames are freed immediately. - If no VMAs overlap the range, returns 0 (Linux no-op semantics).
Lock ordering
PROCESS_TABLE is acquired first (to call munmap_vmas), then released before acquiring MEMORY (to unmap and free pages). Same ordering as sys_mmap and sys_brk.
Errors
| Error | Condition |
|---|---|
-EINVAL | addr not page-aligned, length is 0, or caller is kernel |
brk (nr 12)
Linux Signature
int brk(void *addr);
Note: The raw syscall returns the new program break on success (not 0 like the glibc wrapper).
Description
Sets the end of the process’s data segment (the “program break”). Increasing the break allocates memory; decreasing it deallocates.
Current Implementation
brk(0)orbrk(addr < brk_base): Returns the current program break without modification. This is how musl queries the initial break.brk(addr <= brk_current): Shrinks the break. Updatesbrk_currentbut does not unmap or free any pages.brk(addr > brk_current): Grows the break. The requested address is page-aligned up. For each new page:- A physical frame is allocated via
alloc_dma_pages(1). - The frame is zeroed.
- The frame is mapped into the process’s page table with
PRESENT | WRITABLE | USER_ACCESSIBLE | NO_EXECUTE. brk_currentis updated to the new page-aligned address.
- A physical frame is allocated via
- On allocation failure, returns the old
brk_current(Linux convention: failure = unchanged break).
Initial state: brk_base and brk_current are set to the page-aligned end of the highest PT_LOAD ELF segment when the process is spawned.
Lock ordering: Process table lock is acquired/released to read state, then memory lock for allocation, then process table lock again to write the update.
Source: osl/src/syscalls/mem.rs — sys_brk
Future Work
- Free physical frames and unmap pages when the break is decreased.
- Guard against growing the break into other mapped regions (stack, mmap area).
rt_sigaction (nr 13)
Linux Signature
int rt_sigaction(int signum, const struct sigaction *act,
struct sigaction *oldact, size_t sigsetsize);
Description
Examine and change a signal action.
Current Implementation
Stub: Returns 0 (success) unconditionally. No signal support is implemented. musl’s runtime init calls rt_sigaction to install default signal handlers; the stub allows this to succeed silently.
Source: osl/src/signal.rs — sys_rt_sigaction
rt_sigprocmask (nr 14)
Linux Signature
int rt_sigprocmask(int how, const sigset_t *set, sigset_t *oldset, size_t sigsetsize);
Description
Examine and change blocked signals.
Current Implementation
Stub: Returns 0 (success) unconditionally. No signal support is implemented. musl’s runtime and posix_spawn call rt_sigprocmask to configure the signal mask; the stub allows this to succeed silently.
Source: osl/src/signal.rs — sys_rt_sigprocmask
rt_sigreturn (nr 15)
Restore process context after a signal handler returns.
Signature
rt_sigreturn() → (restores original rax)
Arguments
None. The kernel reads the saved context from the signal frame on the user stack.
Return value
Does not return in the normal sense — restores all registers (including rax) from the signal frame, resuming execution at the point where the signal was delivered.
Description
When the kernel delivers a signal, it pushes a signal frame onto the user
stack containing the interrupted context (all registers, signal mask, RIP,
RSP, RFLAGS) and sets RIP to the user’s signal handler. A trampoline
(__restore_rt) is placed on the stack that calls rt_sigreturn when the
handler returns.
rt_sigreturn reads the saved context from the signal frame, restores the
signal mask, and overwrites the SYSCALL saved registers so that the return
to user space resumes the original interrupted code path.
Signal frame layout (on user stack)
[pretcode] 8 bytes — address of __restore_rt trampoline
[siginfo] 128 bytes — siginfo_t
[ucontext] variable — contains:
uc_flags 8 bytes
uc_link 8 bytes
uc_stack 24 bytes (ss_sp, ss_flags, ss_size)
[sigcontext] 256 bytes (32 × u64: r8–r15, rdi, rsi, rbp, rbx, rdx, rax, rcx, rsp, rip, rflags, ...)
uc_sigmask 8 bytes — saved signal mask
[__restore_rt code] 9 bytes — `mov eax, 15; syscall`
Implementation
osl/src/signal.rs — sys_rt_sigreturn
See also
ioctl (nr 16)
Linux Signature
int ioctl(int fd, unsigned long request, ...);
Description
Manipulates the underlying device parameters of special files. Commonly used to query terminal attributes (TCGETS, TIOCGWINSZ, etc.).
Current Implementation
Always returns -ENOTTY (-25), indicating the file descriptor does not refer to a terminal. All arguments are ignored.
This is sufficient for musl’s stdio, which calls ioctl(fd, TIOCGWINSZ, ...) to check if stdout is a terminal for line buffering decisions. Receiving -ENOTTY causes musl to treat the fd as a non-terminal and use full buffering.
Source: osl/src/syscalls/mod.rs — inline stub in syscall_dispatch
Future Work
- Return
TIOCGWINSZdata for the VGA console (80x25) so musl recognises it as a terminal. - Implement
TCGETS/TCSETSfor basic terminal attribute support. - Dispatch based on fd to different device drivers.
writev (nr 20)
Linux Signature
ssize_t writev(int fd, const struct iovec *iov, int iovcnt);
Where struct iovec is:
struct iovec {
void *iov_base; // Starting address
size_t iov_len; // Number of bytes
};
Description
Writes data from multiple buffers (a “scatter/gather” array) to a file descriptor in a single atomic operation. This is what musl’s printf uses internally instead of plain write.
Current Implementation
Looks up fd in the current process’s per-process file descriptor table. Iterates through iovcnt iovec entries (each 16 bytes: iov_base: u64, iov_len: u64). For each non-empty buffer, calls FileHandle::write() on the handle.
- Console fds (stdout/stderr): Each buffer is printed via
ConsoleHandle::write()(UTF-8 with ASCII fallback). - Invalid fds: Returns
-EBADF(-9). - Returns the total number of bytes written across all iovec entries on success.
- Short-circuits on error from any individual write.
Source: osl/src/syscalls/io.rs — sys_writev
Future Work
- Validate that
iovand alliov_basepointers are valid user-space addresses. - Handle partial writes.
- Cap
iovcntatUIO_MAXIOV(1024) per Linux convention.
pipe (nr 22) / pipe2 (nr 293)
Linux Signature
int pipe(int pipefd[2]);
int pipe2(int pipefd[2], int flags);
Description
Creates a unidirectional data channel (pipe). Returns two file descriptors: pipefd[0] for reading and pipefd[1] for writing.
Both syscalls share the same implementation: pipe(fds) is dispatched as
pipe2(fds, 0) (no flags).
Current Implementation
- Create pipe: Allocates a
PipeInner(sharedVecDeque<u8>buffer) wrapped inPipeReaderandPipeWriterhandles. - Allocate fds: Allocates two file descriptors in the process’s fd table.
- Apply flags: If
O_CLOEXEC(0o2000000) is set, both fds getFD_CLOEXECflag. - Write to user buffer: Writes
[read_fd, write_fd]as twoi32values to the user buffer.
Pipe Semantics
- Read: If the buffer is empty and the writer is still open, the reader blocks via
block_current_thread(). When the writer appends data, it wakes the blocked reader. Returns 0 (EOF) if the writer has been closed and the buffer is empty. - Write: Appends data to the shared buffer and wakes any blocked reader. Currently unbounded (no backpressure).
- Close: Closing the write end sets
write_closed = trueand wakes any blocked reader (so it gets EOF). Closing the read end drops the reader’s Arc reference.
Source: osl/src/syscalls/fs.rs — sys_pipe2, osl/src/syscalls/mod.rs — pipe(22) dispatch, libkernel/src/file.rs — PipeReader, PipeWriter, make_pipe
Usage from C (musl)
#include <unistd.h>
#include <fcntl.h>
int fds[2];
pipe2(fds, O_CLOEXEC);
write(fds[1], "hello", 5);
close(fds[1]);
char buf[32];
ssize_t n = read(fds[0], buf, sizeof(buf)); // n = 5
close(fds[0]);
Errors
| Errno | Condition |
|---|---|
-EFAULT (-14) | Invalid pipefd pointer |
-EMFILE (-24) | Per-process fd limit reached (64) |
Future Work
- Bounded buffer with write-side blocking (backpressure).
O_NONBLOCKflag support.
madvise (nr 28)
Linux Signature
int madvise(void *addr, size_t length, int advice);
Description
Give advice about use of memory.
Current Implementation
Stub: Returns 0 (success) unconditionally. All advice is ignored. musl and Rust std may call madvise(MADV_DONTNEED) on freed memory regions.
Source: osl/src/syscalls/mod.rs — inline stub in syscall_dispatch
dup2 (nr 33)
Linux Signature
int dup2(int oldfd, int newfd);
Description
Duplicates file descriptor oldfd to newfd. If newfd is already open, it is silently closed first.
Current Implementation
- If
oldfd == newfd: validates thatoldfdis open, returnsnewfd. - Reads the
FdEntry(handle + flags) fromoldfd. - Clones the
Arc<dyn FileHandle>and installs it atnewfd. - The new fd does not inherit
FD_CLOEXECfrom the old fd (per POSIX). - If
newfdwas previously open, its old handle is dropped (refcount decremented). - The fd table is extended if
newfdexceeds the current length.
Source: osl/src/syscalls/fs.rs — sys_dup2
Errors
| Errno | Condition |
|---|---|
-EBADF (-9) | oldfd is not a valid open fd |
getpid (nr 39)
Linux Signature
pid_t getpid(void);
Description
Returns the process ID of the calling process.
Current Implementation
Returns current_pid().as_u64(). Always succeeds (no error return).
Source: osl/src/syscalls/process.rs — sys_getpid
clone (nr 56)
Linux Signature
long clone(unsigned long flags, void *child_stack, int *ptid, int *ctid, unsigned long tls);
Description
Creates a new process. ostoo supports the specific flag combination used by musl’s posix_spawn: CLONE_VM | CLONE_VFORK | SIGCHLD (0x4111). The child shares the parent’s address space and the parent blocks until the child calls execve or _exit.
Current Implementation
Only the flag combination CLONE_VM | CLONE_VFORK | SIGCHLD is accepted. Other flag combinations return -ENOSYS.
- Validate arguments:
child_stackmust be non-zero. Unsupported flags return-ENOSYS. - Read parent state: Copies
pml4_phys,cwd,fd_table(Arc clones),brk_*,mmap_*from the parent process. - Capture user registers: Reads
user_rip,user_rflags, anduser_r9fromPerCpuData(saved by the SYSCALL entry stub). These are needed so the child can “return from syscall” at the same instruction as the parent. - Create child process: New PID, same
pml4_physas parent (CLONE_VM), inherited fd table and cwd. Setsvfork_parent_threadto the parent’s scheduler thread index. - Spawn clone thread: Creates a scheduler thread via
spawn_clone_threadthat entersclone_trampoline. The trampoline sets up kernel state and drops to ring 3 atuser_ripwithRAX=0(child return value) andR9=user_r9(musl’s__clonefn pointer). - Block parent: Calls
block_current_thread()(CLONE_VFORK semantics). The parent is unblocked when the child callsexecveor_exit. - Return: After unblocking, returns the child’s PID to the parent.
Source: osl/src/clone.rs — sys_clone, libkernel/src/task/scheduler.rs — spawn_clone_thread, clone_trampoline
Usage from C (musl)
Not called directly — musl’s posix_spawn uses it internally:
#include <spawn.h>
#include <sys/wait.h>
pid_t child;
int err = posix_spawn(&child, "/hello", NULL, NULL, argv, envp);
if (err == 0) {
int status;
waitpid(child, &status, 0);
}
Errors
| Errno | Condition |
|---|---|
-ENOSYS (-38) | Unsupported flag combination |
-EINVAL (-22) | child_stack is NULL |
Design Notes
- musl’s
__cloneassembly stores the child function pointer in R9 beforesyscall. The entry stub saves R9 toPerCpuData.user_r9(offset 32), andclone_trampolinerestores it viajump_to_userspace(rax=0, r9=user_r9). - The child shares the parent’s PML4 (CLONE_VM). After
execve, the child gets a fresh PML4. The old shared PML4 continues to be used by the parent.
execve (nr 59)
Linux Signature
int execve(const char *pathname, char *const argv[], char *const envp[]);
Description
Replaces the current process image with a new ELF binary. On success, the calling process’s address space, stack, and brk are replaced; the process continues execution at the new program’s entry point. On failure, the original process is unchanged.
Current Implementation
- Copy arguments from userspace: Reads
pathname(null-terminated string),argv(NULL-terminated array of string pointers), andenvp(NULL-terminated array of string pointers) into kernel buffers before destroying the address space. - Resolve path: Resolves relative to the process’s
cwd. - Read ELF from VFS: Loads the entire ELF binary via
devices::vfs::read_file(). - Parse ELF: Extracts PT_LOAD segments, entry point, and program headers via
libkernel::elf::parse. - Create fresh PML4: Allocates a new user page table (kernel entries 256–510 are copied from the active PML4). The old PML4 and its user-half page tables are freed after switching CR3 (skipped for
CLONE_VMshared PML4s). - Map ELF segments: Maps each PT_LOAD segment into the new PML4 with correct permissions (R/W/X).
- Map user stack: 8 pages (32 KiB) at
0x0000_7FFF_F000_0000. - Build initial stack: Writes
argc,argvpointers,envppointers, and auxiliary vector (AT_PHDR,AT_PHENT,AT_PHNUM,AT_PAGESZ,AT_ENTRY,AT_UID,AT_RANDOM) onto the user stack. - Update process: Sets new
pml4_phys,entry_point,user_stack_top,brk_base/brk_current, resetsmmap_next/mmap_regions. Callsclose_cloexec_fds()to close all file descriptors withFD_CLOEXECset. ResetsFS_BASEto 0 (new program’s libc will set up TLS). - Unblock vfork parent: If this process was created by
clone(CLONE_VFORK), unblocks the parent thread. - Jump to userspace: Switches CR3 to the new PML4 and does
iretqto the new entry point. Never returns.
On any error before step 9, returns a negative errno — the original process is unchanged.
Source: osl/src/exec.rs — sys_execve
Errors
| Errno | Condition |
|---|---|
-EFAULT (-14) | Invalid pathname, argv, or envp pointer |
-ENOENT (-2) | File not found on VFS |
-ENOEXEC (-8) | Invalid ELF binary or no loadable segments |
-EINVAL (-22) | Too many arguments (>256) |
Future Work
- Support
#!(shebang) script execution. - Proper
AT_RANDOMwith real randomness instead of a fixed address.
exit (nr 60) / exit_group (nr 231)
Linux Signature
void _exit(int status); // nr 60
void exit_group(int status); // nr 231
Description
exit(60): Terminates the calling thread.exit_group(231): Terminates all threads in the calling process.
Both are handled identically in ostoo since each process currently has exactly one thread.
Current Implementation
- Looks up the current PID.
- If it’s a user process (not
ProcessId::KERNEL), callsterminate_process:- Logs
pid N exited with code Cto serial. - Unblocks vfork parent: If this process was created by
clone(CLONE_VFORK)and has not yet calledexecve, unblocks the parent thread so it can resume. Clearsvfork_parent_thread. - Closes all fds: Releases IRQ handles, completion ports, pipes, channels, etc. while the process’s page tables are still active.
- Frees user address space: Switches CR3 to the kernel boot PML4 and updates the scheduler’s thread record, then frees all user-half page tables and data frames via
cleanup_user_address_space. Skipped forCLONE_VMchildren (shared PML4 still used by parent). - Marks zombie: Sets the process state to
Zombiewith the exit code. - Wakes parent: Queues
SIGCHLDand unblocks the parent’swait_threadif set. - Yields + dies: Donates remaining quantum to the parent, calls
yield_now(), thenkill_current_thread()marks the thread asDead.
- Logs
- If it’s a kernel thread: prints a halt message and calls
kill_current_thread().
Zombie processes are reaped by waitpid (when a parent collects exit status) or lazily by reap_zombies() at the start of spawn_process.
Source: osl/src/syscalls/process.rs — sys_exit, libkernel/src/process.rs — terminate_process
CR3 safety on exit
The process’s PML4 frame must not be freed while CR3 still references it.
The frame allocator uses an intrusive free-list that overwrites the first 8
bytes of freed frames immediately; if the scheduler later reschedules the
dying thread (before kill_current_thread runs), a TLB refill through the
corrupted PML4 would triple-fault. terminate_process therefore switches
to the kernel boot PML4 (stored in KERNEL_PML4_PHYS during
memory::init_services) and updates the scheduler via set_current_cr3
before calling cleanup_user_address_space.
Future Work
- Properly distinguish
exit(single thread) fromexit_group(all threads) once multi-threaded processes are supported. - Service auto-cleanup: remove service registry entries on process exit.
wait4 (nr 61)
Linux Signature
pid_t wait4(pid_t pid, int *wstatus, int options, struct rusage *rusage);
Description
Waits for a child process to change state (typically exit). Returns the PID of the child whose state changed, and optionally writes the exit status.
Current Implementation
Called as syscall number 61 (wait4). The rusage parameter is ignored.
- Determines the calling process’s PID (
parent_pid). - Interprets
pidargument:-1: Wait for any child process.> 0: Wait for the specific child with that PID.
- Searches the process table for a zombie child matching the criteria via
find_zombie_child(parent_pid, target_pid). - If a zombie child is found:
- Writes the exit status to the user-space
wstatuspointer (if non-NULL), encoded as(exit_code << 8)matching Linux’sWEXITSTATUSmacro. - Reaps the child process (removes from process table, frees kernel stack).
- Restores the console foreground to the parent process.
- Returns the child’s PID.
- Writes the exit status to the user-space
- If no zombie child exists but living children do:
- Registers the current scheduler thread index in the parent’s
wait_threadfield. - Calls
block_current_thread()to sleep. - When woken (by a child calling
sys_exit), loops back to step 3.
- Registers the current scheduler thread index in the parent’s
- If no children exist at all: Returns
-ECHILD(-10).
Source: osl/src/syscalls/process.rs — sys_wait4
Usage from C (musl)
#include <sys/wait.h>
#include <sys/syscall.h>
/* Wait for specific child */
int status;
pid_t child = syscall(SYS_wait4, child_pid, &status, 0, 0);
int exit_code = WEXITSTATUS(status); /* (status >> 8) & 0xFF */
/* Wait for any child */
pid_t any = syscall(SYS_wait4, -1, &status, 0, 0);
Errors
| Errno | Condition |
|---|---|
-ECHILD (-10) | Calling process has no children |
Future Work
- Support
WNOHANGoption (return immediately if no child has exited). - Support
WUNTRACEDandWCONTINUEDfor stopped/continued children. - Populate
struct rusagewith resource usage statistics. - Handle the case where multiple children exit simultaneously.
kill (nr 62)
Send a signal to a process.
Signature
kill(pid: pid_t, sig: int) → 0 or -errno
Arguments
| Arg | Register | Description |
|---|---|---|
| pid | rdi | Target process ID |
| sig | rsi | Signal number (1–31) |
Return value
Returns 0 on success.
Errors
| Error | Condition |
|---|---|
| EINVAL | Signal number is out of range (< 1 or > 31) |
| ESRCH | No process with the given PID exists |
Description
Queues the specified signal on the target process. The signal is delivered before the process next returns to user space (checked after syscalls and interrupts).
Currently only supports sending to a specific PID. Negative PIDs (process groups) and PID 0 (current process group) are not yet supported.
Implementation
osl/src/signal.rs — sys_kill
See also
fcntl (nr 72)
Linux Signature
int fcntl(int fd, int cmd, ... /* arg */);
Description
Performs operations on file descriptors. Only fd-level flag operations are supported.
Current Implementation
| Command | Value | Behaviour |
|---|---|---|
F_GETFD | 1 | Returns the fd flags (currently only FD_CLOEXEC) |
F_SETFD | 2 | Sets the fd flags to arg |
F_GETFL | 3 | Returns 0 (no file status flags tracked) |
| Other | — | Returns -EINVAL |
Source: osl/src/syscalls/fs.rs — sys_fcntl
Errors
| Errno | Condition |
|---|---|
-EBADF (-9) | fd is not a valid open fd |
-EINVAL (-22) | Unknown cmd |
getcwd (nr 79)
Linux Signature
char *getcwd(char *buf, size_t size);
Description
Copies the absolute pathname of the current working directory into buf. On success, returns buf. On failure, returns -1 and sets errno.
Current Implementation
- Validates that
bufis within user address space. Returns-EFAULT(-14) if not. - Reads the
cwdfield from the current process’sProcessstruct. - Checks that
sizeis large enough to hold the cwd string plus a null terminator. Returns-ERANGE(-34) if too small. - Copies the cwd string and null terminator into the user buffer.
- Returns
buf(the pointer value) on success — matching Linux’s behaviour where the return value is the buffer address.
Each process has its own cwd field (default "/"), updated by chdir.
Source: osl/src/syscalls/fs.rs — sys_getcwd
Usage from C (musl)
#include <unistd.h>
char buf[256];
if (getcwd(buf, sizeof(buf)) != NULL) {
/* buf contains the current working directory */
}
Or via raw syscall:
#include <sys/syscall.h>
char buf[256];
long ret = syscall(SYS_getcwd, buf, sizeof(buf));
/* ret > 0 on success (pointer to buf) */
Future Work
- Support
getcwd(NULL, 0)which auto-allocates a buffer (musl handles this in userspace).
chdir (nr 80)
Linux Signature
int chdir(const char *path);
Description
Changes the current working directory to path.
Current Implementation
- Reads a null-terminated path string from user space (max 4096 bytes). Returns
-EFAULT(-14) if the pointer is invalid. - Resolves the path relative to the process’s current
cwd. Normalises.and..components. - Validates that the resolved path is an existing directory by calling
devices::vfs::list_dir()(throughosl::blocking::blocking()). This blocks the calling thread while the async VFS operation completes. - On success, updates the process’s
cwdfield to the resolved path and returns 0. - On failure, returns the error from the VFS (typically
-ENOENTor-ENOTDIR).
Source: osl/src/syscalls/fs.rs — sys_chdir
Usage from C (musl)
#include <unistd.h>
if (chdir("/some/path") < 0) {
/* error */
}
Errors
| Errno | Condition |
|---|---|
-EFAULT (-14) | Invalid path pointer |
-ENOENT (-2) | Path does not exist |
-ENOTDIR (-20) | A component of the path is not a directory |
-EIO (-5) | VFS I/O error |
Future Work
- Support
fchdir(fd)to change directory via an open directory fd.
sigaltstack (nr 131)
Linux Signature
int sigaltstack(const stack_t *ss, stack_t *old_ss);
Description
Set and/or get the alternate signal stack.
Current Implementation
Stub: Returns 0 (success) unconditionally. No signal support is implemented. Rust’s standard library calls sigaltstack during runtime init to set up an alternate stack for signal handlers.
Source: osl/src/syscalls/mod.rs — inline stub in syscall_dispatch
arch_prctl (nr 158)
Linux Signature
int arch_prctl(int code, unsigned long addr);
Description
Sets or gets architecture-specific thread state. On x86-64, primarily used to set the FS and GS segment base registers for thread-local storage (TLS).
Current Implementation
ARCH_SET_FS(0x1002): Writesaddrto theIA32_FS_BASEMSR (0xC000_0100). This is how musl sets up its TLS pointer during C runtime initialisation. Returns 0 on success.- All other codes: Returns
-EINVAL(-22).
Source: osl/src/syscalls/misc.rs — sys_arch_prctl
Future Work
- Implement
ARCH_GET_FS(0x1003) to read back the current FS base. - Implement
ARCH_SET_GS(0x1001) andARCH_GET_GS(0x1004) for GS-based TLS. - Save/restore FS_BASE across context switches if multiple user processes use TLS concurrently (currently each process sets it fresh via the trampoline, but preemption during a syscall could lose the value).
futex (nr 202)
Linux Signature
long futex(uint32_t *uaddr, int futex_op, uint32_t val,
const struct timespec *timeout, uint32_t *uaddr2, uint32_t val3);
Description
Provides fast user-space locking primitives. FUTEX_WAIT blocks the calling thread
until the value at uaddr changes; FUTEX_WAKE wakes threads waiting on uaddr.
Current Implementation
Always returns 0 (success). Each process is single-threaded, so musl’s internal locks
(used by stdio, malloc, etc.) are never contended. The FUTEX_WAIT path is never
reached in practice, and FUTEX_WAKE returning 0 (no waiters woken) is correct.
Source: osl/src/syscalls/mod.rs — inline stub in syscall_dispatch
Future Work
- Implement
FUTEX_WAITandFUTEX_WAKEproperly once multi-threaded user processes are supported. - Support
FUTEX_WAIT_BITSETand other operations used by musl’s condition variables.
sched_getaffinity (nr 204)
Linux Signature
int sched_getaffinity(pid_t pid, size_t cpusetsize, cpu_set_t *mask);
Description
Get a thread’s CPU affinity mask.
Current Implementation
Zeroes the user-provided mask buffer, then sets bit 0 (CPU 0 only). Returns cpusetsize (the number of bytes written). ostoo is a single-CPU kernel.
Rust’s standard library calls sched_getaffinity during runtime init to determine available parallelism.
Source: osl/src/syscalls/misc.rs — sys_sched_getaffinity
Errors
| Errno | Condition |
|---|---|
-EINVAL (-22) | cpusetsize is 0 |
-EFAULT (-14) | Invalid mask pointer |
getdents64 (nr 217)
Linux Signature
int getdents64(unsigned int fd, struct linux_dirent64 *dirp, unsigned int count);
Where struct linux_dirent64 is:
struct linux_dirent64 {
ino64_t d_ino; /* 64-bit inode number */
off64_t d_off; /* 64-bit offset to next entry */
unsigned short d_reclen; /* Size of this dirent */
unsigned char d_type; /* File type */
char d_name[]; /* Null-terminated filename */
};
Description
Reads directory entries from a directory file descriptor into a buffer. Returns the number of bytes written, or 0 when all entries have been consumed.
Current Implementation
- Validates that
dirpbuffer is within user address space. Returns-EFAULT(-14) if not. - Looks up
fdin the process’s file descriptor table. - Calls
FileHandle::getdents64()on the handle. OnlyDirHandleimplements this; other handle types return-ENOTTY(-25). DirHandlemaintains an internal cursor. On each call, it serializes entries starting from the cursor position into the user buffer:d_ino: Synthetic inode number (cursor index + 1).d_off: Index of the next entry.d_reclen: Record length, 8-byte aligned. Computed as8 + 8 + 2 + 1 + strlen(name) + 1, rounded up to 8.d_type:DT_DIR(4) for directories,DT_REG(8) for regular files.d_name: Null-terminated filename, with zero-padding to alignment.
- Returns total bytes written, or 0 when all entries have been read.
The directory listing is loaded entirely at open() time and cached in the DirHandle.
Source: osl/src/syscalls/io.rs — sys_getdents64, osl/src/file.rs — DirHandle::getdents64
Usage from C (musl)
#include <fcntl.h>
#include <sys/syscall.h>
#include <unistd.h>
int fd = open("/", O_RDONLY | O_DIRECTORY);
char buf[2048];
long nread;
while ((nread = syscall(SYS_getdents64, fd, buf, sizeof(buf))) > 0) {
long pos = 0;
while (pos < nread) {
unsigned short reclen = *(unsigned short *)(buf + pos + 16);
unsigned char d_type = *(unsigned char *)(buf + pos + 18);
char *name = buf + pos + 19;
/* process entry... */
pos += reclen;
}
}
close(fd);
Future Work
- Return proper inode numbers from the VFS.
- Support
lseek/rewinddiron directory handles.
set_tid_address (nr 218)
Linux Signature
pid_t set_tid_address(int *tidptr);
Description
Sets the clear_child_tid pointer for the calling thread. When the thread exits, the kernel writes 0 to *tidptr and wakes any futex waiters. Returns the caller’s TID.
Current Implementation
- Ignores the
tidptrargument entirely (noclear_child_tidtracking). - Returns the current process’s PID (used as TID since each process is single-threaded).
This is sufficient for musl’s early startup, which calls set_tid_address to discover its own TID.
Source: osl/src/syscalls/process.rs — sys_set_tid_address
Future Work
- Store
tidptrin the thread/process structure. - On thread exit, write 0 to
*tidptrand perform a futex wake (needed forpthread_join). - Return a per-thread TID rather than PID once multi-threading is supported.
clock_gettime (nr 228)
Linux Signature
int clock_gettime(clockid_t clk_id, struct timespec *tp);
Description
Retrieves the time of the specified clock.
Current Implementation
Stub: Writes zero for both tv_sec and tv_nsec in the user-provided timespec struct. The clk_id parameter is accepted but ignored. Returns 0 (success).
This satisfies Rust std’s runtime init which calls clock_gettime(CLOCK_MONOTONIC, ...) but doesn’t depend on the actual time value.
Source: osl/src/syscalls/misc.rs — sys_clock_gettime
Errors
| Errno | Condition |
|---|---|
-EFAULT (-14) | Invalid tp pointer |
Future Work
- Return real time based on the PIT/HPET/TSC timer.
- Distinguish
CLOCK_REALTIME,CLOCK_MONOTONIC, etc.
set_robust_list (nr 273)
Linux Signature
long set_robust_list(struct robust_list_head *head, size_t len);
Description
Registers a list of robust futexes with the kernel. If a thread exits while holding a robust futex, the kernel marks it as dead and wakes waiters, preventing permanent deadlocks.
Current Implementation
Always returns 0 (success) without recording anything. Both arguments are ignored.
This is sufficient for musl’s startup, which registers a robust list as part of thread initialisation.
Source: osl/src/syscalls/mod.rs — inline stub in syscall_dispatch
Future Work
- Store the robust list head pointer in the thread structure.
- On thread exit, walk the robust list and wake any futex waiters on held locks.
- Implement
get_robust_list(nr 274) for completeness.
getrandom (nr 318)
Linux Signature
ssize_t getrandom(void *buf, size_t buflen, unsigned int flags);
Description
Fills a buffer with random bytes. Used by Rust’s HashMap for hash seed randomisation and by musl for stack canary initialisation.
Current Implementation
Uses a simple xorshift64* PRNG seeded from the x86 TSC (Time Stamp Counter via rdtsc). Fills the user buffer byte-by-byte from the PRNG state. The flags parameter is accepted but ignored.
Note: This is not cryptographically secure. It provides enough entropy for HashMap seeds and similar non-security use cases.
Source: osl/src/syscalls/misc.rs — sys_getrandom
Errors
| Errno | Condition |
|---|---|
-EFAULT (-14) | Invalid buffer pointer |
Future Work
- Seed from a proper entropy source (e.g.
RDSEED/RDRANDinstructions). - Distinguish
GRND_RANDOMvsGRND_NONBLOCKflags.
io_create (nr 501)
Create a completion port for async I/O.
Signature
io_create(flags: u32) → fd or -errno
Arguments
| Arg | Register | Description |
|---|---|---|
| flags | rdi | Reserved, must be 0 |
Return value
On success, returns a file descriptor for the new completion port.
Errors
| Error | Condition |
|---|---|
| EINVAL | flags is non-zero |
| EMFILE | Process fd table is full |
Description
Creates a new kernel CompletionPort object and returns a file descriptor
referring to it. The port fd is used as the first argument to io_submit
and io_wait.
Completion ports are the core async I/O primitive in ostoo. Operations
(reads, writes, timeouts, IRQ waits, IPC send/recv) are submitted to a port
via io_submit and their completions are harvested via io_wait.
Implementation
osl/src/io_port.rs — sys_io_create
See also
io_submit (nr 502)
Submit async I/O operations to a completion port.
Signature
io_submit(port_fd: i32, entries_ptr: *const IoSubmission, count: u32) → processed or -errno
Arguments
| Arg | Register | Description |
|---|---|---|
| port_fd | rdi | Completion port fd (from io_create) |
| entries_ptr | rsi | Pointer to array of submission entries |
| count | rdx | Number of entries to submit |
Submission entry layout
struct IoSubmission { // 48 bytes, repr(C)
uint64_t user_data; // Opaque value returned in completion
uint32_t opcode; // Operation type (see below)
uint32_t flags; // Reserved, must be 0
int32_t fd; // Target file descriptor (opcode-dependent)
int32_t _pad;
uint64_t buf_addr; // User buffer address
uint32_t buf_len; // User buffer length
uint32_t offset; // Reserved
uint64_t timeout_ns; // Timeout in nanoseconds (OP_TIMEOUT)
};
Opcodes
| Value | Name | Description |
|---|---|---|
| 0 | OP_NOP | Immediate completion (testing/synchronization) |
| 1 | OP_TIMEOUT | Timer that completes after timeout_ns nanoseconds |
| 2 | OP_READ | Async read from fd into buf_addr |
| 3 | OP_WRITE | Async write from buf_addr to fd |
| 4 | OP_IRQ_WAIT | Wait for an interrupt on an IRQ fd |
| 5 | OP_IPC_SEND | Send an IPC message on a channel send-end fd |
| 6 | OP_IPC_RECV | Receive an IPC message on a channel recv-end fd |
| 7 | OP_RING_WAIT | Wait for a notification fd signal |
Return value
On success, returns the number of entries processed.
Errors
| Error | Condition |
|---|---|
| EFAULT | entries_ptr is invalid |
| EBADF | port_fd is not a valid completion port |
Per-entry errors (EBADF, EFAULT, EINVAL) are reported via the completion result field rather than failing the entire submission.
Description
Each submission entry describes an async operation. The kernel processes
entries sequentially, spawning async tasks for operations that cannot
complete immediately. When an operation finishes, a completion entry is
posted to the port and can be harvested via io_wait.
For OP_READ, data is read into a kernel buffer and copied to user space
during io_wait (which runs in the process’s syscall context with the
correct page tables).
For OP_IPC_SEND/RECV, buf_addr points to an IpcMessage struct.
File descriptors in msg.fds are transferred across the channel.
For OP_RING_WAIT, fd must be a notification fd (from notify_create).
The completion fires when another process calls notify(fd). Edge-
triggered, one-shot: re-submit to rearm.
Implementation
osl/src/io_port.rs — sys_io_submit
See also
io_wait (nr 503)
Wait for completions on a completion port.
Signature
io_wait(port_fd: i32, completions_ptr: *mut IoCompletion, max: u32, min: u32, timeout_ns: u64) → count or -errno
Arguments
| Arg | Register | Description |
|---|---|---|
| port_fd | rdi | Completion port fd |
| completions_ptr | rsi | User buffer for completion entries |
| max | rdx | Maximum completions to return |
| min | r10 | Minimum completions before returning (0 = non-blocking poll) |
| timeout_ns | r8 | Timeout in nanoseconds (0 = wait forever) |
Completion entry layout
struct IoCompletion { // 24 bytes, repr(C)
uint64_t user_data; // Copied from submission
int64_t result; // Bytes transferred, or negative errno
uint32_t flags; // Reserved
uint32_t opcode; // Operation that completed
};
Return value
On success, returns the number of completions written to completions_ptr
(between 0 and max).
Errors
| Error | Condition |
|---|---|
| EFAULT | completions_ptr is invalid |
| EBADF | port_fd is not a valid completion port |
Description
Blocks the calling thread until at least min completions are available on
the port, or the timeout expires. Drains up to max completions and
copies them to user memory.
For OP_READ completions, the kernel buffer containing read data is copied to the user-space destination address that was specified in the original submission.
For OP_IPC_RECV completions, transferred file descriptors are installed
in the receiver’s fd table and the IpcMessage.fds array is rewritten with
the new fd numbers before copying to user memory.
The timeout is implemented as a cancellable async timer task. A timeout of 0 means wait forever (no timeout).
Implementation
osl/src/io_port.rs — sys_io_wait
See also
irq_create (nr 504)
Create a file descriptor for receiving hardware interrupts.
Signature
irq_create(gsi: u32) → fd or -errno
Arguments
| Arg | Register | Description |
|---|---|---|
| gsi | rdi | Global System Interrupt number |
Return value
On success, returns a file descriptor for the IRQ object.
Errors
| Error | Condition |
|---|---|
| ENOMEM | No free dynamic interrupt vectors available |
| EINVAL | Failed to program the IO APIC for the given GSI |
| EMFILE | Process fd table is full |
Description
Allocates a dynamic interrupt vector, programs the IO APIC to route the specified GSI to that vector (edge-triggered, active-high, initially masked), and returns an fd referring to the IRQ object.
The IRQ fd is used with io_submit OP_IRQ_WAIT to asynchronously wait
for interrupts via a completion port. When an interrupt fires, the
registered completion port receives a completion with the user_data from the
submission. The GSI is then re-masked until the next OP_IRQ_WAIT is
submitted.
When the fd is closed, the original IO APIC redirection entry is restored and the dynamic vector is freed.
Implementation
osl/src/irq.rs — sys_irq_create
See also
- io_submit (502) —
OP_IRQ_WAITopcode - Completion Port Design
ipc_create (nr 505)
Create a bidirectional IPC channel pair.
Signature
ipc_create(fds_ptr: *mut [i32; 2], capacity: u32, flags: u32) → 0 or -errno
Arguments
| Arg | Register | Description |
|---|---|---|
| fds_ptr | rdi | Pointer to 2-element i32 array for [send_fd, recv_fd] |
| capacity | rsi | Channel buffer capacity (max queued messages) |
| flags | rdx | IPC_CLOEXEC (0x1) to set FD_CLOEXEC on both fds |
Return value
On success, writes [send_fd, recv_fd] to fds_ptr and returns 0.
Errors
| Error | Condition |
|---|---|
| EFAULT | fds_ptr is invalid |
| EINVAL | Unknown flags |
| EMFILE | Process fd table is full |
Description
Creates an IPC channel and returns two file descriptors: a send end and a
receive end. Messages are fixed-size IpcMessage structs (56 bytes)
containing:
struct IpcMessage { // repr(C)
uint64_t tag; // User-defined message type
uint64_t data[3]; // 24 bytes inline payload
int32_t fds[4]; // File descriptors to transfer (-1 = unused)
};
The channel supports capability-based fd passing: file descriptors listed in
msg.fds are duplicated from the sender’s fd table and installed in the
receiver’s fd table on delivery.
Channels can be used in both blocking mode (via ipc_send/ipc_recv
syscalls) and async mode (via OP_IPC_SEND/OP_IPC_RECV on a completion
port).
Implementation
osl/src/ipc.rs — sys_ipc_create
See also
ipc_send (nr 506)
Send a message on an IPC channel (blocking).
Signature
ipc_send(fd: i32, msg_ptr: *const IpcMessage, flags: u32) → 0 or -errno
Arguments
| Arg | Register | Description |
|---|---|---|
| fd | rdi | Channel send-end fd (from ipc_create) |
| msg_ptr | rsi | Pointer to IpcMessage to send |
| flags | rdx | IPC_NONBLOCK (0x1) for non-blocking mode |
Return value
On success, returns 0.
Errors
| Error | Condition |
|---|---|
| EFAULT | msg_ptr is invalid |
| EBADF | fd is not a valid channel send-end |
| EPIPE | Receive end has been closed |
| EAGAIN | IPC_NONBLOCK set and channel is full |
| EMFILE | (receiver) fd table full during fd transfer |
Description
Sends a message through the channel. If the channel buffer is full and
IPC_NONBLOCK is not set, the calling thread blocks until the receiver
drains space.
If msg.fds contains valid file descriptors (not -1), those fd objects are
extracted from the sender’s fd table and transferred to the receiver. The
sender’s fds remain open — this is a dup, not a move.
When a receiver is blocked waiting via ipc_recv, the send uses scheduler
donate to directly switch to the receiver thread for low-latency delivery.
For async (non-blocking, multiplexed) sending, use OP_IPC_SEND via
io_submit instead.
Implementation
osl/src/ipc.rs — sys_ipc_send
See also
- ipc_create (505)
- ipc_recv (507)
- io_submit (502) —
OP_IPC_SENDfor async mode
ipc_recv (nr 507)
Receive a message from an IPC channel (blocking).
Signature
ipc_recv(fd: i32, msg_ptr: *mut IpcMessage, flags: u32) → 0 or -errno
Arguments
| Arg | Register | Description |
|---|---|---|
| fd | rdi | Channel recv-end fd (from ipc_create) |
| msg_ptr | rsi | Pointer to IpcMessage buffer for received message |
| flags | rdx | IPC_NONBLOCK (0x1) for non-blocking mode |
Return value
On success, writes the received message to msg_ptr and returns 0.
Errors
| Error | Condition |
|---|---|
| EFAULT | msg_ptr is invalid |
| EBADF | fd is not a valid channel recv-end |
| EPIPE | Send end has been closed and channel is empty |
| EAGAIN | IPC_NONBLOCK set and no message available |
| EMFILE | fd table full during fd transfer installation |
Description
Receives a message from the channel. If no message is available and
IPC_NONBLOCK is not set, the calling thread blocks until a sender posts a
message.
If the received message carries file descriptors (msg.fds entries != -1),
the transferred fd objects are installed in the receiver’s fd table and the
fds array is rewritten with the new fd numbers before being copied to
user memory.
For async (non-blocking, multiplexed) receiving, use OP_IPC_RECV via
io_submit instead.
Implementation
osl/src/ipc.rs — sys_ipc_recv
See also
- ipc_create (505)
- ipc_send (506)
- io_submit (502) —
OP_IPC_RECVfor async mode
shmem_create (nr 508)
Create a shared memory object and return a file descriptor.
Signature
shmem_create(size: u64, flags: u32) → fd or -errno
Arguments
| Arg | Register | Description |
|---|---|---|
| size | rdi | Size of the shared memory object in bytes (must be > 0) |
| flags | rsi | Flags: SHM_CLOEXEC (0x01) sets close-on-exec on the fd |
Return value
On success, returns a file descriptor for the shared memory object.
Errors
| Error | Condition |
|---|---|
| EINVAL | size is 0, or unknown flags are set |
| ENOMEM | Not enough physical memory to allocate the backing frames |
| EMFILE | Process fd table is full |
Description
Allocates a shared memory object backed by eagerly-allocated, zeroed physical frames. Returns a file descriptor referring to it.
The fd can be inherited by child processes (via clone + execve, unless
SHM_CLOEXEC is set) or transferred via IPC fd-passing (ipc_send /
ipc_recv). Both sides can then call mmap(MAP_SHARED, fd) to map the
same physical pages into their address spaces.
Physical frames are reference-counted. A frame is freed only when all mappings are removed and the last fd referring to the shared memory object is closed.
Flags
| Flag | Value | Description |
|---|---|---|
SHM_CLOEXEC | 0x01 | Set close-on-exec on the returned fd (analogous to Linux’s MFD_CLOEXEC) |
Userspace usage (C)
#define SYS_SHMEM_CREATE 508
#define SHM_CLOEXEC 0x01
static long shmem_create(unsigned long size, unsigned int flags) {
return syscall(SYS_SHMEM_CREATE, size, flags);
}
/* Create 4 KiB shared memory, mmap it */
int fd = shmem_create(4096, 0);
void *ptr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
Implementation
osl/src/syscalls/shmem.rs — sys_shmem_create
Backing struct: libkernel/src/shmem.rs — SharedMemInner
See also
- mmap (9) —
MAP_SHAREDwith a shmem fd - ipc_send (506) — fd-passing for capability transfer
- mmap Design — Phase 5b: anonymous shared memory
notify_create (nr 509)
Create a notification file descriptor for inter-process signaling.
Signature
notify_create(flags: u32) → fd or -errno
Arguments
| Arg | Register | Description |
|---|---|---|
| flags | rdi | Flags: NOTIFY_CLOEXEC (0x01) sets close-on-exec on the fd |
Return value
On success, returns a file descriptor for the notification object.
Errors
| Error | Condition |
|---|---|
| EINVAL | Unknown flags are set |
| EMFILE | Process fd table is full |
Description
Creates a notification fd for signaling between processes. The fd is used with two operations:
- Consumer: submits
OP_RING_WAIT(opcode 7) viaio_submiton the notification fd. The completion port blocks until the producer signals. - Producer: calls
notify(fd)(syscall 510) to signal. If anOP_RING_WAITis armed, a completion is posted to the consumer’s port.
The notification fd can be passed to child processes via inheritance
(clone + execve) or via IPC fd-passing (ipc_send / ipc_recv).
Semantics
- Edge-triggered, one-shot: one
notify()produces one completion. The consumer must re-submitOP_RING_WAITto receive the next signal. - Buffered: if
notify()is called beforeOP_RING_WAITis armed, the notification is buffered. The nextOP_RING_WAITcompletes immediately. Multiple pre-arm signals coalesce into one. - Single waiter: only one
OP_RING_WAITcan be pending per fd.
Flags
| Flag | Value | Description |
|---|---|---|
NOTIFY_CLOEXEC | 0x01 | Set close-on-exec on the returned fd |
Userspace usage (C)
#define SYS_NOTIFY_CREATE 509
#define NOTIFY_CLOEXEC 0x01
static long notify_create(unsigned int flags) {
return syscall(SYS_NOTIFY_CREATE, flags);
}
int nfd = notify_create(0);
Implementation
osl/src/notify.rs — sys_notify_create
Backing struct: libkernel/src/notify.rs — NotifyInner
See also
- notify (510) — signal the notification fd
- io_submit (502) —
OP_RING_WAITopcode - Completion Port Design — Phase 4
notify (nr 510)
Signal a notification file descriptor.
Signature
notify(fd: i32) → 0 or -errno
Arguments
| Arg | Register | Description |
|---|---|---|
| fd | rdi | Notification fd (from notify_create) |
Return value
Returns 0 on success.
Errors
| Error | Condition |
|---|---|
| EBADF | fd is invalid or does not refer to a notification object |
Description
Signals a notification fd, waking a consumer waiting via OP_RING_WAIT
on a completion port.
If an OP_RING_WAIT is armed on the fd, a completion is posted to the
consumer’s port with result = 0 and opcode = OP_RING_WAIT.
If no OP_RING_WAIT is armed, the notification is buffered. The next
OP_RING_WAIT submission will complete immediately. Multiple buffered
notifications coalesce into one event.
The caller uses scheduler donate (set_donate_target + yield_now) for
low-latency wakeup of the consumer.
Userspace usage (C)
#define SYS_NOTIFY 510
static long notify(int fd) {
return syscall(SYS_NOTIFY, fd);
}
notify(nfd); /* wake consumer */
Implementation
osl/src/notify.rs — sys_notify
See also
- notify_create (509) — create the notification fd
- io_submit (502) —
OP_RING_WAITopcode
io_setup_rings (nr 511)
Set up shared-memory submission and completion rings on a completion port.
Signature
io_setup_rings(port_fd: i32, params: *mut IoRingParams) → 0 or -errno
Arguments
| Arg | Register | Description |
|---|---|---|
| port_fd | rdi | Completion port fd (from io_create) |
| params | rsi | Pointer to IoRingParams struct (in/out) |
IoRingParams struct
struct io_ring_params {
uint32_t sq_entries; /* IN: requested SQ size (rounded to pow2, max 64) */
uint32_t cq_entries; /* IN: requested CQ size (rounded to pow2, max 128) */
int32_t sq_fd; /* OUT: shmem fd for SQ ring page */
int32_t cq_fd; /* OUT: shmem fd for CQ ring page */
};
Return value
Returns 0 on success. params->sq_entries and params->cq_entries are
updated to the actual (rounded) sizes. params->sq_fd and params->cq_fd
are set to new shmem fds.
Errors
| Error | Condition |
|---|---|
| EFAULT | params pointer is invalid |
| EBADF | port_fd is invalid or not a completion port |
| EBUSY | Ring already set up on this port |
| ENOMEM | Could not allocate ring pages |
| EMFILE | fd table full |
Description
Transitions a completion port into ring mode. After this call:
io_submitstill works (completions go to the CQ ring)io_waitreturns-EINVAL(replaced byio_ring_enter)- Completions are posted to the shared CQ ring, readable by userspace without a syscall
The caller must mmap(MAP_SHARED) both the SQ and CQ fds to access the
ring buffers. Each ring is a single 4 KiB page with the layout:
Offset 0: RingHeader (16 bytes)
u32 head — consumer advances
u32 tail — producer advances
u32 mask — capacity - 1
u32 flags — reserved (0)
Offset 64: entries[] (cache-line aligned)
SQ: IoSubmission[sq_entries] — 48 bytes each
CQ: IoCompletion[cq_entries] — 24 bytes each
Head and tail are accessed with atomic load/store operations with acquire/release ordering.
Capacity limits
| Ring | Entry size | Max entries | Calculation |
|---|---|---|---|
| SQ | 48 bytes | 64 | (4096 - 64) / 48 rounded to pow2 |
| CQ | 24 bytes | 128 | (4096 - 64) / 24 rounded to pow2 |
Userspace usage (C)
#define SYS_IO_SETUP_RINGS 511
static long io_setup_rings(int port_fd, struct io_ring_params *p) {
return syscall(SYS_IO_SETUP_RINGS, port_fd, p);
}
int port = io_create(0);
struct io_ring_params params = { .sq_entries = 64, .cq_entries = 128 };
io_setup_rings(port, ¶ms);
void *sq = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, params.sq_fd, 0);
void *cq = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, params.cq_fd, 0);
Implementation
osl/src/io_port.rs — sys_io_setup_rings
See also
- io_create (501) — create the completion port
- io_ring_enter (512) — process SQ entries and wait for CQ entries
- Completion Port Design — Phase 5
io_ring_enter (nr 512)
Process SQ entries and optionally wait for CQ completions on a ring-mode completion port.
Signature
io_ring_enter(port_fd: i32, to_submit: u32, min_complete: u32, flags: u32) → i64
Arguments
| Arg | Register | Description |
|---|---|---|
| port_fd | rdi | Completion port fd (must be in ring mode) |
| to_submit | rsi | Max number of SQ entries to process |
| min_complete | rdx | Min CQ entries to wait for before returning |
| flags | r10 | Reserved, must be 0 |
Return value
On success, returns the number of CQ entries available (tail - head).
Errors
| Error | Condition |
|---|---|
| EBADF | port_fd is invalid or not a completion port |
| EINVAL | Port is not in ring mode, or flags != 0 |
Description
This is the single syscall for ring-mode operation, replacing both
io_submit and io_wait.
Processing phases
-
Drain SQ: reads up to
to_submitentries from the shared SQ ring (from the kernel’s SQ head to the userspace-written SQ tail). Each SQE is processed identically toio_submit. The SQ head is advanced. -
Flush deferred completions: drains the kernel queue of completions that need syscall-context processing (OP_READ data copy, OP_IPC_RECV fd installation). These are written to the CQ ring.
-
Wait: if
min_complete > 0, blocks until the CQ ring has at leastmin_completeentries available. On each wakeup, deferred completions are flushed again.
Dual-mode completion posting
When rings are active, CompletionPort::post() routes completions:
- Simple (no read_buf, no transfer_fds): CQE written directly to the shared CQ ring. This is the fast path for OP_NOP, OP_TIMEOUT, OP_WRITE, OP_IRQ_WAIT, OP_IPC_SEND, OP_RING_WAIT.
- Deferred: pushed to the kernel queue and flushed by
io_ring_enterin syscall context. This handles OP_READ and OP_IPC_RECV which need to copy data to user buffers.
Userspace usage (C)
#define SYS_IO_RING_ENTER 512
static long io_ring_enter(int port_fd, unsigned int to_submit,
unsigned int min_complete, unsigned int flags) {
return syscall(SYS_IO_RING_ENTER, port_fd, to_submit, min_complete, flags);
}
/* Write SQE to SQ ring */
uint32_t tail = __atomic_load_n(&sqh->tail, __ATOMIC_RELAXED);
struct io_submission *sqe = sq_entry(sq, tail, sqh->mask);
sqe->opcode = OP_NOP;
sqe->user_data = 42;
__atomic_store_n(&sqh->tail, tail + 1, __ATOMIC_RELEASE);
/* Process 1 SQE, wait for 1 CQE */
io_ring_enter(port, 1, 1, 0);
/* Read CQE from CQ ring */
uint32_t head = __atomic_load_n(&cqh->head, __ATOMIC_RELAXED);
uint32_t cq_tail = __atomic_load_n(&cqh->tail, __ATOMIC_ACQUIRE);
if (head != cq_tail) {
struct io_completion *cqe = cq_entry(cq, head, cqh->mask);
/* process cqe */
__atomic_store_n(&cqh->head, head + 1, __ATOMIC_RELEASE);
}
Implementation
osl/src/io_port.rs — sys_io_ring_enter
See also
- io_setup_rings (511) — set up the shared rings
- io_create (501) — create the completion port
- io_submit (502) — legacy submission (still works in ring mode)
- Completion Port Design — Phase 5
Userspace Shell Design
Status
All phases are complete. The userspace shell runs as the primary user
interface on boot. The kernel shell remains as a fallback when no /shell
binary is found on the filesystem.
Context
The shell was migrated from a kernel actor (kernel/src/shell.rs) to a
ring-3 process — a C program (user/shell.c) compiled with musl that reads
raw keypresses from stdin, does its own line editing, and uses syscalls for
file I/O and process management.
Scope decisions:
- Raw keypresses to userspace (no kernel line editing for foreground user processes)
- Minimal commands: echo, ls, cat, pwd, cd, export, env, unset, pid, exit, help, and running programs by name
- Environment variables: shell maintains an env table, passes it to child processes via posix_spawn
- Kernel provides default environment on boot (PATH=/host/bin, HOME=/, TERM=dumb, SHELL=/bin/shell)
- Kernel shell kept as fallback (dormant when userspace shell is foreground)
- No pipes yet
Phase 1: Scheduler Blocking Support ✅ COMPLETE
Goal: Add Blocked thread state so threads can sleep waiting for I/O.
File: libkernel/src/task/scheduler.rs
- Add
BlockedtoThreadStateenum (line 51) - Modify
preempt_tick(lines 484, 495) — treatBlockedlikeDead: skip quantum decrement, don’t re-queue - Add
pub fn block_current_thread()— marks current threadBlocked, spins onenable_and_hltuntil rescheduled with non-Blocked state - Add
pub fn unblock(thread_idx: usize)— sets thread toReady, pushes onto ready queue (safe from any context including ISR)
Key detail: Blocking from within syscall_dispatch works because each user process has its own 64 KiB kernel stack (set via PER_CPU.kernel_rsp during context switch). The timer saves/restores the full register state, so when unblocked, execution resumes mid-syscall.
Phase 2: File Descriptor Table ✅ COMPLETE
Goal: Per-process FD table with FileHandle trait, refactor existing syscalls.
2a: FileHandle trait + ConsoleHandle
File: libkernel/src/file.rs
FileErrorenum (BadFd, IsDirectory, NotATty, TooManyOpenFiles) — using snafu for DisplayFileHandletrait:read(&self, buf) -> Result<usize, FileError>,write(&self, buf) -> Result<usize, FileError>,close(&self),kind(),getdents64()ConsoleHandle { readable: bool }— write prints to kernel console; read delegates to console input buffer- Linux errno numeric constants live in
osl/src/errno.rs;libkernelhas no knowledge of errno numbers
2b: FD table on Process
File: libkernel/src/process.rs
- Add
fd_table: Vec<Option<Arc<dyn FileHandle>>>toProcess - Initialize fds 0-2 as
ConsoleHandleinProcess::new() - Add
alloc_fd(handle) -> Result<usize, FileError>(scan for firstNoneslot) - Add
close_fd(fd: usize) -> Result<(), FileError> - Add
get_fd(fd: usize) -> Result<Arc<dyn FileHandle>, FileError>
2c: Refactor syscalls to use FD table
File: osl/src/syscalls/io.rs and osl/src/syscalls/fs.rs
sys_write/sys_writev: look up fd in process fd_table, callhandle.write()(osl/src/syscalls/io.rs)sys_read: look up fd, callhandle.read()(osl/src/syscalls/io.rs)sys_close: callprocess.close_fd(fd)(osl/src/syscalls/fs.rs)
Phase 3: Console Input (Raw Keypresses) ✅ COMPLETE
Goal: Route decoded keypresses to a buffer that read(0) consumes, with blocking.
3a: Console input buffer
New file: libkernel/src/console.rs
CONSOLE_INPUT: Mutex<ConsoleInner>withVecDeque<u8>(256 bytes) andblocked_reader: Option<usize>FOREGROUND_PID: AtomicU64— PID of the process that receives keyboard input (0 = kernel)push_input(byte)— pushes to buffer, callsscheduler::unblock()if a reader is blockedread_input(buf) -> usize— drains buffer intobuf; if empty, registersblocked_readerand callsblock_current_thread(), retries on wakeset_foreground(pid)/foreground_pid() -> ProcessIdflush_input()— clear buffer on foreground change
3b: Wire ConsoleHandle::read to console buffer
File: libkernel/src/file.rs
ConsoleHandle::read()callsconsole::read_input(buf)whenreadable == true
3c: Modify keyboard actor routing
File: kernel/src/keyboard_actor.rs
- At top of
on_keyhandler: checkconsole::foreground_pid() - If non-kernel PID: convert
Keyto raw byte(s) and callconsole::push_input():Key::Unicode(c)→ ASCII byte (if c.is_ascii())- Enter →
\n(0x0A) - Backspace →
0x7F(DEL) - Ctrl+C →
0x03, Ctrl+D →0x04, Tab →0x09 - Arrow keys → VT100 sequences (ESC
[A/B/C/D) — optional for later - Return early (skip kernel line-editor)
- If kernel PID: existing line-editor behavior unchanged
Phase 4: VFS Syscalls ✅ COMPLETE
Goal: open, read (files), close, getdents64 so userspace can read files and list directories.
4a: Async-to-sync bridge
File: osl/src/blocking.rs
#![allow(unused)]
fn main() {
pub fn blocking<T: Send + 'static>(future: impl Future<Output=T> + Send + 'static) -> T {
let result = Arc::new(Mutex::new(None));
let thread_idx = scheduler::current_thread_idx();
let r = result.clone();
executor::spawn(Task::new(async move {
*r.lock() = Some(future.await);
scheduler::unblock(thread_idx);
}));
scheduler::block_current_thread();
result.lock().take().unwrap()
}
}
Spawns the async VFS operation as a kernel task, blocks the user thread, unblocks when complete.
4b: VfsHandle (buffered file)
File: osl/src/file.rs
VfsHandle— holdsVec<u8>content + read position; entire file loaded atopentimeDirHandle— holdsVec<VfsDirEntry>listing + cursor; loaded atopentime
4c: sys_open (syscall 2)
File: osl/src/syscalls/fs.rs
- Read null-terminated path from userspace, validate pointer
- Resolve path relative to process
cwd(see Phase 5a) - Use
osl::blocking::blocking()to calldevices::vfs::read_file()ordevices::vfs::list_dir()(try file first, fall back to dir forO_DIRECTORY) - Wrap in
VfsHandleorDirHandle, allocate fd viaprocess.alloc_fd() - Return fd or -ENOENT
4d: sys_getdents64 (syscall 217)
File: osl/src/syscalls/io.rs
- Look up fd → must be
DirHandle - Serialize entries as
linux_dirent64structs into user buffer (d_ino, d_off, d_reclen, d_type, d_name) - Return total bytes written, or 0 at end
4e: Existing sys_read/sys_close already work via FD table (Phase 2c)
Phase 5: Process Management Syscalls ✅ COMPLETE
Goal: chdir/getcwd, process creation (clone+execve), waitpid.
5a: chdir / getcwd
File: libkernel/src/process.rs — add cwd: String to Process, default "/"
File: osl/src/syscalls/fs.rs
sys_chdir(nr 80): validate path exists viaosl::blocking::blocking(devices::vfs::list_dir(path)), updateprocess.cwdsys_getcwd(nr 79): copyprocess.cwdto user buffer
5b: Process spawning (clone + execve)
Process creation uses standard Linux clone(CLONE_VM|CLONE_VFORK) + execve.
musl’s posix_spawn and Rust’s std::process::Command work unmodified.
5c: spawn_process_full (kernel-side ELF spawning)
File: osl/src/spawn.rs
spawn_process_fulltakeself_data,argv: &[&[u8]],envp: &[&[u8]], andparent_pid: ProcessIdparamsbuild_initial_stackwrites argv strings + pointer array + argc (Linux x86_64 ABI)
File: libkernel/src/process.rs
parent_pid: ProcessIdonProcesswait_thread: Option<usize>(thread to wake on child exit)vfork_parent_thread: Option<usize>(thread to unblock after execve)
5d: waitpid (syscall 61 / wait4)
File: osl/src/syscalls/process.rs
sys_waitpid(pid, status_ptr, options) -> pid- Find zombie child matching requested pid (or any child if pid == -1)
- If found: write exit status to userspace, reap, return child PID
- If not found: register
wait_threadon parent, block, retry on wake
File: libkernel/src/process.rs
find_zombie_child(parent, target_pid) -> Option<(ProcessId, i32)>- In
sys_exit: if exiting process has a parent withwait_thread, callunblock() - Clear foreground to parent when child exits
Phase 6: Userspace Shell Binary ✅ COMPLETE
Goal: Write shell.c, compile with musl, deploy.
6a: shell.c
New file: user/src/shell.c
- Line editor: read char by char via
read(0, &c, 1), handle backspace (erase\b \b), Enter (dispatch), Ctrl+C (cancel line), Ctrl+D (exit on empty line) - Echo input: shell echoes each typed character with
write(1, &c, 1)since kernel delivers raw keypresses - Command dispatch:
echo <text>— print argspwd—getcwd()+ printcd <path>—chdir()ls [path]—open()+getdents64()loop +close()cat <file>—open()+read()loop +close()exit—_exit(0)- Anything else — try
posix_spawn(cmd)+waitpid(), print error if spawn fails
- Process spawning: uses
posix_spawn()(musl’s wrapper aroundclone+execve)
6b: Build
File: user/Makefile — builds src/*.c → bin/ as static musl binaries.
6c: Deploy to disk image
Compiled shell binary is output to user/bin/shell; available in guest via 9p at /host/bin/shell or /bin/shell (fallback root mount).
6d: Auto-launch on boot
File: kernel/src/main.rs
- After VFS is mounted, spawn an async task that reads
/shellfrom VFS and callsspawn_process() - Set the spawned shell as the foreground process
- If
/shellnot found, fall back to kernel shell (log a message)
6e: Kernel shell fallback
Automatic via the keyboard routing in Phase 3c: when foreground PID is 0 (kernel), keys go to the kernel shell actor. When the userspace shell exits or crashes, sys_exit resets foreground to parent (kernel), restoring the old behavior.
File Summary
| File | Changes |
|---|---|
libkernel/src/task/scheduler.rs | Blocked state, block_current_thread(), unblock() |
libkernel/src/file.rs | FileHandle trait (returns FileError), FileError enum, ConsoleHandle |
libkernel/src/console.rs | Console input buffer, foreground PID tracking |
libkernel/src/process.rs | fd_table, cwd, parent_pid, wait_thread; fd helpers (return FileError) |
osl/src/errno.rs | Linux errno constants, file_errno() / vfs_errno() converters |
osl/src/blocking.rs | blocking() async-to-sync bridge |
osl/src/file.rs | VfsHandle, DirHandle (VFS-backed file handles) |
osl/src/syscalls/ | Syscall dispatch and implementations: read/write/close/open/getdents64/getcwd/chdir/clone/execve/waitpid |
osl/src/spawn.rs | spawn_process_full with argv + parent PID |
libkernel/src/syscall.rs | SYSCALL assembly entry stub, PER_CPU data, init |
kernel/src/ring3.rs | Legacy spawn_process wrapper, blob spawning tests |
kernel/src/keyboard_actor.rs | Foreground routing: raw bytes to console buffer |
kernel/src/main.rs | Auto-launch /shell on boot |
user/shell.c | Userspace shell with line editing and commands |
Verification
- Phase 1: Spawn a kernel thread that blocks itself; have another thread unblock it after a delay. Verify it resumes.
- Phase 2-3:
exec /hellostill works (write goes through fd table). Boot with no userspace shell — kernel shell still functional. - Phase 4: From kernel shell,
execa test program that doesopen("/hello")+read()+write(1)to cat a file. - Phase 5: Test program that spawns
/helloand waits for it. - Phase 6: Boot with
/shellon disk. Verify: prompt appears, echo/pwd/cd/ls/cat/exit work, running/hellofrom shell works, Ctrl+C cancels input, exiting shell returns to kernel shell.
Risks
- Heap pressure: 512 KiB kernel heap is tight with multiple processes. May need to increase to 1 MiB. Monitor with
/proc/meminfo. - VFS bridge correctness: The async task must complete before the blocked thread is woken. Guaranteed by design, but a panic in the async path leaves the thread blocked forever. Consider adding a timeout or panic handler.
- getdents64 format complexity: Must match Linux’s
struct linux_dirent64layout exactly for musl’sreaddir()to work. Alternative: shell can use rawsyscall(217, ...)with custom parsing.
Cross-Compiling C Programs for ostoo Userspace
This document explains how to compile static musl-linked x86_64 ELF binaries that can run as ostoo user-space processes, using the crosstool-ng toolchain inside Docker.
Prerequisites
- Docker
- The
ctngDocker image (built fromcrosstool/Dockerfile) - The compiled toolchain at
/Volumes/crosstool-ng/x-tools/x86_64-unknown-linux-musl
Toolchain details
| Component | Version |
|---|---|
| GCC | 15.2.0 |
| musl | 1.2.5 |
| binutils | 2.46.0 |
| Linux headers | 6.18.3 |
Target triple: x86_64-unknown-linux-musl
The toolchain produces fully static-linked ELF binaries with no runtime dependencies (no dynamic linker, no shared libraries).
Building the toolchain from scratch
If you need to rebuild the toolchain:
cd crosstool
# Build the Docker image (includes crosstool-ng and the .config)
docker build . -t ctng
# Run the build (output goes to /Volumes/crosstool-ng/x-tools)
./run.sh
The build runs inside Docker’s case-sensitive overlay filesystem to avoid macOS
case-sensitivity issues with the Linux kernel tarball extraction. Only the output
(x-tools) and download cache (src) directories are mounted from the host.
Compiling user programs
Using the build script (recommended)
The scripts/user-build.sh wrapper handles the Docker invocation for you.
Arguments are passed through to make:
./scripts/user-build.sh # build all .c files in user/src/ → user/bin/
./scripts/user-build.sh clean # clean build artifacts
./scripts/user-build.sh bin/hello # build a single target
Manual Docker invocation
If you need to run compiler commands directly:
docker run --rm \
-v /Volumes/crosstool-ng/x-tools:/home/ctng/x-tools \
-v "$(pwd)/user":/home/ctng/user \
ctng bash -c '
export PATH="/home/ctng/x-tools/x86_64-unknown-linux-musl/bin:$PATH"
cd /home/ctng/user
x86_64-unknown-linux-musl-gcc -static -Os -Wall -Wextra -o bin/hello src/hello.c
'
Compiler flags
The recommended flags for ostoo user-space binaries:
| Flag | Purpose |
|---|---|
-static | Produce a fully static binary (required — ostoo has no dynamic linker) |
-Os | Optimize for size (keeps binaries small for the FAT filesystem image) |
-Wall -Wextra | Enable warnings |
-nostdlib | Skip libc entirely (for minimal programs that use only raw syscalls) |
Verifying the output
You can inspect a compiled binary without Docker using the host file command:
file user/bin/hello
# hello: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, ...
Or use readelf from the toolchain:
docker run --rm \
-v /Volumes/crosstool-ng/x-tools:/home/ctng/x-tools \
-v "$(pwd)/user":/home/ctng/user \
ctng bash -c '
export PATH="/home/ctng/x-tools/x86_64-unknown-linux-musl/bin:$PATH"
x86_64-unknown-linux-musl-readelf -h /home/ctng/user/bin/hello
'
Confirm: Type: EXEC, Machine: Advanced Micro Devices X86-64.
Running on ostoo
- Copy the compiled binary onto the FAT filesystem image
- Boot ostoo in QEMU
- From the shell:
exec /hello
The kernel’s ELF loader parses the binary and spawns it as a ring-3 process with
the syscall layer providing write, brk, mmap, exit, and other calls needed
by musl’s startup code.
Available toolchain binaries
All prefixed with x86_64-unknown-linux-musl-:
| Binary | Purpose |
|---|---|
gcc / cc | C compiler |
g++ / c++ | C++ compiler |
as | Assembler |
ld / ld.bfd | Linker |
ar | Archive tool |
objcopy | Binary manipulation |
objdump | Disassembler |
readelf | ELF inspector |
strip | Strip symbols |
nm | Symbol table viewer |
gdb | Debugger |
Cross-Compiling for x86_64 on a Non-x86 Host
This project targets x86_64 bare metal but can be built and run on any host architecture (including aarch64-apple-darwin, i.e. Apple Silicon Macs). This document explains how that works.
Overview
The kernel is compiled for a custom x86_64-os target using Rust’s cross-compilation
support. QEMU provides x86_64 emulation at runtime. The host machine never executes the
kernel code directly.
Toolchain (rust-toolchain.toml)
[toolchain]
channel = "nightly"
components = ["rust-src", "llvm-tools"]
- nightly is required for the
-Z build-stdunstable feature (see below). - rust-src provides the standard library source, which is needed to compile
core,alloc, andcompiler_builtinsfrom source for the custom target. - llvm-tools provides
llvm-objcopyand related tools used bybootimagewhen assembling the final disk image.
Rustup downloads a pre-built nightly toolchain for the host architecture. The host toolchain is only used to drive the build; the kernel itself is compiled to x86_64 object files by rustc’s bundled LLVM backend regardless of host architecture.
Custom Target Spec (x86_64-os.json)
Rust’s built-in targets assume a host OS. For a bare-metal kernel we need a custom target.
The file x86_64-os.json at the workspace root defines it:
{
"llvm-target": "x86_64-unknown-none",
"data-layout": "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128",
"arch": "x86_64",
"target-endian": "little",
"target-pointer-width": 64,
"target-c-int-width": 32,
"os": "none",
"executables": true,
"linker-flavor": "ld.lld",
"linker": "rust-lld",
"panic-strategy": "abort",
"disable-redzone": true,
"rustc-abi": "softfloat",
"features": "-mmx,-sse,-sse2,-sse3,-ssse3,-sse4.1,-sse4.2,-avx,-avx2,+soft-float"
}
Key fields:
| Field | Value | Reason |
|---|---|---|
llvm-target | x86_64-unknown-none | Bare metal; no OS assumed by LLVM |
data-layout | LLVM datalayout string | Must exactly match LLVM’s own layout for this triple; confirmed via rustc +nightly --print target-spec-json --target x86_64-unknown-none -Z unstable-options |
linker-flavor / linker | ld.lld / rust-lld | Uses LLVM’s cross-capable linker bundled with rustc; no host ld or cross-linker needed |
disable-redzone | true | Required for kernel interrupt handlers; the red zone is an x86_64 ABI optimisation that is unsafe when interrupts can fire at any stack pointer |
rustc-abi | softfloat | Tells rustc that this target intentionally violates the standard x86_64 ABI’s SSE requirement. Without this, rustc refuses to compile when SSE is disabled |
features | -mmx,-sse,...,+soft-float | Disables SIMD/SSE in generated code (unsafe in kernel context without SSE state saving) and enables soft-float emulation instead |
Why rustc-abi: softfloat is needed
The standard x86_64 System V ABI mandates SSE2 support. If you disable SSE features in a custom target spec, rustc rejects the build with:
error: target feature 'sse2' is required by the ABI but gets disabled
The rustc-abi: softfloat field is an escape hatch for kernel targets: it tells rustc to
use a different ABI variant (one that does not assume SSE), suppressing the error. This is
the same mechanism used internally by Rust’s x86_64-unknown-none tier-2 target.
Cargo Configuration (.cargo/config.toml)
[build]
target = "x86_64-os.json"
[unstable]
build-std = ["core", "compiler_builtins", "alloc"]
build-std-features = ["compiler-builtins-mem"]
json-target-spec = true
target: Makes everycargo buildin this workspace default to the custom target. No--targetflag is required on the command line.build-std: Compilescore,compiler_builtins, andallocfrom source for the custom target. This is necessary because Cargo ships pre-compiled standard library crates only for known built-in targets; a custom JSON target has no pre-built sysroot.build-std-features = ["compiler-builtins-mem"]: Builds the memory intrinsics (memcpy,memset, etc.) intocompiler_builtinsrather than relying on a C runtime, which does not exist in a bare-metal environment.json-target-spec = true: Unlocks support for.jsoncustom target files in current Cargo nightly. Without this flag, Cargo rejects.jsontarget specs.
Bootloader and Bootimage
The kernel ELF is combined with a real-mode x86 bootloader by the bootimage tool:
cargo bootimage --manifest-path kernel/Cargo.toml
This produces target/x86_64-os/debug/bootimage-kernel.bin, a raw x86 disk image.
The bootloader crate (bootloader = "0.9.x") includes its own target spec
(x86_64-bootloader.json) and declares build-std = "core" in its own Cargo metadata.
bootimage picks this up and compiles the bootloader from source using -Z build-std,
just like the kernel — no separate cross-toolchain or cargo-xbuild is needed.
Why bootloader 0.9.x and not 0.8.x
bootloader 0.8.x was released before -Z build-std became stable enough for the
bootloader’s own build. It fell back to cargo xbuild, a now-deprecated wrapper tool.
The 0.9.x line added build-std to its metadata and has been actively maintained for
compatibility with current Rust nightly (data-layout changes, rustc-abi: softfloat,
integer target fields, json-target-spec). The kernel-facing API (entry_point!,
BootInfo) is the same in both series.
Running Under QEMU
cargo bootimage run --manifest-path kernel/Cargo.toml
or directly:
qemu-system-x86_64 -drive format=raw,file=target/x86_64-os/debug/bootimage-kernel.bin -serial stdio
QEMU provides full x86_64 CPU emulation. The run-time arguments are configured in
kernel/Cargo.toml under [package.metadata.bootimage].
Summary
| Concern | Solution |
|---|---|
| Compiling x86_64 code on ARM | rustc’s LLVM backend handles any target regardless of host |
| Linking for bare metal | rust-lld (cross-capable, bundled with rustc) |
| No pre-built sysroot for custom target | -Z build-std compiles core/alloc from source |
| No OS or C runtime | compiler-builtins-mem provides memory intrinsics |
| SSE disabled but ABI expects it | rustc-abi: softfloat in target spec |
| Bootable disk image | bootimage + bootloader 0.9.x (self-contained, no xbuild) |
| Running the kernel | qemu-system-x86_64 on any host |
Rust Cross-Compilation for ostoo Userspace
Build Rust userspace programs natively on macOS, producing static x86_64 ELF binaries that run on the ostoo kernel.
Architecture
user-rs/ # Separate Cargo workspace
├── Cargo.toml # workspace: rt, hello-rs, hello-std
├── .cargo/config.toml # custom target, build-std
├── x86_64-ostoo-user.json # custom target spec (no CRT objects)
├── rt/ # ostoo-rt runtime crate
│ ├── Cargo.toml # features: no_std (default)
│ └── src/
│ ├── lib.rs # _start, panic handler, global allocator
│ ├── syscall.rs # inline-asm SYSCALL wrappers
│ ├── io.rs # print!/println!/eprint!/eprintln! macros
│ └── alloc_impl.rs # brk-based bump allocator
├── hello-rs/ # no_std + alloc example (~5 KiB)
│ ├── Cargo.toml
│ └── src/main.rs
└── hello-std/ # full std example (~54 KiB)
├── Cargo.toml
└── src/main.rs
sysroot/ # musl sysroot (extracted, gitignored)
└── x86_64-ostoo-user/
├── lib/ # libc.a, crt*.o, libunwind.a stub
└── include/ # C headers
Why a separate workspace?
The kernel uses a custom target (x86_64-os.json) that disables SSE and the
red zone. Userspace needs standard x86_64 ABI with SSE and red zone enabled.
A separate workspace with its own .cargo/config.toml avoids target conflicts.
Custom target (x86_64-ostoo-user.json)
Based on x86_64-unknown-linux-musl but with empty pre-link-objects and
post-link-objects — we provide our own _start in ostoo-rt instead of
using musl’s CRT startup files. Has crt-static-default: true so that the
libc crate links libc.a statically when building std.
Building
# One-time: extract musl sysroot from the ostoo-compiler Docker image
scripts/extract-musl-sysroot.sh
# Build and deploy to user/ (visible at /host/ in guest via virtio-9p)
# (automatically calls extract-musl-sysroot.sh if needed)
scripts/user-rs-build.sh
# Or manually:
cd user-rs
cargo build --release
Uses build-std to compile std and panic_abort (and transitively core,
alloc, compiler_builtins, libc, unwind) from source. Links against
libc.a from the musl sysroot. Requires the nightly toolchain with rust-src
component (already in rust-toolchain.toml).
Note: packages must be built separately (-p hello-rs, then -p hello-std)
because Cargo feature unification would otherwise merge ostoo-rt’s no_std
feature across the workspace, causing duplicate #[panic_handler] errors. The
build script handles this automatically.
Release profile
opt-level = "s", lto = true, panic = "abort", strip = true — produces
small binaries (the hello world example is ~4.6 KiB).
Runtime crate (ostoo-rt)
ostoo-rt has a no_std feature (enabled by default). With no_std, it
provides a panic handler, global allocator, and OOM handler. Without it
(for std programs), it provides only _start and syscall wrappers.
Tier 1: no_std + alloc programs
#![no_std]
#![no_main]
extern crate ostoo_rt;
use ostoo_rt::println;
#[no_mangle]
fn main() -> i32 {
println!("Hello from Rust on ostoo!");
0
}
Depend on ostoo-rt with default features (includes no_std).
Tier 2: std programs
#![feature(restricted_std)]
#![no_main]
extern crate ostoo_rt;
use std::collections::HashMap;
#[no_mangle]
fn main() -> i32 {
println!("Hello from Rust std on ostoo!");
let mut map = HashMap::new();
map.insert("key", 42);
println!("HashMap works: {:?}", map);
0
}
Depend on ostoo-rt with default-features = false (disables no_std so
std’s panic handler and allocator are used instead). The
#![feature(restricted_std)] attribute is required for custom JSON targets.
What ostoo-rt provides
-
_startentry point (always, but behaviour differs by mode):no_std: readsargc/argvfrom the stack, calls_start_rust→ user’smain() -> i32directly.std: extractsargc/argvfrom the stack and calls musl’s__libc_start_main(main, argc, argv, ...)which initializes libc (TLS viaarch_prctl, stdio, locale, auxvec parsing) before callingmain(argc, argv, envp). This is essential — without libc init, musl’swrite()and other functions fault on uninitialised TLS.
-
Syscall wrappers (always) —
syscall0throughsyscall4via inline asm (SYSCALL instruction). Typed wrappers:write,read,open,close,exit,brk,getcwd,chdir,getdents64,wait4. -
print!/println!/eprint!/eprintln!macros (always) — write to fd 1/2 viacore::fmt::Write. Instdmode, preferstd::println!instead. -
Global allocator (
no_stdonly) — brk-based bump allocator. -
Panic handler (
no_stdonly) — prints panic info to stderr, exits 101.
Adding new programs
- Create
user-rs/<name>/Cargo.tomlwithostoo-rtdependency - Add
"<name>"to workspace members inuser-rs/Cargo.toml - Add the binary name to the deploy loop in
scripts/user-rs-build.sh
Verification
# Binary format check
file user-rs/target/x86_64-ostoo-user/release/hello-rs
# → ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped
# Entry point is _start (verify disassembly shows mov rdi,[rsp]; lea rsi,[rsp+8])
llvm-objdump -d --start-address=<entry> target/x86_64-ostoo-user/release/hello-rs
# No .interp section (no dynamic linker)
llvm-readobj -S target/x86_64-ostoo-user/release/hello-rs | grep -c .interp # → 0
Boot ostoo, then:
> spawn /host/hello-rs
Hello from Rust on ostoo!
Heap works: 42
Musl sysroot
The musl sysroot provides libc.a for Rust’s std to link against. It is
extracted from the ostoo-compiler Docker image (which builds musl 1.2.5 via
crosstool-ng).
# Extract sysroot (skips if already present)
scripts/extract-musl-sysroot.sh
The sysroot is placed at sysroot/x86_64-ostoo-user/ and is gitignored. It
contains:
lib/libc.a— musl static C librarylib/crt1.o,crti.o,crtn.o— CRT objects (not linked by default; our target spec has emptypre-link-objects)lib/libunwind.a— empty stub (satisfiesunwindcrate’s#[link]; musl’s unwinder is inlibc.a, and withpanic=abortunwinding is never invoked)include/— C headers
The cargo config passes -Lnative=../sysroot/x86_64-ostoo-user/lib to the
linker so it can find libc.a and libunwind.a.
Binary sizes
| Program | Mode | Size |
|---|---|---|
| hello-rs | no_std + alloc | ~5 KiB |
| hello-std | full std | ~55 KiB |
Both use opt-level = "s", LTO, panic = "abort", and stripping.
APIC and IO APIC Initialization
Background
The x86/x86_64 interrupt subsystem has two generations:
- 8259 PIC (Programmable Interrupt Controller) — the legacy two-chip design. Master (IRQs 0–7) and slave (IRQs 8–15) are chained. Vectors are remapped to 0x20–0x2F to avoid conflicts with CPU exceptions.
- APIC (Advanced Programmable Interrupt Controller) — the modern design, required for SMP. Consists of a Local APIC (LAPIC) per CPU core and one or more IO APICs for external devices.
ACPI describes which model the firmware uses via the MADT (Multiple APIC
Description Table). On QEMU with default settings, the MADT reports
InterruptModel::Apic, meaning APIC mode is required.
Architecture
Device
│
▼
IO APIC ──── Redirection Table ───► Local APIC ──► CPU
(external (per core)
IRQs) LAPIC ID
Local APIC (LAPIC)
- One per CPU core, memory-mapped at physical address
0xFEE00000by default. - Handles inter-processor interrupts (IPIs) and LAPIC-local sources (timer, thermal, etc.).
- Must be enabled by writing to the Spurious Interrupt Vector Register (SIVR)
at offset
0xF0. Setting bit 8 (APIC_ENABLE) activates the LAPIC. Bits 0–7 set the spurious interrupt vector (conventionally0xFF). - EOI (End of Interrupt) is signalled by writing
0to the EOI register at offset0xB0. Unlike the PIC, no interrupt number is needed — the write itself is the acknowledgement.
IO APIC
-
Handles external hardware interrupts (ISA IRQs, PCI interrupts).
-
Accessed via two MMIO registers:
IOREGSEL(write selector) andIOWIN(read/write data window), both at the IO APIC base address. -
Contains a Redirection Table with one 64-bit entry per input pin:
Bits Field Notes 0–7 Vector IDT vector to deliver 8–10 Delivery mode 0 = fixed 11 Destination mode 0 = physical (LAPIC ID), 1 = logical 13 Pin polarity 0 = active high, 1 = active low 15 Trigger mode 0 = edge, 1 = level 16 Mask 1 = masked (disabled) 56–63 Destination Physical: target LAPIC ID
ACPI and Interrupt Source Overrides
ISA IRQs are conventionally edge-triggered, active-high. However, some IRQs are
remapped: QEMU reports that ISA IRQ 0 (the PIT timer) is redirected to GSI 2
with edge/active-high signalling. The ACPI InterruptSourceOverride table
entries describe these remappings:
| ISA IRQ | Default GSI | Override GSI | Override Polarity | Override Trigger |
|---|---|---|---|---|
| 0 | 0 | 2 (QEMU) | Same as bus | Same as bus |
| 1 | 1 | — | — | — |
The init_io function reads these overrides from apic_info.interrupt_source_overrides
and uses the correct GSI, polarity, and trigger mode when programming each
redirection entry.
Initialization Sequence
1. Map Local APIC (libkernel::apic::init_local)
The LAPIC physical address is read from the IA32_APIC_BASE MSR. A virtual
page at APIC_BASE is mapped to this physical frame (with NO_CACHE flag, as
MMIO must not be cached):
Physical 0xFEE00000 → Virtual 0xFFFF_8001_0000_0000
After mapping:
init()logs the LAPIC ID, version, and LVT register state.enable()writes the SIVR:APIC_ENABLE | 0xFF(enable + spurious vector).- The EOI virtual address (
APIC_BASE + 0xB0) is stored inlibkernel::interrupts::LAPIC_EOI_ADDRso interrupt handlers can issue EOI without needing a reference to theapicmodule.
2. Map IO APIC(s) (libkernel::apic::init_io)
Each IO APIC listed in the ACPI MADT is mapped to consecutive virtual pages
starting at APIC_BASE + 4KiB. The global_system_interrupt_base field records
which GSIs this IO APIC handles (typically 0 for the first IO APIC).
After mapping all IO APICs:
- Mask all entries — every redirection table slot is masked before programming, preventing spurious interrupts during setup.
- Route ISA IRQs — IRQ 0 (timer) and IRQ 1 (keyboard) are routed to IDT
vectors
0x20and0x21respectively, targeting the BSP’s LAPIC ID. Source overrides are applied (e.g. timer GSI 2 on QEMU).
3. Update IDT and EOI (libkernel::interrupts)
The IDT is extended with a spurious interrupt handler at vector 0xFF.
Spurious LAPIC interrupts must not receive an EOI.
The timer and keyboard handlers are updated to call send_eoi() instead of
PICS.notify_end_of_interrupt(). send_eoi() checks LAPIC_EOI_ADDR: if
non-zero (APIC mode), it writes 0 to the LAPIC EOI register; otherwise it
falls back to the PIC path. This allows the same IDT to work in both PIC and
APIC modes.
4. Disable the 8259 PIC (libkernel::interrupts::disable_pic)
After the IO APIC is programmed, the PIC is disabled by masking all IRQs:
#![allow(unused)]
fn main() {
Port::<u8>::new(0x21).write(0xFF); // mask master PIC
Port::<u8>::new(0xA1).write(0xFF); // mask slave PIC
}
This prevents the PIC from delivering interrupts that would arrive at the wrong vectors or cause double-delivery with the IO APIC.
Key Constants
| Symbol | Value | Description |
|---|---|---|
APIC_BASE | 0xFFFF_8001_0000_0000 | Virtual base for LAPIC mapping |
LAPIC_EOI_OFFSET | 0xB0 | Offset of EOI register in LAPIC |
LAPIC_SIVR_OFFSET | 0xF0 | Offset of SIVR in LAPIC |
SPURIOUS_VECTOR | 0xFF | IDT vector for LAPIC spurious IRQs |
TIMER_VECTOR | 0x20 | IDT vector for timer (ISA IRQ 0) |
KEYBOARD_VECTOR | 0x21 | IDT vector for keyboard (ISA IRQ 1) |
Crate Location
The APIC code lives in libkernel::apic (module libkernel/src/apic/). It was
originally a separate apic crate but was merged into libkernel so that
libkernel::irq_handle can call IO APIC mask/unmask/write functions directly
without duplicating raw MMIO code.
The LAPIC EOI address is communicated via a single AtomicU64 in
libkernel::interrupts: the apic module writes the address after mapping the
LAPIC, and interrupt handlers read it to perform EOI.
References
- Intel SDM Vol. 3A, Chapter 10: Advanced Programmable Interrupt Controller (APIC)
- OSDev Wiki: APIC, IO APIC, MADT
- ACPI Specification, Section 5.2.12: Multiple APIC Description Table
LAPIC Timer
Overview
The kernel uses the Local APIC (LAPIC) per-core timer as the primary tick source at 1000 Hz (1 ms resolution), replacing the legacy 8253 Programmable Interval Timer (PIT).
| Property | PIT | LAPIC timer |
|---|---|---|
| Frequency | 100 Hz (10 ms/tick) | 1000 Hz (1 ms/tick) |
| Scope | System-wide, ISA bus | Per-core, MMIO |
| Configuration | Port I/O | Memory-mapped registers |
| Timer future resolution | 10 ms | 1 ms |
LAPIC Timer Calibration
The LAPIC timer counts down from a programmed initial value at a rate derived from the CPU bus frequency, which varies between machines. To determine the correct initial count for 1000 Hz, the kernel calibrates against the PIT.
Algorithm
- Start one-shot countdown — write
0xFFFF_FFFFtoTimerInitialCountwith divide-by-16. - Wait 500 ms — busy-wait on
TICK_COUNTfor 50 PIT ticks (50 × 10 ms = 500 ms). - Read elapsed count —
elapsed = 0xFFFF_FFFF - TimerCurrentCount. - Compute bus frequency:
lapic_bus_freq = elapsed × divide × PIT_HZ / PIT_ticks_waited = elapsed × 16 × 100 / 50 - Compute initial count for 1000 Hz:
initial_count = lapic_bus_freq / (divide × target_Hz) = lapic_bus_freq / (16 × 1000) - Start periodic timer with the computed initial count.
Implementation
libkernel::apic::calibrate_and_start_lapic_timer() in libkernel/src/apic/mod.rs:
- Called from
kernel/src/main.rsafterlibkernel::apic::init()anddisable_pic(). - Releases the
LOCAL_APIClock before entering the HLT loop (phase 2) so the PIT ISR can proceed without deadlock. - The LAPIC EOI address is already registered in
libkernel::LAPIC_EOI_ADDRbyinit_local().
PIT Coexistence During Calibration
During the 500 ms calibration window, the PIT ISR (vector 0x20) is still active and increments TICK_COUNT. This is required — wait_ticks() depends on it. After the LAPIC timer starts:
- PIT continues at 100 Hz (vector 0x20 →
tick()) - LAPIC fires at 1000 Hz (vector 0x30 →
tick())
Both call tick(), giving approximately 1100 increments per second. The Delay future handles this correctly: early wakeups cause re-polls, which re-register the waker. Timing is slightly fast during calibration startup (~0.1% error), which is acceptable for kernel timers.
To eliminate the PIT contribution after calibration, mask GSI 2 in the IO APIC:
#![allow(unused)]
fn main() {
// follow-up: IO_APICS.lock()[0].mask_entry(2);
}
This is not yet implemented.
Multi-Waker Design
Problem with AtomicWaker
futures_util::task::AtomicWaker holds a single waker. With multiple concurrent Delay futures across different tasks, each poll() call overwrites the previous waker. When the ISR fires, only the last registered task is woken; others remain pending indefinitely.
Solution: Fixed Waker Array
libkernel/src/task/timer.rs uses a fixed array of 8 optional wakers behind a spinlock:
#![allow(unused)]
fn main() {
static WAKERS: spin::Mutex<[Option<Waker>; 8]> = Mutex::new([None; 8]);
}
On each tick (tick() called from ISR):
- Increment
TICK_COUNT. - Acquire the lock (interrupts already disabled by CPU on IDT dispatch — no deadlock).
- Take and wake every non-empty slot.
In Delay::poll():
- Check
TICK_COUNT >= target— returnReadyimmediately if done. - Clone the waker (may allocate — must be done in task context, before disabling interrupts).
- Disable interrupts (
without_interrupts) and lockWAKERS. - Find an empty slot and insert the cloned waker. Panic if all slots are full (bug indicator).
- Re-check
TICK_COUNT >= target— returnReadyif the ISR fired between step 1 and step 4. - Return
Pending.
ISR/Task Locking Contract
| Context | IF flag | Lock acquisition |
|---|---|---|
| ISR (timer handler) | 0 (CPU clears on IDT dispatch) | Always succeeds immediately |
| Task (Delay::poll) | 1 (enabled) | Uses without_interrupts to prevent ISR re-entry while holding lock |
If a task held the lock with interrupts enabled, the ISR could fire and spin forever trying to acquire the same lock — a deadlock. without_interrupts prevents this.
TICKS_PER_SECOND Constant
Defined in libkernel/src/task/timer.rs:
#![allow(unused)]
fn main() {
pub const TICKS_PER_SECOND: u64 = 1000;
}
Use it to convert between ticks and real time:
#![allow(unused)]
fn main() {
// Convert ticks to seconds elapsed
let secs = ticks() / TICKS_PER_SECOND;
// Create a 1-second delay
Delay::from_secs(1).await;
// Create a 250ms delay
Delay::from_millis(250).await;
}
Delay::from_millis(ms) uses ceiling division to avoid returning early:
#![allow(unused)]
fn main() {
Self::new((ms * TICKS_PER_SECOND + 999) / 1000)
}
LAPIC Timer Registers
| Register | Offset | Purpose |
|---|---|---|
LvtTimer | 0x320 | Vector[7:0], mask[16], mode: one-shot[17]=0, periodic[17]=1 |
TimerInitialCount | 0x380 | Write to start countdown |
TimerCurrentCount | 0x390 | Read-only; current value |
TimerDivideConfiguration | 0x3E0 | Bus clock divisor (0x3 = ÷16) |
The kernel uses divide-by-16 (0x3). The formula above accounts for this divisor.
Key Files
| File | Role |
|---|---|
libkernel/src/task/timer.rs | tick(), wait_ticks(), Delay, TICKS_PER_SECOND, waker array |
libkernel/src/interrupts.rs | LAPIC_TIMER_VECTOR = 0x30, IDT entry, lapic_timer_interrupt_handler |
apic/src/local_apic/mapped.rs | start_oneshot_timer(), start_periodic_timer(), stop_timer(), read_current_count() |
apic/src/lib.rs | calibrate_and_start_lapic_timer() |
kernel/src/main.rs | Calls calibration; spawns timer_task() |
Microkernel Design
Overview
This document explores evolving ostoo towards a microkernel architecture where device drivers run as userspace processes rather than in the kernel. It covers the motivation, the kernel primitives required, how other systems solve this, and a migration path from the current monolithic design.
See also: networking-design.md for how networking specifically fits into either a monolithic or microkernel architecture.
Why Consider a Microkernel
- Fault isolation. A buggy NIC or filesystem driver crashes its own process, not the kernel. The system can restart it.
- Reduced TCB. The trusted computing base shrinks to just the kernel primitives. Less kernel code = fewer exploitable bugs.
- Hot-swappable drivers. Replace or upgrade a driver without rebooting.
- Security boundaries. A compromised driver only has access to the specific device it manages, not all of kernel memory.
How Other Systems Do It
Redox OS — Scheme-Based IPC
Redox uses schemes as both a namespace and IPC channel. A scheme is a
named resource (like tcp:, udp:, disk:) backed by a userspace daemon.
Standard file operations (open/read/write/close) become IPC messages
routed through the kernel to the scheme daemon.
smolnetddaemon implementstcp:/udp:schemes using smoltcp- NIC drivers (e.g.
e1000d) are separate userspace processes - The kernel provides
irq:Nandmemory:schemes for hardware access - Recent work adds io_uring-style shared-memory rings for high-throughput driver-to-driver communication, bypassing the kernel data path
Strengths: elegant “everything is a file” model, reuses POSIX-like ops for IPC. Weaknesses: every cross-component operation involves at least one context switch through the kernel (mitigated by io_uring for data paths).
seL4 — Capabilities + Shared Memory Rings
seL4 provides minimal kernel primitives and lets userspace build everything else:
- Synchronous endpoints for RPC (~0.2us round-trip on ARM64). Small
messages transfer entirely in CPU registers (zero copy). The kernel has a
fastpath for
seL4_Call/seL4_ReplyRecvwith direct process switch (sender switches directly to receiver without full scheduler invocation). - Notifications for async signaling (interrupt delivery, ring wakeups). A notification word acts as a bitmask of binary semaphores — different signalers use different bits, so one notification object can multiplex multiple event sources.
- Capability system — every kernel object (endpoints, frames, interrupts, page tables) is accessed through unforgeable capability tokens stored in per-thread CSpaces. Capabilities can be derived with reduced rights, transferred via IPC, or revoked.
- sDDF (seL4 Device Driver Framework) uses SPSC shared-memory ring buffers for zero-copy packet passing between NIC driver → multiplexer → application.
sDDF on an iMX8 ARM board with a 1 Gb/s NIC saturates the link at ~95% CPU while Linux on the same hardware maxes out at ~600 Mb/s. The shared-memory ring design avoids Linux’s sk_buff allocation/copy overhead.
Minix 3 — Classic Microkernel
Each driver is a separate user-mode process. The kernel provides:
sys_irqsetpolicy()— subscribe to hardware interruptsHARD_INTnotification messages — delivered on nextreceive()SYS_DEVIO— read/write I/O ports from userspaceSYS_PRIVCTL— per-driver access control (which ports, IRQs, memory regions are permitted, declared in/etc/system.conf.d/)SYS_UMAP/SYS_VUMAP— virtual-to-physical translation for DMA setup- Fixed-length synchronous IPC:
send()/receive()/sendrec()+notify()
Networking uses lwIP in a separate server process. A received packet traverses: NIC driver → lwIP server → VFS → application (~4 IPC hops per direction).
Fuchsia (Zircon) — Driver Hosts + FIDL
Drivers are shared libraries loaded into driver host processes. Multiple drivers can be colocated in the same host for zero-overhead communication.
- FIDL (Fuchsia Interface Definition Language) for typed IPC across process boundaries via Zircon channels
- DriverTransport for in-process communication between colocated drivers (no kernel involvement — can invoke handler directly in the same stack frame)
- VMO (Virtual Memory Objects) for shared memory and device MMIO. Created
by the bus driver via
zx_vmo_create_physical(), passed as handles to device drivers, mapped into their address space - BTI (Bus Transaction Initiators) for DMA with IOMMU control.
zx_bti_pin()pins a VMO and returns device-physical addresses - Interrupt objects — created by bus drivers, delivered via
zx_interrupt_wait()(sync) or port binding (async) - Control/data plane split: FIDL messages for setup, pre-allocated shared VMOs for bulk data transfer
Key insight: colocation lets Fuchsia avoid the microkernel IPC tax for tightly-coupled drivers while still getting process isolation for less trusted ones.
Kernel Primitives Needed
To support userspace drivers, ostoo must provide these minimal primitives.
1. Physical Memory Mapping
Map device MMIO BARs into a userspace process’s address space.
ostoo today: mmio_phys_to_virt() maps BARs into kernel space only.
sys_mmap only supports anonymous private pages (returns -ENOSYS for
non-anonymous).
Options:
- Extend
sys_mmapwithMAP_SHARED+ a device fd (Linux-like/dev/mem) - New
sys_mmap_device(phys_addr, size, perms)syscall (simpler) - Capability-based: kernel creates a “device memory” handle, process maps it
via
mmapon that handle’s fd
The capability-based approach (device memory as an fd) fits ostoo’s existing fd_table model well and avoids giving processes a raw “map any physical address” primitive.
2. IRQ Delivery to Userspace
Deliver hardware interrupts as events to a userspace driver process.
ostoo today: Interrupts handled entirely in kernel (APIC/IOAPIC routing to kernel ISRs). No mechanism to notify userspace of IRQs.
Options:
| Approach | Description | Complexity |
|---|---|---|
| IRQ fd | open("/dev/irq/N"), read() blocks until IRQ fires | Low — reuses fd/FileHandle |
| eventfd | New eventfd() syscall, kernel writes on IRQ | Medium |
| Signal | Deliver SIGIO to driver process on IRQ | Medium — requires signals |
| Notification object | Dedicated kernel object (seL4-style) | High — new primitive |
Recommendation: IRQ fd. Create an IrqHandle implementing FileHandle
where read() blocks until the interrupt fires and returns a count. The
kernel ISR masks the interrupt and calls unblock() on the waiting thread.
After handling, the driver writes to the fd to re-enable the interrupt. This
fits the existing scheduler block/unblock pattern and the fd_table model.
3. Fast IPC
Efficient communication between driver processes and between drivers and applications.
ostoo today: Only pipes (byte streams, no message boundaries, kernel- buffered copy).
Progressive options:
- Pipes (have now) — sufficient for prototyping, ~2 copies per message
- Unix domain sockets — adds message boundaries (
SOCK_DGRAM), ancillary data for fd passing (needed to transfer device handles between processes) - Shared memory regions —
mmap(MAP_SHARED)orshmget/shmatfor zero-copy ring buffers between cooperating processes - io_uring-style rings — lock-free SPSC queues in shared memory, kernel only involved for wakeups when rings transition empty → non-empty
For initial microkernel work, pipes + shared memory is sufficient. The performance-critical path (packet data) goes through shared memory rings; the control path (setup, teardown) goes through pipes or sockets.
The long-term goal is a pattern where the control plane uses message-based IPC and the data plane uses shared-memory rings, matching seL4 sDDF and Fuchsia’s architecture.
4. Shared Memory
Map the same physical pages into multiple process address spaces.
ostoo today: Each process gets a private PML4. create_user_page_table
copies kernel entries (256-510) but there is no mechanism for two user
processes to share pages.
What’s needed:
- A shared memory object (named or anonymous) backed by physical frames
mmap()to map it into each participating process- Reference counting so pages are freed only when all mappers unmap
- Access control: processes must hold a handle/capability to map the region
Design sketch:
Process A Kernel Process B
│ │
├─ shmget(key, size) ──→ allocate frames ←── shmget(key, size) ─┤
│ ref_count = 2 │
├─ shmat(id) ──────────→ map into A's PML4 │
│ │
│ map into B's PML4 ←─── shmat(id) ──────┤
│ │
│ (A and B now read/write the same physical pages) │
Alternative: fd-based approach where mmap(fd, MAP_SHARED) on a memfd works
like Linux. This avoids inventing SysV-style APIs and reuses the fd model.
5. DMA Support
Userspace drivers need physical addresses for device DMA programming.
ostoo today: KernelHal::dma_alloc() allocates physically-contiguous
pages and returns (paddr, NonNull<u8>). share() translates virtual to
physical via page table walk. Both are kernel-internal.
What’s needed:
- A syscall to allocate DMA-capable memory: physically contiguous, pinned, and mapped into the calling process. Returns both the virtual address and the physical address (the driver needs the physical address to program the device’s DMA descriptors).
- Or: a two-step model where the kernel allocates DMA buffers and provides an fd. The driver maps the fd and queries the physical address separately.
The fd-based model is safer (physical addresses are not exposed until the driver proves it holds the right handle) and aligns with Fuchsia’s BTI/PMT pattern.
6. Access Control
Prevent arbitrary processes from mapping device memory or claiming IRQs.
Options (increasing sophistication):
| Approach | Description | Precedent |
|---|---|---|
| Init-time grant | Only the init process can spawn drivers with device access, configured at spawn time | Simple, sufficient for single-user OS |
| Capability-based | Kernel objects (device memory, IRQ handles) are capabilities obtained from a parent or resource manager | seL4, Fuchsia |
| Policy-based | Configuration file declares which programs may access which devices | Minix 3 |
Recommendation: Init-time grant for the first iteration. The kernel shell or init process spawns driver processes and passes them fds for their device MMIO region and IRQ. The driver inherits these fds across exec. No new kernel objects needed — just careful fd management.
Later, this can evolve towards a capability model where device handles are kernel objects with typed permissions.
What Stays in the Kernel
Even in a full microkernel, some things must remain:
- CPU scheduling — timer interrupts, context switching, thread states
- Memory management — page tables, frame allocation, address space setup
- IPC primitives — message passing, shared memory mapping, notifications
- Interrupt routing — top-half ISR that masks IRQ and notifies userspace
- Capability/access control — enforce which processes access which devices
- Boot and early init — PCI enumeration can eventually be delegated, but initial hardware discovery typically starts in kernel
Everything else — device drivers, filesystems, network stacks, even the TCP/IP protocol processing — can live in userspace.
Current ostoo Gaps
Summary of what exists vs what’s needed:
| Primitive | Current State | Gap |
|---|---|---|
| Physical memory mapping | Kernel-only (mmio_phys_to_virt) | Need userspace mapping syscall |
| IRQ delivery | Kernel ISRs only | Need IRQ fd or notification |
| IPC | Pipes only (byte stream, ~2 copies) | Need shared memory + message boundaries |
| Shared memory | None (private PML4 per process) | Need cross-process page sharing |
| DMA | Kernel-only (dma_alloc/share) | Need userspace DMA allocation |
| Access control | None (all processes equal) | Need per-process device permissions |
| mmap | Anonymous private only | Need MAP_SHARED, device mapping |
| ioctl | Not implemented | Need for device control |
Migration Path
A phased approach that starts monolithic and progressively moves towards microkernel:
Phase A — Monolithic Drivers (current)
All drivers (virtio-blk, virtio-9p, and eventually virtio-net) run in kernel
space via the devices crate. This is the working baseline. Networking
is implemented in kernel with smoltcp (see networking-design.md).
Phase B — Add Kernel Primitives
Implement foundational primitives without yet moving drivers out. These are independently useful:
mmap(MAP_SHARED)— shared memory between processes (needed for efficient multi-process programs even without microkernel goals)- IRQ fd —
irq_create(gsi)syscall (504) returns an fd backed byFdObject::Irq. Used withOP_IRQ_WAITon a completion port. Implemented (seelibkernel/src/irq_handle.rs,osl/src/irq.rs) - Device MMIO mapping — map physical BAR regions to userspace via an fd
- DMA allocation syscall — allocate pinned, physically-contiguous pages accessible from userspace
Phase C — Userspace NIC Driver
Move the virtio-net driver to a userspace process as a proof-of-concept:
- Process receives device MMIO fd and IRQ fd from init
- Maps the virtio-net PCI BAR into its address space
- Opens IRQ fd and polls/blocks for interrupts
- Allocates DMA buffers for virtqueue descriptors
- Communicates with the in-kernel TCP/IP stack via shared memory ring buffers
The TCP/IP stack stays in kernel at this stage. This tests the driver primitive infrastructure with a single, well-understood device.
Phase D — Userspace TCP/IP Stack
Move smoltcp to a separate userspace server process:
- Receives raw Ethernet frames from NIC driver via shared memory rings
- Processes TCP/IP/ARP/DHCP
- Delivers data to applications via shared memory or kernel-mediated IPC
- The kernel’s socket syscall handlers become thin IPC stubs that route requests to this server (preserving POSIX compatibility for musl)
Phase E — Generalize
Apply the same pattern to other drivers:
- virtio-blk → userspace block driver + userspace filesystem server
- virtio-9p → userspace 9P client
- Console/keyboard → userspace terminal driver
At this point the kernel is a true microkernel: scheduler, memory management,
IPC, and capability enforcement only. The devices and osl crates either
become userspace libraries or are restructured into per-driver binaries.
Performance Considerations
Context Switches Per Packet
| Path | Monolithic | Microkernel |
|---|---|---|
| NIC IRQ → driver | 0 (in kernel) | 1 (kernel → driver) |
| Driver → TCP/IP | 0 (function call) | 1 (shared memory signal) |
| TCP/IP → application | 1 (return to userspace) | 1 (IPC or signal) |
| Total | 1 | 3 (naive) / 1-2 (batched) |
Why This Is Acceptable
- With shared-memory ring buffers and batching, the kernel is only involved for wakeups when rings transition empty → non-empty.
- Under load, driver and TCP/IP server can poll their rings without any kernel involvement (similar to Linux NAPI busy-polling).
- seL4 IPC round-trip: ~0.2us. Network I/O latency: 25-500+us. Even 3 IPC hops are a small fraction of total latency.
- The historical Mach-era penalty (50-100% overhead) is now 5-10% for general workloads and near-zero for I/O-dominated workloads.
Key Optimisations
- Shared memory data plane — kernel only signals, never copies data
- Batching — process N packets per wakeup, not 1
- Direct process switch — IPC sender switches directly to receiver without full scheduler invocation (seL4 fastpath)
- Polling under load — skip notifications entirely when rings are busy
- Pre-allocated buffer pools — no per-packet allocation
- Driver colocation (Fuchsia-style) — run tightly-coupled drivers in the same address space when isolation between them is not needed
Comparison Summary
| Aspect | Monolithic (Phase A) | Full Microkernel (Phase E) |
|---|---|---|
| Kernel code size | Large (drivers + protocols) | Small (primitives only) |
| Driver crash | Kernel panic | Restart driver process |
| Attack surface | Entire kernel | Minimal kernel + IPC |
| Performance | Best (no IPC overhead) | Good (shared memory amortises cost) |
| Implementation effort | Low | High (needs IPC, shared mem, caps) |
| POSIX compat | Direct | Kernel-mediated IPC stubs |
| ostoo readiness | Ready now | Needs phases B-E |
Compositor Design
A Wayland-style userspace compositor for ostoo.
Overview
The compositor takes ownership of the BGA framebuffer, accepts client connections via a service registry, allocates shared-memory pixel buffers for clients, and composites their output to the screen.
MVP scope: window creation, buffer allocation, damage signaling, compositing. No input routing or window management.
Architecture
┌──────────┐ svc_lookup ┌────────────┐ framebuffer_open ┌─────┐
│ Client │──────────────▶│ Compositor │──────────────────▶│ BGA │
│ │ IPC channels │ │ MAP_SHARED mmap │ LFB │
│ shmem │◀─────────────▶│ shmem │ └─────┘
│ buffer │ notify fds │ event │
└──────────┘ │ loop │
└────────────┘
Kernel Primitives Used
| Syscall | Nr | Purpose |
|---|---|---|
svc_register | 513 | Compositor registers itself under "compositor" |
svc_lookup | 514 | Client finds the compositor’s registration channel |
framebuffer_open | 515 | Compositor gets an shmem fd wrapping the BGA LFB |
ipc_create | 505 | Channel pairs for registration and per-client comms |
ipc_send/ipc_recv | 506/507 | Message passing with fd-passing |
shmem_create | 508 | Per-window pixel buffer allocation |
notify_create | 509 | Per-window damage notification fd |
notify | 510 | Client signals “buffer is ready” |
io_create | 501 | Compositor’s completion port |
io_submit | 502 | Arm OP_IPC_RECV and OP_RING_WAIT |
io_wait | 503 | Block until events arrive |
Service Registry (syscalls 513–514)
A kernel-global BTreeMap<String, FdObject> keyed by null-terminated name.
svc_register(name, fd): clones the fd object +notify_dup(), inserts under name. Returns-EBUSYif taken.svc_lookup(name): clones +notify_dup()s the stored object, allocates fd in caller’s table. Returns-ENOENTif not found.
Max name length: 128 bytes.
Framebuffer Access (syscall 515)
framebuffer_open(flags) creates a SharedMemInner::from_existing() wrapping
the BGA LFB physical frames. The frames are non-owning (MMIO memory is never
freed). The caller mmaps with MAP_SHARED to get a user-accessible pointer.
The LFB physical address and size are stored in atomics during BGA init and read by the syscall handler.
Connection Protocol
Uses existing IPC fd-passing — no new kernel primitives needed.
Compositor Setup
ipc_create()→[reg_send, reg_recv]svc_register("compositor\0", reg_send)- Create
CompletionPort, submitOP_IPC_RECVonreg_recv
Client Connects
svc_lookup("compositor\0")→ dup ofreg_send- Create two channel pairs:
c2s(client→server) ands2c(server→client) ipc_send(reg_send, MSG_CONNECT { w, h, fds=[c2s_recv, s2c_send] })ipc_recv(s2c_recv)→MSG_WINDOW_CREATED { id, w, h, fds=[buf_fd, notify_fd] }mmap(MAP_SHARED, buf_fd)→ pixel buffer- Draw, then
notify_signal(notify_fd)
Compositor Accepts
- Extract
c2s_recv,s2c_sendfrom message - Allocate shmem buffer + notify fd
ipc_send(s2c_send, MSG_WINDOW_CREATED { fds=[buf_fd, notify_fd] })- Arm
OP_RING_WAITon notify fd,OP_IPC_RECVonc2s_recv - Re-arm
OP_IPC_RECVonreg_recvfor next client
Wire Protocol
| Tag | Name | Direction | data[] | fds[] |
|---|---|---|---|---|
| 1 | MSG_CONNECT | client→compositor | [w, h, 0] | [c2s_recv, s2c_send] |
| 2 | MSG_WINDOW_CREATED | compositor→client | [wid, w, h] | [buf_fd, notify_fd] |
| 3 | MSG_PRESENT | client→compositor | [wid, 0, 0] | — |
| 4 | MSG_CLOSE | client→compositor | [wid, 0, 0] | — |
Compositor Event Loop
port.wait(min=1) → completions
TAG_NEW_CLIENT → handle_connect(), re-arm OP_IPC_RECV on reg_recv
TAG_DAMAGE(wid) → mark dirty, re-arm OP_RING_WAIT
TAG_CMD(wid) → handle MSG_PRESENT/MSG_CLOSE, re-arm OP_IPC_RECV
if any dirty → composite()
Compositing
- BGRA throughout (matches BGA native format, zero conversion)
- Background: solid dark grey (
0x00282828) - Window placement: auto-tile (2×2 grid)
- Full-screen repaint on any damage (acceptable at 1024×768)
- Painter’s algorithm, back-to-front
Files
| File | Role |
|---|---|
libkernel/src/service.rs | Service registry (register, lookup) |
libkernel/src/framebuffer.rs | LFB phys addr globals (set_lfb_phys, get_lfb_phys) |
osl/src/syscalls/service.rs | sys_svc_register, sys_svc_lookup |
osl/src/syscalls/fb.rs | sys_framebuffer_open |
user-rs/rt/src/compositor_proto.rs | Protocol constants |
user-rs/compositor/ | Compositor binary |
user-rs/demo-client/ | Demo client binary |
Usage
# Build kernel
cargo bootimage --manifest-path kernel/Cargo.toml
# Build and deploy Rust userspace (compositor + demo-client)
scripts/user-rs-build.sh
# Run
scripts/run.sh
The compositor is auto-launched by the kernel at boot (see launch_compositor
in kernel/src/main.rs). Run demo-client from the shell to display a
test gradient.
Future Work
- Input routing: keyboard events → focused window
- Window management: move, resize, focus, Z-order
- Double buffering: back-buffer swap
- Alpha blending
- Dirty rect optimization
- Write-combining PAT entries for LFB pages
- Service auto-cleanup on process exit
Display & Input Ownership
How the framebuffer and keyboard transition from kernel to compositor using the existing fd-passing and service-registry primitives.
Problem
At boot, three components compete for the display and keyboard:
- Kernel WRITER —
println!()renders to the BGA framebuffer via an IrqMutex-protectedFramebufferstruct. - User shell — reads from the console input buffer, writes to stdout (which goes through WRITER).
- Compositor — mmaps the same LFB via
framebuffer_open, composites client windows.
Today there is no ownership model. The kernel WRITER and compositor both
hold pointers to the same physical framebuffer memory and write
concurrently. Keyboard input routes to the user shell via
FOREGROUND_PID but the compositor has no way to receive it.
Design: Capability-Based Handoff
Ownership is expressed through who holds which fds, matching the existing IPC model.
Display Ownership
BOOT COMPOSITOR RUNNING
──── ──────────────────
WRITER ──▶ LFB (active) WRITER ──▶ serial only
Compositor ──▶ LFB (exclusive)
When the compositor calls framebuffer_open (syscall 515), two things
happen:
- The compositor gets an shmem fd wrapping the BGA LFB (existing behaviour).
- Side effect: the kernel marks the WRITER backend as
suppressed. All subsequent
println!()/log::info!()output is redirected to serial only. The kernel no longer touches the LFB. The status bar and timeline strip are also suppressed.
If the compositor exits or crashes, the kernel detects this (via process
exit cleanup in terminate_process) and unsuppresses the WRITER,
calling repaint_all() to restore kernel display output.
Implementation: DISPLAY_SUPPRESSED: AtomicBool and DISPLAY_OWNER_PID: AtomicU64 in libkernel/src/vga_buffer/mod.rs.
Input Ownership — Userspace Keyboard Driver
Instead of a kernel-level input_acquire syscall, the keyboard becomes
a userspace service (/bin/kbd). This requires no new kernel
interfaces — only existing primitives.
┌──────────┐ IRQ fd ┌──────────┐ IPC channel ┌────────────┐
│ IO APIC │───────────▶│ /bin/kbd │──────────────▶│ Compositor │
│ (GSI 1) │ scancode │ │ key events │ │
└──────────┘ in result └──────────┘ └────────────┘
How it works:
/bin/kbdcallsirq_create(1)— claims keyboard IRQ via the existing IRQ fd mechanism. This reroutes the keyboard interrupt from the hardwired kernel ISR (vector 33) to a dynamic vector handled byirq_fd_dispatch, which reads port 0x60 and delivers the scancode incompletion.result. The GSI is kept unmasked between interrupts so that edge-triggered IRQ edges are never lost; scancodes that arrive between OP_IRQ_WAIT re-arms are buffered in a 64-entry ring.- Creates a registration channel and calls
svc_register("keyboard"). - Event loop on CompletionPort:
OP_IRQ_WAIT→ receives scancode → decodes via scancode set 1 state machine → produces key eventsOP_IPC_RECVon registration channel → new client connecting (compositor sends a channel send-end) → stores client
- For each decoded key event, sends an
IpcMessageto all connected clients.
When /bin/kbd exits, close_irq restores the original IO APIC entry,
and the kernel keyboard actor resumes automatically — providing fallback.
Safety: if no client connects within 2 seconds, kbd exits to avoid capturing keyboard input with nobody listening.
Keyboard Protocol
MSG_KB_CONNECT (tag=1): client → keyboard service
data = [0, 0, 0]
fds = [event_send_fd, -1, -1, -1]
MSG_KB_KEY (tag=1): keyboard service → client (via passed channel)
data = [byte, modifiers, key_type]
fds = [-1, -1, -1, -1]
key_type: 0 = ASCII byte, 1 = special key (arrow, etc.)modifiers: bitmask (bit 0 = shift, bit 1 = ctrl, bit 2 = alt)
Input Ownership — Mouse (Integrated into Compositor)
The mouse is handled directly by the compositor — no separate mouse driver process. The compositor claims IRQ 12 itself and decodes PS/2 packets inline, eliminating an IPC round-trip per mouse event.
┌──────────┐ IRQ fd ┌────────────┐
│ IO APIC │───────────▶│ Compositor │
│ (GSI 12) │ byte │ │
└──────────┘ in result └────────────┘
How it works:
- The compositor calls
irq_create(12)— claims mouse IRQ via the existing IRQ fd mechanism. The kernel automatically initializes the PS/2 auxiliary port (i8042 controller) when GSI 12 is claimed. - Arms
OP_IRQ_WAITon the IRQ fd in its completion port event loop. - On each IRQ completion, feeds the raw byte into an inline
MouseDecoderthat collects 3-byte PS/2 packets (sync on byte 0 bit 3), decodes signed deltas using the OSDev wiki formula, and updates the absolute cursor position (clamped to screen bounds).
Compositor Key & Mouse Forwarding
The compositor connects to the keyboard service on startup using
svc_lookup_retry(). Key events are forwarded to the focused
window’s client via MSG_KEY_EVENT (tag 5). Mouse events (decoded
directly from IRQ 12) drive the cursor, focus, window movement,
and resizing.
Window Decorations (Server-Side, CDE Style)
The compositor draws server-side decorations inspired by the Common Desktop Environment (CDE) / Motif toolkit, with 3D beveled borders:
╔═══════════════════════════════╗ ─┐
║ ┌──┐ ║ │
║ │▪▪│ Win 1 (centered) ║ │ TITLE_H = 24px
║ └──┘ ║ │
╠═══════════════════════════════╣ ─┘
║ ┌───────────────────────────┐ ║
║ │ │ ║
║ │ Client Content │ ║ client buffer (w × h)
║ │ │ ║ (sunken inner bevel)
║ └───────────────────────────┘ ║
╚═══════════════════════════════╝
BORDER_W = 4px, BEVEL = 2px
- 3D bevels:
draw_bevel()renders light/dark edge pairs on all four sides to create a raised or sunken look (2px bevel width) - Title bar (24px): raised bevel, blue when focused, grey when unfocused
- Close button: raised square with inner square motif (CDE style), positioned in the top-left of the title bar
- Window title: centered text rendered with 8×16 CP437 font
- Client area: surrounded by a sunken inner bevel
- Color palette: blue-grey CDE theme (slate blue desktop, cool grey-blue window frames, blue active title bars)
Window Management
Focus: Click anywhere in a window to focus it. The focused window moves to the top of the Z-order and receives keyboard input.
Move: Drag the title bar to move a window.
Resize: Drag the bottom edge, right edge, or bottom-right corner
to resize. The cursor changes to indicate the resize direction:
diagonal double-arrow for corners, horizontal for right edge, vertical
for bottom edge. During drag, the window frame updates live. On
mouse-up, the compositor allocates a new shared buffer and sends
MSG_WINDOW_RESIZED to the client. The terminal emulator remaps the
new buffer, recalculates cols/rows, clears the screen, and nudges the
shell to redraw its prompt.
Close: Click the close button to close a window.
Compositor Double Buffering & Cursor-Only Rendering
The compositor uses an offscreen back buffer (heap-allocated, same
size as the framebuffer) to eliminate flicker. Full composite passes
clear and draw all windows (with decorations) into the back buffer,
then copy the finished frame to the LFB in a single memcpy.
Cursor-only optimization: Mouse movement that doesn’t change the scene (no window drag, no focus change) takes a fast path: restore the old cursor rectangle from the back buffer (~12x8 pixels), draw the cursor at the new position, and patch only those two small rectangles on the LFB. This avoids the full 3 MB recomposite on every mouse event.
Terminal Emulator and Shell
The terminal emulator (/bin/term) is a compositor client that spawns
the shell with pipe-connected stdin/stdout.
Compositor Terminal Emulator Shell
────────── ───────────────── ─────
MSG_KEY_EVENT stdin pipe
─────────────────▶ translate ──────────────▶ read(0)
s2c channel stdout pipe
◀───────────────── render ◀────────────── write(1)
damage notify (pipe pair)
◀─────────────────
The terminal emulator:
- Connects to compositor via
svc_lookup_retry("compositor") - Gets a 640×384 window (80×24 cells at 8×16 font)
- Creates pipe pairs for shell stdin/stdout
- Spawns
/bin/shellviaclone(CLONE_VM|CLONE_VFORK)+execve(child_stack=0shares the parent’s stack — safe because the parent is blocked until the child callsexecveor_exit) - Event loop: key events → shell stdin pipe, shell stdout → VT100 parser → glyph rendering → damage signal
Command interpreter (shell)
A plain stdin/stdout program with no knowledge of the compositor:
- Reads lines from stdin, writes output to stdout
- Works identically in both graphical and fallback modes
Fallback Path
If the compositor binary is not present or fails to start:
framebuffer_openis never called → WRITER stays active- Keyboard IRQ stays with kernel actor → routes to console buffer
- User shell reads/writes via console as it does today
The system degrades gracefully to the current behaviour.
Startup Sequence
Boot
├─ launch_keyboard_driver() [100ms VFS settle]
│ └─ /bin/kbd: irq_create(1), svc_register("keyboard")
│ keyboard IRQ rerouted → kernel actor dormant
│
├─ launch_compositor() [100ms VFS settle]
│ ├─ /bin/compositor: framebuffer_open → WRITER suppressed
│ ├─ irq_create(12) → PS/2 aux init, inline mouse decoding
│ ├─ svc_lookup_retry("keyboard") → receive key events
│ ├─ svc_register("compositor")
│ └─ kernel spawns /bin/term
│ ├─ svc_lookup_retry("compositor") → get window
│ ├─ pipe2 + clone/execve → spawn /bin/shell
│ └─ event loop (keys → shell stdin, shell stdout → render)
│
└─ launch_userspace_shell() [polls DISPLAY_SUPPRESSED every 50ms, up to 1s]
└─ if DISPLAY_SUPPRESSED → skip (compositor path active)
else → launch /bin/shell directly (fallback)
Service readiness is coordinated via polling and retry loops rather than hardcoded sleep timings:
- Userspace:
svc_lookup_retry()retries service lookup with 50ms yields - Kernel:
launch_userspace_shellpollsDISPLAY_SUPPRESSEDevery 50ms (up to 20 iterations / 1 second) before falling back to standalone shell
Fallback Matrix
| kbd | compositor | term | Result |
|---|---|---|---|
| yes | yes | yes | Full graphical: kbd→compositor (mouse integrated)→term→shell |
| yes | yes | no | Compositor up, no terminal (display-only) |
| no | no | - | Classic fallback: kernel kbd actor + shell on console |
No New Syscalls
This design uses only existing kernel primitives:
| Primitive | Syscall | Use |
|---|---|---|
irq_create | 504 | keyboard driver claims IRQ 1, compositor claims IRQ 12 (mouse) |
svc_register / svc_lookup | 513/514 | keyboard and compositor service discovery |
ipc_create / ipc_send / ipc_recv | 505-507 | key/mouse event delivery with fd passing |
shmem_create | 508 | window buffers, resize buffer allocation |
framebuffer_open | 515 | display ownership (with suppression side effect) |
pipe2 | 293 | terminal↔shell communication |
clone / execve / dup2 | 56/59/33 | terminal spawns shell |
The only kernel changes beyond the initial suppression flag are:
- PS/2 auxiliary port initialization on
irq_create(12)(libkernel/src/ps2.rs) irq_fd_dispatchreads port 0x60 for GSI 12 (mouse) in addition to GSI 1 (keyboard)
Networking Design
Overview
This document describes the planned networking architecture for ostoo. The design adds TCP/IP networking via a VirtIO network device and the smoltcp protocol stack.
The initial implementation runs entirely in kernel space, matching the existing
pattern where VFS and block I/O run in the devices crate. For the
longer-term microkernel path where the NIC driver and TCP/IP stack move to
userspace, see microkernel-design.md.
Architecture
Kernel-Space (Initial)
Userspace programs (socket/connect/bind/listen/accept/send/recv)
│
osl/src/net.rs ← syscall → smoltcp socket mapping
│
smoltcp::iface::Interface ← protocol processing (TCP/IP/ARP/DHCP/DNS)
│
devices/src/virtio/net.rs ← smoltcp Device trait wrapping VirtIONet
│
VirtIONet<KernelHal, PciTransport> ← raw Ethernet frame send/receive
│
QEMU virtio-net-pci ← -device virtio-net-pci,netdev=net0
-netdev user,id=net0
Userspace (Future)
Once the microkernel primitives from microkernel-design.md are in place, networking can be restructured:
Userspace programs (socket syscalls, routed by kernel to TCP/IP server)
│
TCP/IP server process ← smoltcp in a userspace daemon
│
shared memory ring buffers ← zero-copy packet passing
│
NIC driver process ← virtio-net via mapped MMIO + IRQ fd
│
virtio-net-pci hardware
The kernel’s socket syscall handlers become thin IPC stubs that route requests to the TCP/IP server, preserving POSIX compatibility for musl. This corresponds to Phase C/D in the microkernel migration path.
Kernel-Space vs Userspace
Decision: kernel-space first
The initial implementation runs in kernel space. Reasons:
- Simpler. No IPC overhead — smoltcp directly accesses the virtio-net driver. No message-passing for every packet.
- Lower latency. No user/kernel context switches per packet.
- Matches existing patterns. VFS operations already run in kernel via the
devicescrate; networking follows the same model. - Proven. Hermit OS and Kerla both use smoltcp + virtio-net in kernel space successfully.
The microkernel path (Phases C-D in microkernel-design.md) moves the NIC driver and TCP/IP stack to userspace once shared memory, IRQ delivery, and device MMIO mapping primitives exist.
Protocols
| Layer | Protocol | Priority | Notes |
|---|---|---|---|
| Link | Ethernet II | Required | virtio-net provides raw frames |
| Link | ARP | Required | Automatic in smoltcp with Ethernet medium |
| Network | IPv4 | Required | Core routing |
| Network | ICMP | Required | Ping, error reporting |
| Transport | TCP | Required | Streams (HTTP, SSH, etc.) |
| Transport | UDP | Required | Datagrams (DNS, NTP, etc.) |
| Application | DHCPv4 | Required | Auto-configure IP/gateway/DNS from QEMU |
| Application | DNS | High | Name resolution |
| Network | IPv6 | Deferred | smoltcp supports it when ready |
Crates
virtio-drivers 0.13 (already in workspace)
Provides device::net::VirtIONet with:
new(transport, buf_len)— initialize with PCI transportmac_address()— read hardware MACreceive()/send(tx_buf)— raw Ethernet frame I/Oack_interrupt()/enable_interrupts()— IRQ support
The existing KernelHal and create_pci_transport() work unchanged. Add
device constants for virtio-net PCI IDs (modern: 0x1041, legacy: 0x1000).
smoltcp 0.12
no_std TCP/IP stack. Works with alloc (ostoo already has a heap).
Suggested Cargo features:
smoltcp = { version = "0.12", default-features = false, features = [
"alloc", "log", "medium-ethernet",
"proto-ipv4", "proto-dhcpv4", "proto-dns",
"socket-raw", "socket-udp", "socket-tcp",
"socket-icmp", "socket-dhcpv4", "socket-dns",
] }
Provides:
Interface— central type that drives all protocol processingphy::Devicetrait — integrate a NIC viaRxToken/TxToken(zero-copy, token-based)- Socket types — raw, ICMP, TCP, UDP, DHCPv4 client, DNS resolver
- ARP / neighbor cache — automatic with Ethernet medium
No additional crates needed. embassy-net wraps smoltcp but is tied to the Embassy async runtime — skip it.
Integration Points
NIC Driver (devices/src/virtio/net.rs)
Wrap VirtIONet<KernelHal, PciTransport, 64> in a struct implementing
smoltcp’s phy::Device trait:
receive()→RxToken(read raw frame from virtqueue)transmit()→TxToken(write raw frame to virtqueue)capabilities()→ MTU 1514,Medium::Ethernet, no checksum offload
Polling
smoltcp requires periodic Interface::poll() calls. Two options:
- Timer-driven — poll every 10ms from the scheduler tick (simple, higher latency).
- IRQ-driven — virtio-net interrupt triggers poll (responsive, more complex).
Start with timer-driven. Migrate to IRQ-driven once the basics work.
Blocking Bridge
osl::blocking::blocking() already converts async → sync for VFS. Same
pattern for socket operations: spawn an async task that polls smoltcp, block
the calling thread until data arrives or the operation completes.
Socket File Descriptors
Create a SocketHandle struct implementing the FileHandle trait. Store in
the process fd_table like pipes and files:
read()on a TCP socket → recv from smoltcp TCP socketwrite()on a TCP socket → send to smoltcp TCP socketclose()→ release smoltcp socket handle
UDP sockets need sendto/recvfrom for the address parameter.
DHCP at Boot
After virtio-net init, create a DHCPv4 socket, poll until configured, then set the interface IP/gateway/DNS. QEMU user-mode networking provides DHCP at 10.0.2.2 with default subnet 10.0.2.0/24.
Syscalls
New Linux-compatible syscall numbers to add in osl/src/syscalls/mod.rs:
| Nr | Name | Purpose |
|---|---|---|
| 41 | socket | Create AF_INET SOCK_STREAM/SOCK_DGRAM |
| 42 | connect | TCP connect to remote |
| 43 | accept | Accept incoming TCP connection |
| 44 | sendto | Send datagram with destination address |
| 45 | recvfrom | Receive datagram with source address |
| 46 | sendmsg | Scatter/gather send (needed by musl) |
| 47 | recvmsg | Scatter/gather receive (needed by musl) |
| 49 | bind | Bind to local address/port |
| 50 | listen | Mark socket as listening |
| 51 | getsockname | Get local address of socket |
| 54 | setsockopt | Set socket options (SO_REUSEADDR, etc.) |
| 55 | getsockopt | Get socket options |
Stubs returning ENOSYS for unsupported options are acceptable initially.
QEMU Configuration
Add to scripts/run.sh:
-device virtio-net-pci,netdev=net0 \
-netdev user,id=net0
QEMU user-mode networking (SLIRP) provides:
- NAT — guest can reach the internet and the host
- Built-in DHCP server at 10.0.2.2
- Built-in DNS forwarder at 10.0.2.3
- No host-side configuration needed
For host → guest connections, add port forwards:
-netdev user,id=net0,hostfwd=tcp::8080-:80
What This Enables
With TCP/UDP + DNS + DHCP, userspace programs compiled against musl can use the standard POSIX socket API. This opens the door to:
- Ping (
ICMP echo) - Simple network tools (netcat-like, wget-like)
- HTTP client/server
- Eventually SSH (requires more crypto infrastructure)
Code Quality Audit
A review of code smells, magic numbers, duplicated code, and missing
abstractions across the codebase. Companion to unsafe-audit.md which
covers unsafe specifically.
Date: 2026-03-19
1. Magic Numbers
1.1 Syscall numbers — osl/src/syscalls/mod.rs ✅ DONE
osl/src/syscalls/mod.rsFixed in 95da4c0. Named constants in osl/src/syscall_nr.rs;
dispatch match now uses SYS_READ, SYS_WRITE, etc.
1.2 MSR addresses ✅ DONE
Fixed in 95da4c0. Named constants in libkernel/src/msr.rs
(IA32_FS_BASE, IA32_EFER, etc.); all 12+ inline uses replaced.
1.3 Page size ✅ DONE
Fixed in 95da4c0. PAGE_SIZE and PAGE_MASK in
libkernel/src/consts.rs; all 20+ inline uses replaced.
1.4 Stack sizes ✅ DONE
Fixed in 95da4c0. KERNEL_STACK_SIZE in libkernel/src/consts.rs;
all 4 locations updated.
1.5 I/O port addresses — interrupts.rs
0x21,0xA1— PIC data ports (lines 111-112)0x43,0x40— PIT command/channel0 ports (lines 218-220)0x34— PIT mode command byte (line 220)11932— PIT reload for 100 Hz (line 216)
Fix: Named constants:
#![allow(unused)]
fn main() {
const PIC_MASTER_DATA: u16 = 0x21;
const PIC_SLAVE_DATA: u16 = 0xA1;
const PIT_COMMAND: u16 = 0x43;
const PIT_CHANNEL0: u16 = 0x40;
const PIT_MODE_RATE_GEN: u8 = 0x34;
const PIT_100HZ_RELOAD: u16 = 11932;
}
1.6 stat struct layout ✅ DONE (partial)
Fixed in 95da4c0. STAT_SIZE and S_IFCHR are now named constants
in sys_fstat. 0o666 (permission mode) remains inline — acceptable as a
well-known octal literal.
1.7 VirtIO vendor/device IDs — kernel/src/main.rs
0x1AF4, 0x1042, 0x1001 used inline in the virtio-blk PCI scan.
Now also 0x1049, 0x1009 for virtio-9p.
Fix:
#![allow(unused)]
fn main() {
const VIRTIO_VENDOR_ID: u16 = 0x1AF4;
const VIRTIO_BLK_MODERN_DEVICE_ID: u16 = 0x1042;
const VIRTIO_BLK_LEGACY_DEVICE_ID: u16 = 0x1001;
const VIRTIO_9P_MODERN_DEVICE_ID: u16 = 0x1049;
const VIRTIO_9P_LEGACY_DEVICE_ID: u16 = 0x1009;
}
1.8 Other notable magic numbers
| Location | Value | Suggested name |
|---|---|---|
scheduler.rs:138 | 0x202 | RFLAGS_IF_RESERVED |
scheduler.rs:283,337 | 0x1F80 | MXCSR_DEFAULT |
vga_buffer/mod.rs:85,306 | 0x20..=0x7e | PRINTABLE_ASCII |
vga_buffer/mod.rs:308 | 0xfe | NONPRINTABLE_PLACEHOLDER |
memory/mod.rs:333,335 | 0x1FF | PAGE_TABLE_INDEX_MASK |
syscalls/io.rs | 16 | IOVEC_SIZE |
syscalls/fs.rs | 4096 | MAX_PATH_LEN |
gdt.rs:33 | 4096 * 5 | DOUBLE_FAULT_STACK_SIZE |
2. Duplicated Code
2.1 FD table retrieval ✅ DONE
Fixed in 95da4c0. get_fd_handle() helper (now in osl/src/fd_helpers.rs)
eliminates 4 identical fd-lookup blocks.
2.2 Page alloc + zero + map loop ✅ DONE
Fixed in 95da4c0. MemoryServices::alloc_and_map_user_pages() in
libkernel/src/memory/mod.rs replaces the alloc+zero+map loops in
sys_brk and sys_mmap. (The spawn.rs loop is slightly different —
it writes ELF segment data — so it was not collapsed.)
2.3 Page clearing ✅ DONE
Fixed in 95da4c0. clear_page() in libkernel/src/consts.rs
replaces 6 inline write_bytes calls. (Some calls in spawn.rs that
write non-zero data were not replaced.)
2.4 PageTableFlags construction ✅ DONE
Fixed in 95da4c0. USER_DATA_FLAGS constant in
osl/src/syscalls/mem.rs replaces 3 identical flag expressions.
2.5 Path normalization — duplicated between crates ✅ DONE
Fixed in libkernel/src/path.rs. normalize() and resolve() are
now shared; kernel/src/shell.rs and osl/src/syscalls/ both delegate
to libkernel::path.
2.6 History entry restoration — keyboard_actor.rs ✅ DONE
keyboard_actor.rsFixed alongside item 8. LineState::restore_from_history(&mut self, idx)
eliminates the duplicated buffer-copy logic.
2.7 read_user_string → path error wrapping — 2 copies
#![allow(unused)]
fn main() {
let path = match read_user_string(path_ptr, 4096) {
Some(p) => p,
None => return -errno::EFAULT,
};
}
Fix: fn get_user_path(ptr: u64) -> Result<String, i64>
3. Missing Abstractions / Interface Opportunities
3.1 ProcessManager struct
libkernel/src/process.rs has free functions find_zombie_child,
has_children, mark_zombie, reap that all operate on the global
PROCESS_TABLE. These should be methods on a ProcessManager type
that encapsulates the table.
#![allow(unused)]
fn main() {
pub struct ProcessManager {
table: Mutex<BTreeMap<ProcessId, Process>>,
}
impl ProcessManager {
pub fn find_zombie_child(&self, parent: ProcessId, target: i64) -> Option<(ProcessId, i32)>;
pub fn mark_zombie(&self, pid: ProcessId, code: i32);
pub fn reap(&self, pid: ProcessId);
pub fn has_children(&self, pid: ProcessId) -> bool;
}
}
3.2 FileHandle trait is monolithic
Every FileHandle implementor must provide read, write, close,
kind, and getdents64, even when nonsensical (e.g. DirHandle::write
returns Err).
Options:
- Split into
Readable,Writable,Directorytraits - Or add default impls returning appropriate errors so implementors only override what they support
3.3 MemoryServices is a god object
~500 lines mixing physical allocation, MMIO mapping, user page tables, address translation, and statistics.
Fix: Split into focused sub-types:
PhysicalMemoryManager— frame allocation, phys-to-virt translationMmioMapper— MMIO region registration and cachingUserPageTableManager— create/map/switch user address spaces
3.4 SyscallContext struct
Syscall handlers pass (rdi, rsi, rdx, r10, r8, r9) as 6 separate
u64 parameters. A context struct would be clearer:
#![allow(unused)]
fn main() {
pub struct SyscallContext {
pub arg0: u64,
pub arg1: u64,
pub arg2: u64,
pub arg3: u64,
pub arg4: u64,
pub arg5: u64,
}
}
This would also be the natural home for the fd-table helper method.
3.5 ConsoleInput encapsulation
libkernel/src/console.rs has CONSOLE_INPUT: Mutex<ConsoleInner> plus
FOREGROUND_PID: AtomicU64 as separate globals. These form a single
logical unit that should be one type:
#![allow(unused)]
fn main() {
pub struct ConsoleInput {
inner: Mutex<ConsoleInner>,
foreground_pid: AtomicU64,
}
}
3.6 Scattered global atomics
These related atomics are standalone statics when they could be encapsulated in manager types:
| Static | File | Could belong to |
|---|---|---|
NEXT_PID, CURRENT_PID | process.rs | ProcessManager |
NEXT_THREAD_ID, CURRENT_THREAD_IDX_ATOMIC | scheduler.rs | Scheduler |
FOREGROUND_PID | console.rs | ConsoleInput |
LAPIC_EOI_ADDR | interrupts.rs | Interrupt manager |
CONTEXT_SWITCHES | scheduler.rs | Scheduler |
3.7 User vs kernel address types
The type system uses u64 for both user and kernel virtual addresses.
Newtype wrappers would prevent accidental misuse:
#![allow(unused)]
fn main() {
pub struct UserVirtAddr(u64);
pub struct KernelVirtAddr(u64);
}
4. Long Functions / Deep Nesting
4.1 keyboard_actor.rs:on_key — 238 lines ✅ DONE
keyboard_actor.rs:on_key — 238 linesFixed alongside item 8. Key-handling logic moved to LineState
methods (submit, backspace, delete_forward, move_left/right,
history_up/down, etc.). on_key is now a thin one-liner-per-key
dispatch table.
4.2 scheduler.rs:preempt_tick — 102 lines ✅ DONE
scheduler.rs:preempt_tick — 102 linesFixed alongside item 9. Decomposed into save_current_context(),
restore_thread_state() (via SwitchTarget struct), and
debug_check_initial_alignment(). preempt_tick itself has zero direct
unsafe blocks.
4.3 syscalls/mem.rs:sys_mmap — 68 lines
Validation, allocation, and mapping all in one function.
Fix: Break into validate_mmap_request() and the shared
alloc_and_map_user_pages() from section 2.2.
4.4 syscalls/mem.rs:sys_brk — 60 lines
Same issue as sys_mmap — does too many things.
4.5 Deep nesting in keyboard_actor.rs:159-331 ✅ DONE
keyboard_actor.rs:159-331Fixed alongside item 8. Each match arm is now a one-liner calling
a LineState method; the actual logic lives in those methods at a
single nesting level.
5. Other Code Smells
5.1 Repeated runnable-state check ✅ DONE
Fixed alongside item 9. ThreadState::is_runnable() method replaces
the two identical != Dead && != Blocked checks.
5.2 VFS blocking wrappers
osl/src/syscalls/fs.rs has vfs_read_file and vfs_list_dir
with identical structure (allocate String, call blocking() with async
VFS call).
Fix: Macro or generic wrapper to eliminate the boilerplate.
5.3 Process exit + parent wake pattern
sys_exit (osl/src/syscalls/process.rs) does get-parent → mark_zombie →
unblock-parent as separate steps. This should be a single
ProcessManager::exit_and_notify(pid, code) method.
Summary — Recommended Priority
Tier 1: Easy wins with high readability payoff — ✅ ALL DONE
All Tier 1 items were completed in 95da4c0:
Named constants for syscall numbers, MSRs, page sizes✅Extract✅get_fd_handle()helper (eliminates 4 copies)Extract✅alloc_and_map_user_pages()(eliminates 3 copies)✅const USER_DATA_FLAGSfor page table flags✅clear_page()utility (eliminates 8 copies)
Tier 2: Structural improvements
Share path normalization between kernel shell and osl✅✅ProcessManagerstruct to encapsulate process tableDecompose✅on_keyinto aLineEditorstate machineDecompose✅preempt_tickinto smaller functions- Break
sys_brk/sys_mmapinto validation + mapping
Tier 3: Architectural refinements
- Split
MemoryServicesinto focused sub-managers SyscallContextstruct for cleaner parameter passingConsoleInputencapsulationUserVirtAddr/KernelVirtAddrnewtypesFileHandletrait restructuring (split or default impls)
Unsafe Code Audit & Refactoring Opportunities
An audit of unsafe usage across the codebase, prioritised by density and
refactoring payoff.
1. libkernel/src/vga_buffer/mod.rs — Raw pointer to MMIO buffer ✅ DONE
Writer stores a raw *mut Buffer pointer and dereferences it with
unsafe { &mut *self.buffer } in 7 separate places. There is also a
manual unsafe impl Send to paper over the raw pointer.
Completed (commit 75de8c4):
- Introduced a
VgaBuffersafe wrapper that encapsulates the raw pointer withunsafeconfined to construction only. Saferead_cell/write_cell/set_hw_cursormethods replaced all interiorunsafeblocks inWritermethods and free functions. unsafe impl Sendmoved fromWritertoVgaBufferwith documented invariant.set_hw_cursoris now a safe method onVgaBuffer(was a standaloneunsafe fn).core::mem::transmutein tests replaced with a newColor::from_u8()constructor.timeline_appendrefactored: ISR now pushes to a lock-freeArrayQueueinstead of writing directly to VGA RAM with raw pointers. A newTimelineActor(stream-driven, using#[on_stream]) drains the queue and writes to VGA row 1 through the safeWRITER/VgaBufferinterface. Eliminates the lastunsafeblock and removes theVGA_BASEatomic.
2. libkernel/src/task/scheduler.rs — Raw stack frame construction & inline asm ✅ DONE
spawn_thread and spawn_user_thread both manually write 20 u64 values
to raw stack pointers to construct fake iretq frames. preempt_tick reads
raw pointers at computed offsets for sanity checks. process_trampoline
contains a large unsafe asm block.
Completed (commit ac60740):
- Introduced
#[repr(C)] SwitchFramewith named fields matching thelapic_timer_stubpush/pop order. Constructorsnew_kernel()andnew_user_trampoline()replace magic-numberframe.add(N).write(...)in bothspawn_threadandspawn_user_thread. preempt_ticksanity check readsframe.rip/frame.rspthrough the typed struct instead of raw pointer arithmetic.- Extracted
drop_to_ring3()unsafe helper fromprocess_trampoline: GS MSR writes + CR3 switch + iretq in one well-documentedunsafe fn, making the safety boundary explicit.
3. libkernel/src/syscall.rs — static mut per-CPU data ✅ DONE
PER_CPU and SYSCALL_STACK are static mut, accessed with bare
unsafe throughout. sys_write creates a slice from a raw user-space
pointer without any validation.
Completed (commit 1c28010):
- Replaced
static mut PER_CPUwith anUnsafeCellwrapper (PerCpuCell) with documented safety invariant (single CPU, interrupts disabled). - Replaced
static mut SYSCALL_STACKwith a safe#[repr(align(16))]static.kernel_stack_top()is now fully safe. sys_writenow validates that the user buffer falls entirely within user address space (<0x0000_8000_0000_0000), returningEFAULTfor invalid pointers.init(),set_kernel_rsp(),per_cpu_addr()updated to use new accessors — no more&raw constonstatic mut.
4. apic/src/local_apic/mapped.rs — Every method is unsafe ✅ DONE
MappedLocalApic has 15 public unsafe methods. The unsafety stems
from MMIO access via raw pointers in read_reg_32 / write_reg_32, but
the actual invariant is in construction (providing a valid base address),
not in each register read/write.
Completed (commit 24a421d):
MappedLocalApic::new()is now the soleunsafeboundary with documented safety invariants.- All 15 public methods are now safe;
read_reg_32/write_reg_32trait impl usescore::ptr::read_volatile/write_volatile. - Callers in
apic/src/lib.rsanddevices/src/vfs/proc_vfs/updated — dozens ofunsafeblocks removed.
5. apic/src/io_apic/mapped.rs — Same pattern as local APIC ✅ DONE
Same issue — every public method is unsafe, and register access helpers
use raw pointer dereferences without read_volatile / write_volatile.
Completed (commit 24a421d):
MappedIoApic::new()is now the soleunsafeboundary with documented safety invariants.base_addrfield made private withbase_addr()getter.- All public methods (
mask_all,mask_entry,set_irq,max_redirect_entries,read_version_raw,read_redirect_entry) are now safe. Internal calls to theIoApictrait methods remainunsafeblocks. IoApictrait impl (read_reg_32/write_reg_32/read_reg_64/write_reg_64) now usescore::ptr::read_volatile/write_volatileinstead of raw dereferences — correct for MMIO.- Callers in
apic/src/lib.rsanddevices/src/vfs/proc_vfs/updated.
6. kernel/src/kernel_acpi.rs — Repetitive raw pointer reads/writes
The acpi::Handler impl has 8 nearly identical read_uN / write_uN
methods, each doing unsafe { *(addr as *const T) }. No volatile access,
no alignment checks.
Recommendations:
- Create a generic
fn mmio_read<T>(addr: usize) -> T/fn mmio_write<T>(addr: usize, val: T)helper usingread_volatile/write_volatile, then call it from each trait method. Reduces 16 lines of unsafe to 2. - Same for the IO port methods — a single
port_read::<T>(port)/port_write::<T>(port, val)generic would collapse 6 methods.
7. kernel/src/ring3.rs — Scattered raw pointer copies
spawn_blob and spawn_process manually call core::ptr::write_bytes
and core::ptr::copy_nonoverlapping on physical-memory-mapped addresses.
The pattern phys_off + phys_addr → as_mut_ptr → write_bytes repeats
multiple times.
Recommendations:
- Add
zero_frame(phys: PhysAddr)andcopy_to_frame(phys: PhysAddr, data: &[u8])utilities onMemoryServicesthat encapsulate the offset arithmetic and unsafe ptr operations. This would also clean up similar patterns inlibkernel/src/memory/mod.rs.
8. libkernel/src/gdt.rs — Mutable cast of static TSS
set_kernel_stack casts &*TSS through *const → *mut to write rsp0.
This is technically UB (mutating through a shared reference to a
lazy_static).
Recommendations:
- Store the TSS in an
UnsafeCellorMutexso the mutation is sound. Since it is single-CPU and only called with interrupts off, anUnsafeCellwrapper with a documented invariant is sufficient.
9. libkernel/src/interrupts.rs — Crash-dump raw pointer reads
double_fault_handler and invalid_opcode_handler use
core::ptr::read_volatile on raw addresses for crash diagnostics, and the
inline-asm MSR reads are duplicated across fault handlers.
Recommendations:
- Extract a
fn dump_cpu_state(frame: &InterruptStackFrame) -> CpuStatehelper that reads CR2/CR3/CR4/GS MSRs once and returns a struct, eliminating duplicated inline asm across fault handlers. - A
fn dump_bytes_at(addr: u64, len: usize) -> [u8; 16]helper would replace the raw pointer reads in both handlers.
10. devices/src/vfs/proc_vfs/ — Manual page-table walking
gen_pmap() manually walks PML4 / PDPT / PD / PT levels using raw pointer
casts like unsafe { &*((phys_off + addr) as *const PageTable) }.
Recommendations:
- Add a
walk_page_tablesiterator or visitor onMemoryServicesthat safely provides(virt_range, phys_base, flags)entries. Replaces 50+ lines of raw pointer walks.
Summary table
| Priority | File | Unsafe count | Refactor |
|---|---|---|---|
| High | scheduler.rs | ✅ Done — SwitchFrame struct, drop_to_ring3 | |
| High | syscall.rs | ✅ Done — UnsafeCell, user pointer validation | |
| High | local_apic/mapped.rs | ✅ Done — safe methods, unsafe-only construction | |
| High | io_apic/mapped.rs | ✅ Done — same + read_volatile / write_volatile | |
| Medium | vga_buffer/mod.rs | ✅ Done — VgaBuffer wrapper | |
| Medium | kernel_acpi.rs | ~16 | Generic volatile MMIO helpers |
| Medium | ring3.rs | ~8 | zero_frame / copy_to_frame on MemoryServices |
| Medium | gdt.rs | 2 | UnsafeCell for TSS mutation |
| Low | interrupts.rs | ~10 | dump_cpu_state + dump_bytes_at helpers |
| Low | proc_vfs/ | ~5 | Page-table walk iterator |
SMP Safety Audit
An audit of concurrency issues that would arise when running on multiple CPUs. The kernel currently runs single-core only; this document catalogues what must change before bringing up Application Processors.
Issues are grouped by severity: Critical = data corruption / crash on SMP, High = deadlock or lost wakeup, Medium = ordering bugs or contention, Low = design limitation / hardening.
Critical
1. PerCpuData is a single static
libkernel/src/syscall.rs:44-65
PerCpuData (kernel_rsp, user_rsp, user_rip, user_rflags, user_r9,
saved_frame_ptr) lives at a single address. Every CPU’s GS base points
there. A SYSCALL on CPU 1 overwrites CPU 0’s saved registers mid-flight.
Impact: Stack corruption, wrong return to userspace, privilege escalation.
Fix: Allocate a distinct PerCpuData page per CPU and set
IA32_GS_BASE / IA32_KERNEL_GS_BASE independently during AP bringup.
2. GDT / TSS / IST stacks are shared
libkernel/src/gdt.rs:29-54
A single TSS (with a single double-fault IST stack) and a single GDT are
used by all CPUs. set_kernel_stack() (:77-84) unsafely mutates the shared
TSS’s rsp0 field.
Impact: Two CPUs taking a ring-3 → ring-0 transition simultaneously use the same kernel stack. Two simultaneous double faults corrupt each other’s IST stack.
Fix: Per-CPU GDT, per-CPU TSS, per-CPU IST stacks.
3. Scheduler has a single current_idx
libkernel/src/task/scheduler.rs:130, 809, 836
SCHEDULER is a single SpinMutex<Scheduler> with one current_idx field
that records which thread is currently executing. On SMP each CPU runs a
different thread, but current_idx can only represent one.
Impact: Every use of sched.current_idx — preempt_tick, block, save
context — operates on whichever CPU wrote it last, not the local CPU’s thread.
Fix: Per-CPU current_idx (or per-CPU scheduler instances).
4. block_current_thread() uses stale current_idx
libkernel/src/task/scheduler.rs:640-661
#![allow(unused)]
fn main() {
pub fn block_current_thread() {
without_interrupts(|| {
let mut sched = SCHEDULER.lock();
let idx = sched.current_idx; // ← global, not per-CPU
sched.threads[idx].state = Blocked;
});
loop {
enable_and_hlt();
let state = without_interrupts(|| {
let sched = SCHEDULER.lock();
let idx = sched.current_idx; // ← may now be another CPU's thread
sched.threads[idx].state
});
if state != Blocked { break; }
}
}
}
If CPU 0 blocks thread A and CPU 1 runs thread B, the re-check reads
current_idx (now B) and tests the wrong thread. Thread A either never
wakes or wakes with B’s state.
Impact: Hung threads, wrong-thread wakeup.
Fix: Save the thread’s own index before blocking; check that saved index
in the loop, not current_idx.
5. CURRENT_THREAD_IDX_ATOMIC is a single global
libkernel/src/task/scheduler.rs:16-24
#![allow(unused)]
fn main() {
static CURRENT_THREAD_IDX_ATOMIC: AtomicUsize = AtomicUsize::new(0);
pub fn current_thread_idx() -> usize {
CURRENT_THREAD_IDX_ATOMIC.load(Ordering::Relaxed)
}
}
Called from ISR context (interrupt handlers), syscall context (console, pipes, channels), and the scheduler itself. On SMP the value reflects whichever CPU wrote it last, not the caller’s CPU.
Impact: Signal delivery, pipe wakeup, IPC blocking — all index the wrong thread when the reading CPU differs from the writing CPU.
Fix: Per-CPU current-thread-index (read from a per-CPU variable or from a CPU-local register like GS).
6. IO APIC register select/window interleaving
libkernel/src/apic/io_apic/mapped.rs:120-142
64-bit redirection entries are read/written as two 32-bit MMIO accesses
through a shared IOREGSEL / IOWIN register pair. Although callers hold the
IO_APICS SpinMutex, the lock does not disable interrupts. If a timer ISR
fires between the two halves of a 64-bit access on the same CPU, and the ISR
path touches IO APIC registers, the IOREGSEL is clobbered.
Currently no ISR path touches the IO APIC, so this is latent. On SMP with multiple IO APICs, per-APIC locking would be needed.
Impact: Corrupted redirection entry → interrupt routed to wrong vector or silently masked.
Fix: Use IrqMutex (or at minimum without_interrupts) around all IO
APIC register-pair accesses. Consider per-APIC locks for scalability.
High
7. SCHEDULER lock is a SpinMutex — ISR can spin on it
libkernel/src/task/scheduler.rs:130
SCHEDULER uses SpinMutex (interrupts stay enabled). Syscall-context
callers (block_current_thread, unblock, spawn_thread) wrap acquisitions
in without_interrupts, but the lock itself does not enforce this.
If a code path acquires the lock without disabling interrupts and the timer
ISR fires on the same CPU, preempt_tick (:803) will spin forever waiting for
the syscall to release the lock — which it never can, because it’s preempted.
Impact: Deadlock (single-CPU or SMP).
Fix: Change SCHEDULER to IrqMutex, or ensure every acquisition site
uses without_interrupts. All current sites do, but the type does not
enforce it — a future caller could forget.
8. MEMORY lock is not ISR-safe
libkernel/src/memory/mod.rs:559
MEMORY uses SpinMutex. The comment warns “must not be called from
interrupt context”, but this is not enforced by the type. Any future ISR
path that triggers frame allocation or page-table manipulation will deadlock
on single-CPU if a syscall holds the lock.
Impact: Deadlock.
Fix: Change to IrqMutex, or add a compile-time / runtime ISR guard.
9. Global heap allocator is not ISR-safe
libkernel/src/lib.rs:20
#![allow(unused)]
fn main() {
#[global_allocator]
static ALLOCATOR: LockedHeap = LockedHeap::empty();
}
LockedHeap uses spin::Mutex internally — no interrupt disabling. Any
heap allocation from ISR context while a syscall holds the heap lock will
deadlock.
The scheduler’s push_back on the ready queue can trigger a Vec
reallocation if the queue grows. Currently the scheduler lock is acquired
with interrupts disabled, so the heap allocation happens with IF=0 — safe on
single-CPU. On SMP, CPU 1’s ISR could try to allocate while CPU 0 holds the
heap lock.
Impact: Deadlock (ISR + heap contention).
Fix: Use an ISR-safe allocator wrapper, or guarantee no heap allocation from ISR context.
10. Console ISR → scheduler lock ordering
libkernel/src/console.rs:35-47
push_input() is called from the keyboard ISR. It acquires
CONSOLE_INPUT (SpinMutex), then calls scheduler::unblock() which acquires
SCHEDULER (SpinMutex, inside without_interrupts).
On SMP:
- CPU 0: syscall holds SCHEDULER lock (IF disabled), tries to read console → acquires CONSOLE_INPUT.
- CPU 1: keyboard ISR holds CONSOLE_INPUT, calls
unblock()→ spins on SCHEDULER. - CPU 0: still holds SCHEDULER, spins on CONSOLE_INPUT held by CPU 1.
Impact: Deadlock (lock-order inversion: SCHEDULER → CONSOLE_INPUT vs. CONSOLE_INPUT → SCHEDULER).
Fix: Don’t call unblock() while holding CONSOLE_INPUT. Buffer the
thread index and call unblock() after dropping the console lock.
11. DONATE_TARGET is a single global
libkernel/src/task/scheduler.rs:682-700
DONATE_TARGET: AtomicUsize stores one target thread index, consumed by
the next yield_tick. On SMP, CPU 0 stores a donate target, but CPU 1’s
yield_tick consumes it first.
Impact: Scheduler donate delivers the wrong thread to the wrong CPU; intended recipient never gets donated to.
Fix: Per-CPU donate target, or pass the target through a different mechanism (e.g. IPI + per-CPU mailbox).
12. Lost wakeup in sys_wait4
osl/src/syscalls/process.rs:26-64
1. find_zombie_child(parent) → None
2. ← child exits on CPU 1, calls unblock(parent_wait_thread)
but wait_thread is not yet set → no-op
3. set wait_thread = current_thread
4. block_current_thread() → sleeps forever
The zombie check and the wait_thread registration are not atomic.
Impact: Parent process hangs forever waiting for an already-exited child.
Fix: Hold the process table lock across the zombie check and the
wait_thread write, so that terminate_process() on another CPU sees the
wait_thread before posting the zombie.
Medium
13. Relaxed ordering on cross-CPU atomics
Several atomics use Ordering::Relaxed where Acquire/Release would be
more appropriate for cross-CPU visibility:
| Atomic | File | Line | Used by |
|---|---|---|---|
CURRENT_THREAD_IDX_ATOMIC | scheduler.rs | 16 | ISR + syscall |
current_pid | process.rs | 488 | syscall context |
FOREGROUND_PID | console.rs | 31 | keyboard ISR |
LAPIC_EOI_ADDR | interrupts.rs | 12 | ISR |
On x86-64 all loads/stores are implicitly acquire/release for aligned
naturally-sized values, so this is a correctness concern primarily on
weakly-ordered architectures or under compiler reordering. Using explicit
Acquire/Release is still best practice for documentation and portability.
Impact: Stale reads possible under compiler reordering; wrong-process signal delivery, wrong-process console input routing.
Fix: Release on writes, Acquire on reads.
14. Stack arena contention
libkernel/src/stack_arena.rs:16
A single SpinMutex<ArenaInner> protects a 32-bit free bitmap for all
thread stack allocations / deallocations. On SMP with frequent thread
creation, this becomes a serialisation bottleneck.
Impact: Performance (lock contention), not correctness.
Fix: Per-CPU arenas, or lock-free bitmap (atomic CAS on u32).
15. Lock ordering not documented or enforced
Multiple subsystems acquire locks in ad-hoc order. Observed nesting:
CONSOLE_INPUT → SCHEDULER(push_input → unblock)IrqInner → CompletionPort(irq_fd_dispatch → post)NotifyInner → CompletionPort(signal_notify → post)PROCESS_TABLE → SCHEDULER(with_process → spawn_user_thread)
No static or runtime enforcement exists. Adding a second CPU increases the risk of discovering new inversion paths.
Impact: Latent deadlocks as code evolves.
Fix: Document a global lock ordering. Consider runtime lock-order checking in debug builds (e.g. per-CPU lock-stack tracking).
16. User memory TOCTOU with CLONE_VM
osl/src/user_mem.rs:27-45
user_slice() validates then returns a 'static slice. With CLONE_VM
(vfork), the parent and child share an address space. If the child calls
mmap / munmap while the parent is mid-syscall with a validated slice, the
pages backing the slice may be unmapped.
Currently mitigated because CLONE_VM blocks the parent (vfork
semantics), so only the child runs. If shared-address-space threading is
added, this becomes exploitable.
Impact: Latent use-after-free in shared address spaces.
Fix: Pin pages for the duration of the syscall, or copy user data into a kernel buffer before releasing the process lock.
17. VMA / page-table flag divergence in mprotect
osl/src/syscalls/mem.rs (sys_mprotect)
The process lock is released between mprotect_vmas() (updates VMA metadata)
and update_user_page_flags() (updates hardware page tables). A concurrent
mmap or munmap on the same address range could see inconsistent state.
Currently safe because only one thread per process runs at a time (no kernel threading within a process).
Impact: Latent protection-flag inconsistency if intra-process parallelism is added.
Fix: Hold the process lock (or a per-address-space lock) across both the VMA update and the page-table update.
Low
18. LAPIC timer calibration is BSP-only
libkernel/src/apic/mod.rs:205-254
Calibration uses a global PIT busy-wait and assumes a single LAPIC. Each AP would need its own calibration pass (LAPIC frequencies can differ, especially under virtualisation).
19. Dynamic vector allocation uses without_interrupts
libkernel/src/interrupts.rs:22-75
register_handler() disables local interrupts and acquires
DYNAMIC_HANDLERS (SpinMutex). On SMP, without_interrupts only affects
the local CPU. Two CPUs calling register_handler() concurrently will
correctly serialise via the SpinMutex — no bug, but the without_interrupts
wrapper is unnecessary and misleading.
20. Single ready queue scalability
libkernel/src/task/scheduler.rs:103
The single VecDeque ready queue serialises all scheduling decisions behind
one lock. This is the standard starting point but will need per-CPU run
queues and work-stealing for acceptable SMP throughput.
Summary
| # | Severity | Component | One-line summary |
|---|---|---|---|
| 1 | Critical | syscall.rs | PerCpuData is a single static shared by all CPUs |
| 2 | Critical | gdt.rs | GDT / TSS / IST stacks shared across CPUs |
| 3 | Critical | scheduler.rs | Single current_idx — meaningless on SMP |
| 4 | Critical | scheduler.rs | block_current_thread reads stale current_idx |
| 5 | Critical | scheduler.rs | CURRENT_THREAD_IDX_ATOMIC is one global |
| 6 | Critical | io_apic | Register select/window not ISR-safe |
| 7 | High | scheduler.rs | SCHEDULER SpinMutex not ISR-enforced |
| 8 | High | memory/mod.rs | MEMORY SpinMutex not ISR-safe |
| 9 | High | lib.rs | Global heap allocator not ISR-safe |
| 10 | High | console.rs | ISR lock-order inversion (CONSOLE → SCHEDULER) |
| 11 | High | scheduler.rs | DONATE_TARGET is a single global |
| 12 | High | process.rs | Lost wakeup in sys_wait4 |
| 13 | Medium | various | Relaxed ordering on cross-CPU atomics |
| 14 | Medium | stack_arena.rs | Single-lock bitmap contention |
| 15 | Medium | various | Lock ordering not documented |
| 16 | Medium | user_mem.rs | TOCTOU with CLONE_VM (latent) |
| 17 | Medium | mem.rs | VMA / page-table flag divergence (latent) |
| 18 | Low | apic/mod.rs | LAPIC calibration BSP-only |
| 19 | Low | interrupts.rs | Misleading without_interrupts wrapper |
| 20 | Low | scheduler.rs | Single ready queue scalability |
Recommended SMP bringup order
- Per-CPU infrastructure: PerCpuData, GDT, TSS, IST stacks, LAPIC init.
- Per-CPU scheduler state:
current_idx, ready queue, donate target. - Fix
block_current_threadto use saved thread index. - Promote
SCHEDULERandMEMORYtoIrqMutex(or add IF-disable wrappers). - Fix lock-ordering inversions (console, notify, channel → scheduler).
- Fix
sys_wait4lost-wakeup race. - Per-CPU LAPIC calibration.
- Document and enforce global lock ordering.
Testing
Overview
The kernel uses Rust’s custom_test_frameworks feature since the standard test
harness requires std. Tests run inside QEMU in headless mode, communicate
results over the serial port, and signal pass/fail via an ISA debug-exit device.
# Kernel crate tests (kernel unit tests + integration tests)
cargo test --manifest-path kernel/Cargo.toml
# libkernel tests (allocator, VGA, path, timer, interrupts, etc.)
cargo test --manifest-path libkernel/Cargo.toml
Or via the Makefile:
make test
Note: make test currently runs only the kernel crate tests. libkernel has
its own test binary (with heap initialization) that must be run separately.
How It Works
Custom test runner
libkernel/src/lib.rs defines the framework:
#![allow(unused)]
#![feature(custom_test_frameworks)]
#![test_runner(crate::test_runner)]
#![reexport_test_harness_main = "test_main"]
fn main() {
}
test_runner iterates over every #[test_case] function, runs it, and writes
results to serial. On completion it writes to I/O port 0xf4 to exit QEMU:
| Exit code | Meaning |
|---|---|
0x10 | All tests passed |
0x11 | A test panicked |
QEMU configuration
kernel/Cargo.toml configures bootimage to launch QEMU with:
test-args = [
"-device", "isa-debug-exit,iobase=0xf4,iosize=0x04",
"-serial", "stdio",
"-display", "none"
]
test-success-exit-code = 33
test-timeout = 30
- isa-debug-exit — writing to port
0xf4terminates QEMU with an exit code - serial stdio — test output appears on the host terminal
- display none — headless, no VGA window
- timeout 30s — kills stuck tests
Serial output
Tests print to COM1 (0x3F8) via serial_print! / serial_println! from
libkernel/src/serial.rs. Each test prints its name and [ok] on success;
the panic handler prints [failed] and the error before exiting.
Test Types
Unit tests (#[test_case])
Standard tests that run inside the kernel. When built with cargo test, the
kernel initialises GDT, IDT, heap, and memory, then calls test_main() which
invokes test_runner with all collected test cases.
Integration tests (kernel/tests/)
Each file in kernel/tests/ compiles as a separate kernel binary with its own
entry point. bootimage boots each one independently in QEMU.
Two integration tests use harness = false because they need custom control
flow (e.g. verifying that a panic or exception fires correctly):
[[test]]
name = "should_panic"
harness = false
[[test]]
name = "stack_overflow"
harness = false
Test Inventory
Unit tests (37 tests)
All unit tests live in libkernel and run via
cargo test --manifest-path libkernel/Cargo.toml.
| File | Tests | What they cover |
|---|---|---|
libkernel/src/path.rs | 13 | normalize and resolve: dots, dotdot, root, relative, absolute |
libkernel/src/vga_buffer/mod.rs | 8 | println output, color encoding, FixedBuf formatting |
libkernel/src/md5.rs | 7 | MD5 hash (RFC 1321 test vectors) |
libkernel/src/allocator/mod.rs | 3 | align_up correctness and boundary conditions |
libkernel/src/memory/vmem_allocator.rs | 3 | Virtual memory allocator state and page tracking |
libkernel/src/task/timer.rs | 2 | Delay struct millisecond/second calculations |
libkernel/src/interrupts.rs | 1 | Breakpoint exception (int3) handling |
Integration tests (4 binaries)
| File | Harness | What it tests |
|---|---|---|
basic_boot.rs | standard | Kernel boots and VGA println works |
heap_allocation.rs | standard | Box, Vec, and repeated allocation patterns |
should_panic.rs | custom | Panic handler fires and exits correctly |
stack_overflow.rs | custom | Double-fault handler catches stack overflow via IST |
Execution Flow
cargo test
|
bootimage compiles test binary (kernel + bootloader)
|
QEMU boots with isa-debug-exit, serial stdio, no display
|
Kernel init: GDT, IDT, heap, memory
|
test_main() calls test_runner(&[...])
|
For each #[test_case]:
run test
serial_println!("test_name... [ok]")
|
exit_qemu(Success) --> write 0x10 to port 0xf4
|
bootimage reads exit code, reports result
For harness = false tests, the binary manages its own flow and exit.
Key Files
| File | Role |
|---|---|
libkernel/src/lib.rs | test_runner, test_panic_handler, QemuExitCode |
libkernel/src/serial.rs | COM1 serial output (serial_print!, serial_println!) |
kernel/Cargo.toml | bootimage test-args, exit codes, timeout |
.cargo/config.toml | bootimage runner, build target |
kernel/tests/ | Integration test binaries |
Adding a New Test
Unit test
Add #[test_case] to a function in any file that has #[cfg(test)] access
to the test framework (libkernel modules or kernel crate modules):
#![allow(unused)]
fn main() {
#[test_case]
fn test_something() {
serial_print!("test_something... ");
assert_eq!(1 + 1, 2);
serial_println!("[ok]");
}
}
Integration test
Create kernel/tests/my_test.rs with its own entry point:
#![no_std]
#![no_main]
#![feature(custom_test_frameworks)]
#![test_runner(libkernel::test_runner)]
#![reexport_test_harness_main = "test_main"]
use bootloader::{entry_point, BootInfo};
use core::panic::PanicInfo;
entry_point!(main);
fn main(boot_info: &'static BootInfo) -> ! {
libkernel::init();
// ... any additional setup ...
test_main();
libkernel::hlt_loop();
}
#[panic_handler]
fn panic(info: &PanicInfo) -> ! {
libkernel::test_panic_handler(info)
}
#[test_case]
fn my_test() {
// ...
}
For tests that need custom panic/exception handling, add harness = false to
kernel/Cargo.toml and manage the entry point and exit manually.