Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Project Status

ostoo is a hobby x86-64 kernel written in Rust, following the Writing an OS in Rust blog series by Philipp Oppermann. All twelve tutorial chapters have been completed and the project has gone significantly beyond the tutorial.

Workspace Layout

CratePurpose
kernel/Top-level kernel binary — entry point, ties everything together
libkernel/Core kernel library — all subsystems including APIC live here
osl/“OS Subsystem for Linux” — syscall dispatch + VFS bridge
devices/Driver framework — DriverTask trait, actor macro, built-in drivers, VFS
devices-macros/Proc-macro crate: #[actor], #[on_message], #[on_info], #[on_tick], #[on_stream], #[on_start]

Target triple: x86_64-os (custom JSON target, bare-metal, no std). Build tooling: cargo-xbuild + bootimage (BIOS bootloader). Toolchain: current nightly (floating, rust-toolchain.toml).


Completed Tutorial Chapters

1–2. Freestanding Binary / Minimal Kernel

  • #![no_std], #![no_main], custom panic handler.
  • bootloader crate provides the BIOS boot stage and passes a BootInfo struct.
  • Entry point via entry_point! macro (libkernel_main in kernel/src/main.rs).

3. VGA Text Mode

  • libkernel/src/vga_buffer/mod.rs — a Writer behind an IrqMutex.
  • print! / println! macros available globally.
  • Volatile writes to avoid compiler optimisation of MMIO.
  • Hardware cursor (CRTC registers 0x3D4/0x3D5) kept in sync on every write.
  • redraw_line(start_col, buf, len, cursor) for in-place line editing.
  • Fixed status bar at row 0 (status_bar! macro, white-on-blue); updated by status_task every 250 ms with thread index, context-switch count, task queue depths, and uptime.
  • Timeline strip at row 1: scrolling coloured blocks, one per context switch, colour-coded by thread index.

4. Testing

  • Custom test framework (custom_test_frameworks feature).
  • Integration tests in kernel/tests/: basic_boot, heap_allocation, should_panic, stack_overflow.
  • QEMU isa-debug-exit device used to signal pass/fail to the host.
  • Serial port (libkernel/src/serial.rs) used for test output.

5–6. CPU Exceptions / Double Faults

  • IDT set up in libkernel/src/interrupts.rs via lazy_static.
  • Handlers: breakpoint, page fault (panics), double fault (panics).
  • Double fault uses a dedicated IST stack (GDT TSS entry).
  • GDT + TSS initialised in libkernel/src/gdt.rs.

7. Hardware Interrupts

  • 8259 PIC (chained) initialised via pic8259; remapped to IRQ vectors 32–47.
  • PIC is later disabled once the APIC is configured.
  • Timer interrupt handler (IRQ 0): increments tick counter, wakes timer futures.
  • Keyboard interrupt handler (IRQ 1): reads scancode from port 0x60, pushes it into the async scancode queue.

8–9. Paging / Paging Implementation

  • libkernel/src/memory/mod.rsRecursivePageTable (PML4 slot 511 self-referential); MMIO bump allocator at 0xFFFF_8002_0000_0000 with BTreeMap cache for idempotency; physical memory identity map kept for DMA address translation only (phys_mem_offset from bootloader).
  • libkernel/src/memory/frame_allocator.rsBootInfoFrameAllocator walks the bootloader memory map to hand out usable physical frames.
  • libkernel/src/memory/vmem_allocator.rsDumbVmemAllocator hands out a sequential range of virtual addresses (no reclamation); currently unused in production — the MMIO bump allocator in MemoryServices handles all virtual address allocation at runtime.

10. Heap Allocation

  • Kernel heap mapped at 0xFFFF_8000_0000_0000, size 512 KiB (libkernel/src/allocator/mod.rs).
  • Global allocator: linked_list_allocator::LockedHeap.
  • extern crate alloc available; Box, Vec, Rc, BTreeMap, etc. all work.

11. Allocator Designs

  • Bump allocator implemented in libkernel/src/allocator/bump.rs (O(1) alloc, no free).
  • linked_list_allocator is the active global allocator (can be swapped by changing the static ALLOCATOR line in libkernel/src/lib.rs).

12. Async/Await

  • Task abstraction in libkernel/src/task/mod.rs — pinned boxed futures with atomic TaskId.
  • Simple round-robin executor in task/simple_executor.rs.
  • Full waker-based executor in task/executor.rs:
    • Ready tasks in a VecDeque, waiting tasks in a BTreeMap.
    • Wake queue (crossbeam_queue::ArrayQueue) for interrupt-safe wakeups.
    • sleep_if_idle uses sti; hlt to avoid busy-waiting.

Beyond the Tutorial

Timer

  • libkernel/src/task/timer.rs — LAPIC tick counter; TICKS_PER_SECOND = 1000.
  • Delay future: resolves after a given number of ticks.
  • Mailbox::recv_timeout(ticks) races inbox against a Delay.

Preemptive Multi-threaded Scheduler

  • libkernel/src/task/scheduler.rs — round-robin preemptive scheduler driven by the LAPIC timer at 1000 Hz; 10 ms quantum (QUANTUM_TICKS = 10).
  • Assembly stub lapic_timer_stub saves all 15 GPRs + iret frame on the current stack, then calls preempt_tick(current_rsp) -> new_rsp in Rust.
  • preempt_tick advances the tick counter, acknowledges the LAPIC interrupt, decrements the quantum, and when it expires saves the old RSP, selects the next ready thread, and returns its saved_rsp.
  • scheduler::migrate_to_heap_stack(run_kernel) allocates a 64 KiB heap stack and switches thread 0 off the bootloader’s lower-half stack onto PML4 entry 256 (high canonical half), so it survives CR3 switches into user page tables.
  • scheduler::init() registers the boot context as thread 0.
  • scheduler::spawn_thread(entry) allocates a 64 KiB stack, synthesises an iret frame, and enqueues the new thread.
  • The kernel boots two executor threads (threads 0 and 1) that share the same async task queue; tasks are transparently dispatched across both.
  • Shell command threads shows the current thread index and total context switches since boot.

Actor System (devices/, devices-macros/)

  • DriverTask trait: name(), run(inbox, handle).
  • Mailbox<M> / Inbox<M> MPSC queue; ActorMsg<M,I> envelope wraps inner messages, info queries, and erased-type info queries.
  • Process registry (libkernel/src/task/registry.rs): actors register by name; registry::get::<M,I>(name) returns a typed sender handle.
  • ErasedInfo registry: actors register a Box<dyn Fn() -> ...> so the shell can query any actor’s info without knowing its concrete type.

Proc-macro attributes (used inside #[actor] blocks)

AttributeEffect
#[on_start]Called once before the run loop
#[on_message(Variant)]Handles one inner message enum variant
#[on_info]Returns the actor’s typed info struct
#[on_tick]Called periodically; actor provides tick_interval_ticks()
#[on_stream(factory)]Polls a Stream + Unpin in the unified event loop

The macro generates a unified poll_fn loop when #[on_tick] or #[on_stream] are present, racing all event sources in a single future.

User Space and Process Isolation

  • Full ring-3 process support with per-process page tables, SYSCALL/SYSRET, and preemptive scheduling. Process exit and execve properly free user-half page tables and data frames (with refcount-aware shared frame handling).
  • 35+ Linux-compatible syscalls in osl/src/syscalls/.
  • Per-process FD table, CWD tracking, parent/child relationships, zombie lifecycle with wait4/reap.
  • ELF loader for static x86-64 binaries; initial stack with argc/argv/auxv.
  • IPC channels with fd-passing (capability transfer) — syscalls 505–507. See docs/ipc-channels.md.
  • Shared memory via shmem_create (syscall 508) + mmap(MAP_SHARED) — anonymous shared memory backed by reference-counted physical frames. See docs/mmap-design.md Phase 5b.
  • Notification fds via notify_create (509) + notify (510) — general- purpose inter-process signaling through completion ports (OP_RING_WAIT). See docs/completion-port-design.md Phase 4.
  • Console input buffer with foreground PID routing and blocking read(0).
  • Async-to-sync bridge (osl/src/blocking.rs) for VFS calls from syscall context.
  • See docs/userspace-plan.md for the full roadmap (Phases 0–6 complete; Phase 7 signals not yet started).

Userspace Libraries (user/include/ostoo.h, user-rs/rt/)

  • C library (libostoo.a): shared header user/include/ostoo.h with struct definitions, syscall numbers, opcodes, and flags. Static library user/lib/libostoo.a provides typed syscall wrappers for all 12 custom syscalls (501–512), output helpers (puts_stdout, put_num, put_hex), conversion helpers (itoa_buf, simple_atoi), and ring buffer access helpers (sq_entry, cq_entry). All 21 demo programs have been migrated to use the shared library, eliminating per-file boilerplate.
  • Rust library (ostoo-rt crate): two modules added to the existing user-rs/rt/ runtime crate. sys module provides raw syscall wrappers and repr(C) struct definitions matching the kernel ABI. ostoo module provides safe RAII types (CompletionPort, IpcSend/IpcRecv, SharedMem, NotifyFd, IrqFd, IoRing) with automatic fd cleanup on drop, plus builder methods on IoSubmission for each opcode.

Userspace Shell (user/src/shell.c)

  • Primary user interface: musl-linked C binary, auto-launched on boot from /bin/shell via kernel/src/main.rs.
  • Line editing: read char-by-char, echo, backspace, Ctrl+C (cancel), Ctrl+D (exit on empty line).
  • Built-in commands: echo, pwd, cd, ls, cat, pid, export, env, unset, exit, help.
  • Environment variables: shell maintains an env table, passes it to children. Kernel provides defaults: PATH=/host/bin, HOME=/, TERM=dumb, SHELL=/bin/shell.
  • External programs: posix_spawn(path) + waitpid.
  • Built with Docker-based musl cross-compiler (scripts/user-build.sh).
  • Sources in user/src/, binaries output to user/bin/.
  • See docs/userspace-shell.md for full design.

Kernel Shell (kernel/src/shell.rs) — fallback

  • #[actor]-based shell actor, active when no userspace shell is running.
  • Prompt includes CWD: ostoo:/path> .
  • Commands: help, echo, driver <start|stop|info>, blk <info|read|ls|cat>, ls, cat, pwd, cd, mount, exec, test.
  • Info commands (cpuinfo, meminfo, etc.) migrated to /proc; accessible via cat /proc/<file>.

Keyboard Actor (kernel/src/keyboard_actor.rs)

  • #[actor] + #[on_stream(key_stream)]; registered as "keyboard".
  • Foreground routing: when a user process is foreground, raw keypresses are delivered to console::push_input() for userspace read(0).
  • When kernel is foreground: full readline-style line editing:
    • Cursor movement: ← → / Ctrl+B/F, Home/End / Ctrl+A/E
    • Editing: Backspace, Delete, Ctrl+K (kill to end), Ctrl+U (kill to start), Ctrl+W (delete word)
    • History: ↑↓ / Ctrl+P/N, 50-entry VecDeque, live-buffer save/restore
    • Ctrl+C clears the line; Ctrl+L clears the screen
  • Dispatches complete lines to the kernel shell via ShellMsg::KeyLine.

virtio-blk Block Device (devices/src/virtio/)

  • virtio-drivers 0.13 crate provides the virtio protocol; the kernel supplies KernelHal implementing Hal for DMA allocation, MMIO mapping, and virtual→physical address translation.
  • QEMU Q35 machine; PCIe ECAM at physical 0xB000_0000 mapped at boot via MemoryServices::map_mmio_region. PciRoot is generic over MmioCam<'static>.
  • VirtioBlkActor actor: handles Read and Write messages using the non-blocking virtio-drivers API (read_blocks_nb / complete_read_blocks) with a busy-poll CompletionFuture for MVP.
  • KernelHal::share performs a full page-table walk (translate_virt) so that heap-allocated BlkReq/BlkResp/data buffers produce correct physical addresses for the device.
  • Shell commands: blk info, blk read <sector>.
  • See docs/virtio-blk.md for full details.

VirtIO 9P Host Directory Sharing (devices/src/virtio/p9*.rs)

  • VirtIO 9P (9P2000.L) driver for sharing a host directory into the guest, providing a Docker-volume-like workflow: edit files on the host, they appear instantly in the guest.
  • p9_proto.rs — minimal 9P2000.L wire protocol: 8 message pairs (version, attach, walk, lopen, read, readdir, getattr, clunk).
  • p9.rsP9Client high-level client wrapping VirtIO9p<KernelHal, PciTransport>. Synchronous API behind SpinMutex; performs version handshake + attach on construction. Public methods: list_dir, read_file, stat.
  • QEMU shares ./user directory via -fsdev local,...,security_model=none
    • -device virtio-9p-pci,...,mount_tag=hostfs.
  • Mounted at /host (always) and at / as fallback when no virtio-blk disk is present, so /bin/shell auto-launch works without a disk image.
  • PCI device IDs: 0x1AF4:0x1049 (modern), 0x1AF4:0x1009 (legacy).
  • Read-only for MVP; no write/create/delete support.
  • See docs/virtio-9p.md for full details.

exFAT Filesystem (devices/src/virtio/exfat.rs)

  • Read-only exFAT driver with no external dependencies.
  • Auto-detects bare exFAT, MBR-partitioned, and GPT-partitioned disk images.
  • Implements: boot sector parsing, FAT chain traversal, directory entry set parsing (File / Stream Extension / File Name entries), and recursive path walking with case-insensitive ASCII matching.
  • File reads capped at 16 KiB; peak heap usage during ls ≈ 5 KiB.
  • See docs/exfat.md for full details.

VFS Layer (devices/src/vfs/)

  • Uniform path namespace over multiple filesystems; shell no longer calls filesystem drivers directly.
  • Enum dispatch (AnyVfs) avoids Pin<Box<dyn Future>> trait objects.
  • Mount table (MOUNTS: SpinMutex<Vec<(String, Arc<AnyVfs>)>>) sorted longest-mountpoint-first; the Arc is cloned out before any .await so the lock is never held across a suspension point.
  • ExfatVfs — wraps a BlkInbox and delegates to the exFAT driver.
  • Plan9Vfs — wraps an Arc<P9Client> and delegates to the 9P client. Maps P9Error to VfsError (ENOENT→NotFound, ENOTDIR→NotADirectory, etc.).
  • ProcVfs — synthetic filesystem; no block I/O. All system info commands have been migrated from the shell to /proc virtual files:
    • /proc/tasks — ready / waiting task counts from the executor.
    • /proc/uptime — seconds since boot from the LAPIC tick counter.
    • /proc/drivers — name and state of every registered driver.
    • /proc/threads — current thread index and context-switch count.
    • /proc/meminfo — heap usage, frame allocator stats, known virtual regions.
    • /proc/memmap — physical memory regions from the bootloader memory map.
    • /proc/cpuinfo — CPU vendor, family/model/stepping, CR0/CR4/EFER/RFLAGS.
    • /proc/pmap — page table walk with coalesced contiguous regions.
    • /proc/idt — IDT vector assignments (exceptions, PIC, LAPIC, dynamic).
    • /proc/pci — enumerated PCI devices.
    • /proc/lapic — Local APIC state and timer configuration.
    • /proc/ioapic — I/O APIC redirection table entries.
    • /proc/irq_stats — per-slot IRQ counters (total, delivered, buffered, spurious).
  • Shell commands: ls, cat, cd use the VFS API; mount manages the mount table at runtime (mount, mount proc <mp>, mount blk <mp>).
  • /proc is always mounted at boot; exFAT / is mounted if virtio-blk is present; 9p /host is mounted if virtio-9p is present (and 9p falls back to / when no disk image exists).
  • See docs/vfs.md for full design notes.

Completion Port Async I/O (osl/src/io_port.rs)

  • io_uring-style completion-based async I/O subsystem.
  • Kernel object: CompletionPort (libkernel/src/completion_port.rs) — bounded queue of completions with single-waiter blocking semantics.
  • FdObject enum in libkernel/src/file.rs provides type-safe polymorphism for the fd table (File | Port), replacing the previous trait-object downcast approach.
  • IrqMutex protects the CompletionPort for ISR-safe post() from interrupt context.
  • Syscalls: io_create (501), io_submit (502), io_wait (503), io_setup_rings (511), io_ring_enter (512).
  • Supported operations: OP_NOP (immediate), OP_TIMEOUT (async timer via executor), OP_READ / OP_WRITE (async — user buffers are copied to/from kernel memory during io_submit/io_wait; the actual I/O runs on executor tasks so io_submit returns immediately), OP_IRQ_WAIT (hardware interrupt delivery — ISR masks GSI and posts completion; rearm via another submit unmasks).
  • Shared-memory SQ/CQ rings (Phase 5): io_setup_rings allocates ring pages as shmem fds; userspace writes SQEs to the SQ ring and reads CQEs from the CQ ring. io_ring_enter kicks the kernel and/or blocks for completions.
  • FileHandle trait has poll_read / poll_write methods (default impls delegate to sync read/write). PipeReader and ConsoleHandle override poll_read with waker-based async semantics so completion port reads never block executor threads.
  • Userspace demo programs: io_demo.c (smoke test), io_pingpong.c / io_pong.c (parent-child IPC via completion port).
  • See docs/completion-port-design.md for the full phased roadmap (all phases complete).

IRQ File Descriptors (libkernel/src/irq_handle.rs, osl/src/irq.rs)

  • Userspace interrupt delivery via irq_create(gsi) syscall (504).
  • IrqInner tracks GSI, vector, slot, and saved IO APIC redirection entry.
  • ISR handler (irq_fd_dispatch) masks the GSI via libkernel::apic::mask_gsi and posts a completion to the associated CompletionPort. For keyboard (GSI 1) and mouse (GSI 12), the ISR reads port 0x60 and drains all available bytes per interrupt (up to 16 per ISR invocation).
  • 64-entry scancode ring buffer per slot prevents lost scancodes between rearms. arm_irq bulk-drains the entire buffer into completions.
  • Per-slot atomic IRQ counters (total, delivered, buffered, spurious, wrong_source) visible via /proc/irq_stats.
  • On close, the original IO APIC entry is restored.
  • Demo: user/irq_demo.c — keyboard scancode display via OP_IRQ_WAIT.

IPC Channels (libkernel/src/channel.rs, osl/src/ipc.rs)

  • Capability-based IPC channels for structured message passing between processes.
  • Unidirectional with configurable buffer capacity: capacity=0 for synchronous rendezvous (seL4-style), capacity>0 for async buffered.
  • Fixed 48-byte messages: tag (u64) + data[3] (u64) + fds[4] (i32).
  • fd-passing (capability transfer): sender’s fds are extracted at send time, kernel objects are stored in the channel, and new fds are allocated in the receiver’s fd table at recv time. Cleanup on drop for undelivered messages.
  • Completion port integration: OP_IPC_SEND (5) and OP_IPC_RECV (6) for multiplexing IPC with timers, IRQs, and file I/O.
  • Syscalls: ipc_create (505), ipc_send (506), ipc_recv (507).
  • Demos: ipc_sync.c, ipc_async.c, ipc_port.c, ipc_fdpass.c.
  • See docs/ipc-channels.md for full design.

Deadlock Detection (libkernel/src/spin_mutex.rs)

  • All spin::Mutex locks replaced with SpinMutex — a drop-in wrapper that counts spin iterations and panics after a threshold, turning silent hangs into actionable diagnostics with serial output.
  • SpinMutex: 100M iteration limit (~100 ms) — allows for legitimate preemption contention on a single-core scheduler.
  • IrqMutex: 10M iteration limit (~10 ms) — interrupts disabled means no preemption, so any contention indicates a true deadlock.
  • deadlock_panic() writes directly to COM1 (0x3F8) bypassing SERIAL1’s lock, then panics.

POSIX Signals (libkernel/src/signal.rs, osl/src/signal.rs)

  • Phases 1–2: signal infrastructure, delivery on SYSCALL return, Ctrl+C/SIGINT, signal-interrupted syscalls (EINTR).
  • rt_sigaction (13): install/query signal handlers (SA_SIGINFO, SA_RESTORER).
  • rt_sigprocmask (14): SIG_BLOCK/UNBLOCK/SETMASK for the signal mask.
  • kill (62): send a signal to a specific pid; wakes interruptible blocks.
  • rt_sigreturn (15): restore context from rt_sigframe after handler returns.
  • Signal delivery via check_pending_signals in the SYSCALL return path: constructs a Linux-ABI-compatible rt_sigframe on the user stack, rewrites the saved register frame so sysretq “returns” into the handler.
  • Ctrl+C: keyboard actor queues SIGINT on foreground_pid(), wakes blocked console reader.
  • EINTR: blocking syscalls (sys_wait4, PipeReader::read) set a per-process signal_thread field; sys_kill unblocks it so the syscall returns EINTR. The shell forwards SIGINT to child processes on EINTR from waitpid.
  • Default actions: SIG_DFL terminate (SIGKILL, SIGTERM, etc.) or ignore (SIGCHLD).
  • Demos: user/sig_demo.c (SIGUSR1 self-signal), user/sig_int.c (Ctrl+C interrupt test), userspace shell handles SIGINT via sigaction.
  • See docs/signals.md for full design.

Dummy Driver (devices/src/dummy.rs)

  • Example actor with #[on_tick] heartbeat, #[on_message(SetInterval)], and #[on_info].
  • Demonstrates the full actor feature set.

ACPI Parsing

  • kernel/src/kernel_acpi.rs implements an AcpiHandler that accesses physical ACPI regions via the bootloader’s identity map (phys + physical_memory_offset); no dynamic page mapping is required since all ACPI tables live in physical RAM.
  • Calls acpi::search_for_rsdp_bios to locate and parse ACPI tables.
  • On boot the interrupt model is printed; APIC vs legacy PIC is detected.

APIC Module (libkernel/src/apic/)

  • APIC code lives in libkernel::apic, mapped at 0xFFFF_8001_0000_0000.
  • libkernel/src/apic/local_apic/ — Local APIC register access via MMIO and MSR.
  • libkernel/src/apic/io_apic/ — I/O APIC register access via MMIO.
  • libkernel::apic::init() maps the Local APIC and all I/O APICs from the ACPI table, routes ISA IRQs 0 (timer) and 1 (keyboard) through the I/O APIC to IDT vectors 0x20 and 0x21, then disables the 8259 PIC.
  • libkernel::apic::calibrate_and_start_lapic_timer() uses the PIT as a reference to measure the LAPIC bus frequency, starts the LAPIC timer in periodic mode at 1000 Hz, then masks the PIT’s I/O APIC entry so it no longer fires.

Logging

  • libkernel/src/logger.rs wraps the VGA println! macro as a log::Log implementation.
  • log::{debug, info, warn, error} macros usable throughout the kernel.
  • Initialised early in libkernel_main.

CPUID

  • libkernel/src/cpuid.rs — thin wrapper around raw-cpuid; init() called during kernel init.

Known Issues / Technical Debt

Heap Size

The heap is a fixed 1 MiB at 0xFFFF_8000_0000_0000. Kernel thread stacks are allocated from a separate stack arena (libkernel/src/stack_arena.rs) at 0xFFFF_8000_0010_0000 (16 × 64 KiB = 1 MiB), keeping large stack allocations off the general-purpose heap and eliminating fragmentation. The arena uses a bitmap for O(1) alloc/free with RAII slot handles. The heap is now used only for small driver/task allocations. The DumbVmemAllocator has no reclamation path, so virtual address space for MMIO/ACPI mappings is consumed monotonically.

virtio-blk Single-sector I/O

Block I/O uses IRQ-driven completion via AtomicWaker, but is still limited to one 512-byte sector per request.

exFAT Write Support

The exFAT driver is read-only. All filesystem state changes (create, write, delete) are unsupported.

ProcVfs File Sizes Reported as Zero

VfsDirEntry::size is 0 for all /proc entries because the content length is not known until the data is serialised. This is cosmetically wrong in ls output but functionally harmless.


Possible Next Steps

Completion Port — All Phases Complete

  • Phases 1–4 (core, read/write, OP_IRQ_WAIT, OP_RING_WAIT) — see sections above.
  • Phase 5: Shared-memory SQ/CQ rings — implemented. io_setup_rings (511) allocates shared SQ/CQ ring pages exposed as shmem fds. io_ring_enter (512) processes SQ entries and waits for CQ completions. Dual-mode post() writes simple CQEs directly to the shared CQ ring; deferred completions (OP_READ, OP_IPC_RECV) are flushed in syscall context. Test: ring_sq_test.

Memory Management

  1. Larger / growable heap — demand-paged heap that grows on fault, or a larger static allocation. 1 MiB is tight with concurrent processes.

  2. Reclaiming virtual address space — replace DumbVmemAllocator with a proper free-list allocator so MMIO mappings can be released.

  3. File-backed MAP_SHARED — anonymous shared memory (via shmem_create) is complete; file-backed MAP_SHARED with inode page cache remains future work. See docs/mmap-design.md Phase 5c.

Process Model

  1. Signals Phase 3+ — Phases 1–2 (basic signal delivery + Ctrl+C/SIGINT + EINTR) are complete: rt_sigaction, rt_sigprocmask, kill, signal delivery on SYSCALL return, rt_sigreturn, Ctrl+C → SIGINT to foreground process, signal-interrupted blocking syscalls (EINTR). Remaining: exception-generated signals (SIGSEGV, SIGILL), SIGCHLD on child exit. See docs/signals.md.

  2. fork + CoW page faults — standard POSIX fork. clone(CLONE_VM|CLONE_VFORK) and execve are now implemented, enabling unpatched musl posix_spawn and Rust std::process::Command. Full fork with CoW still requires a page fault handler and frame reference counting.

Drivers & I/O

  1. Multi-sector DMA — batch multiple sectors per virtio request to reduce queue round-trips for directory scans and file reads.

  2. exFAT write support — directory entry creation, FAT chain allocation, and sector writes to enable touch, mkdir, cp, rm.

Compositor & Window Management

The userspace compositor (/bin/compositor) is a Wayland-style display server with full input routing and window management.

  • Display: Takes exclusive ownership of the BGA framebuffer via framebuffer_open (515). Double-buffered compositing with painter’s algorithm. Cursor-only rendering optimization patches small rectangles for mouse movement instead of full recomposite.
  • Input: Connects to /bin/kbd (keyboard) service via the service registry. Mouse input is integrated directly — the compositor claims IRQ 12 and decodes PS/2 packets inline (no separate mouse driver process). Key events forwarded to focused window. Mouse events drive cursor, focus, drag, and resize.
  • CDE-style decorations: Server-side window decorations inspired by CDE/Motif — 3D beveled borders (BORDER_W=4, BEVEL=2), 24px title bar with centered title, CDE-style close button, sunken inner bevel around client area. Blue-grey color palette.
  • Window management: Click-to-focus with Z-order raise. Title bar drag to move. Edge/corner drag to resize with context-sensitive cursor icons (diagonal, horizontal, vertical double-arrows). Close button removes window.
  • Resize protocol: On resize completion, compositor allocates a new shared buffer and sends MSG_WINDOW_RESIZED (tag 7) with the new buffer fd. Terminal emulator remaps buffer, recalculates dimensions, and redraws.
  • Terminal emulator (/bin/term): Compositor client that spawns /bin/shell with pipe-connected stdin/stdout. VT100 parser with color support. Character-level screen buffer (Cell array + per-row wrapped flags) enables text reflow on resize: logical lines are extracted, re-wrapped to the new width, and pixels regenerated from the cell buffer. Cursor position is preserved across resize.
  • See docs/compositor-design.md and docs/display-input-ownership.md.

Microkernel Path

  1. Microkernel Phase B — kernel primitives for userspace drivers: device MMIO mapping, DMA syscalls. IRQ fd (syscall 504 + OP_IRQ_WAIT) and MAP_SHARED (via shmem_create 508) are complete. Remaining items unblock userspace NIC driver. See docs/microkernel-design.md.

  2. Networking — virtio-net driver + smoltcp TCP/IP stack. The completion port is ready to back it once the NIC driver lands. See docs/networking-design.md.

Preemptive Scheduler & Multi-threaded Async Executor

Overview

The kernel uses a round-robin preemptive scheduler built on top of the LAPIC timer (1000 Hz). Every 10 ms (configurable via QUANTUM_TICKS) the timer ISR saves the current CPU state and switches to the next ready thread, regardless of what that thread was doing. This prevents any single async task — even one that busy-loops — from starving all others.

The async executor’s state lives in global statics, so multiple kernel threads can pull and poll tasks from the same shared queue concurrently.


Thread Lifecycle

        spawn_thread()
             │
             ▼
          [ Ready ] ◄──────────────────────────────┐
             │  (selected by scheduler)             │
             ▼                                      │
         [ Running ]  ──── quantum expired ─────────┘

Threads cycle between Ready and Running in strict round-robin order. There is no blocked/sleeping state for threads — a thread that has nothing to do (idle executor loop) calls HLT until an interrupt wakes it.

Thread 0 is the initial kernel thread. Early in boot, libkernel_main calls scheduler::migrate_to_heap_stack(run_kernel) which allocates a 64 KiB heap stack and switches RSP to it before continuing. This moves thread 0 off the bootloader’s lower-half stack onto PML4 entry 256 (high canonical half), so its stack survives CR3 switches into user page tables.

Additional threads are created with scheduler::spawn_thread(entry: fn() -> !). The entry function must never return; in practice it calls executor::run_worker().


Context Switch Mechanism

LAPIC timer IDT entry

The IDT entry for LAPIC_TIMER_VECTOR (0x30) is set with set_handler_addr pointing directly at lapic_timer_stub. This bypasses the extern "x86-interrupt" wrapper so the stub can manipulate RSP freely.

Assembly stub (lapic_timer_stub)

lapic_timer_stub:
    push rax; push rbx; push rcx; push rdx
    push rsi; push rdi; push rbp
    push r8;  push r9;  push r10; push r11
    push r12; push r13; push r14; push r15
    sub  rsp, 512           // allocate FXSAVE area
    fxsave [rsp]            // save x87/MMX/SSE state
    mov  rdi, rsp           // current_rsp → first argument
    call preempt_tick       // returns new rsp in rax
    mov  rsp, rax           // switch to (possibly new) thread's stack
    fxrstor [rsp]           // restore x87/MMX/SSE state
    add  rsp, 512           // deallocate FXSAVE area
    pop r15; pop r14; pop r13; pop r12
    pop r11; pop r10; pop r9;  pop r8
    pop rbp; pop rdi; pop rsi
    pop rdx; pop rcx; pop rbx; pop rax
    iretq

The CPU pushes an interrupt frame (SS/RSP/RFLAGS/CS/RIP, 40 bytes) before the stub runs. The stub pushes 15 GPRs (120 bytes) and then allocates a 512-byte FXSAVE area for x87/MMX/SSE register state. Together that is 672 bytes = 42 × 16, so RSP is 16-byte aligned for both fxsave [rsp] (requires 16-byte alignment) and the call instruction (SysV ABI: RSP + 8 aligned at function entry).

preempt_tick(current_rsp: u64) -> u64

Runs entirely on the current thread’s stack (inside the call/ret pair), then returns the next thread’s saved_rsp in RAX.

  1. Increments the global tick counter and wakes sleeping async tasks.
  2. Sends LAPIC EOI.
  3. Locks SCHEDULER (interrupts already off — no deadlock risk).
  4. If not yet initialised, returns current_rsp unchanged.
  5. Decrements the current thread’s ticks_remaining; if still > 0, returns unchanged.
  6. Saves current_rsp in current_thread.saved_rsp.
  7. Pushes the current thread index onto ready_queue (marks it Ready).
  8. Pops the front of ready_queue as next_idx. Because we just pushed current, the queue is always non-empty; unwrap_or(current_idx) is only a safety fallback. If current was the only thread it gets re-scheduled.
  9. Resets ticks_remaining = QUANTUM_TICKS, marks thread as Running.
  10. Returns next_thread.saved_rsp.

The stub then sets RSP = returned value and executes the symmetric pops + iretq, which resumes execution on the new thread.


Initial Stack Layout for New Threads

spawn_thread(entry) allocates a 64 KiB Vec<u8> and writes a fake interrupt frame at the top. The frame is exactly what a preempted thread’s stack looks like, so the same assembly stub can start a new thread as if it were resuming a preempted one.

high address  ┌──────────────────────────┐
              │  SS   = 0                │  ← null selector, valid for ring-0
              │  RSP  = stack_top−8      │  ← thread's initial stack pointer
              │  RFLAGS = 0x202          │  ← bit 9 (IF) + bit 1 (reserved)
              │  CS   = 0x08             │  ← kernel code segment
              │  RIP  = entry            │  ← thread entry point
              ├──────────────────────────┤
              │  rax  = 0                │  15 GPRs (120 bytes)
              │  rbx  = 0                │
              │  …                       │
              │  r15  = 0                │
              ├──────────────────────────┤
              │  FXSAVE area             │  512 bytes (16-byte aligned)
              │  (x87/MMX/SSE state)     │  MXCSR = 0x1F80 at offset +24
              │                          │  XMM0-15 at offset +160
              │                          │  ← saved_rsp points here
low address   └──────────────────────────┘

saved_rsp = base of the 512-byte FXSAVE area. The SwitchFrame (GPRs + iretq frame) sits at saved_rsp + 512. Total region is 672 bytes, guaranteed 16-byte aligned by rounding stack_top down.


Timer Quantum

QUANTUM_TICKS in task/scheduler.rs controls how many LAPIC ticks (1 tick = 1 ms at 1000 Hz) each thread runs before being preempted. The default is 10 (10 ms per thread).

To increase to 50 ms:

#![allow(unused)]
fn main() {
pub const QUANTUM_TICKS: u32 = 50;
}

Thread-safe Async Executor

Global state

StaticTypePurpose
TASK_QUEUEMutex<VecDeque<Task>>Tasks ready to be polled
WAIT_MAPMutex<BTreeMap<TaskId, Task>>Tasks waiting for a waker
WAKE_QUEUEArc<ArrayQueue<TaskId>>Lock-free waker notifications (ISR-safe)
WAKER_CACHEMutex<BTreeMap<TaskId, Waker>>One Waker per live task; keeps Arc count ≥ 2 to prevent ISR deallocation

TASK_QUEUE and WAIT_MAP use SpinMutex (a spin::Mutex wrapper with deadlock detection). On a single CPU with preemption, a thread can be preempted while holding a spinlock; the new thread spinning on the same lock will waste its quantum and yield back, at which point the original thread releases the lock. If this doesn’t resolve within ~100 ms (SPIN_LIMIT iterations), SpinMutex panics with a serial diagnostic rather than hanging silently.

ISR-safe waker deallocation (WAKER_CACHE)

Both the timer ISR (timer::tick) and the keyboard ISR call Waker::wake(), which consumes the stored Waker. If that were the last Arc<TaskWaker> reference, the Drop impl would call into linked_list_allocator, whose spinlock may already be held by the preempted thread → deadlock.

WAKER_CACHE (Mutex<BTreeMap<TaskId, Waker>>) holds one cached Waker per live task, keeping the Arc strong count ≥ 2 whenever an ISR-accessible copy exists. The ISR’s drop reduces the count from 2 → 1; the cache’s copy is only freed from executor context when a task completes (Poll::Ready).

Task: Send requirement

Task::new requires Future<Output = ()> + Send + 'static. All built-in tasks (timer, keyboard, example) satisfy this because they only hold values that are Send (atomics, Mutex-guarded globals, simple scalars).

spawn(task) and run_worker()

executor::spawn pushes a Task into TASK_QUEUE. executor::run_worker loops:

  1. Move tasks whose wakers fired from WAIT_MAPTASK_QUEUE.
  2. Poll every task in TASK_QUEUE. If Pending, move to WAIT_MAP.
  3. sleep_if_idle: disable interrupts, check WAKE_QUEUE, then atomically re-enable + HLT (prevents missed-wakeup race).

Locking Rules

LockWhere heldRule
SCHEDULERISR and non-ISRNon-ISR callers must use `without_interrupts(
TASK_QUEUENon-ISR onlyReleased before polling to allow spawn() inside poll
WAIT_MAPNon-ISR onlyReleased before locking TASK_QUEUE to avoid ordering inversion
WAKER_CACHENon-ISR onlyReleased before polling
timer::WAKERSISR (tick) + non-ISR (Delay::poll)Non-ISR uses without_interrupts

The ISR already runs with IF = 0, so it never needs to call without_interrupts.

Deadlock Detection

All spin::Mutex locks have been replaced with SpinMutex (libkernel/src/spin_mutex.rs), which counts spin iterations and panics after a threshold:

Lock typeThresholdRationale
SpinMutex100,000,000 (~100 ms)Well beyond the 10 ms quantum; allows for legitimate preemption contention
IrqMutex10,000,000 (~10 ms)Interrupts are disabled — no preemption, so any contention is a true deadlock

On timeout, deadlock_panic() writes directly to serial port 0x3F8 (bypassing SERIAL1’s lock) and then panics. This turns silent hangs into actionable diagnostics.


Demonstrating Preemption

Add a spinning task to confirm no starvation:

#![allow(unused)]
fn main() {
executor::spawn(Task::new(async {
    loop { core::hint::spin_loop(); }
}));
}

Without preemption this would freeze the kernel. With the scheduler, the LAPIC timer fires every 10 ms and rotates to the next thread, so [timer] tick: Ns elapsed still appears every second.

Paging Design

Virtual Address Layout

x86-64 canonical addresses split into two halves:

0x0000_0000_0000_0000 ┐
       ...            │  lower canonical half — user process address space
0x0000_7FFF_FFFF_FFFF ┘
                        (non-canonical gap — any access faults)
0xFFFF_8000_0000_0000   kernel heap        (HEAP_START, 256 KiB)
0xFFFF_8001_0000_0000   Local APIC MMIO   (APIC_BASE, 4 KiB)
0xFFFF_8001_0001_0000   IO APIC(s)        (4 KiB × n, relative to APIC_BASE)
0xFFFF_8002_0000_0000   MMIO window       (MMIO_VIRT_BASE, 512 GiB)
  ↑ PCIe ECAM, virtio BARs, future driver MMIO allocated here
0xFFFF_FF80_0000_0000   recursive PT window (for index 511, see below)
0xFFFF_FFFF_FFFF_F000   PML4 self-mapping   (recursive index 511)
phys_mem_offset         bootloader physical identity map (stays put)
  + all physical RAM

All three kernel allocation regions (heap, APIC, MMIO) share PML4 index 256 (0xFFFF_8000_* through 0xFFFF_80FF_*), keeping the kernel footprint in a single top-level page-table entry — easy to share across per-process page tables without marking it USER_ACCESSIBLE.


Page Table Implementation: RecursivePageTable

Why recursive instead of OffsetPageTable

OffsetPageTable walks page-table frames by computing phys_mem_offset + frame_phys_address. This creates a permanent dependency on the bootloader’s physical-identity map (which lives in the lower canonical half). For user-space isolation we want the lower half to be entirely process-owned.

RecursivePageTable eliminates this dependency: the CPU’s own hardware page walker is used to reach PT frames, so no identity map is needed for page-table operations.

How recursive mapping works

One PML4 slot (index 511) is pointed at the PML4’s own physical frame. When the CPU walks this entry it re-enters the same PML4 as if it were a PDPT. Repeating four times (P4→511, P3→511, P2→511, P1→511) exposes the PML4’s own 4 KiB page at virtual address 0xFFFF_FFFF_FFFF_F000.

The full recursive window for index R (R=511) maps every page-table frame at a computable virtual address:

DepthVirtual base (R=511)What is mapped there
PML40xFFFF_FFFF_FFFF_F000the PML4 itself
PDs0xFFFF_FFFF_FFE0_0000+all 512 PDPTs
PTs0xFFFF_FFFF_C000_0000+all 512 × 512 PDs
Pages0xFFFF_FF80_0000_0000+all PT frames

The x86_64 crate’s RecursivePageTable type uses these computable addresses to implement Mapper and Translate without any identity-map knowledge.

Setup sequence (in libkernel::memory::init)

1. Read CR3 → PML4 physical frame
2. Access PML4 via bootloader identity map: virt = phys_mem_offset + pml4_phys
3. Write PML4[511] = (pml4_phys_frame, PRESENT | WRITABLE)
4. flush_all()   ← new mapping is now active
5. Compute recursive PML4 address: 0xFFFF_FFFF_FFFF_F000
6. Obtain &'static mut PageTable at that address
7. RecursivePageTable::new(pml4_at_recursive_addr)

After step 7 the identity map is still live (bootloader mapping is never removed), but RecursivePageTable does not use it for page-table walks.


MMIO Virtual Address Allocator

Problem with the old approach

The old map_mmio_region mapped MMIO at phys_mem_offset + phys_addr — the same virtual address the identity map uses for regular RAM. This worked but:

  • It placed MMIO in the lower canonical half (future user space).
  • It gave MMIO a fixed virtual address tied to phys_mem_offset, which varies per boot and can change if the bootloader is swapped.

New design: bump allocator + cache

A bump pointer starts at MMIO_VIRT_BASE = 0xFFFF_8002_0000_0000 and advances one region at a time. A BTreeMap<phys_base, virt_base> cache ensures that mapping the same physical address twice returns the same virtual address.

MMIO_VIRT_BASE  0xFFFF_8002_0000_0000
  + PCIe ECAM    1 MiB
  + virtio BAR0  varies
  + ...
  (grows upward; 512 GiB window — exhaustion is practically impossible)

Flags: PRESENT | WRITABLE | NO_CACHE (same as before).

Cache key

The cache key is the page-aligned physical base address. If the same physical base is mapped twice with different sizes the second call returns the cached mapping (the first mapping covers at least as many pages as were originally requested; in practice PCI BAR sizes are fixed per device).

Heap dependency

BTreeMap::insert allocates from the kernel heap. map_mmio_region must not be called before init_heap completes or from interrupt context. All current call sites (boot path in main.rs, KernelHal::mmio_phys_to_virt) satisfy this constraint.


Remaining Identity Map Dependency

After the switch to RecursivePageTable, the bootloader identity map is still used in exactly two places:

UseLocationNotes
DMA address translationKernelHal::dma_alloc in devices/src/virtio/mod.rsDMA frames are physical RAM; phys_mem_offset + paddr gives the kernel virtual address for CPU access
ACPI table accessKernelAcpiHandler::phys_to_virt in kernel/src/kernel_acpi.rsACPI tables are in physical RAM; same formula

Both are kernel-private and never exposed to user space. The bootloader identity map entries do not have the USER_ACCESSIBLE flag, so they are invisible to ring-3 processes regardless.

The restriction: every page table the kernel uses to walk page structures must keep the bootloader’s PML4 entries for the identity-map region. For per-process page tables this is easily satisfied by copying PML4 entries 0–255 (lower half) from the kernel PML4 — without USER_ACCESSIBLE — at process creation time.


Per-Process Page Tables

Each process gets its own PML4, created by MemoryServices::create_user_page_table:

  • Slot 511: self-referential entry pointing to the process’s own PML4 physical frame (required for RecursivePageTable to work per-process).
  • Slots 256–510: shared kernel mappings (heap, APIC, MMIO window, physical memory direct map), copied verbatim from the active PML4. These are high-half addresses, never accessible from ring-3. Because the PML4 entries point to the same PDPT/PD/PT frames, changes to kernel page tables at levels below PML4 are automatically visible in all address spaces.
  • Slots 0–255: process-private user-space mappings. The process’s code, stack, heap, and memory-mapped files live here.

Switching between processes requires only a mov cr3, new_pml4_phys — the kernel’s high-half mappings are identical in every page table so no TLB flush is needed for kernel entries (on CPUs with PCID support).

PML4 lifecycle

User PML4s and their lower-half page table frames are freed when a process exits (terminate_process) or replaces its address space (execve). The kernel boot PML4 physical address is stored in KERNEL_PML4_PHYS (set during init_services). Before freeing a user PML4, the dying/exec’ing code switches CR3 and the scheduler’s thread record to the kernel PML4. This is critical because the frame allocator uses an intrusive free-list that overwrites freed frames immediately — leaving CR3 pointing at a freed PML4 would cause a triple fault on the next TLB refill.


Files Changed

FileChange
libkernel/src/allocator/mod.rsHEAP_START = 0xFFFF_8000_0000_0000
kernel/src/main.rsAPIC_BASE = 0xFFFF_8001_0000_0000
libkernel/src/memory/mod.rsRecursivePageTable; MMIO_VIRT_BASE bump allocator; mmio_cache: BTreeMap
libkernel/src/memory/vmem_allocator.rsTest BASE constant updated (cosmetic)

mmap Phased Design

Overview

This document describes a phased plan for improving the virtual memory management subsystem, starting from the current minimal mmap implementation and building towards file-backed, shared mappings.

Each phase is self-contained and independently testable.


Current State

mmap (syscall 9)

  • Anonymous (MAP_ANONYMOUS) and file-backed MAP_PRIVATE (eager copy).
  • MAP_FIXED supported — implicit munmap of overlapping VMAs (Linux semantics).
  • Non-fixed allocations use a top-down gap finder over the VMA tree ([MMAP_FLOOR, MMAP_CEILING) = [0x10_0000_0000, 0x4000_0000_0000)). Freed regions are automatically reused.
  • Pages are eagerly allocated, zeroed, and mapped.
  • prot argument is honoured — page table flags are derived from PROT_READ, PROT_WRITE, PROT_EXEC via Vma::page_table_flags().
  • Regions are tracked as BTreeMap<u64, Vma> (vma_map in Process).
  • /proc/maps displays actual rwxp flags from VMA metadata.

munmap (syscall 11)

Implemented — unmaps pages, frees frames to the free list, and splits/removes VMAs. Supports partial unmaps (front, tail, middle split).

mprotect (syscall 10)

Implemented — updates page table flags and splits/updates VMAs. Supports partial mprotect across VMA boundaries (front, tail, middle split).

Process cleanup on exit

sys_exit frees all user-space frames (ELF segments, brk heap, user stack, mmap regions) and intermediate page table frames before marking zombie.

Process cleanup on execve

sys_execve creates a fresh PML4, switches CR3, then frees the old address space (all user pages and page tables).


Phase 1: VMA Tracking + PROT Flags ✓ (implemented)

Goal: Replace the bare Vec<(u64, u64)> region list with a proper VMA (Virtual Memory Area) structure, and honour the prot argument in mmap.

VMA struct

Add to libkernel/src/process.rs (or a new libkernel/src/vma.rs):

#![allow(unused)]
fn main() {
#[derive(Debug, Clone)]
pub struct Vma {
    pub start: u64,        // page-aligned
    pub len: u64,          // page-aligned
    pub prot: u32,         // PROT_READ | PROT_WRITE | PROT_EXEC
    pub flags: u32,        // MAP_PRIVATE | MAP_ANONYMOUS | MAP_SHARED | ...
    pub fd: Option<usize>, // file descriptor (Phase 5)
    pub offset: u64,       // file offset   (Phase 5)
}
}

Store VMAs in a BTreeMap<u64, Vma> keyed by start address, replacing mmap_regions: Vec<(u64, u64)>.

PROT flag translation

Map Linux PROT_* to x86-64 page table flags:

Linuxx86-64 PTFNotes
PROT_READPRESENT | USER_ACCESSIBLEx86 has no read-only without NX
PROT_WRITE+ WRITABLE
PROT_EXECclear NO_EXECUTE
PROT_NONEclear PRESENT

Apply these flags in alloc_and_map_user_pages instead of the current hardcoded USER_DATA_FLAGS.

Changes

FileChange
libkernel/src/process.rsAdd Vma struct, replace mmap_regions with BTreeMap<u64, Vma>
osl/src/syscalls/mem.rs (sys_mmap)Parse prot, compute PTF, store VMA
osl/src/clone.rsClone the VMA map instead of Vec<(u64, u64)>
osl/src/exec.rsClear VMA map on execve

Test

Allocate an mmap region with PROT_READ only, attempt a write from userspace — should page-fault.


Phase 2: Frame Free List + munmap ✓ (implemented)

Goal: Actually free physical frames when munmap is called.

Frame allocator changes

The current frame allocator (BootInfoFrameAllocator wrapping an iterator of usable frames) is allocate-only. Two options:

  1. Bitmap allocator — replace the iterator with a bitmap over all usable RAM. Deallocation sets a bit. Simple, O(1) free, but O(n) alloc in the worst case.
  2. Free-list overlay — keep the bitmap for the initial boot-time pool, but maintain a singly-linked free list of returned frames (write the next pointer into the first 8 bytes of the freed page via the physical memory map). O(1) alloc and free.

Decision: free-list overlay. The bitmap is needed anyway to know which frames are in use, but a free list on top gives O(1) alloc from returned frames.

Unmap primitive

Add unmap_user_page(pml4_phys, vaddr) -> Option<PhysAddr> to the memory subsystem. This walks the page table, clears the PTE, invokes invlpg, and returns the physical frame address so the caller can free it.

sys_munmap implementation

fn sys_munmap(addr: u64, length: u64) -> i64
  1. Page-align addr and length.
  2. Look up overlapping VMAs.
  3. For each page in the range: call unmap_user_page, push the returned frame onto the free list.
  4. Split/remove VMAs as needed (a munmap in the middle of a VMA creates two smaller VMAs).
  5. TLB flush (per-page invlpg is fine for now; batch flush can come later).

Changes

FileChange
libkernel/src/memory/Add unmap_user_page, frame free list
osl/src/syscalls/mem.rsImplement sys_munmap
libkernel/src/process.rsVMA split/remove helpers

Contiguous DMA allocations

alloc_dma_pages(pages) with pages > 1 bypasses the free list and uses allocate_frame_sequential to guarantee physical contiguity. The sequential allocator walks the boot-time memory map and can be exhausted — once next exceeds the total usable frames, it returns None even if the free list has recycled frames available.

In practice this is fine because multi-page contiguous allocations only happen during early boot (VirtIO descriptor rings). If this becomes a problem in the future, options include:

  • Fall back to the free list for single-frame DMA when sequential is exhausted.
  • Replace the sequential allocator with a buddy allocator that can satisfy contiguous requests from recycled frames.

Test

mmap a region, write a pattern, munmap it, mmap a new region — should get the same (or nearby) frames back, zero-filled.


Phase 3: mprotect + Process Cleanup ✓ (implemented)

Goal: Change page permissions on existing mappings, and free all process memory on exit/execve.

sys_mprotect

fn sys_mprotect(addr: u64, length: u64, prot: u64) -> i64
  1. Validate addr is page-aligned.
  2. Walk VMAs in the range, update vma.prot.
  3. For each page: rewrite the PTE flags to match the new prot (reuse the PROT→PTF translation from Phase 1).
  4. invlpg each modified page.
  5. May need to split VMAs if the prot change covers only part of a VMA.

Process cleanup on exit

When a process exits (sys_exit / sys_exit_group), before marking zombie:

  1. Iterate all VMAs.
  2. For each page in each VMA: unmap and free the frame (reuse Phase 2 primitives).
  3. Free the user page tables themselves (PML4, PDPT, PD, PT pages).
  4. Free the brk region (iterate from brk_base to brk_current).
  5. Free the user stack pages.

Process cleanup on execve

sys_execve already creates a fresh PML4. After the new PML4 is set up, free the old page tables and all frames from the old VMA map (same cleanup logic as exit, but targeting the old PML4).

Changes

FileChange
osl/src/syscalls/mem.rsImplement sys_mprotect; osl/src/syscalls/process.rs calls cleanup in sys_exit
osl/src/exec.rsCall cleanup for old address space before jump
libkernel/src/memory/PTE flag update helper, page table walker for cleanup
libkernel/src/process.rsVMA split for partial mprotect

Test

mmap RW, write data, mprotect to read-only, attempt write — should fault. Run a long-lived process that repeatedly spawns children — memory usage should stay bounded.


Phase 4: MAP_FIXED + Gap Finding ✓ (implemented)

Goal: Support MAP_FIXED placement and smarter allocation that avoids fragmenting the address space.

MAP_FIXED

MAP_FIXED performs implicit munmap of overlapping VMAs before mapping at the requested address (Linux semantics). Addr must be page-aligned and non-zero.

Gap-finding allocator

Replaced the bump-down pointer (mmap_next) with a generic top-down gap finder (libkernel/src/gap.rs). The OccupiedRanges trait abstracts iteration over occupied intervals so the algorithm can be reused.

Search range: [MMAP_FLOOR, MMAP_CEILING) = [0x10_0000_0000, 0x4000_0000_0000). The VMA BTreeMap is the sole source of truth — no bump pointer.

Changes

FileChange
libkernel/src/gap.rsNewOccupiedRanges trait, find_gap_topdown
libkernel/src/lib.rsAdd pub mod gap
libkernel/src/process.rsRemove mmap_next, add MMAP_FLOOR/MMAP_CEILING, find_mmap_gap
osl/src/syscalls/mem.rsRewrite sys_mmap with gap finder + MAP_FIXED
osl/src/clone.rsRemove mmap_next from clone state
osl/src/exec.rsRemove mmap_next reset and local MMAP_BASE constant

Phase 5a: File-Backed MAP_PRIVATE (eager copy) ✓ (implemented)

Goal: Support mmap(fd, offset, ...) for MAP_PRIVATE file-backed mappings with eager data copy. No sharing, no refcounting, no writeback.

6th syscall argument

The Linux mmap signature is:

void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

fd and offset are the 5th and 6th arguments. The assembly stub saves user R9 to per_cpu.user_r9 (offset 32). sys_mmap reads the offset via libkernel::syscall::get_user_r9() — no ABI change needed.

Design: read from the fd’s buffer

Two approaches were considered:

  1. Read from VFS by path — incorrect because a file’s path can change after open (rename, unlink). An open fd refers to an inode, not a path.
  2. Read from the fd’s existing in-memory bufferVfsHandle holds the full file content in a Vec<u8>. Exposed via FileHandle::content_bytes(). Semantically correct: the fd holds a reference to the file content.

Decision: option 2. When lazy/partial sys_open or inode-based VFS arrives later, content_bytes() can trigger a full load or we switch to an inode-keyed page cache. The mmap code doesn’t need to change.

Implementation

  • FileHandle::content_bytes() — default returns None.
  • VfsHandle::content_bytes() — returns Some(&self.content).
  • sys_mmap file-backed path: extracts fd/offset, calls content_bytes(), allocates per-page (clear + copy file data, clamped to file length — bytes past EOF stay zero, matching Linux), maps with prot flags.
  • Both MAP_FIXED and non-fixed variants work for file-backed — the address selection logic from Phase 4 is reused.

Changes

FileChange
libkernel/src/file.rsAdded content_bytes() default method to FileHandle
osl/src/file.rsImplemented content_bytes() on VfsHandle
osl/src/errno.rsAdded ENODEV for non-mmap-able handles
osl/src/syscalls/mem.rsExtended sys_mmap with file-backed MAP_PRIVATE, added mmap_alloc_pages helper
user/mmap_file.cNew demo: open file, mmap, compare with read(), munmap

Test

mmap_file: opens /shell, reads first 64 bytes via read(), mmaps same file with MAP_PRIVATE/PROT_READ, compares mapped bytes with read() output, munmaps, exits cleanly.


Phase 5b: MAP_SHARED + Refcounted Frames ✓ (anonymous shared memory)

Goal: Support shared anonymous mappings with reference-counted frames.

Shared memory objects (shmem_create)

A custom syscall shmem_create(size) (nr 508) creates a shared memory object backed by eagerly-allocated, zeroed physical frames and returns a file descriptor. The fd can be inherited by child processes or passed via IPC. Both sides call mmap(MAP_SHARED, fd) to map the same physical frames into their address spaces.

Frame refcount table

A BTreeMap<u64, u16> in MemoryServices tracks frames with refcount ≥ 2. Frames not in the table have an implicit refcount of 1 (single owner).

Each shared frame has owners:

  1. The SharedMemInner object itself (1 ref, released on Arc drop)
  2. Each process mapping (1 ref per mmap, released on munmap or exit)

Methods:

  • ref_share(phys) — increment (insert with 2 if new to table)
  • ref_release(phys) -> bool — decrement, return true if frame should be freed

Refcount-aware cleanup

  • unmap_and_release_user_page() — unmaps PTE, calls ref_release, only frees when refcount reaches 0.
  • cleanup_user_address_space() — uses ref_release for all leaf frames. Backwards-compatible: non-shared frames return true immediately.
  • SharedMemInner::drop() — calls release_shared_frame() for each backing frame. Safe because Drop only fires from fd close (outside with_memory).

MAP_SHARED in sys_mmap

  • Validates MAP_SHARED and MAP_PRIVATE are mutually exclusive
  • MAP_SHARED | MAP_ANONYMOUS returns -EINVAL (no fork)
  • MAP_SHARED with fd: extracts SharedMemInner from FdObject::SharedMem, maps its physical frames, increments refcounts via ref_share

Changes

FileChange
libkernel/src/memory/mod.rsrefcounts: BTreeMap, ref_share, ref_release, unmap_and_release_user_page, refcount-aware cleanup_user_address_space
libkernel/src/shmem.rsNewSharedMemInner struct with Drop
libkernel/src/file.rsFdObject::SharedMem variant, as_shmem()
libkernel/src/process.rsMAP_SHARED constant
osl/src/syscalls/shmem.rsNewsys_shmem_create
osl/src/syscalls/mod.rsWire syscall 508
osl/src/syscalls/mem.rsMAP_SHARED path in sys_mmap, refcount-aware sys_munmap
osl/src/fd_helpers.rsget_fd_shmem helper

Test

user/src/shmem_test.c: Parent creates shmem, writes magic pattern, spawns child. Child inherits fd, mmaps it, verifies pattern, writes response. Parent waits, verifies response.


Phase 5c: File-Backed MAP_SHARED (future)

Goal: Multiple processes mapping the same file share physical frames via an inode-keyed page cache.

This requires:

  1. VFS inode identifiers — unique per file across mounts. The 9P protocol carries qid.path which serves as an inode, but it is currently discarded when converting to VfsDirEntry.
  2. Shared page cache — a global BTreeMap<(InodeId, page_offset) → PhysAddr> so multiple processes mapping the same file page get the same frame.
  3. Dirty trackingmsync or process exit writes dirty shared pages back to the file.

The frame refcount table from Phase 5b provides the foundation.


Dependency Graph

Phase 1 ─── VMA tracking + PROT flags
   │
   ├──▶ Phase 2 ─── Frame free list + munmap
   │       │
   │       └──▶ Phase 3 ─── mprotect + process cleanup
   │               │
   │               └──▶ Phase 4 ─── MAP_FIXED + gap finding
   │                       │
   │                       ├──▶ Phase 5a ─── File-backed MAP_PRIVATE (eager copy)
   │                       │
   │                       └──▶ Phase 5b ─── MAP_SHARED (anonymous, shmem_create)
   │                               │
   │                               └──▶ Phase 5c ─── File-backed MAP_SHARED
   │                                       (requires inode-based VFS + page cache)

Phase 5b (MAP_SHARED anonymous) uses frame refcounting and shmem_create to share physical frames between processes. No VFS changes needed.

Phase 5c (file-backed MAP_SHARED) requires inode identifiers from the VFS and a global page cache, building on Phase 5b’s refcount infrastructure.


Key Decisions

Eager vs demand paging

All phases use eager paging — frames are allocated and mapped immediately in sys_mmap. Demand paging (lazy fault-in) is a future optimisation that does not affect the syscall interface.

6th syscall argument for mmap

The offset parameter (6th arg, user r9) will be read from PerCpuData rather than changing the dispatch function signature. This avoids adding overhead to every syscall for a parameter only mmap uses.

Frame allocator: free-list overlay

Freed frames go onto a singly-linked free list stored in the pages themselves (using the physical memory map for access). The existing boot-time allocator remains for initial allocation; the free list is consulted first.

VMA storage: BTreeMap

A BTreeMap<u64, Vma> keyed by start address provides O(log n) lookup, ordered iteration for gap-finding, and natural support for range queries. Adequate for the expected number of VMAs per process (tens to low hundreds).

Graphics Subsystem Design

Overview

The kernel migrates from VGA text mode (80x25) to a pixel framebuffer during boot. Two hardware paths are covered: the Bochs Graphics Adapter (BGA) for the initial implementation, and virtio-gpu as future work.

After the switch, all existing output (println!, status_bar!, timeline strip, boot progress bar) renders via an 8x16 bitmap font onto the framebuffer. The text grid expands from 80x25 to 128x48 characters at 1024x768 resolution.

Architecture

  Early boot (text mode)          After PCI scan (graphical mode)
  ========================        ================================

  print!/status_bar!              print!/status_bar!
       |                               |
    Writer                          Writer
       |                               |
  DisplayBackend::TextMode        DisplayBackend::Graphical
       |                               |
  VgaBuffer (0xB8000 MMIO)       Framebuffer (BGA LFB MMIO)
                                       |
                                  font::draw_char() -> pixels

The Writer struct contains a DisplayBackend enum that dispatches all cell reads/writes to either the legacy VGA text buffer or the pixel framebuffer. The switch happens once during boot after PCI enumeration detects the BGA device.

BGA (Bochs Graphics Adapter) – Implemented

Hardware Interface

The BGA device is QEMU’s default VGA adapter on Q35 machines (-vga std). It is controlled via two I/O ports:

PortDirectionDescription
0x01CEWriteRegister index
0x01CFR/WRegister data

Register Map

IndexNameDescription
0IDVersion ID (0xB0C0..0xB0C5)
1XRESHorizontal resolution
2YRESVertical resolution
3BPPBits per pixel (8/15/16/24/32)
4ENABLEDisplay enable + LFB enable
5BANKVGA bank (legacy, not used)
6VIRT_WIDTHVirtual width (scrolling)
7VIRT_HEIGHTVirtual height (scrolling)
8X_OFFSETDisplay X offset
9Y_OFFSETDisplay Y offset

Mode Switch Sequence

  1. Write ENABLE = 0 (disable display)
  2. Write XRES = 1024, YRES = 768, BPP = 32
  3. Write ENABLE = 0x01 | 0x20 (enabled + LFB enabled)

Linear Framebuffer (LFB)

The LFB is located at PCI BAR0 of the BGA device:

  • PCI Vendor: 0x1234
  • PCI Device: 0x1111
  • BAR0: Physical base address of the LFB (typically 0xFD000000 on Q35)
  • Size: width * height * (bpp/8) = 1024 * 768 * 4 = 3,145,728 bytes
  • Pixel format: BGRX (blue in byte 0, green in byte 1, red in byte 2, byte 3 unused)

The kernel maps the LFB into the kernel MMIO virtual window (0xFFFF_8002_…) using map_mmio_region(). This region is present in all user page tables (via shared PML4 entries 256-510).

Software Text Rendering

Characters are rendered using an embedded 8x16 bitmap font (standard IBM VGA ROM font, CP437 character set, 256 glyphs, 4096 bytes).

  • Text grid: 128 columns x 48 rows (1024/8 x 768/16)
  • Font: libkernel/src/font.rsFONT_8X16 static array + draw_char()
  • Shadow buffer: [[ScreenChar; 128]; 48] inside DisplayBackend::Graphical enables scrolling without reading back from MMIO

Color Mapping

The VGA 16-color palette is mapped to 32-bit BGRA values:

VGA ColorBGRA Value
Black0x00000000
Blue0x00AA0000
Green0x0000AA00
White0x00FFFFFF

Row Layout (preserved from text mode)

Row(s)Purpose
0Status bar (white on blue)
1Timeline strip (colored blocks)
2Boot progress bar (during init)
3-47Scrolling text output

Scrolling

The graphical backend uses a fast scroll path:

  1. Framebuffer::scroll_up() uses core::ptr::copy() to shift pixel data up by one character row (16 scanlines) in a single memcpy operation
  2. The shadow cells array is shifted correspondingly
  3. Only the new blank bottom row is cleared with fill_rect()

This avoids the naive approach of redrawing every character cell on scroll.

Boot Sequence

  1. Kernel boots in VGA text mode – early println! works before PCI scan
  2. After PCI scan: detect BGA via I/O port ID register
  3. Find BGA PCI device, read BAR0 for LFB physical address
  4. Map LFB into kernel virtual space
  5. Call bga_set_mode(1024, 768, 32) to switch hardware
  6. Call switch_to_framebuffer() – copies current text content into shadow buffer, repaints entire screen
  7. All subsequent output renders as pixels

If BGA is not detected (e.g. -vga none), the kernel stays in text mode.

QEMU Configuration

Q35 machine includes stdvga (BGA-compatible) by default. No changes to run.sh required. To be explicit: -vga std.

Limitations

  • No hardware cursor in graphical mode (software cursor is a future enhancement)
  • LFB mapped with NO_CACHE (not write-combining); acceptable for text console but suboptimal for heavy graphics. A future optimization would configure PAT for write-combining on the LFB region.
  • Only works in QEMU/Bochs (BGA is not present on real hardware)

VirtIO-GPU – Future Work

Motivation

  • Standard virtio device, works with the virtio-drivers crate already in use
  • Supports hardware-accelerated 2D operations (TRANSFER_TO_HOST_2D)
  • Better fit for the existing virtio infrastructure (virtio-blk, virtio-9p)
  • Portable across any hypervisor supporting virtio-gpu (not just QEMU)

Hardware Interface

FieldValue
PCI Vendor0x1AF4
PCI Device0x1050 (modern) / 0x1010 (legacy)
Device classDisplay controller
Virtqueuescontrolq (commands), cursorq

Command Protocol

Unlike BGA’s simple I/O port registers, virtio-gpu uses a request/response protocol over virtqueues:

  1. RESOURCE_CREATE_2D – allocate a 2D resource (the framebuffer)
  2. RESOURCE_ATTACH_BACKING – attach DMA pages as backing store
  3. SET_SCANOUT – assign the resource to a display output
  4. TRANSFER_TO_HOST_2D – copy dirty rectangles from guest to host
  5. RESOURCE_FLUSH – tell the host to display the updated region

Design Sketch

  • VirtioGpuActor following the existing VirtioBlkActor pattern
  • Scanout = framebuffer resource backed by DMA pages from alloc_dma_pages()
  • Periodic TRANSFER_TO_HOST_2D + RESOURCE_FLUSH to update display
  • Dirty-rect tracking to minimize transfer size
  • Could share the same Framebuffer abstraction used by BGA

Why BGA First

  • Simpler: I/O port registers + direct MMIO framebuffer writes
  • No virtqueue setup, no command protocol
  • QEMU Q35 has it by default (stdvga)
  • Sufficient for a text console

Key Files

FileDescription
libkernel/src/framebuffer.rsBGA register access, Framebuffer struct
libkernel/src/font.rsEmbedded 8x16 bitmap font + draw_char()
libkernel/src/vga_buffer/DisplayBackend abstraction, Writer refactoring (mod.rs, capture.rs, timeline.rs)
kernel/src/main.rsinit_bga_framebuffer() boot integration

Status

  • BGA detection and mode switching
  • Linear framebuffer mapping and pixel rendering
  • 8x16 bitmap font (full CP437 character set)
  • DisplayBackend abstraction with text-mode fallback
  • Fast pixel scrolling
  • Status bar, timeline, progress bar all work in graphical mode
  • Software cursor (underline/block at cursor position)
  • Write-combining for LFB pages (PAT configuration)
  • Virtio-GPU backend

FPU / SSE State Management

x86-64 Floating-Point & SIMD Instruction Sets

FamilyRegistersWidthNotes
x87 FPUST(0)–ST(7)80-bitLegacy; used by some libm implementations
MMXMM0–MM764-bitAliases x87 registers
SSE/SSE2XMM0–XMM15128-bitBaseline for x86-64; musl uses SSE2
AVX/AVX2YMM0–YMM15256-bitExtends XMM to 256-bit upper halves
AVX-512ZMM0–ZMM31512-bitNot relevant for this kernel

SSE2 is part of the x86-64 baseline — every long-mode CPU supports it, and the System V AMD64 ABI uses XMM0–XMM7 for floating-point arguments/returns. musl libc is compiled with SSE2 and will use XMM registers in user-space code.


Kernel Target Configuration

The kernel’s custom target (x86_64-os.json) specifies:

"features": "-mmx,-sse,+soft-float"

This tells LLVM to never emit SSE/MMX instructions in kernel Rust code. All floating-point operations (if any) use soft-float emulation. This means the kernel never touches XMM registers, so:

  • Syscall path: No SSE save/restore needed — the kernel executes entirely with GPRs, and syscall/sysret returns to the same process.
  • Interrupt handlers: Safe as long as they don’t use SSE (guaranteed by the target config).
  • Timer preemption: The only path that switches between different user processes’ register contexts — requires SSE save/restore.

CR0/CR4 Setup (enable_sse)

SSE instructions will fault unless the CPU’s control registers are configured:

#![allow(unused)]
fn main() {
pub fn enable_sse() {
    unsafe {
        // CR0: clear EM (bit 2, x87 emulation), set MP (bit 1, monitor coprocessor)
        let mut cr0 = Cr0::read_raw();
        cr0 &= !(1 << 2); // clear CR0.EM
        cr0 |= 1 << 1;    // set CR0.MP
        Cr0::write_raw(cr0);

        // CR4: set OSFXSR (bit 9) and OSXMMEXCPT (bit 10)
        let mut cr4 = Cr4::read_raw();
        cr4 |= (1 << 9) | (1 << 10);
        Cr4::write_raw(cr4);
    }
}
}
  • CR0.EM = 0: Do not trap x87/SSE instructions.
  • CR0.MP = 1: Enable WAIT/FWAIT monitoring.
  • CR4.OSFXSR = 1: Enable FXSAVE/FXRSTOR and SSE instructions.
  • CR4.OSXMMEXCPT = 1: Enable unmasked SSE exception handling via #XM.

Called once during boot, before any user processes are spawned.


Eager FXSAVE/FXRSTOR Context Switch

We use the eager strategy: save and restore FPU/SSE state on every timer-driven context switch, unconditionally.

Timer stub flow

interrupt fires → CPU pushes iretq frame (40 bytes)
               → stub pushes 15 GPRs (120 bytes)
               → sub rsp, 512; fxsave [rsp]   ← save SSE state
               → call preempt_tick             ← may switch RSP
               → fxrstor [rsp]                 ← restore SSE state
               → add rsp, 512
               → pop GPRs; iretq

Stack layout during preemption

high address  ┌──────────────────────────┐
              │  SS / RSP / RFLAGS       │  iretq frame (40 bytes)
              │  CS / RIP                │
              ├──────────────────────────┤
              │  rax, rbx, ... r15       │  15 GPRs (120 bytes)
              ├──────────────────────────┤
              │  FXSAVE area             │  512 bytes (16-byte aligned)
              │  (x87/MMX/SSE state)     │  MXCSR at offset +24
              │                          │  XMM0-15 at offset +160
low address   └──────────────────────────┘  ← saved_rsp points here

Total: 672 bytes = 42 x 16, preserving 16-byte alignment for both fxsave (requires 16-byte aligned operand) and the SysV ABI call convention.

New thread initialization

spawn_thread and spawn_user_thread allocate the FXSAVE area below the SwitchFrame and initialize MXCSR at offset +24 to 0x1F80 (the Intel default: all SSE exceptions masked, round-to-nearest mode). XMM registers start zeroed.


FXSAVE Memory Layout (512 bytes)

OffsetSizeField
02FCW (x87 control word)
22FSW (x87 status word)
41FTW (abridged x87 tag word)
61Reserved
82FOP (last x87 opcode)
108FIP (x87 instruction pointer)
188FDP (x87 data pointer)
244MXCSR (SSE control/status)
284MXCSR_MASK
32128ST(0)–ST(7) / MM0–MM7 (8 x 16 bytes)
160256XMM0–XMM15 (16 x 16 bytes)
41696Reserved (must be zero for FXRSTOR)

The MXCSR default value 0x1F80 means:

  • Bits 12:7 = 0b111111 — all six SSE exception masks set (no traps)
  • Bits 14:13 = 0b00 — round-to-nearest-even
  • All exception flags (bits 5:0) cleared

Why Syscalls Don’t Need SSE Saves

The SYSCALL instruction does not change the process — it transitions from ring 3 to ring 0 within the same thread. Since the kernel target has -sse,+soft-float, no kernel code will modify XMM registers. When the syscall handler returns via SYSRETQ, XMM registers still hold the user process’s values.

The timer preemption path is different: it can switch from process A’s context to process B’s context, so process A’s XMM state would be overwritten by process B if not saved.


Future Considerations

Lazy FPU switching (CR0.TS)

Instead of saving/restoring on every context switch, set CR0.TS = 1 after switching away from a thread. The next SSE instruction triggers a #NM (Device Not Available) fault, at which point the handler saves the old thread’s state and loads the new thread’s state, then clears CR0.TS.

Pros: Avoids the 512-byte save/restore overhead when threads don’t use SSE (e.g., kernel threads). Cons: More complex, #NM handler latency, modern CPUs make FXSAVE fast enough that eager switching is preferred (Linux switched to eager in 3.15).

XSAVE for AVX

If AVX support is needed in the future, FXSAVE/FXRSTOR only covers XMM0–XMM15. XSAVE/XRSTOR can save the full YMM/ZMM state, but the save area size varies by CPU (queried via CPUID leaf 0xD). This would require:

  1. CPUID.0xD.0:EBX to determine XSAVE area size
  2. CR4.OSXSAVE = 1 and XCR0 configuration
  3. Dynamic allocation of per-thread XSAVE areas
  4. Replace FXSAVE/FXRSTOR with XSAVE/XRSTOR in the timer stub

File Descriptors & Pipes

Design for per-process file descriptor tables, the FileHandle trait, blocking syscalls, and the pipe implementation.


Motivation

The kernel currently has three syscalls: write (hardcoded to stdout/stderr via crate::print!()), exit, and arch_prctl. There is no concept of a file descriptor, no read/close, and no IPC mechanism between user processes.

Adding a proper file descriptor layer enables:

  • pipe for parent→child / sibling IPC
  • Redirecting stdout/stderr to pipes (shell pipelines)
  • Future open/read/write/close for VFS-backed files
  • dup2 for fd redirection

Overview

  User process                      Kernel
  ─────────────                     ──────
  write(fd, buf, n)  ──syscall──►  fd_table[fd].write(buf)
                                       │
                         ┌─────────────┼──────────────┐
                         ▼             ▼              ▼
                    ConsoleHandle  PipeWriter     VfsHandle  SharedMem
                    → print!()     → PipeInner    → VFS     → shmem frames
                                       ▲
                         ┌─────────────┘
                         │
  read(fd, buf, n)  ──►  fd_table[fd].read(buf)
                              │
                         PipeReader
                         → PipeInner

Layer 1: FileHandle trait

#![allow(unused)]
fn main() {
/// A kernel object backing an open file descriptor.
///
/// Implementations must be safe to share across threads (the fd table
/// holds `Arc<dyn FileHandle>`).
pub trait FileHandle: Send + Sync {
    /// Read up to `buf.len()` bytes.  Returns the number of bytes read,
    /// or 0 for EOF.  May block the calling thread (see "Blocking" below).
    fn read(&self, buf: &mut [u8]) -> Result<usize, FileError>;

    /// Write up to `buf.len()` bytes.  Returns the number of bytes written.
    /// May block the calling thread.
    fn write(&self, buf: &[u8]) -> Result<usize, FileError>;

    /// Release resources associated with this handle.
    /// Called when the last `Arc` is dropped (i.e. last fd closed).
    fn close(&self) {}

    /// Return a name for downcasting purposes.
    fn kind(&self) -> &'static str;

    /// For directory handles: serialize entries as linux_dirent64 into buf.
    fn getdents64(&self, _buf: &mut [u8]) -> Result<usize, FileError> {
        Err(FileError::NotATty)
    }
}
}

FileError is a structured enum in libkernel::file (using snafu for Display):

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Copy, Snafu)]
pub enum FileError {
    BadFd,           // bad file descriptor
    IsDirectory,     // is a directory
    NotATty,         // inappropriate ioctl for device
    TooManyOpenFiles, // too many open files
}
}

Linux errno numeric codes are defined separately in osl::errno and converted from FileError via errno::file_errno(). This keeps libkernel free of Linux-specific numeric constants.

FileHandle::read/write are synchronous — they return when the operation completes or an error occurs. Blocking is handled at the scheduler level (see below), not via async/await.


Layer 2: Per-process fd table

Add to Process:

#![allow(unused)]
fn main() {
pub struct Process {
    // ... existing fields ...
    pub fd_table: Vec<Option<Arc<dyn FileHandle>>>,
}
}

On process creation, pre-populate fds 0–2:

#![allow(unused)]
fn main() {
fd_table: vec![
    Some(Arc::new(ConsoleHandle)),  // 0: stdin  (read returns EBADF for now)
    Some(Arc::new(ConsoleHandle)),  // 1: stdout
    Some(Arc::new(ConsoleHandle)),  // 2: stderr
],
}

Fd allocation: scan for the first None slot; if none, push a new entry. This matches the POSIX “lowest available fd” rule.

#![allow(unused)]
fn main() {
impl Process {
    pub fn alloc_fd(&mut self, handle: Arc<dyn FileHandle>) -> Result<usize, FileError> {
        for (i, slot) in self.fd_table.iter().enumerate() {
            if slot.is_none() {
                self.fd_table[i] = Some(handle);
                return Ok(i);
            }
        }
        if self.fd_table.len() < MAX_FDS {
            let fd = self.fd_table.len();
            self.fd_table.push(Some(handle));
            Ok(fd)
        } else {
            Err(FileError::TooManyOpenFiles)
        }
    }

    pub fn close_fd(&mut self, fd: usize) -> Result<(), FileError> {
        if fd >= self.fd_table.len() {
            return Err(FileError::BadFd);
        }
        match self.fd_table[fd].take() {
            Some(handle) => { handle.close(); Ok(()) }
            None => Err(FileError::BadFd),
        }
    }

    pub fn get_fd(&self, fd: usize) -> Result<Arc<dyn FileHandle>, FileError> {
        self.fd_table.get(fd)
            .and_then(|slot| slot.clone())
            .ok_or(FileError::BadFd)
    }
}
}

Layer 3: ConsoleHandle

The simplest FileHandle — wraps the existing crate::print!() behaviour:

#![allow(unused)]
fn main() {
pub struct ConsoleHandle {
    pub readable: bool,
}

impl FileHandle for ConsoleHandle {
    fn read(&self, buf: &mut [u8]) -> Result<usize, FileError> {
        if !self.readable {
            return Err(FileError::BadFd);
        }
        Ok(crate::console::read_input(buf))
    }

    fn write(&self, buf: &[u8]) -> Result<usize, FileError> {
        if let Ok(s) = core::str::from_utf8(buf) {
            crate::print!("{}", s);
        }
        Ok(buf.len())
    }

    fn kind(&self) -> &'static str { "console" }
}
}

Layer 4: Blocking syscalls (Option C)

Pipe read and write must block when the buffer is empty or full. Rather than adding async/await to the syscall path, we add a Blocked state to the scheduler.

New thread state

#![allow(unused)]
fn main() {
enum ThreadState {
    Ready,
    Running,
    Blocked,   // ← new
    Dead,
}
}

Blocking API

#![allow(unused)]
fn main() {
/// Block the current thread until `waker` is called.
///
/// Saves the current thread's state as `Blocked` and yields to the
/// scheduler.  Returns when another thread (or ISR) calls
/// `unblock(thread_idx)`.
///
/// Must be called with interrupts disabled.
pub fn block_current_thread() { ... }

/// Move a blocked thread back onto the ready queue.
///
/// Safe to call from ISR context (e.g. a pipe write that wakes a reader).
pub fn unblock(thread_idx: usize) { ... }
}

How blocking works

  1. Syscall handler (e.g. sys_read on an empty pipe) calls block_current_thread().
  2. The scheduler marks the thread Blocked and context-switches away.
  3. preempt_tick never re-queues Blocked threads.
  4. When the condition is met (e.g. a writer pushes data into the pipe), the pipe calls unblock(thread_idx).
  5. unblock sets the thread to Ready and pushes it onto the ready queue.
  6. On the next preemption the thread is scheduled, returns from block_current_thread, and the syscall retries the operation.

Avoiding lost wakeups

The pipe must check the condition and call block_current_thread while holding the pipe’s internal lock. The sequence is:

lock pipe
if buffer_empty:
    register self as waiter (store thread_idx)
    unlock pipe
    block_current_thread()       ← yields here
    goto top                     ← retry after wakeup
else:
    copy data
    wake writer if blocked
    unlock pipe
    return count

The critical property: between checking the condition and blocking, no writer can sneak in — the pipe lock is held. The writer will see the registered waiter and call unblock after releasing the lock.


Layer 5: Pipe

Shared state

#![allow(unused)]
fn main() {
struct PipeInner {
    buf: VecDeque<u8>,
    capacity: usize,               // default 4096
    reader_closed: bool,
    writer_closed: bool,
    blocked_reader: Option<usize>,  // thread_idx waiting for data
    blocked_writer: Option<usize>,  // thread_idx waiting for space
}

pub struct Pipe {
    inner: Mutex<PipeInner>,
}
}

Read end

#![allow(unused)]
fn main() {
pub struct PipeReader(Arc<Pipe>);

impl FileHandle for PipeReader {
    fn read(&self, buf: &mut [u8]) -> Result<usize, FileError> {
        loop {
            let mut inner = self.0.inner.lock();
            if !inner.buf.is_empty() {
                let n = inner.drain_to(buf);
                // Wake blocked writer if there's now space.
                if let Some(writer) = inner.blocked_writer.take() {
                    scheduler::unblock(writer);
                }
                return Ok(n);
            }
            if inner.writer_closed {
                return Ok(0); // EOF
            }
            // Buffer empty, writer alive — block.
            inner.blocked_reader = Some(scheduler::current_thread_idx());
            drop(inner);
            scheduler::block_current_thread();
            // Woken up — retry.
        }
    }

    fn write(&self, _buf: &[u8]) -> Result<usize, FileError> {
        Err(FileError::EBADF)
    }

    fn close(&self) {
        let mut inner = self.0.inner.lock();
        inner.reader_closed = true;
        // Wake blocked writer so it sees EPIPE.
        if let Some(writer) = inner.blocked_writer.take() {
            scheduler::unblock(writer);
        }
    }
}
}

Write end

#![allow(unused)]
fn main() {
pub struct PipeWriter(Arc<Pipe>);

impl FileHandle for PipeWriter {
    fn read(&self, _buf: &mut [u8]) -> Result<usize, FileError> {
        Err(FileError::EBADF)
    }

    fn write(&self, buf: &[u8]) -> Result<usize, FileError> {
        let mut offset = 0;
        while offset < buf.len() {
            let mut inner = self.0.inner.lock();
            if inner.reader_closed {
                return Err(FileError::EPIPE);
            }
            let space = inner.capacity - inner.buf.len();
            if space > 0 {
                let n = core::cmp::min(space, buf.len() - offset);
                inner.buf.extend(&buf[offset..offset + n]);
                offset += n;
                // Wake blocked reader.
                if let Some(reader) = inner.blocked_reader.take() {
                    scheduler::unblock(reader);
                }
            } else {
                // Buffer full — block.
                inner.blocked_writer = Some(scheduler::current_thread_idx());
                drop(inner);
                scheduler::block_current_thread();
            }
        }
        Ok(buf.len())
    }

    fn close(&self) {
        let mut inner = self.0.inner.lock();
        inner.writer_closed = true;
        // Wake blocked reader so it sees EOF.
        if let Some(reader) = inner.blocked_reader.take() {
            scheduler::unblock(reader);
        }
    }
}
}

Creating a pipe

#![allow(unused)]
fn main() {
pub fn new_pipe(capacity: usize) -> (PipeReader, PipeWriter) {
    let pipe = Arc::new(Pipe {
        inner: Mutex::new(PipeInner {
            buf: VecDeque::with_capacity(capacity),
            capacity,
            reader_closed: false,
            writer_closed: false,
            blocked_reader: None,
            blocked_writer: None,
        }),
    });
    (PipeReader(pipe.clone()), PipeWriter(pipe))
}
}

Layer 6: Syscall wiring

New syscalls

NrNameSignature
0readread(fd, buf, count) → ssize_t
1writewrite(fd, buf, count) → ssize_t
3closeclose(fd) → int
22pipepipe(fds) → int

sys_pipe implementation

#![allow(unused)]
fn main() {
fn sys_pipe(fds_ptr: u64) -> i64 {
    // Validate user pointer (2 × i32 = 8 bytes).
    const USER_LIMIT: u64 = 0x0000_8000_0000_0000;
    if fds_ptr == 0 || fds_ptr + 8 > USER_LIMIT {
        return FileError::EFAULT.0;
    }

    let (reader, writer) = new_pipe(4096);
    let pid = process::current_pid();
    let (read_fd, write_fd) = process::with_process(pid, |proc| {
        let rfd = proc.alloc_fd(Arc::new(reader))?;
        match proc.alloc_fd(Arc::new(writer)) {
            Ok(wfd) => Ok((rfd, wfd)),
            Err(e) => { proc.close_fd(rfd).ok(); Err(e) }
        }
    }).unwrap_or(Err(FileError::EBADF))?;

    // Write fds to user space.
    let fds = fds_ptr as *mut [i32; 2];
    unsafe { (*fds) = [read_fd as i32, write_fd as i32]; }
    0
}
}

Refactored sys_write

#![allow(unused)]
fn main() {
fn sys_write(fd: u64, buf: u64, count: u64) -> i64 {
    // ... existing user pointer validation ...
    let bytes = validated_user_slice(buf, count)?;
    let pid = process::current_pid();
    let handle = process::with_process_ref(pid, |p| {
        p.fd_table.get(fd as usize).and_then(|s| s.clone())
    }).flatten().ok_or(FileError::EBADF)?;
    match handle.write(bytes) {
        Ok(n) => n as i64,
        Err(e) => e.0,
    }
}
}

Implementation order

PhaseWhatFiles
1FileHandle trait + FileErrorlibkernel/src/file.rs (new)
2fd_table on Process + alloc_fd/close_fdlibkernel/src/process.rs
3ConsoleHandlelibkernel/src/file.rs
4Refactor sys_write to use fd tablelibkernel/src/syscall.rs
5Add sys_read + sys_closelibkernel/src/syscall.rs
6Blocked thread state + block_current_thread / unblocklibkernel/src/task/scheduler.rs
7PipeInner / PipeReader / PipeWriterlibkernel/src/pipe.rs (new)
8sys_pipe syscalllibkernel/src/syscall.rs
9dup2 (optional, for shell redirection)libkernel/src/syscall.rs

Phases 1–5 are useful independently — they give user processes a real fd abstraction for stdout/stderr. Phase 6 is needed for any future blocking syscall (futex, sleep, waitpid). Phases 7–8 deliver pipes.


Open questions

  • Pipe capacity: 4096 bytes matches Linux’s historical default. Should this be page-sized for alignment, or is VecDeque fine?
  • Multiple readers/writers: This design supports only one blocked reader and one blocked writer. For a single pipe between two processes this is fine, but dup-ed fds sharing a pipe end would need a wait queue.
  • Signal delivery: POSIX SIGPIPE on write to a broken pipe is not modelled — we return EPIPE instead. Signals can be added later.
  • O_NONBLOCK: Not yet supported. Would return EAGAIN instead of blocking. Requires fd-level flags.
  • VFS integration: A future VfsHandle implementing FileHandle would connect the VFS’s async read_file to the synchronous FileHandle::read by using the same blocking mechanism.

Actor System

Overview

The kernel uses a lightweight actor model for device drivers and long-running system services. Each actor is an async task that owns its state behind an Arc, receives typed messages through a Mailbox, and responds to requests via one-shot Reply channels.

The design avoids shared mutable state and lock contention between drivers: all cross-actor communication is by message passing.


Core Primitives

Mailbox<M>libkernel::task::mailbox

An async, mutex-backed message queue.

sender                         receiver (actor run loop)
──────                         ────────────────────────
mailbox.send(msg)         →    while let Some(msg) = inbox.recv().await { ... }
                               (suspends when queue empty; woken on send)
mailbox.close()           →    recv() drains remaining msgs, then returns None

Key properties:

  • send acquires the lock, checks closed, and either enqueues the message or drops it immediately. Dropping a message also drops any embedded Reply, which closes the reply channel and unblocks the sender with None.
  • close sets closed = true under the lock and wakes the receiver. Messages already in the queue are not removed — recv delivers them before returning None. Any send arriving after close is silently dropped.
  • reopen clears the closed flag, used when restarting a driver.
  • The mutex makes send and close atomic with respect to each other, eliminating the race between “is it closed?” and “enqueue”.

recv uses a double-check pattern to avoid missed wakeups:

poll():
  lock → dequeue / check closed → unlock   (fast path)
  register waker
  lock → dequeue / check closed → unlock   (second check)
  → Pending

The lock is always released before registering the waker and before waking it, so a send or close that arrives between the two checks will either be seen by the second check or will wake the (now-registered) waker.

Reply<T> — one-shot response channel

Reply<T> is the sending half of a request/response pair.

#![allow(unused)]
fn main() {
// Actor receives:
ActorMsg::Info(reply) => reply.send(ActorStatus { name: "dummy", running: true, info: () }),

// Sender awaits:
let status: Option<ActorStatus<()>> = inbox.ask(|r| ActorMsg::Info(r)).await;
}

Reply::new() returns (Reply<T>, Arc<Mailbox<T>>). The actor calls reply.send(value) to deliver a response; the Drop impl calls close() on the inner mailbox regardless, so the receiver always unblocks:

  • reply.send(value) → value pushed, then Reply dropped → close() called. close() does not drain the queue, so the value is still there for recv.
  • reply dropped without sendclose() called on an empty mailbox → recv() returns None.

ActorMsg<M, I> — the envelope type

Every actor mailbox is Mailbox<ActorMsg<M, I>> where M is the actor-specific message type and I is the actor-specific info detail type (defaults to ()).

#![allow(unused)]
fn main() {
pub enum ActorMsg<M, I: Send = ()> {
    /// Typed info request — reply carries ActorStatus<I> with the full detail.
    Info(Reply<ActorStatus<I>>),
    /// Type-erased info request from the process registry — reply carries
    /// ActorStatus<ErasedInfo> so callers can display detail without knowing I.
    ErasedInfo(Reply<ActorStatus<ErasedInfo>>),
    /// An actor-specific message.
    Inner(M),
}
}

ActorStatus<I> is the response to both info variants:

#![allow(unused)]
fn main() {
pub struct ActorStatus<I = ()> {
    pub name:    &'static str,
    pub running: bool,   // always true when the actor is responding
    pub info:    I,      // actor-specific detail
}
}

ErasedInfo is a type alias for the boxed detail used in type-erased queries:

#![allow(unused)]
fn main() {
pub type ErasedInfo = Box<dyn core::fmt::Debug + Send>;
}

RecvTimeout<M> — timed receive

recv_timeout races the inbox against a Delay, returning whichever fires first:

#![allow(unused)]
fn main() {
pub enum RecvTimeout<M> {
    Message(M),  // a message arrived before the deadline
    Closed,      // mailbox was closed (actor should exit)
    Elapsed,     // timer fired before any message
}

// Usage:
match inbox.recv_timeout(ticks).await {
    RecvTimeout::Message(msg) => { /* handle */ }
    RecvTimeout::Closed       => break,
    RecvTimeout::Elapsed      => { /* periodic work */ }
}
}

Used internally by the #[on_tick] generated run loop.

ask — the request/response pattern

#![allow(unused)]
fn main() {
// Returns Option<R>; None if the actor is stopped or dropped the reply.
let result = inbox.ask(|reply| ActorMsg::Inner(MyMsg::GetThing(reply))).await;
}

ask creates a Reply, wraps it in a message, sends it, and awaits the response. Because a closed mailbox drops incoming messages (and their Replys), ask on a stopped actor returns None immediately rather than hanging.

Self-query deadlock: an actor must never use ask (or registry::ask_info) to query its own mailbox from within a message handler — it cannot recv() the response while blocked executing the current message. Detect self-queries by comparing names and respond directly instead.


Driver Lifecycle — devices::task_driver

DriverTask trait

#![allow(unused)]
fn main() {
pub trait DriverTask: Send + Sync + 'static {
    type Message: Send;
    type Info:    Send + 'static;
    fn name(&self) -> &'static str;
    fn run(
        handle: Arc<Self>,
        stop:   StopToken,
        inbox:  Arc<Mailbox<ActorMsg<Self::Message, Self::Info>>>,
    ) -> impl Future<Output = ()> + Send;
}
}

type Info is the actor-specific detail returned by #[on_info]. Use () if the actor has no custom info.

The run future is 'static because all state is accessed through Arc<Self>. StopToken can be polled between messages for cooperative stop, though most actors simply let inbox.recv() return None (which happens when the mailbox is closed by stop()).

TaskDriver<T> — the lifecycle wrapper

TaskDriver<T> implements Driver (the registry interface) and owns:

FieldTypePurpose
taskArc<T>actor state, shared with the run future
runningArc<AtomicBool>set true on start, false when run exits
stop_flagArc<AtomicBool>StopToken reads this
inboxArc<Mailbox<ActorMsg<T::Message, T::Info>>>message channel

Lifecycle:

TaskDriver::new()
  inbox starts CLOSED → sends before start() are dropped immediately

start()
  inbox.reopen()          opens the mailbox
  running = true
  spawn(async { T::run(handle, stop, inbox).await; running = false; })

stop()
  stop_flag = true        StopToken fires
  inbox.close()           recv() will return None after draining

(run loop exits)
  running = false

TaskDriver::new returns (TaskDriver<T>, Arc<Mailbox<ActorMsg<T::Message, T::Info>>>). The caller holds onto the Arc<Mailbox> to send actor-specific messages and registers it in the process registry (see below).


The #[actor] Macro — devices_macros

The macro generates a complete DriverTask implementation from an annotated impl block, eliminating the run-loop boilerplate. All attributes are passthrough no-ops when used outside an #[actor] block.

Basic usage — pure message actor

#![allow(unused)]
fn main() {
pub enum DummyMsg { SetInterval(u64) }

#[derive(Debug)]
pub struct DummyInfo { pub interval_secs: u64 }

pub struct Dummy { interval_secs: AtomicU64 }

#[actor("dummy", DummyMsg)]
impl Dummy {
    #[on_info]
    async fn on_info(&self) -> DummyInfo {
        DummyInfo { interval_secs: self.interval_secs.load(Ordering::Relaxed) }
    }

    #[on_message(SetInterval)]
    async fn set_interval(&self, secs: u64) {
        self.interval_secs.store(secs, Ordering::Relaxed);
    }
}
}

What the macro generates:

#![allow(unused)]
fn main() {
// Inherent impl with handler methods (attributes stripped):
impl Dummy {
    async fn on_info(&self) -> DummyInfo { ... }
    async fn set_interval(&self, secs: u64) { ... }
}

// DriverTask impl with the generated run loop:
impl DriverTask for Dummy {
    type Message = DummyMsg;
    type Info    = DummyInfo;
    fn name(&self) -> &'static str { "dummy" }

    async fn run(handle: Arc<Self>, _stop: StopToken,
                 inbox: Arc<Mailbox<ActorMsg<DummyMsg, DummyInfo>>>) {
        log::info!("[dummy] started");
        while let Some(msg) = inbox.recv().await {
            match msg {
                ActorMsg::Info(reply) =>
                    reply.send(ActorStatus { name: "dummy", running: true,
                                            info: handle.on_info().await }),
                ActorMsg::ErasedInfo(reply) =>
                    reply.send(ActorStatus { name: "dummy", running: true,
                                            info: Box::new(handle.on_info().await) }),
                ActorMsg::Inner(msg) => match msg {
                    DummyMsg::SetInterval(secs) => handle.set_interval(secs).await,
                }
            }
        }
        log::info!("[dummy] stopped");
    }
}

// Convenience type alias (struct name + "Driver"):
pub type DummyDriver = TaskDriver<Dummy>;
}

Any methods in the #[actor] block that have no actor attribute are emitted unchanged in the inherent impl and are callable from handler methods.

#[on_start] — actor startup hook

Called once, after the [actor] started log line and before the message loop:

#![allow(unused)]
fn main() {
#[on_start]
async fn on_start(&self) {
    println!();
    print!("myactor> ");
}
}

Only one #[on_start] method is allowed per actor.

#[on_info] — custom actor info

Without #[on_info], Info and ErasedInfo reply with info: (). Annotate one method to provide actor-specific detail:

#![allow(unused)]
fn main() {
#[on_info]
async fn on_info(&self) -> MyInfo {
    MyInfo { /* fields from self */ }
}
}

The return type must implement Debug + Send. The macro infers type Info = MyInfo and generates both Info and ErasedInfo arms automatically.

#[on_message(Variant)] — inner message handler

Maps one enum variant of the actor’s message type to an async handler:

#![allow(unused)]
fn main() {
#[on_message(DoThing)]
async fn do_thing(&self, n: u32) { ... }
}

The generated match arm is:

#![allow(unused)]
fn main() {
ActorMsg::Inner(MyMsg::DoThing(n)) => handle.do_thing(n).await,
}

Multiple #[on_message] methods are allowed, one per variant.

#[on_tick] — periodic callback

When present, the macro switches to a unified poll_fn loop (see below) that races the inbox against a Delay. The actor must also provide a plain tick_interval_ticks(&self) -> u64 method (no attribute needed):

#![allow(unused)]
fn main() {
fn tick_interval_ticks(&self) -> u64 {
    self.interval_secs.load(Ordering::Relaxed) * TICKS_PER_SECOND
}

#[on_tick]
async fn heartbeat(&self) {
    log::info!("[myactor] tick");
}
}

Only one #[on_tick] method is allowed per actor. The delay is reset after each tick so tick_interval_ticks can change dynamically.

#[on_stream(factory)] — interrupt/hardware stream source

Actors that need to react to hardware events (interrupts, async streams) use #[on_stream]. The factory argument names a plain method that returns a Stream + Unpin; the handler is called for each item:

#![allow(unused)]
fn main() {
// Factory — called once when the actor starts:
fn key_stream(&self) -> KeyStream { KeyStream::new() }

// Handler — called for each item from the stream:
#[on_stream(key_stream)]
async fn on_key(&self, key: Key) {
    // process key event
}
}

Multiple #[on_stream] methods are allowed, one per stream.

The unified poll_fn loop

When one or more #[on_stream] or #[on_tick] attributes are present the macro generates a loop that races all event sources in a single poll_fn:

#![allow(unused)]
fn main() {
// Streams initialised once before the loop:
let mut _stream_0 = handle.key_stream();
// Timer initialised if #[on_tick] is present:
let mut _delay = Delay::new(handle.tick_interval_ticks());

loop {
    enum _Event {
        _Inbox(ActorMsg<KeyboardMsg, KeyboardInfo>),
        _Stream0(Key),   // one variant per #[on_stream]
        _Tick,           // present if #[on_tick]
        _Stopped,
    }
    let mut _recv = inbox.recv();
    let _ev = poll_fn(|cx| {
        // Streams polled first — interrupt-driven, lowest latency:
        match poll_stream_next(&mut _stream_0, cx) {
            Poll::Ready(Some(item)) => return Poll::Ready(_Event::_Stream0(item)),
            Poll::Ready(None)       => return Poll::Ready(_Event::_Stopped),
            Poll::Pending           => {}
        }
        // Inbox — control messages and stop signal:
        match Pin::new(&mut _recv).poll(cx) {
            Poll::Ready(Some(msg)) => return Poll::Ready(_Event::_Inbox(msg)),
            Poll::Ready(None)      => return Poll::Ready(_Event::_Stopped),
            Poll::Pending          => {}
        }
        // Timer (lowest priority):
        if let Poll::Ready(()) = Pin::new(&mut _delay).poll(cx) {
            return Poll::Ready(_Event::_Tick);
        }
        Poll::Pending
    }).await;

    match _ev {
        _Event::_Stopped         => break,
        _Event::_Inbox(msg)      => match msg { /* Info, ErasedInfo, Inner arms */ }
        _Event::_Stream0(key)    => handle.on_key(key).await,
        _Event::_Tick            => {
            handle.heartbeat().await;
            _delay = Delay::new(handle.tick_interval_ticks());
        }
    }
}
}

All wakers (mailbox AtomicWaker, stream AtomicWaker, timer WAKERS slot) register the same task waker, so whichever source fires first reschedules the task. No extra task or thread is needed.

Using #[actor] outside the devices crate

The macro generates impl crate::task_driver::DriverTask for … and pub type XDriver = crate::task_driver::TaskDriver<X>;. In the devices crate this resolves naturally. For crates that use devices as a dependency (e.g. kernel), expose task_driver at the crate root:

#![allow(unused)]
fn main() {
// kernel/src/task_driver.rs
pub use devices::task_driver::*;

// kernel/src/main.rs
pub mod task_driver;   // makes crate::task_driver resolve for #[actor] expansions
}

The generated type alias uses the struct name suffixed with Driver: KeyboardActorKeyboardActorDriver, ShellShellDriver.


Process Registry — libkernel::task::registry

The registry maps actor names to their mailboxes, allowing any code to send messages to a named actor without holding a direct reference.

#![allow(unused)]
fn main() {
// Registration (at init time, in main.rs):
registry::register("dummy", dummy_inbox.clone());

// Typed lookup (when the caller knows both message and info types):
let inbox: Arc<Mailbox<ActorMsg<DummyMsg, DummyInfo>>> =
    registry::get::<DummyMsg, DummyInfo>("dummy")?;
inbox.send(ActorMsg::Inner(DummyMsg::SetInterval(5)));

// Type-erased info query (no knowledge of M or I needed):
if let Some(status) = registry::ask_info("dummy").await {
    println!("name: {}  running: {}  info: {:?}", status.name, status.running, status.info);
}
}

Each registry entry stores two representations of the same mailbox:

FieldTypeUsed for
mailboxArc<dyn Any + Send + Sync>typed downcast via get<M, I>
informableArc<dyn Informable>type-erased ErasedInfo query via ask_info

Informable is a simple object-safe trait:

#![allow(unused)]
fn main() {
pub trait Informable: Send + Sync {
    fn send_info(&self, reply: Reply<ActorStatus<ErasedInfo>>);
}
// Blanket impl for all actor mailboxes:
impl<M: Send, I: Send + 'static> Informable for Mailbox<ActorMsg<M, I>> {
    fn send_info(&self, reply: Reply<ActorStatus<ErasedInfo>>) {
        self.send(ActorMsg::ErasedInfo(reply));
    }
}
}

ask_info clones the Arc<dyn Informable> while holding the registry lock, drops the lock, then sends the request and awaits the reply — the lock is never held across an await.


Actors in Practice

Shell — pure message actor with startup hook

#![allow(unused)]
fn main() {
pub enum ShellMsg { KeyLine(String) }
pub struct Shell;

#[actor("shell", ShellMsg)]
impl Shell {
    #[on_start]
    async fn on_start(&self) {
        println!();
        print!("ostoo> ");
    }

    #[on_message(KeyLine)]
    async fn on_key_line(&self, line: String) {
        self.execute_command(&line).await;
        print!("ostoo> ");
    }

    // Plain helpers — land in the inherent impl:
    async fn execute_command(&self, line: &str) { ... }
    async fn cmd_driver(&self, rest: &str) { ... }
}
}

The shell prints its prompt in #[on_start] (once, when the actor starts) and again after each command in #[on_message(KeyLine)].

Fire-and-forget dispatch: the keyboard actor sends ShellMsg::KeyLine with mailbox.send() (no reply), so it never blocks waiting for the shell. The shell processes one command at a time; new lines queue in the mailbox.

Self-query avoidance: driver info shell from within a shell command would deadlock if it sent ErasedInfo to the shell’s own mailbox (the shell is busy executing the command and cannot recv). The handler detects the name "shell" and responds directly without going through the registry.

Keyboard — stream actor

#![allow(unused)]
fn main() {
pub struct KeyboardActor {
    keys_processed:   AtomicU64,
    lines_dispatched: AtomicU64,
    line:             spin::Mutex<LineBuf>,
}

#[actor("keyboard", KeyboardMsg)]
impl KeyboardActor {
    fn key_stream(&self) -> KeyStream { KeyStream::new() }

    #[on_stream(key_stream)]
    async fn on_key(&self, key: Key) {
        // buffer characters; dispatch complete lines to shell via send()
    }

    #[on_info]
    async fn on_info(&self) -> KeyboardInfo { ... }
}
}

KeyStream is interrupt-driven: every PS/2 scancode IRQ pushes into a lock-free queue and wakes an AtomicWaker. Because both the stream waker and the inbox waker register the same task waker, the actor sleeps in a single poll_fn and wakes on whichever event arrives first.

The line buffer lives in the actor struct behind a spin::Mutex<LineBuf> so it is accessible from the &self reference in on_key. The mutex is never held across an .await.

Dummy — tick actor (example / test driver)

#![allow(unused)]
fn main() {
#[actor("dummy", DummyMsg)]
impl Dummy {
    fn tick_interval_ticks(&self) -> u64 {
        self.interval_secs.load(Ordering::Relaxed) * TICKS_PER_SECOND
    }

    #[on_tick]
    async fn heartbeat(&self) {
        log::info!("[dummy] heartbeat");
    }

    #[on_info]
    async fn on_info(&self) -> DummyInfo { ... }

    #[on_message(SetInterval)]
    async fn set_interval(&self, secs: u64) { ... }
}
}

Starts stopped. driver start dummy from the shell opens its mailbox and spawns the run loop. driver dummy set-interval 3 sends SetInterval(3) and changes the heartbeat rate at runtime.


Startup Sequence

#![allow(unused)]
fn main() {
// main.rs (abridged)

// Dummy driver — starts stopped, user can start it from the shell
let (dummy_driver, dummy_inbox) = DummyDriver::new(Dummy::new());
devices::driver::register(Box::new(dummy_driver));
registry::register("dummy", dummy_inbox);

// Shell actor — started immediately
let (shell_driver, shell_inbox) = ShellDriver::new(Shell::new());
devices::driver::register(Box::new(shell_driver));
registry::register("shell", shell_inbox.clone());
devices::driver::start_driver("shell").ok();   // reopen + spawn run loop

// Keyboard actor — started immediately, stream-driven by PS/2 IRQs
let (kb_driver, kb_inbox) =
    KeyboardActorDriver::new(KeyboardActor::new());
devices::driver::register(Box::new(kb_driver));
registry::register("keyboard", kb_inbox);
devices::driver::start_driver("keyboard").ok();
}

File Map

PathRole
libkernel/src/task/mailbox.rsMailbox<M>, Reply<T>, ActorMsg<M,I>, ActorStatus<I>, ErasedInfo, RecvTimeout<M>
libkernel/src/task/mod.rspoll_stream_next helper used by macro-generated code
libkernel/src/task/registry.rsprocess registry, Informable, ask_info
devices/src/task_driver.rsDriverTask trait, TaskDriver<T>, StopToken
devices/src/driver.rsDriver trait, driver registry (start/stop/list)
devices-macros/src/lib.rs#[actor], #[on_message], #[on_info], #[on_start], #[on_tick], #[on_stream]
devices/src/dummy.rstick + message actor (#[on_tick], #[on_message], #[on_info])
kernel/src/shell.rsshell actor (#[on_start], #[on_message])
kernel/src/keyboard_actor.rskeyboard actor (#[on_stream], #[on_info])
kernel/src/task_driver.rspub use devices::task_driver::* shim for crate::task_driver path

virtio-blk Block Device Driver

Overview

The kernel includes a PCI virtio-blk driver that provides read/write access to a QEMU virtual disk. The driver is implemented using the virtio-drivers crate (v0.13) and integrates with the existing actor/driver framework.

The driver is started automatically at boot if a virtio-blk PCI device is found. It is accessible from the shell via the blk commands.


Architecture

QEMU virtio-blk device (PCIe, Q35 ECAM)
        │
        │  PciTransport (virtio-drivers)
        ▼
  VirtIOBlk<KernelHal, PciTransport>   ← virtio protocol implementation
        │
  spin::Mutex (actor + ISR safe)
        │
  VirtioBlkActor                        ← actor framework wrapper
        │
  Mailbox<ActorMsg<VirtioBlkMsg, VirtioBlkInfo>>
        │
  Shell / other actors                  ← consumers

Components

devices/src/virtio/mod.rs — HAL and transport

KernelHal

Implements the virtio_drivers::Hal unsafe trait, bridging the virtio-drivers crate into the kernel memory model:

MethodImplementation
dma_alloc(pages)Allocates contiguous physical frames via MemoryServices::alloc_dma_pages; returns (paddr, virt) where virt is in the linear physical-memory window (phys_mem_offset + paddr). Pages are zeroed.
dma_deallocNo-op. The frame allocator has no free operation; allocations are leaked (acceptable for MVP).
mmio_phys_to_virt(paddr, size)Calls MemoryServices::map_mmio_region to ensure the physical range is mapped, then returns the linear-window virtual address.
share(buffer)Performs a page-table walk via MemoryServices::translate_virt to find the physical address of any buffer (heap or DMA window). A plain vaddr - phys_mem_offset would be wrong for heap buffers.
unshareNo-op on x86 (cache-coherent).

ECAM / PciRoot

The Q35 machine exposes a PCIe Extended Configuration Access Mechanism (ECAM) region at physical address 0xB000_0000 (1 MiB, covering bus 0).

Physical 0xB000_0000  →  Virtual phys_mem_offset + 0xB000_0000

The mapping is created once during libkernel_main by calling MemoryServices::map_mmio_region. The resulting virtual base is stored in the ECAM_VIRT_BASE atomic and used by create_pci_root() which constructs a PciRoot<MmioCam<'static>> for the virtio-drivers transport layer. (In virtio-drivers 0.13, PciRoot is generic over a ConfigurationAccess implementation; MmioCam wraps the raw MMIO pointer with a Cam::Ecam mode.)

create_pci_transport (formerly create_blk_transport)

#![allow(unused)]
fn main() {
pub fn create_pci_transport(bus: u8, device: u8, function: u8) -> Option<PciTransport>
}

Wraps PciTransport::new::<KernelHal, _>, isolating virtio-drivers from the kernel binary — the kernel crate does not depend on virtio-drivers directly. Works for any virtio-pci device (blk, 9p, etc.), not just block devices. create_blk_transport is kept as a legacy alias.

register_blk_irq

#![allow(unused)]
fn main() {
pub fn register_blk_irq(handler: fn()) -> Option<u8>
}

Registers a dynamic IDT handler for the virtio-blk interrupt (delegating to libkernel::interrupts::register_handler). Returns the allocated IDT vector, which must be programmed into the device’s MSI or IO APIC routing table. IRQ-driven completion is not yet wired up (see Limitations).


devices/src/virtio/blk.rs — the actor

Messages

#![allow(unused)]
fn main() {
pub enum VirtioBlkMsg {
    Read(u64, Reply<Result<Vec<u8>, ()>>),   // sector, reply
    Write(u64, Vec<u8>, Reply<Result<(), ()>>), // sector, data, reply
}
}

Info

#![allow(unused)]
fn main() {
#[derive(Debug)]
pub struct VirtioBlkInfo {
    pub capacity_sectors: u64,
    pub reads:  u64,
    pub writes: u64,
}
}

Returned by driver info virtio-blk and blk info.

VirtioBlkActor

Owns a spin::Mutex<VirtIOBlk<KernelHal, PciTransport>>. The mutex is needed because both the actor task and (future) interrupt handler may access the device.

unsafe impl Send + Sync are required because VirtIOBlk contains raw DMA buffer pointers, which are not auto-Send. Access is always serialised through the spin::Mutex.

Read/write flow

on_read(sector, reply):
  1. lock device → read_blocks_nb(sector, &mut req, buf, &mut resp) → token
  2. unlock device
  3. CompletionFuture.await  (busy-polls peek_used until the device signals done)
  4. lock device → complete_read_blocks(token, &req, buf, &resp)
  5. unlock device
  6. reply.send(Ok(buf))

Write is symmetric with write_blocks_nb / complete_write_blocks.

All of read_blocks_nb, write_blocks_nb, complete_read_blocks, and complete_write_blocks are unsafe fn in virtio-drivers — the safety contract is that the buffers remain valid and unpinned for the duration of the I/O. Because buf, req, and resp all live in the async state machine on the heap, they are not moved or dropped between submit and complete.

CompletionFuture

#![allow(unused)]
fn main() {
struct CompletionFuture<'a> {
    device: &'a spin::Mutex<VirtIOBlk<KernelHal, PciTransport>>,
}

impl Future for CompletionFuture<'_> {
    type Output = ();
    fn poll(...) -> Poll<()> {
        if device.lock().peek_used().is_some() {
            Poll::Ready(())
        } else {
            cx.waker().wake_by_ref();   // reschedule immediately (busy-poll)
            Poll::Pending
        }
    }
}
}

This is a busy-poll future for MVP. It re-schedules itself every executor turn until the virtqueue returns a used buffer. See Limitations for the planned IRQ-driven replacement.


libkernel/src/memory/mod.rs — supporting APIs

Three methods were added to MemoryServices for virtio support:

map_mmio_region(phys_start, size) -> VirtAddr

Maps a physical MMIO range into the linear physical-memory window (phys_mem_offset + phys_start) using 4 KiB pages with PRESENT | WRITABLE | NO_CACHE flags.

Pages already mapped as 4 KiB pages are skipped silently (Ok(_)). Pages inside a 2 MiB or 1 GiB huge-page entry are also skipped (Err(TranslateError::ParentEntryHugePage)) — they are already accessible because the bootloader maps all physical RAM using 2 MiB huge pages.

This huge-page check was the fix for the map_to failed: ParentEntryHugePage panic that occurred when mapping the ECAM region.

alloc_dma_pages(pages) -> Option<PhysAddr>

Allocates pages physically-contiguous 4 KiB frames from the BootInfoFrameAllocator. Panics if frames are not contiguous (very unlikely with the sequential allocator).

translate_virt(virt) -> Option<PhysAddr>

Walks the active RecursivePageTable to find the physical address for any virtual address, regardless of page size (4 KiB, 2 MiB, or 1 GiB).

This is used by KernelHal::share to convert heap buffer addresses to physical addresses. A simple vaddr - phys_mem_offset subtraction would be wrong for heap buffers (which live at HEAP_START, not in the linear physical window), producing garbage physical addresses and causing QEMU to report virtio: zero sized buffers are not allowed.


Boot Sequence

libkernel_main()
  1. memory::init_services(mapper, frame_allocator, phys_mem_offset, map)
  2. map_mmio_region(0xB000_0000, 1 MiB)   ← ECAM
     virtio::set_ecam_base(ecam_virt)
  3. devices::pci::init()                   ← scan CF8/CFC config space
  4. find_devices(0x1AF4, 0x1042)           ← probe modern-transitional first
     find_devices(0x1AF4, 0x1001)           ← then legacy
  5. virtio::create_pci_transport(bus, dev, func)
       └─ PciRoot::new(MmioCam::new(ECAM_VIRT_BASE, Cam::Ecam))
          PciTransport::new::<KernelHal, _>(&mut root, df)
  6. VirtioBlkActor::new(transport)
  7. VirtioBlkActorDriver::new(actor)
  8. driver::register + registry::register("virtio-blk", inbox)
  9. driver::start_driver("virtio-blk")
     → "[kernel] virtio-blk registered"

Shell Commands

CommandDescription
blk infoPrint capacity, read count, and write count
blk read <sector>Read 512 bytes from sector N; hex-dump first 64 bytes
blk ls [path]List exFAT directory (see exfat.md)
blk cat <path>Print exFAT file as text (see exfat.md)
ls [path]Alias for blk ls
cat <path>Alias for blk cat
driver info virtio-blkSame info via the generic driver info command
driver stop virtio-blkStop the actor (mailbox closed; no further I/O)
driver start virtio-blkRestart the actor

Running with a Disk

# Create a blank 64 MiB disk image (once):
make disk

# Build and run with the disk attached:
make run

The run target adds:

-drive file=disk.img,format=raw,if=none,id=hd0
-device virtio-blk-pci,drive=hd0

The kernel uses a Q35 machine (-machine q35) which provides native PCIe and ECAM support.

To run without a disk (e.g. for quick boot tests):

make run-nodisk

PCI Device IDs

Device IDVariant
0x1AF4:0x1042Modern-transitional virtio-blk (QEMU default)
0x1AF4:0x1001Legacy virtio-blk

Both are probed at boot; modern-transitional is tried first.


Key Files

FileRole
devices/src/virtio/mod.rsKernelHal, ECAM state, create_pci_transport, register_blk_irq
devices/src/virtio/blk.rsVirtioBlkActor, VirtioBlkMsg, VirtioBlkInfo, CompletionFuture
devices/src/virtio/p9_proto.rs9P2000.L wire protocol encode/decode
devices/src/virtio/p9.rsP9Client — high-level 9P client wrapping VirtIO9p
kernel/src/main.rsECAM mapping, PCI probe (blk + 9p), actor registration
devices/src/virtio/exfat.rsexFAT partition detection, filesystem, path walk
kernel/src/shell.rsblk info, blk read, blk ls, blk cat, ls, cat, cd, pwd
libkernel/src/memory/mod.rsmap_mmio_region, alloc_dma_pages, translate_virt
Makefiledisk, run, run-nodisk targets

Limitations

Busy-poll completion

CompletionFuture re-schedules itself every executor turn, consuming CPU until the device completes I/O. The intended replacement is an AtomicWaker-based future that sleeps until the IRQ handler calls wake():

#![allow(unused)]
fn main() {
static IRQ_WAKER: AtomicWaker = AtomicWaker::new();

fn virtio_blk_irq_handler() {
    IRQ_PENDING.store(true, Ordering::Release);
    IRQ_WAKER.wake();
}
}

This requires programming the device’s MSI capability or IO APIC routing with the vector returned by register_blk_irq. The infrastructure exists; wiring is the remaining work.

No DMA free

dma_dealloc is a no-op. Freed DMA pages are leaked. The BootInfoFrameAllocator has no reclamation path. Acceptable for MVP; a proper frame allocator with free would be needed for a production kernel.

Single device

The IRQ state (IRQ_PENDING) is a file-level static, supporting only one virtio-blk device. Multi-device support would require per-device state.

Heap size

The kernel heap is 100 KiB. DMA allocations come from the frame allocator (not the heap), but Vec<u8> read buffers and BlkReq/BlkResp structs live on the heap. Sustained I/O workloads should remain well within the limit.

VirtIO 9P Host Directory Sharing

Overview

The kernel includes a VirtIO 9P (9P2000.L) driver that shares a host directory directly into the guest via QEMU’s -fsdev mechanism. This provides a Docker-volume-like workflow: edit files on the host, they appear instantly in the guest — no disk image rebuild needed.

The driver uses the VirtIO9p device from the virtio-drivers crate (v0.13) and implements a minimal read-only 9P2000.L client on top.


Architecture

Host directory (./user)
        │
  QEMU virtio-9p-pci device
        │
        │  PciTransport (virtio-drivers)
        ▼
  VirtIO9p<KernelHal, PciTransport>   ← virtio device, raw request/response
        │
  spin::Mutex (synchronous access)
        │
  P9Client                             ← 9P2000.L protocol client
        │
  Plan9Vfs                             ← VFS adapter
        │
  devices::vfs mount table             ← /host and optionally /

Components

devices/src/virtio/p9_proto.rs — Wire Protocol

Minimal 9P2000.L message encoding/decoding. All messages use little-endian wire format with a 7-byte header: size[4] type[1] tag[2].

Message pairs implemented

T-messageR-messageType codesPurpose
TversionRversion100 / 101Protocol handshake (negotiates msize)
TattachRattach104 / 105Mount filesystem, get root fid
TwalkRwalk110 / 111Traverse path components
TlopenRlopen12 / 13Open a fid for reading
TreadRread116 / 117Read file data
TreaddirRreaddir40 / 41Read directory entries
TgetattrRgetattr24 / 25Get file attributes (mode, size)
TclunkRclunk120 / 121Release a fid

Error responses use Rlerror (type 7) with a Linux errno code.

Key types

#![allow(unused)]
fn main() {
pub struct Qid { pub qid_type: u8, pub version: u32, pub path: u64 }
pub struct DirEntry9p { pub qid: Qid, pub offset: u64, pub dtype: u8, pub name: String }
pub struct Stat9p { pub mode: u32, pub size: u64, pub qid: Qid }
}

devices/src/virtio/p9.rs — P9Client

High-level client wrapping VirtIO9p. The device is accessed synchronously through a spin::Mutex — no actor pattern needed since 9P access happens from syscall context via osl::blocking::blocking().

#![allow(unused)]
fn main() {
pub struct P9Client {
    device: Mutex<VirtIO9p<KernelHal, PciTransport>>,
    msize:  u32,       // negotiated max message size (typically 8192)
    next_fid: Mutex<u32>,
}
}

Construction

P9Client::new(transport) performs the handshake:

  1. Tversion — negotiates protocol version (“9P2000.L”) and max message size
  2. Tattach — attaches root fid (fid 0) to the shared directory

Public methods

MethodFlow
list_dir(path)walk → lopen → readdir (loop) → clunk
read_file(path)walk → getattr (size) → lopen → read (loop) → clunk
stat(path)walk → getattr → clunk

Each method walks from the root fid, allocating a temporary fid that is clunked after the operation completes. The readdir and read loops consume data in chunks of msize - 64 bytes.

list_dir filters out . and .. entries automatically.


devices/src/vfs/plan9_vfs.rs — VFS Adapter

Follows the ExfatVfs pattern. Wraps an Arc<P9Client> and maps:

  • DirEntry9pVfsDirEntry (dtype 4 or qid type 0x80 → is_dir)
  • P9ErrorVfsError

The P9Client methods are synchronous but the VFS interface is async. Since the virtio-9p device uses polling (no IRQ), blocking in an async context is acceptable for MVP.


QEMU Configuration

In scripts/run.sh:

-fsdev local,id=fsdev0,path=./user,security_model=none \
-device virtio-9p-pci,fsdev=fsdev0,mount_tag=hostfs

This shares the ./user directory (where userspace binaries are built) into the guest. security_model=none disables host permission mapping, which is appropriate since the guest is read-only.


Boot Sequence

run_kernel()
  1. PCI probe: find_devices(0x1AF4, 0x1049)   ← modern virtio-9p
                find_devices(0x1AF4, 0x1009)   ← legacy
  2. create_pci_transport(bus, dev, func)
  3. P9Client::new(transport)
       └─ Tversion + Tattach handshake
  4. Arc::new(client)
  5. vfs::mount("/host", Plan9(Plan9Vfs::new(Arc::clone(&client))))
  6. If no virtio-blk:
       vfs::mount("/", Plan9(Plan9Vfs::new(client)))

PCI Device IDs

Device IDVariant
0x1AF4:0x1049Modern virtio-9p (probed first)
0x1AF4:0x1009Legacy virtio-9p

Key Files

FileRole
devices/src/virtio/p9_proto.rs9P2000.L wire protocol encode/decode
devices/src/virtio/p9.rsP9Client — high-level 9P client
devices/src/vfs/plan9_vfs.rsPlan9Vfs — VFS adapter
devices/src/virtio/mod.rsKernelHal, create_pci_transport (shared with blk)
kernel/src/main.rs9P probe, mount at /host and fallback /
scripts/run.shQEMU -fsdev and -device virtio-9p-pci flags

Limitations

Read-only

The 9P client only implements read operations (walk, lopen, read, readdir, getattr). Write, create, mkdir, remove, and rename are not supported.

No fid recycling

Fid numbers are allocated monotonically and never reused. With 32-bit fids this is unlikely to be a problem in practice, but a long-running system performing many file operations would eventually exhaust the fid space.

Directory entry sizes

list_dir reports size: 0 for all entries because readdir does not return file sizes. A per-entry getattr could be added but would increase the number of 9P round-trips.

Single device

Only one virtio-9p device is probed. Multiple shared directories would require iterating over all matching PCI devices and mounting each at a different path.

Synchronous I/O

All 9P operations block the calling scheduler thread. This is acceptable when called from syscall context via osl::blocking::blocking(), but direct use from async tasks would stall the executor.

exFAT Read-Only Filesystem

Overview

The kernel includes a read-only exFAT filesystem driver that sits on top of the virtio-blk block device. It auto-detects bare exFAT volumes, MBR-partitioned disks, and GPT-partitioned disks, then exposes simple directory-listing and file-read operations through the shell.

The driver is implemented entirely in devices/src/virtio/exfat.rs with no external dependencies.


Architecture

Shell (ls / cat / cd / pwd)
        │
        │  open_exfat / list_dir / read_file
        ▼
  ExfatVol  ──── async sector reads ────▶  BlkInbox
        │                                     │
  Partition detection                   VirtioBlkActor
  Boot sector parse                     (virtio-blk driver)
  FAT traversal
  Dir entry parse
  Path walk

All filesystem I/O is done one 512-byte sector at a time via the ask pattern on the virtio-blk actor’s mailbox (VirtioBlkMsg::Read).


Partition Auto-Detection

open_exfat reads sector 0 and applies the following decision tree:

sector0[3..11] == "EXFAT   "
  → bare exFAT (no partition table); volume starts at LBA 0

sector0[510..512] == [0x55, 0xAA]
  read sector 1
  sector1[0..8] == "EFI PART"
    → GPT: scan partition entries starting at the LBA stored in the header
      look for type GUID = EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
      (on-disk mixed-endian: A2 A0 D0 EB E5 B9 33 44 87 C0 68 B6 B7 26 99 C7)
      read StartingLBA of the matching entry, verify "EXFAT   " there
  else
    → MBR: scan partition table at sector0[446..510]
      look for entry with type byte 0x07
      read LBA start (bytes 8–11 LE u32), verify "EXFAT   " there

else
  → ExfatError::UnknownPartitionLayout

Type 0x07 is shared by exFAT and NTFS. The driver always verifies the OEM name at the candidate partition’s first sector before accepting it as exFAT.


On-Disk Layout

Boot Sector

OffsetSizeFieldNotes
38FileSystemNameMust equal "EXFAT " (with trailing space)
804FatOffsetSectors from volume start to FAT
884ClusterHeapOffsetSectors from volume start to data region
964FirstClusterOfRootDirectoryCluster number of root dir
1091SectorsPerClusterShiftsectors_per_cluster = 1 << shift
5102BootSignatureMust equal [0x55, 0xAA]

FAT (File Allocation Table)

An array of u32 little-endian values. Entry N holds the next cluster in the chain for cluster N, or 0xFFFFFFFF for end-of-chain.

fat_lba          = volume_lba + FatOffset
sector_of_entry  = fat_lba + (N * 4) / 512
byte_in_sector   = (N * 4) % 512

Cluster Heap

cluster_lba(N) = cluster_heap_lba + (N − 2) * sectors_per_cluster

Cluster numbers start at 2; clusters 0 and 1 are reserved.

Directory Entry Sets

Each entry is 32 bytes. A file or directory is represented by a consecutive set of three or more entries:

Type byteNameKey fields
0x85File[1] SecondaryCount; [4..6] FileAttributes (bit 4 = directory)
0xC0Stream Extension[8..16] DataLength (u64 LE); [20..24] FirstCluster (u32 LE)
0xC1+File Name[2..32] up to 15 UTF-16LE code units per entry

Type 0x00 marks the end of directory; scanning stops immediately. Any type byte with bit 7 clear (< 0x80) is an unused or deleted entry and is skipped.


ExfatVol State

#![allow(unused)]
fn main() {
pub struct ExfatVol {
    lba_base:            u64,  // absolute LBA of the exFAT boot sector
    sectors_per_cluster: u64,
    fat_lba:             u64,  // absolute LBA of the FAT
    cluster_heap_lba:    u64,  // absolute LBA of the cluster heap
    root_cluster:        u32,
}
}

This is returned by open_exfat and passed to every subsequent call. The shell calls open_exfat fresh on each command (stateless).


Public API

#![allow(unused)]
fn main() {
/// Auto-detect layout and open the exFAT volume.
pub async fn open_exfat(inbox: &BlkInbox) -> Result<ExfatVol, ExfatError>;

/// List directory at `path` (e.g. "/" or "/docs").
pub async fn list_dir(vol: &ExfatVol, inbox: &BlkInbox, path: &str)
    -> Result<Vec<DirEntry>, ExfatError>;

/// Read a file into memory.  Capped at 16 KiB.
pub async fn read_file(vol: &ExfatVol, inbox: &BlkInbox, path: &str)
    -> Result<Vec<u8>, ExfatError>;
}
#![allow(unused)]
fn main() {
pub struct DirEntry {
    pub name:   String,
    pub is_dir: bool,
    pub size:   u64,
}

pub enum ExfatError {
    NoDevice, IoError, NotExfat, UnknownPartitionLayout,
    PathNotFound, NotAFile, NotADirectory, FileTooLarge,
}
}

BlkInbox is a type alias for the virtio-blk actor’s mailbox:

#![allow(unused)]
fn main() {
pub type BlkInbox = Arc<Mailbox<ActorMsg<VirtioBlkMsg, VirtioBlkInfo>>>;
}

Path Resolution

The shell maintains a current working directory (CWD) in Shell::cwd (spin::Mutex<String>, default "/").

resolve_path(cwd, path) in kernel/src/shell.rs handles relative and absolute paths, then normalize_path collapses . and .. components:

cwd = "/a/b"
resolve("../c")  →  normalize("/a/b/../c")  →  "/a/c"
resolve("/foo")  →  "/foo"
resolve("")      →  "/a/b"   (defaults to CWD)

Path component matching in the driver is case-insensitive ASCII (str::eq_ignore_ascii_case). Non-ASCII filename characters are replaced with ? in the decoded string.


Shell Commands

CommandDescription
ls [path]List directory; defaults to CWD
cat <path>Print file as text; non-printable bytes shown as .
pwdPrint current working directory
cd [path]Change CWD; verifies the target exists; defaults to /
blk ls [path]Alias for ls
blk cat <path>Alias for cat

cd calls list_dir on the target path before updating the CWD, so invalid paths are rejected with an error rather than silently accepted.


Memory Budget

Peak heap usage during ls:

ItemSize
Boot sector512 B
FAT sector (per entry lookup)512 B
Cluster data (typical 4 KiB cluster)4 KiB
Vec<DirEntry>small
Total~5 KiB

read_file caps output at 16 KiB. The kernel heap is 100 KiB; both operations are well within budget.


Limitations

Read-only

Write support is not implemented. VirtioBlkMsg::Write exists in the block driver but the exFAT layer has no write path.

Entry sets crossing cluster boundaries

scan_dir_cluster collects all sectors of a cluster into a flat buffer before parsing entries. An entry set whose 0x85 primary entry is in one cluster and whose secondary entries start in the next cluster will be silently skipped. This situation does not arise on normally-formatted volumes where directories start empty.

ASCII-only filenames

UTF-16LE code points above U+007F are replaced with ?. Files can still be opened by name if the shell command uses the same replacement — but in practice, test images should use ASCII filenames.

Fresh volume open per command

open_exfat reads the boot sector (and up to ~32 GPT entry sectors) on every shell command. A cached ExfatVol stored in the shell actor would reduce overhead, but is unnecessary given the current workload.

16 KiB file cap

read_file returns ExfatError::FileTooLarge for files exceeding 16 KiB. The limit exists to protect the 100 KiB heap; it can be raised if the heap is grown.


Key Files

FileRole
devices/src/virtio/exfat.rsPartition detection, boot parse, FAT traversal, dir scan, path walk, public API
devices/src/virtio/mod.rsRe-exports BlkInbox, DirEntry, ExfatError, ExfatVol, public functions
kernel/src/shell.rscmd_blk_ls, cmd_blk_cat, cmd_cd, cmd_pwd, resolve_path, normalize_path

Creating Test Images

GPT (macOS default)

hdiutil create -size 32m -fs ExFAT -volname TEST test-gpt.dmg
hdiutil attach test-gpt.dmg
cp hello.txt /Volumes/TEST/
mkdir /Volumes/TEST/subdir
cp nested.txt /Volumes/TEST/subdir/
hdiutil detach /Volumes/TEST
hdiutil convert test-gpt.dmg -format UDRO -o test-gpt-ro.dmg

MBR-partitioned

diskutil eraseDisk ExFAT TEST MBRFormat /dev/diskN

Bare exFAT (no partition table)

diskutil eraseVolume ExFAT TEST /dev/diskN

Running in QEMU

qemu-system-x86_64 ... \
  -drive file=test-gpt.img,format=raw,if=none,id=hd0 \
  -device virtio-blk-pci,drive=hd0

Then in the shell:

ostoo:/> ls
  [DIR]        subdir
  [FILE    13] hello.txt
ostoo:/> cat /hello.txt
Hello, kernel!
ostoo:/> cd subdir
ostoo:/subdir> ls
  [FILE    11] nested.txt
ostoo:/subdir> cat nested.txt
Hello again!
ostoo:/subdir> cd /
ostoo:/> pwd
/

Virtual Filesystem (VFS) Layer

Overview

The VFS layer provides a uniform path namespace over multiple filesystems. Before its introduction, the shell called the exFAT driver directly; adding a second filesystem would have required invasive shell changes. The VFS decouples path resolution and filesystem dispatch so that new drivers slot in without touching the shell.

Key properties:

  • Enum dispatch — no heap-allocating Pin<Box<dyn Future>> trait objects.
  • Mount table — filesystems are attached at arbitrary absolute paths.
  • Lock safety — the mount-table lock is never held across an await point.
  • No new Cargo dependencies — everything already present in the workspace.

Source layout

devices/src/
  vfs/
    mod.rs          — public API, mount table, path resolution
    exfat_vfs.rs    — ExfatVfs: wraps virtio-blk + exFAT driver
    plan9_vfs.rs    — Plan9Vfs: wraps virtio-9p P9Client
    proc_vfs/       — ProcVfs: synthetic kernel-info filesystem (mod.rs + generator submodules)

Public API (devices::vfs)

#![allow(unused)]
fn main() {
// Types
pub struct VfsDirEntry { pub name: String, pub is_dir: bool, pub size: u64 }

pub enum VfsError {
    IoError, NotFound, NotAFile, NotADirectory, FileTooLarge, NoFilesystem,
}

pub enum AnyVfs { Exfat(ExfatVfs), Plan9(Plan9Vfs), Proc(ProcVfs) }

// Functions
pub fn  mount(mountpoint: &str, fs: AnyVfs);
pub async fn list_dir(path: &str)  -> Result<Vec<VfsDirEntry>, VfsError>;
pub async fn read_file(path: &str) -> Result<Vec<u8>,          VfsError>;
pub fn  with_mounts<F: FnOnce(&[(String, Arc<AnyVfs>)])>(f: F);
}

All paths supplied to list_dir and read_file must be absolute (the shell’s resolve_path runs first and normalises . / ..).


Enum dispatch

Async methods on trait objects require Pin<Box<dyn Future>> — allocating and verbose in no_std. Instead, AnyVfs is a plain enum:

#![allow(unused)]
fn main() {
pub enum AnyVfs {
    Exfat(ExfatVfs),
    Plan9(Plan9Vfs),
    Proc(ProcVfs),
}

impl AnyVfs {
    pub async fn list_dir(&self, path: &str) -> Result<Vec<VfsDirEntry>, VfsError> {
        match self {
            AnyVfs::Exfat(fs) => fs.list_dir(path).await,
            AnyVfs::Plan9(fs) => fs.list_dir(path).await,
            AnyVfs::Proc(fs)  => fs.list_dir(path).await,
        }
    }
    // read_file, fs_type likewise
}
}

Adding a new filesystem = add one variant + three match arms (list_dir, read_file, fs_type).


Mount table

#![allow(unused)]
fn main() {
lazy_static! {
    static ref MOUNTS: spin::Mutex<Vec<(String, Arc<AnyVfs>)>> = ...;
}
}

Entries are kept sorted longest-mountpoint-first so resolution is a simple linear scan — the first match wins without any backtracking.

mount() replaces an existing entry at the same mountpoint, then re-sorts.

Arc<AnyVfs> is cloned out of the lock before any .await; the spinlock is never held across a suspension point.

Path resolution rules

SituationMountpointRequest pathRel path passed to driver
Exact match/proc/proc/
Prefix match/proc/proc/tasks/tasks
Root pass-through//docs/foo/docs/foo
No match/missingVfsError::NoFilesystem
#![allow(unused)]
fn main() {
fn resolve(path: &str) -> Option<(Arc<AnyVfs>, String)> {
    for (mp, fs) in MOUNTS.lock().iter() {
        if mp == "/"          { return Some((clone(fs), path.into())); }
        if path == mp         { return Some((clone(fs), "/".into())); }
        if path.starts_with(mp) && path[mp.len()..].starts_with('/') {
            return Some((clone(fs), path[mp.len()..].into()));
        }
    }
    None
}
}

ExfatVfs

ExfatVfs wraps a BlkInbox (the virtio-blk actor’s mailbox) and delegates to the existing devices::virtio::exfat functions. It calls open_exfat fresh on every request — identical to the pre-VFS shell behaviour.

ExfatVfs::list_dir / read_file
    └─ exfat::open_exfat  (detects bare/MBR/GPT layout)
    └─ exfat::list_dir / read_file

ExfatErrorVfsError mapping:

ExfatErrorVfsError
NoDevice / IoError / NotExfat / UnknownPartitionLayoutIoError
PathNotFoundNotFound
NotAFileNotAFile
NotADirectoryNotADirectory
FileTooLargeFileTooLarge

Plan9Vfs

Plan9Vfs wraps an Arc<P9Client> and delegates to the 9P2000.L client. Unlike ExfatVfs (which goes through the actor/mailbox path), the P9 client performs synchronous virtio-9p device I/O directly under a spin::Mutex.

Plan9Vfs::list_dir / read_file
    └─ P9Client::list_dir / read_file
        └─ VirtIO9p::request (virtio-drivers)

P9ErrorVfsError mapping:

P9ErrorVfsError
ServerError(2) (ENOENT)NotFound
ServerError(20) (ENOTDIR)NotADirectory
ServerError(21) (EISDIR)NotAFile
ServerError(_) / DeviceErrorIoError
BufferTooSmall / InvalidResponse / Utf8ErrorIoError

The list_dir result sets is_dir from the dirent’s dtype field (4 = DT_DIR) or the qid type bit (0x80 = directory). The size field is 0 since readdir does not report file sizes — a follow-up stat per entry could be added later.

See docs/virtio-9p.md for the full 9P driver documentation.


ProcVfs

A synthetic filesystem with no block I/O. All content is computed on demand.

VFS pathRelative path seen by driverContent
/proc/directory listing
/proc/tasks/tasksready: N waiting: M\n
/proc/uptime/uptimeNs\n
/proc/drivers/driversone name State line per driver

Data sources:

  • executor::ready_count() / executor::wait_count() — task queue depths
  • timer::ticks() / TICKS_PER_SECOND — seconds since boot
  • driver::with_drivers() — registered driver names and states

Kernel initialisation (kernel/src/main.rs)

#![allow(unused)]
fn main() {
// Probe virtio-9p and create a shared P9Client.
let p9_client = probe_9p();  // returns Option<Arc<P9Client>>

// If 9p is available, always mount at /host.
if let Some(ref client) = p9_client {
    devices::vfs::mount("/host", AnyVfs::Plan9(Plan9Vfs::new(Arc::clone(client))));
}

// Always mount /proc — available without a block device.
devices::vfs::mount("/proc", AnyVfs::Proc(ProcVfs));

// Mount exFAT at / if virtio-blk was probed successfully.
let have_blk = if let Some(inbox) = registry::get::<..>("virtio-blk") {
    devices::vfs::mount("/", AnyVfs::Exfat(ExfatVfs::new(inbox)));
    true
} else { false };

// Fallback: mount 9p at / if no disk image is present.
if !have_blk {
    if let Some(client) = p9_client {
        devices::vfs::mount("/", AnyVfs::Plan9(Plan9Vfs::new(client)));
    }
}
}

This runs after both the virtio-blk and virtio-9p probe blocks and before task spawning. When both are present, exFAT owns / and 9p is at /host. When only 9p is present, it is mounted at both /host and / so that /shell auto-launch works without a disk image.


Shell integration (kernel/src/shell.rs)

The shell commands ls, cat, and cd now call the VFS API instead of the exFAT driver directly:

ls [path]   →  devices::vfs::list_dir(&path).await
cat <path>  →  devices::vfs::read_file(&path).await
cd [path]   →  devices::vfs::list_dir(&target).await  (directory check)

A new mount command manages the mount table at runtime:

mount                   — list all mounts
mount proc <mountpoint> — attach a ProcVfs instance
mount blk  <mountpoint> — attach an ExfatVfs instance (requires virtio-blk)

Example session

# Boot with 9p only (no disk image)
ostoo:/> mount
  /       9p
  /host   9p
  /proc   proc
ostoo:/> ls /
         shell
ostoo:/> ls /host
         shell
ostoo:/> cat /proc/uptime
42s

# Boot with both disk image and 9p
ostoo:/> mount
  /       exfat
  /host   9p
  /proc   proc
ostoo:/> ls /
  [DIR]        subdir
  [FILE    13]  hello.txt
ostoo:/> ls /host
         shell
ostoo:/> cat /host/shell | head
(binary ELF data)

Extending the VFS

To add a new filesystem type:

  1. Create devices/src/vfs/<name>_vfs.rs implementing list_dir and read_file as plain async fn.
  2. Add a variant to AnyVfs in mod.rs and two match arms in list_dir / read_file.
  3. Re-export the new type from mod.rs.
  4. Mount it from main.rs or the shell’s mount command.

No changes to the shell dispatch loop or path-resolution logic are required.

IPC Channels

Overview

Capability-based IPC channels for structured message passing between processes. A channel is a unidirectional message conduit with configurable buffer capacity. Channels come in pairs: a send end and a receive end, each exposed as a file descriptor (capability).

The buffer capacity, set at creation time, determines the communication model:

  • capacity = 0 – Synchronous rendezvous. Sender blocks until a receiver calls recv. Direct message transfer with scheduler donate for minimal latency (matching seL4 endpoint characteristics).
  • capacity > 0 – Asynchronous buffered. Sender enqueues and returns immediately. Blocks only when the buffer is full.

This gives applications full control: create a sync channel for tight RPC-style communication, or an async channel for decoupled producer-consumer patterns.


Message Format

struct ipc_message {
    uint64_t tag;       /* user-defined message type */
    uint64_t data[3];   /* 24 bytes of inline payload */
    int32_t  fds[4];    /* file descriptors for capability passing (-1 = unused) */
};
/* Total: 48 bytes */

The tag field is opaque to the kernel – applications use it to identify message types. The data array carries the payload (pointers, handles, small structs). The fds array carries file descriptors for capability passing (set unused slots to -1). For bulk data, use shared memory with a channel for signaling.


Syscalls

ipc_create (505)

long ipc_create(int fds[2], unsigned capacity, unsigned flags);

Creates a channel pair. Writes the send-end fd to fds[0] and the receive-end fd to fds[1].

ParameterDescription
fdsUser pointer to a 2-element int array
capacityBuffer capacity: 0 = sync, >0 = async buffered
flagsIPC_CLOEXEC (0x1): set close-on-exec on both fds

Returns 0 on success, negative errno on failure.

ipc_send (506)

long ipc_send(int fd, const struct ipc_message *msg, unsigned flags);

Send a message through a send-end fd.

ParameterDescription
fdSend-end file descriptor
msgPointer to message in user memory
flagsIPC_NONBLOCK (0x1): return -EAGAIN instead of blocking

Blocking behavior:

  • Sync (cap=0): blocks until a receiver calls recv, then transfers directly
  • Async (cap>0): blocks only if the buffer is full

Returns 0 on success, -EPIPE if receive end is closed, -EAGAIN if non-blocking and would block.

ipc_recv (507)

long ipc_recv(int fd, struct ipc_message *msg, unsigned flags);

Receive a message from a receive-end fd.

ParameterDescription
fdReceive-end file descriptor
msgPointer to buffer in user memory
flagsIPC_NONBLOCK (0x1): return -EAGAIN instead of blocking

Returns 0 on success, -EPIPE if send end is closed and no messages remain.


Examples

Sync channel (capacity=0)

int fds[2];
ipc_create(fds, 0, 0);         /* sync channel */
int send_fd = fds[0], recv_fd = fds[1];

/* In child (after clone+execve, with recv_fd inherited): */
struct ipc_message msg;
ipc_recv(recv_fd, &msg, 0);    /* blocks until parent sends */

/* In parent: */
struct ipc_message req = { .tag = 1, .data = {42, 0, 0}, .fds = {-1, -1, -1, -1} };
ipc_send(send_fd, &req, 0);    /* blocks until child recvs, then donates */

Async channel (capacity=4)

int fds[2];
ipc_create(fds, 4, 0);         /* buffered, 4 messages */

/* Producer can send 4 messages without blocking: */
for (int i = 0; i < 4; i++) {
    struct ipc_message m = { .tag = i };
    ipc_send(fds[0], &m, 0);
}

/* Consumer drains: */
struct ipc_message m;
while (ipc_recv(fds[1], &m, IPC_NONBLOCK) == 0) {
    /* process m */
}
/* returns -EAGAIN when empty */

fd-passing (capability transfer)

/* Create a pipe and an IPC channel */
int pipe_fds[2], ch_fds[2];
pipe(pipe_fds);
ipc_create(ch_fds, 4, 0);

/* Send the pipe write-end through the channel */
struct ipc_message msg = {
    .tag = 1,
    .data = { 0, 0, 0 },
    .fds = { pipe_fds[1], -1, -1, -1 },   /* transfer pipe write-end */
};
ipc_send(ch_fds[0], &msg, 0);

/* Receive — kernel allocates a new fd for the pipe write-end */
struct ipc_message recv_msg;
ipc_recv(ch_fds[1], &recv_msg, 0);
int new_write_fd = recv_msg.fds[0];   /* new fd number in receiver */

write(new_write_fd, "hello", 5);      /* writes to the same pipe */

Semantics: When ipc_send is called with non-(-1) values in the fds array, the kernel looks up each fd in the sender’s fd table, increments reference counts, and stores the kernel objects inside the channel. When ipc_recv delivers the message, the kernel allocates new fds in the receiver’s fd table and rewrites msg.fds with the new fd numbers.

Error handling: If any fd in msg.fds is invalid, the entire send fails with -EBADF. If the receiver’s fd table is full, recv fails with -EMFILE.

Cleanup: If a message with transferred fds is never received (e.g., the channel is destroyed with messages in the queue), the kernel closes the transferred fd objects automatically.


Kernel Implementation

Files

FilePurpose
libkernel/src/channel.rsChannelInner kernel object, IpcMessage struct, send/recv/close logic
libkernel/src/file.rsFdObject::Channel(ChannelFd) variant, ChannelFd::Send/Recv
osl/src/ipc.rsSyscall implementations (sys_ipc_create/send/recv)
osl/src/syscall_nr.rsSYS_IPC_CREATE=505, SYS_IPC_SEND=506, SYS_IPC_RECV=507

Sync rendezvous internals

When capacity=0, sender and receiver rendezvous directly:

  1. If receiver is already blocked: sender copies message to pending_send, unblocks receiver, donates quantum via set_donate_target + yield_now
  2. If no receiver: sender stores message in pending_send, records thread index, blocks via block_current_thread()
  3. Receiver wakes, takes message from pending_send, unblocks sender

This uses the same block_current_thread / unblock / donate primitives as pipes and waitpid (see docs/scheduler-donate.md).

Async buffered internals

Messages are stored in a VecDeque<IpcMessage> bounded by capacity:

  • Send: push to queue, wake blocked receiver if any
  • Recv: pop from queue, wake blocked sender if queue was full
  • Queue full: sender blocks until receiver drains
  • Queue empty: receiver blocks until sender enqueues

Design Decisions

Unidirectional: Simpler and more composable than bidirectional. For RPC, use two channels (request + reply). For server fan-in, share the send-end fd via dup/fork.

Fixed-size messages: No heap allocation per message. 48 bytes fits common control-plane payloads plus 4 file descriptors for capability passing. Bulk data should use shared memory.

Capacity determines semantics: The application chooses sync vs async at creation time, not at each send/recv. This makes the channel’s behavior predictable and matches the Go channels model.

IPC_NONBLOCK flag: Adds flexibility for polling patterns and try-send/ try-recv without changing the channel’s fundamental semantics.

Channel as fd: Reuses the existing fd_table, close, dup2, CLOEXEC, and cleanup-on-exit infrastructure. No new kernel handle namespace.


Completion Port Integration

IPC channels can be multiplexed with other async I/O sources (IRQs, timers, file reads) via the completion port system.

OP_IPC_SEND (opcode 5)

Submit an IPC send as an async operation via io_submit. The message is read from user memory at submission time. If the channel can accept it immediately, a completion is posted right away. Otherwise the message is stored and the completion fires when a receiver drains space.

Submission fields:

FieldValue
opcode5 (OP_IPC_SEND)
fdChannel send-end file descriptor
buf_addrPointer to user struct ipc_message to send
user_dataUser-defined tag (returned in completion)

Completion fields:

FieldValue
opcode5 (OP_IPC_SEND)
result0 on success, -EPIPE if receive end closed
user_dataSame as submission

OP_IPC_RECV (opcode 6)

Submit an IPC receive as an async operation via io_submit. When a message arrives on the channel, a completion is posted to the port with the message copied to the user-provided buffer.

Submission fields:

FieldValue
opcode6 (OP_IPC_RECV)
fdChannel receive-end file descriptor
buf_addrPointer to user struct ipc_message buffer
user_dataUser-defined tag (returned in completion)

Completion fields:

FieldValue
opcode6 (OP_IPC_RECV)
result0 on success, -EPIPE if send end closed
user_dataSame as submission

The message is copied to buf_addr by io_wait (same mechanism as OP_READ).

Semantics: Both operations are one-shot, like OP_IRQ_WAIT. Each submission handles exactly one message. Re-submit after each completion for continuous send/receive.

Example: event loop with IPC + timer

int port = io_create(0);
int fds[2];
ipc_create(fds, 4, 0);

struct ipc_message recv_buf;
struct io_submission subs[2] = {
    { .opcode = 6 /* OP_IPC_RECV */, .fd = fds[1],
      .buf_addr = (uint64_t)&recv_buf, .user_data = 1 },
    { .opcode = 1 /* OP_TIMEOUT */, .timeout_ns = 1000000000, .user_data = 2 },
};
io_submit(port, subs, 2);

struct io_completion comp;
io_wait(port, &comp, 1, 1, 0);
if (comp.user_data == 1) {
    /* IPC message received in recv_buf */
} else {
    /* timer fired */
}

Kernel internals

When io_submit processes OP_IPC_RECV, it calls arm_recv():

  1. If a message is already in the queue, posts a completion immediately
  2. Otherwise, registers the port on the channel (pending_port field)
  3. When a future ipc_send deposits a message, try_send detects the armed port and returns SendAction::PostToPort — the caller serializes the message and posts it to the port after releasing the channel lock

When io_submit processes OP_IPC_SEND, it calls arm_send():

  1. If the channel can accept the message (queue not full, or receiver waiting), delivers it and posts a success completion immediately
  2. Otherwise, stores the port + message in pending_send_port
  3. When a future ipc_recv drains space, try_recv detects the armed send port and returns RecvAction::MessageAndNotifySendPort — the caller posts a success completion to the send port

Lock ordering: channel lock is always acquired before port lock (never reversed), preventing deadlocks.


Future Extensions

  • ipc_call(fd, send_msg, recv_msg) – atomic send+recv for RPC
  • Bidirectional channels – two queues in one object
  • fd-passing in messages – implemented: fds[4] in IpcMessage, kernel transfers fd objects between processes
  • Notification objects – seL4-style bitmask signaling

Completion Port Design

Overview

This document describes a unified completion-based async I/O primitive (CompletionPort) for ostoo. The design supports both io_uring-style and Windows IOCP-style patterns through a single kernel object, accessed as an ordinary file descriptor.

The CompletionPort is motivated by the microkernel migration path (microkernel-design.md, Phases B-E) where userspace drivers need to wait on multiple event sources — IRQs, shared-memory ring wakeups, timers — through a single blocking wait point, without polling or managing multiple threads.

See also: mmap-design.md for the shared memory primitives that enable the zero-syscall ring optimisation (Phase 5 of this design).


Motivation

The Problem

A userspace NIC driver in the microkernel architecture must simultaneously wait for:

  1. IRQ events — the device raised an interrupt
  2. Ring wakeups — the TCP/IP server posted new transmit descriptors
  3. Timers — a retransmit or watchdog timer expired

With the current kernel, each of these is a separate blocking read() on a separate fd. A driver would need one thread per event source, or a poll()/ select() readiness multiplexer — neither of which exists yet.

Why Completion-Based

A completion-based model inverts the usual readiness pattern:

  • Readiness (epoll/poll/select): “tell me when fd X is ready, then I’ll do the I/O myself.” Two syscalls per operation (wait + read/write).
  • Completion (io_uring/IOCP): “do this I/O for me and tell me when it’s done.” One syscall to submit, one to reap — or zero with shared-memory rings.

Completion-based I/O is a better fit for ostoo because:

  • Simpler driver loops. Submit work, wait for completions. No edge- triggered vs level-triggered subtlety.
  • Naturally batched. Multiple operations submitted and reaped per syscall.
  • Unifies heterogeneous events. IRQs, timers, and file I/O all produce the same IoCompletion struct.
  • Shared-memory fast path. The submission/completion queues can be mapped into userspace for zero-syscall operation under load (Phase 5).
  • Matches the microkernel data plane. Drivers post work and reap completions — the same pattern as managing hardware descriptor rings.

How Other Systems Do It

Linux io_uring

Introduced in 5.1. Submission Queue (SQ) and Completion Queue (CQ) are shared-memory ring buffers mapped into userspace. The kernel polls the SQ for new entries; completions appear in the CQ. io_uring_enter() is the single syscall (submit + wait).

  • Supports 60+ operation types (read, write, accept, timeout, etc.)
  • SQEs carry a user_data field returned verbatim in CQEs for demux
  • IORING_SETUP_SQPOLL mode: kernel thread polls the SQ — truly zero- syscall submission under load
  • Fixed-file and fixed-buffer registration to avoid per-op fd/buffer lookup

Windows IOCP (I/O Completion Ports)

The original completion-based API (NT 3.5, 1994). A completion port is a kernel object that aggregates completions from multiple file handles.

  • CreateIoCompletionPort() creates the port and associates handles
  • Async operations (ReadFile, WriteFile with OVERLAPPED) post completions to the associated port
  • GetQueuedCompletionStatus() dequeues one completion (blocking)
  • PostQueuedCompletionStatus() manually posts a completion (for app-level signaling)
  • The kernel limits concurrent threads to the port’s concurrency value

Fuchsia zx_port

Zircon ports are the unified event aggregation primitive:

  • zx_port_create() creates a port
  • zx_object_wait_async() registers interest in an object’s signals (channels, interrupts, timers, processes) — when the signal fires, a packet is queued to the port
  • zx_port_wait() dequeues a packet (blocking with optional timeout)
  • zx_port_queue() manually enqueues a user packet
  • Packets carry a key field for demux (equivalent to user_data)

seL4 Notifications

seL4 uses a minimal signaling primitive:

  • A notification is a word-sized bitmask of binary semaphores
  • seL4_Signal() OR-sets bits; seL4_Wait() atomically reads and clears
  • Multiple event sources (IRQs, IPC completions) signal different bits in the same notification
  • One seL4_Wait() multiplexes all sources — the returned word tells which bits fired
  • Limitation: carries no payload beyond the bitmask. Data transfer requires a separate shared-memory protocol

Comparison

Aspectio_uringIOCPzx_portseL4 notifyostoo (proposed)
ModelCompletionCompletionCompletionSignalCompletion
Queue locationShared memoryKernelKernelKernel (1 word)Kernel (Phase 5: shared mem)
PayloadFull SQE/CQEBytes + keyPacket unionBitmask onlyIoCompletion struct
Demux fielduser_dataCompletionKeykeyBit positionuser_data
Event sourcesFiles, sockets, timersFile handlesObjects + signalsCapabilitiesFds, IRQs, timers, rings
Zero-syscall pathSQPOLL modeNoNoNoPhase 5 (shared rings)
Submit + waitSingle syscallSeparateSeparateSeparateSingle syscall (io_wait)

Core Abstraction

A CompletionPort is a kernel object consisting of a FIFO completion queue and a waiter slot. It is accessed through a file descriptor, like any other ostoo resource.

                   ┌─────────────────────────────┐
                   │       CompletionPort         │
                   │                              │
  io_submit ──────▶│  ┌─────────────────────────┐ │
                   │  │   Completion Queue       │ │
  IRQ ISR ────────▶│  │  ┌────┬────┬────┬───┐   │ │
                   │  │  │ C0 │ C1 │ C2 │...│   │ │──────▶ io_wait
  Timer expire ───▶│  │  └────┴────┴────┴───┘   │ │        (blocks until
                   │  └─────────────────────────┘ │         non-empty)
  Ring wakeup ────▶│                              │
                   │  waiter: Option<thread_idx>  │
                   └─────────────────────────────┘

Key properties:

  • Single consumer. Only one thread may call io_wait on a port at a time. This avoids thundering-herd complexity and matches the single- threaded driver loop model.
  • Multiple producers. Any context — syscall path, ISR, timer callback — can post a completion to the queue.
  • User_data demux. Every submission carries a u64 user_data field that is returned verbatim in the completion, allowing the caller to identify which operation completed without inspecting the payload.
  • Port as fd. The port lives in the process’s fd_table and can be closed, passed across execve (unless FD_CLOEXEC), or used with dup2.

Syscall Interface

Three syscalls using custom numbers in the 500+ range:

NrNameSignature
501io_createio_create(flags: u32) → fd
502io_submitio_submit(port_fd: i32, entries: *const IoSubmission, count: u32) → i64
503io_waitio_wait(port_fd: i32, completions: *mut IoCompletion, max: u32, min: u32, timeout_ns: u64) → i64

io_create (501)

Creates a new CompletionPort and returns its file descriptor.

  • flags: reserved, must be 0. Future: IO_CLOEXEC.
  • Returns: fd on success, negative errno on failure.

io_submit (502)

Submits one or more I/O operations to the port.

  • port_fd: fd returned by io_create.
  • entries: pointer to an array of IoSubmission structs in user memory.
  • count: number of entries to submit (0 < count ≤ 64).
  • Returns: number of entries successfully submitted, or negative errno.

Submissions that reference invalid fds or unsupported operations fail individually — the return value indicates how many of the leading entries were accepted.

io_wait (503)

Waits for completions and copies them to user memory.

  • port_fd: fd returned by io_create.
  • completions: pointer to an array of IoCompletion structs in user memory.
  • max: maximum number of completions to return.
  • min: minimum number to wait for before returning (0 = non-blocking poll).
  • timeout_ns: maximum wait time in nanoseconds. 0 = no timeout (wait indefinitely for min completions). With min=0, returns immediately.
  • Returns: number of completions written, or negative errno.

IoSubmission struct

#![allow(unused)]
fn main() {
#[repr(C)]
pub struct IoSubmission {
    pub user_data: u64,   // returned in completion, opaque to kernel
    pub opcode: u32,      // OP_NOP, OP_READ, etc.
    pub flags: u32,       // per-op flags, reserved
    pub fd: i32,          // target fd (for OP_READ, OP_WRITE)
    pub _pad: i32,
    pub buf_addr: u64,    // user buffer pointer
    pub buf_len: u32,     // buffer length
    pub offset: u32,      // file offset (low 32 bits, sufficient initially)
    pub timeout_ns: u64,  // for OP_TIMEOUT
}
}

Total size: 48 bytes.

IoCompletion struct

#![allow(unused)]
fn main() {
#[repr(C)]
pub struct IoCompletion {
    pub user_data: u64,   // copied from submission
    pub result: i64,      // bytes transferred, or negative errno
    pub flags: u32,       // completion flags (reserved)
    pub opcode: u32,      // echoed from submission
}
}

Total size: 24 bytes.


Operations

OpcodeNameDescriptionStatus
0OP_NOPNo operation. Completes immediately. Useful for testing.Implemented
1OP_TIMEOUTCompletes after timeout_ns nanoseconds.Implemented
2OP_READRead from fd into buf_addr.Implemented
3OP_WRITEWrite to fd from buf_addr.Implemented
4OP_IRQ_WAITWait for interrupt on IRQ fd.Implemented
5OP_IPC_SENDSend a message through an IPC channel.Implemented
6OP_IPC_RECVReceive a message from an IPC channel.Implemented
7OP_RING_WAITWait for notification fd signal.Implemented

OP_NOP (0)

Immediately posts a completion with result = 0. No side effects. Used for round-trip latency testing and as a wake-up mechanism (submit a NOP from another thread to unblock io_wait).

OP_TIMEOUT (1)

Registers a one-shot timer. Completes after timeout_ns nanoseconds with result = 0, or result = -ETIME if cancelled (future).

Implementation: uses the existing libkernel::task::timer (Delay/Sleep) infrastructure. The submission spawns an async delay task that posts the completion when the timer fires.

OP_READ (2)

Reads up to buf_len bytes from fd at offset into buf_addr.

  • result = number of bytes read, or negative errno.
  • For console/pipe fds (no meaningful offset), offset is ignored.

Implementation: see “Sync fallback worker” below.

OP_WRITE (3)

Writes up to buf_len bytes from buf_addr to fd at offset.

  • result = number of bytes written, or negative errno.

Implementation: same sync fallback pattern as OP_READ.

OP_IRQ_WAIT (4)

Waits for a hardware interrupt on an IRQ fd (from microkernel-design.md, Phase B).

  • fd must be an IRQ fd (IrqHandle).
  • result = interrupt count since last wait, or negative errno.

Implementation: the IRQ fd’s ISR-safe notification calls port.post() when the interrupt fires. No worker thread needed — the ISR posts directly.

OP_IPC_SEND (5)

Sends a message through an IPC channel send-end fd as an async operation.

  • fd must be a channel send-end fd.
  • buf_addr points to a user-space struct ipc_message (48 bytes).
  • result = 0 on success, -EPIPE if receive end closed.

The message (including any fd-passing entries in fds[4]) is read from user memory at submission time. If the channel can accept the message immediately, a completion is posted right away. Otherwise the message is stored and the completion fires when a receiver drains space.

See ipc-channels.md for full details.

OP_IPC_RECV (6)

Receives a message from an IPC channel receive-end fd as an async operation.

  • fd must be a channel receive-end fd.
  • buf_addr points to a user-space struct ipc_message buffer (48 bytes).
  • result = 0 on success, -EPIPE if send end closed and no messages remain.

When a message arrives on the channel, a completion is posted to the port. The message (including any transferred fds, allocated in the receiver’s fd table) is copied to buf_addr during io_wait.

See ipc-channels.md for full details.

OP_RING_WAIT (7) — Implemented

Waits for a notification fd to be signaled.

  • fd must be a notification fd (from notify_create, syscall 509).
  • result = 0 on wakeup, or negative errno.

Implementation: the consumer submits OP_RING_WAIT via io_submit. The kernel stores the port + user_data on the NotifyInner object. When the producer calls notify(fd) (syscall 510), the kernel posts a completion to the port. No worker thread needed — the syscall posts directly.

Edge-triggered, one-shot: one notify() → one completion. Consumer must re-submit OP_RING_WAIT to rearm. If notify() is called before OP_RING_WAIT is armed, the notification is buffered (coalesced).

The notification fd is a general-purpose signaling primitive, not tied to any specific ring buffer format. The kernel does not inspect ring buffer contents — it simply provides the signal/wait mechanism.


Kernel Implementation Sketch

CompletionPort struct

#![allow(unused)]
fn main() {
use alloc::collections::VecDeque;

pub struct CompletionPort {
    queue: VecDeque<IoCompletion>,
    waiter: Option<usize>,       // thread index blocked in io_wait
    max_queued: usize,           // backpressure limit (default 256)
}
}

The CompletionPort is wrapped in a Mutex and stored inside a CompletionPortHandle that implements FileHandle.

CompletionPortHandle

#![allow(unused)]
fn main() {
pub struct CompletionPortHandle {
    port: Arc<Mutex<CompletionPort>>,
}
}

Implements FileHandle:

  • read() → returns Err(FileError::BadFd) (use io_wait instead)
  • write() → returns Err(FileError::BadFd) (use io_submit instead)
  • close() → drop the Arc. Pending operations are cancelled (completions with -ECANCELED are discarded).
  • kind() → a new FileKind::CompletionPort variant

ISR-safe post()

The post() method must be callable from interrupt context (e.g., an IRQ handler posting OP_IRQ_WAIT completions).

#![allow(unused)]
fn main() {
impl CompletionPort {
    /// Post a completion.  Safe to call from ISR context.
    pub fn post(&mut self, completion: IoCompletion) {
        if self.queue.len() < self.max_queued {
            self.queue.push_back(completion);
        }
        // Wake the blocked waiter, if any
        if let Some(thread_idx) = self.waiter.take() {
            scheduler::unblock(thread_idx);
        }
    }
}
}

The CompletionPort is wrapped in an IrqMutex which disables interrupts while held, making post() safe to call from ISR context.

io_wait blocking pattern

sys_io_wait(port_fd, completions_ptr, max, min, timeout_ns):
    port = lookup_fd(port_fd) as CompletionPortHandle
    loop:
        lock port
        n = drain up to max completions from queue
        if n >= min:
            copy n completions to user memory
            return n
        register current thread as waiter
        unlock port
        block_current_thread()       // scheduler marks Blocked, yields
        // ... woken by post() or timeout ...
    if timeout expired:
        return completions drained so far (may be 0)

This reuses the existing scheduler::block_current_thread() / scheduler::unblock() pattern from the pipe and waitpid implementations.

Sync fallback worker for OP_READ / OP_WRITE

Existing FileHandle implementations (VfsHandle, ConsoleHandle, PipeReader) are synchronous and blocking. To integrate them with the CompletionPort without rewriting every handle:

  1. io_submit for OP_READ/OP_WRITE spawns an async task (via the existing executor).
  2. The task calls osl::blocking::blocking() which blocks a scheduler thread on the synchronous FileHandle::read() or FileHandle::write().
  3. When the blocking call returns, the task posts a completion to the port.

This means each in-flight OP_READ/OP_WRITE consumes one scheduler thread while blocked. Acceptable for the initial implementation; a true async FileHandle path can be added later.

io_submit(OP_READ, fd, buf, len):
    spawn async {
        let result = blocking(|| {
            file_handle.read(buf, len)
        });
        port.lock().post(IoCompletion {
            user_data,
            result: result as i64,
            opcode: OP_READ,
            flags: 0,
        });
    }

Integration with Existing Infrastructure

FileHandle trait

No changes to the FileHandle trait in Phase 1. The sync fallback worker bridges existing handles.

In a future phase, an optional submit_async() method could be added to FileHandle for handles that can natively post completions (e.g., a future async virtio-blk driver):

#![allow(unused)]
fn main() {
pub trait FileHandle: Send + Sync {
    // ... existing methods ...

    /// Submit an async operation.  Default: not supported (use sync fallback).
    fn submit_async(&self, _op: &IoSubmission, _port: &Arc<Mutex<CompletionPort>>)
        -> Result<(), FileError>
    {
        Err(FileError::NotSupported)
    }
}
}

Executor and timer reuse

  • Executor: the existing async task executor spawns fallback worker tasks and timeout tasks. No changes needed.
  • Timer: OP_TIMEOUT uses the existing libkernel::task::timer::Delay (which builds on the LAPIC timer tick). No new timer infrastructure.

fd_table

The CompletionPort is stored in the process fd_table as a CompletionPortHandle. This means:

  • close(port_fd) cleans up the port.
  • dup2 works (two fds alias the same port via Arc).
  • FD_CLOEXEC / close_cloexec_fds() works for execve.
  • No new kernel data structures outside the existing fd model.

Syscall dispatch wiring

Add to osl/src/syscalls/mod.rs:

#![allow(unused)]
fn main() {
501 => sys_io_create(a1 as u32),
502 => sys_io_submit(a1 as i32, a2 as *const IoSubmission, a3 as u32),
503 => sys_io_wait(a1 as i32, a2 as *mut IoCompletion, a3 as u32, a4 as u32, a5 as u64),
}

The IoSubmission and IoCompletion structs live in osl/src/io_port.rs (new file). The CompletionPort and CompletionPortHandle implementations live in libkernel/src/file.rs alongside the existing handle types.


Integration with Microkernel Primitives

The CompletionPort replaces the need for two separate primitives described in microkernel-design.md:

  • IRQ fd (Phase B) — instead of a standalone IrqHandle where read() blocks, the IRQ fd posts completions to a port via OP_IRQ_WAIT. The driver submits an OP_IRQ_WAIT and reaps it alongside other completions.

  • Notification objects (seL4-style) — unnecessary. A port with manual post() (exposed as a future io_post syscall or via OP_NOP with user_data tagging) serves the same role.

NIC driver example loop

int port = io_create(0);
int irq_fd = open("/dev/irq/11", O_RDONLY);
int ring_fd = open("/dev/shm/txring", O_RDWR);

// Submit initial waits
IoSubmission subs[2] = {
    { .user_data = TAG_IRQ,  .opcode = OP_IRQ_WAIT,  .fd = irq_fd  },
    { .user_data = TAG_RING, .opcode = OP_RING_WAIT, .fd = ring_fd },
};
io_submit(port, subs, 2);

for (;;) {
    IoCompletion comp[8];
    int n = io_wait(port, comp, 8, /*min=*/1, /*timeout=*/0);

    for (int i = 0; i < n; i++) {
        switch (comp[i].user_data) {
        case TAG_IRQ:
            handle_interrupt();
            // Resubmit IRQ wait
            io_submit(port, &(IoSubmission){
                .user_data = TAG_IRQ, .opcode = OP_IRQ_WAIT, .fd = irq_fd
            }, 1);
            break;

        case TAG_RING:
            drain_tx_ring();
            // Resubmit ring wait
            io_submit(port, &(IoSubmission){
                .user_data = TAG_RING, .opcode = OP_RING_WAIT, .fd = ring_fd
            }, 1);
            break;
        }
    }
}

This single-threaded loop handles both IRQs and ring wakeups through one blocking wait point — exactly the pattern needed for microkernel drivers.


Shared-Memory Ring Optimisation (Phase 5 — Implemented)

Under high throughput, even one syscall per batch can become a bottleneck. The shared-memory ring optimisation maps the submission and completion queues into userspace as shared-memory ring buffers, eliminating syscalls on the hot path for reading completions.

Syscalls

NrNamePurpose
511io_setup_ringsAllocate SQ/CQ shared memory, put port in ring mode
512io_ring_enterProcess SQ entries + optionally block for CQ completions

io_setup_rings (511)

io_setup_rings(port_fd, params: *mut IoRingParams) → 0 or -errno

Allocates SQ and CQ ring pages and returns shmem fds that the process mmaps with MAP_SHARED:

struct io_ring_params params = { .sq_entries = 64, .cq_entries = 128 };
io_setup_rings(port, &params);
void *sq = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, params.sq_fd, 0);
void *cq = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, params.cq_fd, 0);

io_ring_enter (512)

io_ring_enter(port_fd, to_submit, min_complete, flags) → i64

Processes up to to_submit SQEs from the shared SQ ring, flushes deferred completions, and optionally blocks until min_complete CQEs are available in the CQ ring.

Ring layout

Single 4 KiB page per ring:

Offset 0:  RingHeader (16 bytes)
  AtomicU32 head  — consumer advances (SQ: kernel, CQ: user)
  AtomicU32 tail  — producer advances (SQ: user, CQ: kernel)
  u32 mask        — capacity - 1
  u32 flags       — reserved (0)

Offset 64: entries[] (cache-line aligned)
  SQ: IoSubmission[capacity]  — 48 bytes each, max 64
  CQ: IoCompletion[capacity]  — 24 bytes each, max 128

Head and tail use atomic load/store with acquire/release ordering.

Dual-mode post()

When rings are active, CompletionPort::post() routes completions:

  • Simple (no read_buf, no transfer_fds): CQE written directly to the shared CQ ring via IoRing::post_cqe(). Fast path for OP_NOP, OP_TIMEOUT, OP_WRITE, OP_IRQ_WAIT, OP_IPC_SEND, OP_RING_WAIT.
  • Deferred (read_buf or transfer_fds present): pushed to the kernel VecDeque. io_ring_enter flushes these in syscall context where page tables are correct for data copy and fd installation.

Backward compatibility

  • io_submit works in ring mode (completions go to CQ ring)
  • io_wait returns -EINVAL in ring mode (use io_ring_enter)
  • Ports without rings work exactly as before

Phased Implementation

Phase 1: Core + OP_NOP + OP_TIMEOUT

Goal: Establish the CompletionPort kernel object, syscall interface, and basic operations.

ItemDetail
Fileslibkernel/src/file.rs (CompletionPortHandle), osl/src/io_port.rs (new: structs + sys_io_*), osl/src/syscalls/mod.rs (wire 501-503)
DependenciesNone — uses existing scheduler, timer, fd_table
Deliversio_create, io_submit, io_wait; OP_NOP and OP_TIMEOUT
TestUserspace program: create port, submit OP_NOP, io_wait returns immediately. Submit OP_TIMEOUT(100ms), io_wait blocks ~100ms then returns.

Phase 2: OP_READ + OP_WRITE with Sync Fallback

Goal: Bridge existing FileHandle implementations into the completion model.

ItemDetail
Filesosl/src/io_port.rs (add fallback worker logic)
DependenciesPhase 1; existing osl::blocking::blocking()
DeliversOP_READ and OP_WRITE on console, pipe, and VFS file fds
TestSubmit OP_WRITE to stdout + OP_READ from a file fd, reap both completions. Verify data matches.

Phase 3: OP_IRQ_WAIT — Implemented

Goal: Hardware interrupt delivery through the completion port.

ItemDetail
Fileslibkernel/src/irq_handle.rs (IrqInner, IRQ slot table, ISR dispatch), libkernel/src/file.rs (FdObject::Irq variant), libkernel/src/completion_port.rs (OP_IRQ_WAIT constant), osl/src/irq.rs (sys_irq_create), osl/src/io_port.rs (OP_IRQ_WAIT handler in io_submit)
DependenciesPhase 1; IO APIC route/mask/unmask (libkernel::apic)
Deliversirq_create(gsi) syscall (504), submit OP_IRQ_WAIT on an IRQ fd, ISR masks line and posts completion to port, rearm via another OP_IRQ_WAIT unmasks
Testuser/irq_demo.c: create IRQ fd for keyboard GSI 1, submit OP_IRQ_WAIT, press key, verify completion with scancode in result.

Phase 3b: OP_IPC_SEND + OP_IPC_RECV — Implemented

Goal: Multiplex IPC channel operations with other async I/O sources.

ItemDetail
Fileslibkernel/src/channel.rs (arm_send, arm_recv, PendingPortSend/Recv), libkernel/src/completion_port.rs (OP_IPC_SEND/RECV constants, transfer_fds on Completion), osl/src/io_port.rs (OP_IPC_SEND/RECV handlers)
DependenciesPhase 1; IPC channels (syscalls 505–507)
DeliversSubmit OP_IPC_SEND/RECV on channel fds, completions posted when message delivered/received. Supports fd-passing: transferred fds installed in receiver during io_wait.
Testuser/ipc_port.c: IPC send/recv multiplexed with timers via completion port. user/ipc_fdpass.c: fd-passing through IPC channels.

Phase 4: OP_RING_WAIT — Implemented

Goal: Inter-process signaling through the completion port via notification fds.

ItemDetail
Fileslibkernel/src/notify.rs (NotifyInner, arm/signal), libkernel/src/file.rs (FdObject::Notify), osl/src/notify.rs (sys_notify_create 509, sys_notify 510), osl/src/io_port.rs (OP_RING_WAIT handler)
DependenciesPhase 1
Deliversnotify_create(flags) syscall (509), notify(fd) syscall (510), submit OP_RING_WAIT on notification fd, producer-side notify() posts completion
Testuser/ring_test.c: parent creates shmem + notify fd, spawns child, child writes to shmem and signals, parent reaps OP_RING_WAIT completion and verifies data.

Phase 5: Shared-Memory SQ/CQ Rings — Implemented

Goal: Zero-syscall submission and completion for high-throughput paths.

ItemDetail
Fileslibkernel/src/completion_port.rs (IoRing, IoSubmission, IoCompletion, RingHeader, dual-mode post()), libkernel/src/shmem.rs (from_existing), osl/src/io_port.rs (io_setup_rings 511, io_ring_enter 512, process_submission refactor)
DependenciesPhase 1; MAP_SHARED from mmap-design.md Phase 5
DeliversUserspace-mapped SQ/CQ rings via shmem fds, io_ring_enter processes SQ + waits for CQ
Testuser/ring_sq_test.c: submit OP_NOP + OP_TIMEOUT via shared-memory SQ, reap from CQ ring, verify completions.

Dependency Graph

                     ┌────────────────────────┐
                     │  Phase 1               │
                     │  Core + NOP + TIMEOUT   │
                     │  (no external deps)     │
                     └──┬──────────┬───────┬──┘
                        │          │       │
               ┌────────▼──┐   ┌──▼────┐  │
               │  Phase 2   │  │       │  │
               │  READ/WRITE│  │       │  │
               │  sync fbk  │  │       │  │
               └────────────┘  │       │  │
                               │       │  │
          ┌────────────────────┘       │  │
          │                            │  │
  ┌───────▼────────┐  ┌───────────┐   │  │
  │  Phase 3  ✓    │  │ Phase 3b ✓│   │  │
  │  OP_IRQ_WAIT   │  │ OP_IPC_*  │   │  │
  │                │  │           │   │  │
  │  requires:     │  │ requires: │   │  │
  │  IO APIC       │  │ IPC chans │   │  │
  └────────────────┘  └───────────┘   │  │
                                      │  │
          ┌────────────────────┐      │  │
          │  Phase 4  ✓        │      │  │
          │  OP_RING_WAIT      │      │  │
          │  (notify fds)      │      │  │
          └────────────────────┘      │  │
                                      │  │
                    ┌─────────────────▼──▼───────────────┐
                    │  Phase 5  ✓                        │
                    │  Shared-memory SQ/CQ rings         │
                    │                                    │
                    │  requires:                         │
                    │  mmap Phase 5 (MAP_SHARED)  ✓      │
                    └────────────────────────────────────┘

All phases are complete.


Key Design Decisions

Syscall-first, not ring-first

Phase 1 uses traditional syscalls (io_submit/io_wait). Shared-memory rings are deferred to Phase 5. This avoids coupling the initial implementation to MAP_SHARED (which is not yet implemented) and keeps the kernel-side logic simple.

Port as fd

The CompletionPort is an fd in the process’s fd_table, not a special kernel handle type. This reuses existing infrastructure (close, dup2, CLOEXEC, fd_table cleanup on exit) and avoids inventing a parallel handle namespace.

Eager posting

Completions are pushed to the port’s queue immediately when the operation finishes (or the ISR fires). There is no lazy/deferred completion model. This is simpler and matches the existing block/unblock scheduling model.

Single-threaded wait

Only one thread may block in io_wait per port. This is a deliberate constraint matching the single-threaded driver loop model. Multi-threaded consumers can use multiple ports. This avoids thundering-herd wake-up logic and lock contention on the completion queue.

user_data for demux

Every submission carries a u64 user_data returned verbatim in the completion. The kernel never inspects this field. The caller uses it to identify which logical operation completed (e.g., TAG_IRQ, TAG_RING, a pointer to a request struct). This is the same pattern used by io_uring, IOCP, and Fuchsia ports.

Custom syscall numbers (501-503)

The 500+ range is reserved for ostoo-specific syscalls. These are not Linux syscall numbers. If Linux compatibility is needed later, a shim layer can map Linux’s io_uring_setup/io_uring_enter numbers to the ostoo equivalents.

Kernel-buffered queue

The completion queue lives in kernel memory (VecDeque<IoCompletion>), not shared memory. io_wait copies completions to user buffers. Simple, correct, and sufficient until Phase 5 adds the zero-copy shared-memory path.

Sync fallback for existing FileHandles

Rather than rewriting ConsoleHandle, PipeReader, VfsHandle, etc. to be async-aware, OP_READ/OP_WRITE spawn a blocking worker that calls the existing synchronous FileHandle::read()/FileHandle::write() and posts the completion when done. This trades a scheduler thread per in-flight op for zero changes to existing handle implementations.

Timer via Delay

OP_TIMEOUT reuses the existing libkernel::task::timer::Delay rather than introducing a new timer subsystem. The LAPIC timer already provides 10ms ticks; Delay builds on this.

ISR-safe posting

CompletionPort::post() must work from interrupt context. The IrqMutex around the port disables interrupts while held. The scheduler::unblock() call is already ISR-safe (it just pushes to the ready queue).

No cancellation in Phase 1

Submitted operations cannot be cancelled. This avoids the complexity of cancellation tokens, in-progress state tracking, and partial-completion semantics. Cancellation support can be added later as an io_cancel syscall (504) once the basic model is proven.

Completion-oriented, not readiness-oriented

The port reports “operation X is done” (completion), not “fd Y is readable” (readiness). This is a deliberate choice:

  • Completion avoids the double-syscall problem (wait for ready, then do I/O).
  • Completion naturally supports heterogeneous event sources (timers, IRQs) that don’t have a “ready” state.
  • Readiness can be emulated on top of completion (submit a zero-length OP_READ as a readiness probe) but not vice versa.

Blocking Protocol

This document describes the blocking/wakeup protocol used throughout the kernel, the lost-wakeup race it currently suffers from, and a proposed fix based on an idle thread and a WaitCondition primitive.

Formal PlusCal/TLA+ models of the protocol live in specs/. See specs/PLUSCAL.md for authoring instructions and specs/README.md for per-spec details.

Current protocol (buggy)

Every blocking site in the kernel follows this pattern:

#![allow(unused)]
fn main() {
{
    let mut guard = shared_state.lock();   // 1. acquire lock
    guard.waiter = Some(thread_idx);       // 2. register waiter
}                                          // 3. release lock
scheduler::block_current_thread();         // 4. mark Blocked + spin
}

The waker (a producer, timer, or signal) does:

#![allow(unused)]
fn main() {
{
    let mut guard = shared_state.lock();
    if let Some(t) = guard.waiter.take() { // clear waiter slot
        scheduler::unblock(t);             // wake up the blocked thread
    }
}
}

And unblock() is conditional:

#![allow(unused)]
fn main() {
pub fn unblock(thread_idx: usize) {
    let mut sched = SCHEDULER.lock();
    if let Some(t) = sched.threads.get_mut(thread_idx) {
        if t.state == ThreadState::Blocked {   // <-- only acts if Blocked
            t.state = ThreadState::Ready;
            sched.ready_queue.push_back(thread_idx);
        }
    }
}
}

The race

A waker can execute between steps 3 (unlock) and 4 (mark Blocked):

  1. Waiter: acquires lock, sets waiter = Some(self), releases lock. Thread state is still Running.
  2. Waker: acquires lock, calls waiter.take(), calls unblock(waiter). unblock checks state == Blocked — it’s Runningno-op. Waiter slot is now None.
  3. Waiter: calls block_current_thread(), sets state to Blocked, spins forever. No future waker will call unblock because the waiter slot was already consumed. Deadlock.

This race is confirmed by the PlusCal model in specs/completion_port/completion_port.tla (TLC finds a deadlock trace) and by code inspection of scheduler.rs lines 640-676.

Affected sites

The race affects every blocking site, not just the completion port:

SiteFileLock typeStatus
sys_io_waitosl/src/io_port.rsIrqMutexFixed: WaitCondition
sys_io_ring_enter phase 3osl/src/io_port.rsIrqMutexFixed: WaitCondition
PipeReader::readlibkernel/src/file.rsSpinMutexFixed: WaitCondition
read_input (console)libkernel/src/console.rsSpinMutexFixed: WaitCondition
sys_ipc_sendosl/src/ipc.rsIrqMutexmark_blocked under lock (action enum)
sys_ipc_recvosl/src/ipc.rsIrqMutexmark_blocked under lock (action enum)
sys_wait4osl/src/syscalls/process.rsSpinMutexFixed: WaitCondition
sys_clone (vfork parent)osl/src/clone.rsSpinMutexFixed: WaitCondition
blocking() (async bridge)osl/src/blocking.rsSpinMutexFixed: WaitCondition

The IPC channel sites (sys_ipc_send, sys_ipc_recv) use the split mark_blocked() / yield_now() pair. The mark_blocked is called inside ChannelInner::try_send/try_recv (under the IrqMutex), and the caller does yield_now() after the lock drops via the SendAction/RecvAction enum. These are already race-free; WaitCondition doesn’t fit the action-enum pattern without a larger refactor.

The sys_io_ring_enter variant previously had an additional bug: check and set_waiter were under separate lock acquisitions, so a completion could arrive between the check and the registration. WaitCondition fixes this by construction.

Why Blocked threads spin today

Both preempt_tick and yield_tick handle an empty ready queue by returning current_rsp — i.e. they keep running the current thread even if it’s Blocked. This forces block_current_thread() to include a HLT spin loop: the Blocked thread keeps running on the CPU, calling enable_and_hlt() in a loop, waiting for the next timer interrupt to check if unblock() has changed its state. This wastes up to one full quantum (10ms) per blocking event and prevents the CPU from doing useful work while the thread is Blocked.

Proposed fix

The fix has three parts: an idle thread that eliminates the need for blocked threads to spin, a split of block_current_thread that fixes the race, and a WaitCondition wrapper that makes the correct pattern easy and the buggy pattern impossible.

Step 1: Add an idle thread

Create a per-CPU idle thread that the scheduler falls back to when the ready queue is empty. The idle thread does nothing but HLT in a loop, yielding the CPU until the next interrupt:

#![allow(unused)]
fn main() {
fn idle_thread() -> ! {
    loop {
        x86_64::instructions::interrupts::enable_and_hlt();
    }
}
}

The idle thread is created during scheduler init and stored in Scheduler:

#![allow(unused)]
fn main() {
struct Scheduler {
    // ...
    idle_thread_idx: usize,  // always present, never on the ready queue
}
}

Then preempt_tick and yield_tick switch to the idle thread instead of staying on a Blocked/Dead thread:

#![allow(unused)]
fn main() {
let next_idx = match sched.ready_queue.pop_front() {
    Some(idx) => idx,
    None => sched.idle_thread_idx,  // was: return current_rsp
};
}

The idle thread is never pushed onto the ready queue. The scheduler only runs it as a fallback when nothing else is Ready. The first unblock() call pushes a real thread onto the ready queue, and the next timer tick preempts idle and switches to it.

With this change, a Blocked thread no longer needs to spin — the scheduler context-switches away from it immediately and never schedules it again until unblock() makes it Ready.

Step 2: Split block_current_thread into mark + yield

#![allow(unused)]
fn main() {
/// Mark the current thread Blocked and yield to the scheduler.
///
/// Safe to call while holding any lock (acquires SCHEDULER briefly).
/// The scheduler will context-switch away and never schedule this thread
/// again until unblock() is called. Execution resumes at the instruction
/// after this call.
// [spec: completion_port/completion_port.tla CheckAndAct — "thread_state := blocked"
//        + WaitUnblocked — "await thread_state = running"]
pub fn mark_blocked_and_yield() {
    x86_64::instructions::interrupts::without_interrupts(|| {
        let mut sched = SCHEDULER.lock();
        let idx = sched.current_idx;
        sched.threads[idx].state = ThreadState::Blocked;
    });
    yield_now();  // context-switch away; resume here after unblock + reschedule
}
}

There is no spin loop. yield_now() triggers int 0x50, which enters yield_tick. The scheduler sees the thread is Blocked, does not re-queue it, and switches to the next ready thread (or idle). When unblock() is called later, the thread is pushed onto the ready queue with state Ready. The scheduler eventually picks it and context-switches back, resuming execution right after the yield_now() call.

A separate mark_blocked() (without yield) is still useful for callers that need to mark Blocked under a lock and yield after dropping it:

#![allow(unused)]
fn main() {
/// Mark the current thread Blocked. Does NOT yield.
/// Caller must call yield_now() after releasing their lock.
pub fn mark_blocked() {
    x86_64::instructions::interrupts::without_interrupts(|| {
        let mut sched = SCHEDULER.lock();
        let idx = sched.current_idx;
        sched.threads[idx].state = ThreadState::Blocked;
    });
}
}

Step 3: Migrate call sites

Each site becomes:

#![allow(unused)]
fn main() {
{
    let mut guard = shared_state.lock();   // 1. acquire lock
    guard.waiter = Some(thread_idx);       // 2. register waiter
    scheduler::mark_blocked();             // 3. mark Blocked UNDER LOCK
}                                          // 4. release lock
scheduler::yield_now();                    // 5. context-switch away
// execution resumes here after unblock + reschedule
}

No loop, no spin, no HLT. The thread is off the CPU until explicitly woken.

Step 4: Introduce WaitCondition to enforce the pattern (DONE)

Implemented in libkernel/src/wait_condition.rs. Seven sites now use WaitCondition::wait_while(): sys_io_wait, sys_io_ring_enter, PipeReader::read, read_input (console), sys_wait4, sys_clone (vfork parent), and blocking(). The two IPC channel sites use the split mark_blocked()/yield_now() pair via an action enum.

A condvar-like wrapper that makes the ordering impossible to get wrong:

#![allow(unused)]
fn main() {
/// Single-waiter condvar for kernel blocking.
///
/// Encapsulates the check → register → mark_blocked → unlock → yield cycle.
/// The type system ensures mark_blocked happens before the guard drops.
pub struct WaitCondition;

impl WaitCondition {
    /// If `predicate(guard)` returns true (i.e. "should block"), register
    /// the waiter, mark the thread Blocked, release the lock, and yield.
    /// Returns when unblocked and rescheduled.
    ///
    // [spec: completion_port/completion_port.tla
    //   CheckAndAct (check + set_waiter + mark_blocked) = one label
    //   WaitUnblocked (await running) = next label]
    pub fn wait_while<T, L: Lock<T>>(
        mut guard: L::Guard<'_>,
        predicate: impl Fn(&T) -> bool,
        register: impl FnOnce(&mut T, usize),
    ) {
        if !predicate(&*guard) {
            return; // condition already satisfied, no need to block
        }
        let thread_idx = scheduler::current_thread_idx();
        register(&mut *guard, thread_idx);
        scheduler::mark_blocked();  // mark Blocked while lock held
        drop(guard);                // release lock
        scheduler::yield_now();     // context-switch away; resume after unblock
    }
}
}

Each blocking site reduces to a single call:

#![allow(unused)]
fn main() {
// Completion port
WaitCondition::wait_while(
    port.lock(),
    |p| p.pending() < min,
    |p, idx| p.set_waiter(idx),
);

// Pipe reader
WaitCondition::wait_while(
    self.inner.lock(),
    |inner| inner.buffer.is_empty() && !inner.write_closed,
    |inner, idx| { inner.reader_thread = Some(idx); },
);

// Console input
WaitCondition::wait_while(
    CONSOLE_INPUT.lock(),
    |c| c.buf.is_empty(),
    |c, idx| { c.blocked_reader = Some(idx); },
);

// sys_wait4
WaitCondition::wait_while(
    PROCESS_TABLE.lock(),
    |table| find_zombie_child(table, pid).is_none(),
    |table, idx| {
        table.get_mut(&pid).unwrap().wait_thread = Some(idx);
    },
);
}

Step 5: Deprecate block_current_thread

All three remaining sites (sys_wait4, sys_clone, blocking()) have been migrated. block_current_thread is now unused and can be deprecated. New blocking code must use WaitCondition or the mark_blocked() / yield_now() pair.

Lock ordering note

mark_blocked() acquires the scheduler’s SCHEDULER SpinMutex internally. This means any lock held by the caller must come before SCHEDULER in the lock ordering. The current codebase already satisfies this: all IrqMutex and SpinMutex locks protecting shared state are acquired before (and never after) the scheduler lock.

If a future lock needs to be acquired after the scheduler lock, that lock cannot be held when calling mark_blocked() — use the manual split instead, and mark blocked before acquiring the inner lock.

PlusCal correspondence

Rust constructPlusCal labelAtomicity
guard = state.lock()Start of labelLock acquired
register(guard, idx)Same labelUnder same lock
mark_blocked()Same labelUnder same lock (acquires SCHEDULER briefly)
drop(guard)End of labelLock released
yield_now()Next labelawait thread_state = "running"
unblock(idx) in wakerWaker’s labelif thread_state = "blocked" then running

Each WaitCondition::wait_while call maps to exactly two PlusCal labels, making formal verification straightforward.

Relation to the io_ring_enter double-lock bug

sys_io_ring_enter phase 3 currently does:

#![allow(unused)]
fn main() {
{ let p = port.lock(); /* check cq_available */ }  // lock 1
{ let mut p = port.lock(); p.set_waiter(idx); }    // lock 2
scheduler::block_current_thread();                  // lock 3
}

The check and set_waiter are under separate locks, so a CQE posted between them is missed. WaitCondition fixes this by construction: the predicate check and waiter registration happen under a single lock acquisition.

Scheduler Donate (Direct-Switch) Infrastructure

Overview

All blocking IPC in ostoo (pipes, completion ports, wait4, vfork) uses unblock(thread_idx) which pushes the woken thread to the back of the ready queue. The thread then waits for the scheduler’s round-robin to reach it — up to 10 ms (full quantum).

The scheduler donate mechanism adds a voluntary yield via a dedicated ISR vector (int 0x50) so the waker can switch to the woken thread immediately, eliminating the up-to-10 ms latency.

Mechanism

Yield interrupt (vector 0x50)

ipc_yield_stub is an assembly handler identical to the LAPIC timer stub (lapic_timer_stub) but calls yield_tick instead of preempt_tick. It provides a software-triggered context switch from syscall context.

yield_tick differs from preempt_tick:

  • No tick(), no lapic_eoi() — not a hardware interrupt
  • No quantum decrement — always performs the switch
  • Checks DONATE_TARGET: AtomicUsize for a direct-switch target

Public API

FunctionDescription
yield_now()Trigger int 0x50 — voluntary preemption
set_donate_target(idx)Set direct-switch target for next yield
unblock_yield(idx)Unblock + set donate + yield (convenience)

unblock_yield is the high-level primitive for the pattern: unblock a thread and immediately switch to it.

Direct-switch flow

  1. Waker calls unblock(target) — target moves to Ready, pushed to ready queue
  2. Waker calls set_donate_target(target) — stores target in atomic
  3. Waker calls yield_now() — triggers int 0x50
  4. yield_tick saves waker’s state, sees donate target, switches to target
  5. Target resumes from its blocked state immediately
  6. Waker is re-queued as Ready and runs later via normal scheduling

If the donate target is no longer Ready (e.g., the timer already dispatched it), yield_tick falls back to regular round-robin.

Applied to existing primitives

Pipes (libkernel/src/file.rs)

pipe_wake_reader() returns the woken thread index. PipeWriter::write() drops the pipe lock, then calls set_donate_target + yield_now() if a reader was woken.

PipeWriter::close() returns the woken thread index (if writer_count reaches 0 and a reader was woken). It cannot yield itself because it runs inside with_process() which holds the process table lock. Instead, sys_close yields after the lock is released.

PipeInner tracks writer_count (incremented by on_dup(), decremented by close()). write_closed is set only when writer_count reaches 0, matching Unix pipe semantics where EOF is delivered only after all writer fds are closed.

FdObject::clone() does NOT call on_dup() — it is a plain Arc clone. on_dup() is only called via FdObject::notify_dup() at actual fd-duplication sites (clone/fork fd_table inheritance, dup2).

The pipe lock must be dropped before yielding — otherwise the reader thread would deadlock trying to acquire it.

Completion ports (libkernel/src/completion_port.rs)

CompletionPort::post() returns Option<usize> — the woken waiter thread index. ISR-context callers ignore the return value. Syscall-context callers (e.g., OP_NOP in io_port.rs) use it to yield to the waiter.

Process exit (libkernel/src/process.rs)

terminate_process() calls yield_now() before kill_current_thread(). If the parent has a wait_thread, the donate target is set to the parent’s thread so it returns from wait4 immediately. The dying thread’s remaining quantum is donated to the parent.

Safety constraints

  • yield_now() must NOT be called from ISR context. The scheduler lock could deadlock (ISR preempts code holding the lock, ISR tries to acquire lock → deadlock).
  • ISR paths (e.g., irq_fd_dispatchCompletionPort::post()) continue using plain unblock(). This is fine because ISRs are short.
  • All locks (pipe, completion port) must be dropped before calling yield_now().

Why int 0x50 works from syscall context

During a SYSCALL handler the CPU runs on the kernel stack with GS = kernel GS (from swapgs in the syscall entry stub). int 0x50 pushes a ring-0 interrupt frame. The yield stub sees RPL = 0 in the saved CS, skips swapgs. Saves all GPRs + FXSAVE. yield_tick saves RSP, switches to target’s stack. Target’s frame (from its own yield or timer preemption) is restored via fxrstor + GPR pops + iretq.

Key files

FileChange
libkernel/src/task/scheduler.rsipc_yield_stub asm, yield_tick, DONATE_TARGET, public API
libkernel/src/interrupts.rsRegister vector 0x50 in IDT
libkernel/src/file.rspipe_wake_reader returns thread idx, yield in PipeWriter
libkernel/src/completion_port.rspost() returns Option<usize>
osl/src/io_port.rsYield after OP_NOP post
libkernel/src/process.rsYield before kill_current_thread in terminate_process

Signal Support

Current state

Phase 1 of POSIX signal support: basic signal data structures, rt_sigaction, rt_sigprocmask, signal delivery on SYSCALL return, rt_sigreturn, and kill.

What works

  • rt_sigaction (syscall 13): install/query signal handlers with SA_SIGINFO and SA_RESTORER
  • rt_sigprocmask (syscall 14): SIG_BLOCK, SIG_UNBLOCK, SIG_SETMASK
  • kill (syscall 62): send signals to specific pids
  • Signal delivery on SYSCALL return path via check_pending_signals
  • rt_sigreturn (syscall 15): restore context after signal handler returns
  • Default actions: SIG_DFL (terminate or ignore depending on signal), SIG_IGN
  • sigaltstack (syscall 131): stub returning 0

Signal delivery mechanism

The SYSCALL assembly stub saves 8 registers onto the kernel stack and stores the stack pointer into PerCpuData.saved_frame_ptr (GS offset 40). After syscall_dispatch returns, check_pending_signals() is called:

  1. Peek at process’s pending & !blocked — early return if empty
  2. Dequeue lowest pending signal
  3. If SIG_DFL: terminate (SIGKILL, SIGTERM, etc.) or ignore (SIGCHLD, SIGCONT)
  4. If SIG_IGN: return
  5. If handler installed: construct rt_sigframe on user stack, rewrite saved frame

The rt_sigframe on the user stack contains:

  • pretcode (8B): sa_restorer address (musl’s __restore_rt)
  • siginfo_t (128B): signal number, errno, code
  • ucontext_t (224B): saved registers (sigcontext), fpstate ptr, signal mask

The saved SYSCALL frame is rewritten so sysretq “returns” into the handler:

  • RCX (→ RIP) = handler address
  • RDI = signal number
  • RSI = &siginfo (if SA_SIGINFO)
  • RDX = &ucontext (if SA_SIGINFO)
  • User RSP = rt_sigframe base

When the handler returns, __restore_rt calls rt_sigreturn (syscall 15), which reads the saved context from the rt_sigframe and restores the original registers and signal mask.

Architecture

Key files

FilePurpose
libkernel/src/signal.rsSignal constants, SigAction, SignalState
libkernel/src/syscall.rsPerCpuData.saved_frame_ptr, SyscallSavedFrame, check_pending_signals, deliver_signal
libkernel/src/process.rsProcess.signal field
osl/src/signal.rssys_rt_sigreturn, sys_kill
osl/src/signal.rssys_rt_sigaction, sys_rt_sigprocmask

PerCpuData layout

OffsetFieldPurpose
0kernel_rspLoaded on SYSCALL entry
8user_rspSaved by entry stub
16user_ripRCX saved by entry stub
24user_rflagsR11 saved by entry stub
32user_r9R9 saved (for clone)
40saved_frame_ptrRSP after register pushes (for signal delivery)

saved_frame_ptr is not saved/restored per-thread

saved_frame_ptr lives in a single per-CPU slot and is not saved/restored during context switches. This is safe today because it is set and consumed entirely within the SYSCALL entry/exit path with interrupts disabled:

  1. The assembly stub pushes registers, writes mov gs:40, rsp, then calls syscall_dispatch followed by check_pending_signals — all before the register pops and sysretq.
  2. rt_sigreturn is itself a syscall, so the stub sets saved_frame_ptr at the start of the same SYSCALL path before sys_rt_sigreturn reads it.

No preemption can occur between setting and consuming the pointer.

If signal delivery is ever needed from interrupt context (e.g. delivering SIGSEGV from a page-fault handler or SIGINT from a keyboard ISR), this design must be revisited — either by saving/restoring saved_frame_ptr per-thread in the scheduler, or by using a different mechanism to locate the interrupted frame (e.g. the interrupt stack frame pushed by the CPU).

Signal-interrupted syscalls (EINTR)

Blocking syscalls (sys_wait4, PipeReader::read) can be interrupted by signals. The mechanism uses a per-process signal_thread field:

  1. Before blocking, the syscall stores its scheduler thread index in process.signal_thread.
  2. sys_kill, after queuing a signal, reads signal_thread and calls scheduler::unblock() on it if set.
  3. When the blocked thread wakes, it checks for pending signals. If any are deliverable (pending & !blocked != 0), it returns EINTR instead of re-blocking.
  4. The field is cleared on any exit path (data available, EOF, or signal).

Only interruptible blocking sites set signal_thread. Non-interruptible blocks (vfork parent in sys_clone, blocking() async bridge) never set it, so they remain unaffected.

The shell’s cmd_run handles EINTR from waitpid by forwarding SIGINT to the child process and re-waiting, enabling Ctrl+C to reach child processes running in the terminal.

Future work

  • Exception-generated signals (SIGSEGV, SIGILL, SIGFPE from ring-3 faults)
  • FPU state save/restore in signal frames
  • Signal queuing (currently only one instance per signal — standard signals)

Process Spawning

How user-space processes are created.


Current Implementation

Process creation uses the standard Linux clone(CLONE_VM|CLONE_VFORK) + execve path. musl’s posix_spawn and Rust’s std::process::Command work unmodified.

clone (CLONE_VM | CLONE_VFORK | SIGCHLD)

clone creates a child process that shares the parent’s address space. The parent blocks until the child calls execve or _exit.

See syscalls/clone.md for full details.

execve

execve replaces the current process’s address space with a new ELF binary. Reads the ELF from the VFS, creates a fresh PML4, maps segments, builds the initial stack with argc/argv/envp/auxv, closes FD_CLOEXEC fds, unblocks the vfork parent, and jumps to userspace.

See syscalls/execve.md for full details.

Internal spawning (kernel-side)

For boot-time process creation (e.g. auto-launching the shell), the kernel uses osl::spawn::spawn_process_full(elf_data, argv, envp, parent_pid) which combines ELF loading and process creation in a single call.

kernel/src/ring3.rs provides spawn_process and spawn_process_with_env wrappers that delegate to spawn_process_full.


Process lifecycle

parent: clone(CLONE_VM|CLONE_VFORK)
  │
  │  ┌─── child created (shares parent PML4) ───┐
  │  │                                            │
  │  │  execve("/bin/prog", argv, envp)           │
  │  │    → fresh PML4, ELF mapped                │
  │  │    → close CLOEXEC fds                     │
  │  │    → unblock parent                        │
  │  │    → jump to ring 3                        │
  │  │                                            │
  ├──┘  parent unblocked                          │
  │                                               │
  │  waitpid(child, &status, 0)                   │
  │    → blocks until child exits                 │
  │                                               │
  │  child: _exit(code)                           │
  │    → mark zombie, wake parent                 │
  │                                               │
  ▼  parent: waitpid returns, reap zombie

Key files

FilePurpose
osl/src/clone.rssys_clone — vfork child creation
osl/src/exec.rssys_execve — replace process image
osl/src/spawn.rsspawn_process_full — kernel-side ELF spawning
osl/src/elf_loader.rsELF parsing and address space setup
libkernel/src/task/scheduler.rsspawn_clone_thread, clone_trampoline
kernel/src/ring3.rsspawn_process wrapper for boot-time use

Future work

  • fork + CoW page faults — full POSIX fork with copy-on-write. Requires page fault handler and per-frame reference counting.
  • fd inheritance across clone — currently the child gets a copy of the parent’s fd table; selective inheritance could be added.

Plan: User Space and Process Isolation

Context

The kernel currently runs everything — drivers, shell, filesystem — in a single ring-0 address space as async Rust tasks. This document outlines the path from that baseline to a system where untrusted programs run in isolated ring-3 processes with their own virtual address spaces, communicating with the kernel through system calls, and eventually linked against a ported musl libc.


Progress Summary

Phases 0–6 are complete. The kernel runs a musl-linked C shell (user/shell.c) as its primary user interface. The shell auto-launches on boot, supports line editing, built-in commands (echo, pwd, cd, ls, cat, exit, help), and spawning external programs. Process creation uses standard Linux clone(CLONE_VM|CLONE_VFORK) + execve, enabling unpatched musl posix_spawn and Rust std::process::Command. 35+ syscalls are implemented including pipe2, dup2, fcntl, getpid, getrandom, clone/execve, and custom completion port / IPC syscalls.

PhaseStatusMilestone
0 — ToolchainDoneHand-crafted assembly blobs and static ELF binaries load and run
1 — Ring-3 + SYSCALLDoneGDT has ring-3 segments; SYSCALL/SYSRET works; sys_write, sys_exit, sys_arch_prctl implemented
2 — Per-process page tablesDonecreate_user_page_table, map_user_page, CR3 switching on context switch; ring-3 page faults kill the process
3 — Process abstractionDoneProcess struct, process table, ELF loader, exec shell command, zombie reaping
4 — System call layerDone14 syscalls implemented; initial stack with auxv; brk/mmap for heap; writev for musl printf
5 — Cross-compiler + muslDoneDocker-based musl cross-compiler (scripts/user-build.sh); static musl binaries run on ostoo
6 — Spawn / wait / user shellDoneclone(CLONE_VM|CLONE_VFORK) + execve for process creation; wait4; pipe2, dup2, fcntl, getpid, getrandom; userspace C shell with line editing, auto-launched on boot
7 — SignalsNot startedRequires signal frame push/pop, rt_sigaction, rt_sigreturn

What works today

  • Userspace shell (user/shell.c): musl-linked C shell compiled via Docker cross-compiler, deployed to disk image at /shell. Auto-launched from kernel/src/main.rs on boot; falls back to kernel shell if not found.
  • Line editing in the shell: read char-by-char, echo, backspace, Ctrl+C (cancel line), Ctrl+D (exit on empty line).
  • Built-in commands: echo, pwd, cd, ls, cat, exit, help.
  • External programs: posix_spawn(path) + waitpid from the shell.
  • Raw keypress delivery to userspace via libkernel/src/console.rs: foreground PID routing, blocking read(0), keyboard ISR wakeup.
  • Per-process FD table (fds 0–2 = ConsoleHandle); FileHandle trait with ConsoleHandle, VfsHandle, and DirHandle implementations.
  • 35+ syscalls implemented (see docs/syscalls/ for per-syscall docs): read, write, open, close, fstat, lseek, mmap, mprotect, munmap, brk, ioctl, writev, exit/exit_group, wait4, getcwd, chdir, arch_prctl, futex, getdents64, set_tid_address, set_robust_list, clone, execve, pipe2, dup2, fcntl, getpid, getrandom, kill, rt_sigaction, rt_sigprocmask, rt_sigreturn, sigaltstack, madvise, sched_getaffinity, clock_gettime, plus custom syscalls for completion ports (501–503), IRQ (504), and IPC channels (505–507).
  • open resolves paths relative to process CWD; supports both files (VfsHandle) and directories (DirHandle with O_DIRECTORY).
  • getdents64 returns linux_dirent64 structs from DirHandle.
  • clone(CLONE_VM|CLONE_VFORK) creates a child sharing the parent’s address space; execve replaces it with a new ELF binary. wait4 blocks parent until child exits/zombies.
  • writev (used by musl’s printf) writes scatter/gather buffers to VGA.
  • brk grows the process heap by allocating and mapping zero-filled pages.
  • mmap supports anonymous MAP_PRIVATE allocations via a bump-down allocator starting at 0x4000_0000_0000.
  • Process tracks brk_base/brk_current (computed from ELF segment extents), mmap_next/mmap_regions, fd_table, cwd, parent_pid, wait_thread.
  • ELF parser extracts phdr_vaddr, phnum, and phentsize for the auxiliary vector (musl reads AT_PHDR/AT_PHNUM/AT_PHENT during startup).
  • spawn_process_full (in osl/src/spawn.rs) builds the initial stack with argc, argv strings, envp (NULL), and auxiliary vector.
  • Async-to-sync bridge (osl/src/blocking.rs): spawns async VFS operations as kernel tasks, blocks the user thread, unblocks on completion.
  • Unhandled syscalls log a warning with the syscall number and first 3 args, then return -ENOSYS.
  • Ring-3 page faults, GPFs, and invalid opcodes log the fault, mark the process zombie, wake the parent’s wait thread, restore kernel GS polarity, and kill the thread — no kernel panic.
  • test isolation verifies two independently-created PML4s have genuinely independent user-space mappings at the same virtual address.
  • System info commands (cpuinfo, meminfo, memmap, pmap, threads, tasks, idt, pci, lapic, ioapic, drivers, uptime) are exposed as /proc virtual files accessible via cat /proc/<file>.

Key implementation files

FileRole
libkernel/src/gdt.rsGDT with kernel + user code/data segments, TSS, set_kernel_stack for rsp0
libkernel/src/syscall.rsSYSCALL MSR init, assembly entry stub, per-CPU data
libkernel/src/file.rsFileHandle trait, FileError enum, ConsoleHandle
libkernel/src/console.rsConsole input buffer, foreground PID routing, blocking read
libkernel/src/process.rsProcess struct (fd_table, cwd, brk/mmap, parent/wait), ProcessManager, zombie lifecycle
libkernel/src/elf.rsELF64 parser (static ET_EXEC, x86-64) with phdr metadata for auxv
libkernel/src/memory/mod.rscreate_user_page_table, map_user_page, switch_address_space
libkernel/src/task/scheduler.rsspawn_user_thread, process_trampoline, CR3 switching in preempt_tick, block/unblock
libkernel/src/interrupts.rsRing-3-aware page fault, GPF, and invalid opcode handlers
osl/src/syscalls/syscall_dispatch + syscall implementations (io.rs, fs.rs, mem.rs, process.rs, misc.rs)
osl/src/errno.rsLinux errno constants, file_errno() / vfs_errno() converters
osl/src/file.rsVfsHandle, DirHandle (VFS-backed file handles)
osl/src/blocking.rsAsync-to-sync bridge for VFS calls
osl/src/spawn.rsspawn_process_full (ELF spawning with argv and parent PID)
kernel/src/ring3.rsLegacy spawn_process wrapper, spawn_blob (raw code), test helpers
kernel/src/keyboard_actor.rsForeground routing: raw bytes to console or kernel line editor
kernel/src/main.rsAuto-launch /shell on boot
devices/src/vfs/proc_vfs/mod.rsProcVfs with 12+ virtual files (generator submodules)
user/shell.cUserspace shell (musl, static)
docs/syscalls/*.mdPer-syscall documentation

Virtual Address Space Layout

The kernel’s heap, APIC, and MMIO window live in the high canonical half (≥ 0xFFFF_8000_0000_0000), so the entire lower canonical half is available for user process address spaces. The kernel/user boundary is enforced at the PML4 level: entries 0–255 (lower half) are user-private; entries 256–510 (high half) are kernel-shared; entry 511 is the per-PML4 recursive self-mapping.

0x0000_0000_0000_0000  ← canonical zero (null pointer trap page, unmapped)
0x0000_0000_0040_0000  ← ELF load address (4 MiB, standard x86-64)
         ↓ text, data, BSS
         ↓ brk heap (grows up from page-aligned end of highest PT_LOAD segment)
         ...
0x0000_4000_0000_0000  ← mmap region (bump-down allocator, grows downward)
         ...
0x0000_7FFF_F000_0000  ← ELF user stack base (8 pages = 32 KiB)
0x0000_7FFF_F000_8000  ← ELF user stack top (RSP starts here minus auxv layout)
0x0000_7FFF_FFFF_FFFF  ← top of lower canonical half (entire range = user)
                         (non-canonical gap)
0xFFFF_8000_0000_0000  ← kernel heap        (HEAP_START, 512 KiB)
0xFFFF_8001_0000_0000  ← Local APIC MMIO    (APIC_BASE)
0xFFFF_8001_0001_0000  ← IO APIC(s)
0xFFFF_8002_0000_0000  ← MMIO window        (MMIO_VIRT_BASE, 512 GiB)
phys_mem_offset         ← bootloader physical memory identity map (high half)
0xFFFF_FF80_0000_0000  ← recursive PT window (PML4[511])
0xFFFF_FFFF_FFFF_F000  ← PML4 self-mapping

Kernel entries (PML4 indices 256–510) are copied into every process page table without USER_ACCESSIBLE; they are invisible to ring-3 code.


Phase 0 — Toolchain and Build Infrastructure ✅ COMPLETE

Goal: produce user-space ELF binaries that the kernel can load, without needing musl yet.

0a. Custom linker script

Write user/link.ld:

ENTRY(_start)
SECTIONS {
  . = 0x400000;
  .text   : { *(.text*) }
  .rodata : { *(.rodata*) }
  .data   : { *(.data*) }
  .bss    : { *(.bss*) COMMON }
}

0b. Rust no_std user target

Add a custom target JSON x86_64-ostoo-user.json with:

  • "os": "none", "env": "", "vendor": "unknown"
  • "pre-link-args": pass the linker script
  • "panic-strategy": "abort" (no unwinding in user space initially)
  • "disable-redzone": true (same requirement as kernel)

A minimal user/ crate can implement _start in assembly, call a main, then invoke the exit syscall.

0c. Assembly user programs

Before the ELF loader exists, a hand-crafted binary blob (or raw ELF built from a few lines of NASM) is enough to verify the ring-3 transition and basic syscalls work.


Phase 1 — Ring-3 GDT Segments and SYSCALL Infrastructure ✅ COMPLETE

Goal: the kernel can jump to ring 3 and come back via SYSCALL/SYSRET. No process isolation yet — user code runs in the kernel’s own address space.

What was implemented:

  • GDT extended with kernel data, user data, and user code segments in the order required by IA32_STAR (libkernel/src/gdt.rs).
  • TSS.rsp0 updated via set_kernel_stack() on every context switch to a user process.
  • SYSCALL MSRs (STAR, LSTAR, FMASK, EFER.SCE) configured in libkernel/src/syscall.rs::init().
  • Assembly entry stub with swapgs, per-CPU kernel/user RSP swap, and SysV64 argument shuffle before calling syscall_dispatch.
  • Three syscalls: write (fd 1/2 to VGA), exit/exit_group (mark zombie + kill thread), arch_prctl(ARCH_SET_FS) (write IA32_FS_BASE MSR).
  • Ring-3 test (test ring3): drops to user mode, writes “Hello from ring 3!” via syscall, exits cleanly.

1a. GDT additions (libkernel/src/gdt.rs)

Add four new descriptors in the order required by IA32_STAR:

Index  Selector  Descriptor
  0    0x00      Null
  1    0x08      Kernel code (ring 0, already exists)
  2    0x10      Kernel data (ring 0) ← new; SYSRET expects it at STAR[47:32]+8
  3    0x18      (padding / null for SYSRET alignment)
  4    0x20      User   code (ring 3) ← new; STAR[63:48]
  5    0x28      User   data (ring 3) ← new; at STAR[63:48]+8
  6    0x30+     TSS (2 slots for the 16-byte system descriptor)

IA32_STAR layout: bits 47:32 = kernel CS (SYSCALL), bits 63:48 = user CS − 16 (SYSRET uses this+16 for CS and +8 for SS).

Update the Selectors struct and init() in gdt.rs.

1b. TSS kernel-stack field

When the CPU delivers a ring-3 interrupt it loads RSP from TSS.rsp0. This must point to the current process’s kernel stack top. For now a single global TSS is fine; when processes exist, rsp0 is updated on every context switch.

1c. SYSCALL MSR setup (libkernel/src/interrupts.rs or new libkernel/src/syscall.rs)

#![allow(unused)]
fn main() {
pub fn init_syscall() {
    // IA32_STAR: kernel CS at bits 47:32, user CS-16 at bits 63:48
    let star: u64 = ((KERNEL_CS as u64) << 32) | ((USER_CS as u64 - 16) << 48);
    unsafe { Msr::new(0xC000_0081).write(star); }         // STAR

    // IA32_LSTAR: entry point for 64-bit SYSCALL
    unsafe { Msr::new(0xC000_0082).write(syscall_entry as u64); }

    // IA32_FMASK: clear IF, DF on SYSCALL (but keep other flags)
    unsafe { Msr::new(0xC000_0084).write(0x0000_0300); }  // IF | DF

    // Enable SCE bit in EFER
    let efer = unsafe { Msr::new(0xC000_0080).read() };
    unsafe { Msr::new(0xC000_0080).write(efer | 1); }
}
}

1d. Assembly syscall entry stub

libkernel/src/syscall_entry.asm (or global_asm! in syscall.rs):

syscall_entry:
    swapgs                  ; switch to kernel GS (store user GS)
    mov  [gs:USER_RSP], rsp ; save user RSP into per-cpu area
    mov  rsp, [gs:KERN_RSP] ; load kernel RSP

    push rcx                ; user RIP (SYSCALL saves it here)
    push r11                ; user RFLAGS

    ; push all scratch registers
    push rax
    push rdi
    push rsi
    push rdx
    push r10
    push r8
    push r9

    ; rax = syscall number, rdi/rsi/rdx/r10/r8/r9 = arguments
    mov  rdi, rax
    call syscall_dispatch   ; -> rax = return value

    pop  r9
    pop  r8
    pop  r10
    pop  rdx
    pop  rsi
    pop  rdi
    ; leave rax as return value

    pop  r11                ; restore RFLAGS
    pop  rcx                ; restore user RIP
    mov  rsp, [gs:USER_RSP] ; restore user RSP
    swapgs
    sysretq

swapgs requires a per-CPU data block holding the kernel stack pointer. Implement as a small struct at a known virtual address (or via GS_BASE MSR).

1e. Minimal syscall dispatch table

Start with just three numbers (matching Linux x86-64 for musl compatibility):

NumberNameAction
0readstub → return −ENOSYS
1writewrite to VGA console if fd==1/2
60exitterminate current process

1f. First ring-3 test

Write a tiny inline assembly test in kernel/src/main.rs that:

  1. Pushes a fake user-mode iret frame (SS, RSP, RFLAGS with IF, CS ring-3, RIP).
  2. iretq into ring 3.
  3. User code executes syscall with rax=1 (write), prints one character.
  4. Kernel writes it to VGA and returns to ring 3.
  5. User code executes syscall with rax=60 (exit).

This verifies the GDT, SYSCALL, and basic ABI without an ELF loader or address space isolation.


Phase 2 — Per-Process Page Tables and Address Space Isolation ✅ COMPLETE

Goal: each process has its own PML4; kernel mappings are shared; user mappings are private.

What was implemented:

  • MemoryServices::create_user_page_table() allocates a fresh PML4, copies kernel entries (indices 256–510) without USER_ACCESSIBLE, and sets the recursive self-mapping at index 511.
  • MemoryServices::map_user_page() maps individual 4 KiB pages in a non-active page table given its PML4 physical address.
  • unsafe switch_address_space(pml4_phys) writes CR3.
  • Page fault handler (libkernel/src/interrupts.rs) checks stack_frame.code_segment.rpl() — ring-3 faults mark the process zombie (exit code -11 / SIGSEGV), restore kernel GS via swapgs, and call kill_current_thread(). Kernel faults still panic.
  • test isolation shell command verifies two PML4s map the same user virtual address to different physical frames.
  • Scheduler preempt_tick saves/restores CR3 when switching between threads with different page tables.

2a. Page table creation (libkernel/src/memory/)

Add to MemoryServices:

#![allow(unused)]
fn main() {
/// Allocate a fresh PML4, copy all kernel PML4 entries (indices where
/// virtual_address >= KERNEL_SPLIT) into it, and return the physical
/// address of the new PML4 frame.
pub fn create_user_page_table(&mut self) -> PhysAddr;

/// Map a single 4 KiB page in a specific (possibly non-active) page table.
pub fn map_user_page(
    &mut self,
    pml4_phys: PhysAddr,
    virt: VirtAddr,
    phys: PhysAddr,
    flags: PageTableFlags,   // USER_ACCESSIBLE | PRESENT | WRITABLE | NO_EXECUTE as needed
) -> Result<(), MapToError<Size4KiB>>;

/// Switch the active address space.  Must be called with interrupts disabled.
pub unsafe fn switch_address_space(&self, pml4_phys: PhysAddr);
}

2b. Kernel/user PML4 split

The layout gives a clean hardware-level split:

  • PML4 indices 0–255 (lower canonical half, 0x0000_*) — user-private. Left empty at process creation; populated by the ELF loader and mmap.
  • PML4 indices 256–510 (high canonical half, 0xFFFF_8000_* through 0xFFFF_FF7F_*) — kernel-shared. Copied from the kernel PML4 at process creation; marked present but never USER_ACCESSIBLE.
  • PML4 index 511 — the recursive self-mapping. Each process PML4 must have its own entry here pointing to its own physical PML4 frame (not the kernel’s). create_user_page_table must set this explicitly.

2c. Page fault handler upgrade

Replace the panic in page_fault_handler with:

#![allow(unused)]
fn main() {
extern "x86-interrupt" fn page_fault_handler(frame: InterruptStackFrame, ec: PageFaultErrorCode) {
    let faulting_addr = Cr2::read();
    if frame.code_segment.rpl() == PrivilegeLevel::Ring3 {
        // Fault in user space — kill the process (deliver SIGSEGV later).
        kill_current_process(Signal::Segv);
        schedule_next();       // does not return to faulting instruction
    } else {
        panic!("kernel page fault at {:?}\n{:#?}\n{:?}", faulting_addr, frame, ec);
    }
}
}

This is the minimum needed to prevent a kernel panic when user code accesses invalid memory; proper CoW / demand paging comes later.

2d. Address space switch on context switch

The scheduler’s preempt_tick function currently saves/restores only kernel RSP. Extend it to also write CR3 when switching between processes with different page tables.


Phase 3 — Process Abstraction ✅ COMPLETE

Goal: Process struct, a process table, and a working exec.

What was implemented:

  • Process struct (libkernel/src/process.rs) with PID, state (Running/Zombie), PML4 physical address, heap-allocated 64 KiB kernel stack, entry point, user stack top, thread index, and exit code.
  • Global PROCESS_TABLE: Mutex<BTreeMap<ProcessId, Process>> and CURRENT_PID: AtomicU64.
  • insert(), current_pid(), set_current_pid(), with_process(), mark_zombie(), reap(), reap_zombies().
  • Scheduler integration: SchedulableKind::Kernel | UserProcess(ProcessId). spawn_user_thread creates a thread targeting process_trampoline which sets up TSS.rsp0, per-CPU kernel RSP, PID tracking, GS polarity, CR3 switch, and then does iretq into ring-3 user code.
  • kill_current_thread() marks the thread Dead and spins; timer preemption skips dead threads.
  • ELF loader (libkernel/src/elf.rs): minimal parser for static ET_EXEC x86-64 binaries. Returns ElfInfo { entry, segments, phdr_vaddr, phnum, phentsize }.
  • kernel/src/ring3.rs::spawn_process(elf_data) — parses ELF, creates user PML4, maps all PT_LOAD segments (with correct R/W/X flags) plus a user stack page, creates a Process, and spawns a user thread. Returns Ok(ProcessId).
  • Shell command exec <path> reads an ELF from the VFS and calls spawn_process.
  • spawn_blob(code) helper for test commands: maps a raw code blob + stack, creates a Process, spawns a user thread.
  • Zombie reaping: reap_zombies() is called at the start of spawn_blob and spawn_process to free kernel stacks of fully-exited processes.

3a. Process struct (libkernel/src/process/mod.rs)

#![allow(unused)]
fn main() {
pub struct Process {
    pub pid:           ProcessId,
    pub state:         ProcessState,          // Running, Ready, Blocked, Zombie
    pub pml4_phys:     PhysAddr,              // physical address of PML4
    pub kernel_stack:  Vec<u8>,               // 64 KiB kernel stack
    pub saved_rsp:     u64,                   // kernel RSP when not running
    pub user_rsp:      u64,                   // user RSP (restored on ring-3 return)
    pub files:         FileTable,             // open file descriptors
    pub parent:        Option<ProcessId>,
    pub exit_code:     Option<i32>,
}
}

3b. Process table

#![allow(unused)]
fn main() {
lazy_static! {
    static ref PROCESSES: Mutex<BTreeMap<ProcessId, Process>> = ...;
}
}

CURRENT_PID: AtomicU32 — the PID running on each CPU (single-CPU for now).

3c. Scheduler integration

Replace the bare Thread list in scheduler.rs with process-aware scheduling:

  • On preempt_tick: save user context (if coming from ring 3), switch CR3, load next process’s user context and kernel RSP.
  • TSS.rsp0 updated to point to the new process’s kernel stack top.

3d. ELF loader (libkernel/src/elf.rs)

#![allow(unused)]
fn main() {
pub fn load_elf(
    bytes: &[u8],
    process: &mut Process,
    mem: &mut MemoryServices,
) -> Result<VirtAddr, ElfError>   // returns entry point
}

Steps:

  1. Validate ELF magic, e_machine == EM_X86_64, e_type == ET_EXEC (static) or ET_DYN (PIE).
  2. For each PT_LOAD segment: allocate physical frames, map at p_vaddr with USER_ACCESSIBLE and flags derived from p_flags (R/W/X).
  3. Copy p_filesz bytes from the ELF image; zero-fill to p_memsz.
  4. Allocate and map a user stack (8–16 pages) just below the stack top.
  5. Set up the initial stack frame: argc=0, argv=NULL, envp=NULL, auxv entries for AT_ENTRY, AT_PHDR, AT_PAGESZ (required by musl’s _start).
  6. Return e_entry.

3e. sys_execve syscall

#![allow(unused)]
fn main() {
fn sys_execve(path: *const u8, argv: *const *const u8, envp: *const *const u8) -> ! {
    let bytes = vfs::read_file(path_str).expect("exec: read failed");
    let process = current_process_mut();
    process.reset_address_space();          // drop old page table
    let entry = load_elf(&bytes, process, &mut memory());
    switch_to_user(entry, process.user_stack_top);   // does not return
}
}

Phase 4 — System Call Layer ✅ COMPLETE

Goal: a syscall table wide enough to run a static musl binary that prints “Hello, world!” and exits.

What was implemented:

4a. ELF parser extensions (libkernel/src/elf.rs)

ElfInfo now includes phdr_vaddr, phnum, and phentsize. The parser looks for a PT_PHDR program header (type 6) to get the phdr virtual address directly; fallback computes it from the PT_LOAD segment containing e_phoff. These values populate the auxiliary vector that musl reads during startup.

4b. Process memory tracking (libkernel/src/process.rs)

Process gained four new fields:

FieldTypePurpose
brk_baseu64Page-aligned end of highest PT_LOAD segment (immutable)
brk_currentu64Current program break (starts == brk_base)
mmap_nextu64Bump-down pointer for anonymous mmap (starts at 0x4000_0000_0000)
mmap_regionsVec<(u64, u64)>Tracked (vaddr, len) pairs

Process::new() now takes a brk_base parameter. spawn_process computes it from max(seg.vaddr + seg.memsz) page-aligned up.

4c. Initial stack layout (kernel/src/ring3.rs)

ELF processes get an 8-page (32 KiB) contiguous stack at 0x7FFF_F000_0000, allocated via alloc_dma_pages(8) so the auxv layout can be written through the kernel’s phys_mem_offset window. build_initial_stack() writes:

[stack_top]
  16 bytes pseudo-random data (AT_RANDOM target)
  alignment padding (8 bytes)
  AT_NULL (0, 0)
  AT_RANDOM (25, addr)
  AT_ENTRY (9, entry_point)
  AT_PHNUM (5, phnum)
  AT_PHENT (4, phentsize)
  AT_PHDR (3, phdr_vaddr)
  AT_PAGESZ (6, 4096)
  AT_UID (11, 0)
  NULL                    ← envp terminator
  NULL                    ← argv terminator
  0                       ← argc = 0
[RSP points here, 16-byte aligned]

4d. Syscall table (osl/src/syscalls/mod.rs)

All syscalls use Linux x86-64 numbers for musl compatibility. Unhandled numbers log a warning and return -ENOSYS. Errno constants are defined in osl/src/errno.rs; libkernel uses FileError for structured errors.

NrNameImplementation
0readVia fd_table → FileHandle::read; ConsoleHandle blocks on empty input
1writeVia fd_table → FileHandle::write
2openVFS read_file or list_dirVfsHandle/DirHandle; path resolution relative to CWD
3closeVia fd_table
5fstatS_IFCHR for console fds
8lseekReturns -ESPIPE (not seekable)
9mmapAnonymous MAP_PRIVATE only; bump-down allocator
10mprotectUpdates page table flags for VMA regions
11munmapUnmaps pages, frees frames, splits/removes VMAs
12brkQuery or grow heap; allocates+maps zero-filled pages
16ioctlReturns -ENOTTY
20writevVia fd_table; scatter/gather write
60exitMark zombie, wake parent wait_thread, kill thread
61wait4Find zombie child, block if none, reap and return
72futexNo-op stub (single-threaded, lock never contended)
79getcwdCopy process.cwd to user buffer
80chdirValidate path via VFS list_dir, update process.cwd
158arch_prctlARCH_SET_FS writes IA32_FS_BASE MSR
217getdents64Via DirHandle::getdents64
218set_tid_addressReturns current PID as TID
231exit_groupSame as exit (single-threaded)
273set_robust_listNo-op, returns 0

Lock ordering for brk and mmap: process table lock acquired/released to read state, then memory lock for frame allocation and page mapping, then process table lock re-acquired to write updates. This avoids nested lock deadlocks.

See docs/syscalls/ for detailed per-syscall documentation.

4e. What’s still missing (deferred to later phases)

  • SMAP enforcement: User pointers in writev, fstat, brk are accessed without stac/clac.
  • Page deallocation: ✅ Fixed — munmap frees frames and splits VMAs; brk shrink unmaps and frees pages; process exit cleans up the entire user address space.
  • mprotect: ✅ Fixed — updates page table flags for the target VMA range.
  • FS_BASE save/restore: ✅ Fixed — FS_BASE is saved/restored per-thread in preempt_tick via save_current_context / restore_thread_state.

Phase 5 — Cross-Compiler and musl Port ✅ COMPLETE

Goal: compile C programs that run as ostoo user processes.

What was implemented:

  • Docker-based build environment (scripts/user-build.sh) using x86_64-linux-musl-cross toolchain.
  • user/Makefile compiles *.c files to static musl-linked ELF binaries.
  • user/shell.c is the primary musl binary (see Phase 6).
  • Binaries are deployed to the exFAT disk image or shared via virtio-9p.

5a. Toolchain strategy

The simplest path: use an existing x86_64-linux-musl sysroot unmodified, because we implement Linux-compatible syscall numbers (Phase 4). musl does not inspect the OS name at runtime — it just issues syscalls.

Option A (quickest): install x86_64-linux-musl-gcc from musl.cc or via brew install x86_64-linux-musl-cross. Compile with:

x86_64-linux-musl-gcc -static -o hello hello.c

The resulting fully-static ELF should work on ostoo with the Phase 4 syscalls.

Option B (custom triple): build musl from source with a custom --target configured for ostoo. This is useful once ostoo diverges from Linux’s ABI (e.g. custom syscall numbers or a different startup convention).

5b. musl build recipe (Option B outline)

# Prerequisites: a bare x86_64-elf-gcc cross-compiler (via crosstool-ng or
# manual binutils + gcc build targeting x86_64-unknown-elf).

git clone https://git.musl-libc.org/cgit/musl
cd musl
./configure \
  --target=x86_64 \
  --prefix=/opt/ostoo-sysroot \
  --syslibdir=/opt/ostoo-sysroot/lib \
  CROSS_COMPILE=x86_64-elf-
make -j$(nproc)
make install

Key musl files:

  • arch/x86_64/syscall_arch.h__syscall0__syscall6 use the syscall instruction; no changes needed if syscall numbers match Linux.
  • crt/x86_64/crt1.o_start sets up argc/argv/envp from the initial stack (ABI defined in the ELF auxiliary vector; match what the ELF loader sets up in Phase 3d).
  • src/env/__init_tls.c — calls arch_prctl(ARCH_SET_FS, ...); requires the sys_arch_prctl syscall (Phase 4b).

5c. Rust user programs

For Rust programs targeting ostoo, add a custom target x86_64-ostoo-user.json (from Phase 0b) and a minimal ostoo-rt crate that:

  • Provides _start (sets up a stack frame; calls main; calls sys_exit).
  • Provides #[panic_handler] that calls sys_exit(1).
  • Wraps the small syscall ABI.

Users can then write:

#![no_std]
#![no_main]
extern crate ostoo_rt;

#[no_mangle]
pub extern "C" fn main() {
    ostoo_rt::write(1, b"Hello from Rust!\n");
}

Phase 6 — Spawn, Wait, and a Minimal Shell ✅ COMPLETE

Goal: a user-mode shell that can launch and wait for child programs.

What was implemented:

Process creation uses the standard Linux clone(CLONE_VM|CLONE_VFORK) + execve path. musl’s posix_spawn and Rust’s std::process::Command work unmodified.

6a. clone (syscall 56)

clone(CLONE_VM|CLONE_VFORK|SIGCHLD) creates a child sharing the parent’s address space. The parent blocks until the child calls execve or _exit.

See clone.

6b. execve (syscall 59)

Replaces the current process image with a new ELF binary. Reads from VFS, creates a fresh PML4, maps segments, builds the initial stack, closes CLOEXEC fds, unblocks the vfork parent, and jumps to userspace.

See execve.

6c. wait4 (syscall 61)

  • sys_wait4(pid, status_ptr, options) — find zombie child, write exit status, reap, return child PID
  • If no zombie found: register wait_thread on parent, block, retry on wake
  • sys_exit wakes parent’s wait_thread via scheduler::unblock()

6d. Userspace shell (user/shell.c)

  • Compiled with musl (static), deployed at /shell
  • Line editing: read char-by-char, echo, backspace, Ctrl+C, Ctrl+D
  • Built-in commands: echo, pwd, cd, ls, cat, exit, help
  • External programs: posix_spawn(path) + waitpid(child, &status, 0)
  • Auto-launched from kernel/src/main.rs; falls back to kernel shell if /shell is not found on the filesystem

6e. What’s deferred

  • fork + CoW page faults — standard POSIX fork is not implemented. Adding it would require: marking all user pages read-only in both parent and child, a CoW page fault handler that copies on write, and reference counting on physical frames.

Phase 7 — Signals ⬜ NOT STARTED

Signals are the last major piece of POSIX plumbing needed for a realistic user-space environment.

Minimal signal implementation

#![allow(unused)]
fn main() {
pub struct SigAction { handler: usize, flags: u32, mask: SigSet }
pub struct SigTable  { actions: [SigAction; 32], pending: SigSet, masked: SigSet }
}
  • sys_rt_sigaction installs handlers.
  • Before returning to user space after a syscall or interrupt, check pending & ~masked.
  • If set: push a signal frame on the user stack (siginfo + ucontext), set RIP to the handler, clear the pending bit.
  • sys_rt_sigreturn: the signal handler calls this when done; the kernel pops the ucontext and resumes normal user execution.

Dependency Graph

Phase 0 ✅ ← Phase 1 ✅ ← Phase 2 ✅ ← Phase 3 ✅ ← Phase 4 ✅ ← Phase 5 ✅ ← Phase 6 ✅
(toolchain)   (ring-3,       (address     (Process,        (syscall        (musl)       (spawn/wait/
               syscall)       spaces)      ELF loader)      layer)                       shell)
                                                                                  ↓
                                                                           Phase 7 (signals)

Key Risks and Design Decisions

SYSCALL vs INT 0x80

Use SYSCALL/SYSRET (64-bit, fast path). INT 0x80 is the 32-bit ABI; musl uses SYSCALL on x86-64 exclusively.

Kernel/user split

The kernel lives entirely in the high canonical half (0xFFFF_8000_* and above): heap at 0xFFFF_8000_*, APIC at 0xFFFF_8001_*, MMIO window at 0xFFFF_8002_*. The entire lower canonical half is free for user processes. The split is enforced at the PML4 level — user processes simply have no mappings at indices 256–510, and the kernel entries they inherit are never USER_ACCESSIBLE. SMEP (CR4.20) and SMAP (CR4.21) provide the hardware enforcement layer once ring-3 processes exist.

SMEP and SMAP

Once ring-3 processes exist, enable SMEP (CR4.20) to prevent the kernel from accidentally executing user-mapped code, and SMAP (CR4.21) to prevent the kernel from silently accessing user memory without an explicit stac/clac pair. Any kernel code that copies from user buffers must use a checked copy function that uses stac to temporarily permit access.

Static-only ELF initially

Dynamic linking requires an in-kernel or user-space ELD interpreter. Start with -static binaries and the ELF loader described in Phase 3d. PIE static binaries (ET_DYN with no INTERP segment) should work with minor adjustments to the loader.

Single CPU for now

The process table and scheduler assume a single CPU. SMP support would require per-CPU CURRENT_PID, per-CPU kernel stacks in the TSS, and IPI-based TLB shootdown when modifying another process’s page table.

Heap size

The kernel heap is 1 MiB. Process control blocks each consume 64 KiB (kernel stack) plus page table frames, plus Vec storage for mmap_regions. Zombie processes are reaped via wait4 + reap(), but loading multiple concurrent processes will still pressure the heap.

Memory management

munmap frees frames and splits/removes VMAs. brk shrink frees pages. Process exit calls cleanup_user_address_space to walk and free all user-half page tables and frames. The kernel heap (1 MiB) is the main remaining pressure point for concurrent processes.


Milestones and Test Checkpoints

MilestoneObservable resultStatus
Phase 1 completeiretq drops to ring 3; syscall returns to ring 0; “Hello from ring 3!” appears on VGA✅ Done
Phase 2 completeTwo user processes have separate address spaces; test isolation passes✅ Done
Phase 3 completeexec /path/to/elf reads an ELF from the VFS, loads it into a fresh address space, and runs it✅ Done
Phase 4 complete14 syscalls, initial stack with auxv, brk/mmap heap, writev for printf✅ Done
Phase 5 completehello compiled with x86_64-linux-musl-gcc -static prints and exits cleanly✅ Done
Phase 6 completeUserspace shell spawns children and waits for them; auto-launches on boot✅ Done
Phase 7 completeSIGINT (Ctrl+C) terminates the foreground process

read (nr 0)

Linux Signature

ssize_t read(int fd, void *buf, size_t count);

Description

Reads up to count bytes from file descriptor fd into buf.

Current Implementation

Looks up fd in the current process’s per-process file descriptor table and calls FileHandle::read() on the handle.

  • fd 0 (stdin) — ConsoleHandle: Reads raw bytes from the console input buffer (libkernel/src/console.rs). If the buffer is empty, blocks the current scheduler thread via block_current_thread() until the keyboard ISR delivers input via push_input(). Returns at least 1 byte per call.
  • VFS file fds — VfsHandle: Reads from an in-memory buffer loaded at open() time. Maintains a per-handle read position. Returns 0 at EOF.
  • Directory fds — DirHandle: Returns -EISDIR (-21).
  • Invalid fds: Returns -EBADF (-9).

Validates that buf falls within user address space (< 0x0000_8000_0000_0000). Returns -EFAULT (-14) on invalid pointers. Returns 0 immediately if count is 0.

Source: osl/src/syscalls/io.rssys_read

Future Work

  • Support partial reads and proper error handling for VFS files.
  • SMAP enforcement for user buffer validation.

write (nr 1)

Linux Signature

ssize_t write(int fd, const void *buf, size_t count);

Description

Writes up to count bytes from buf to file descriptor fd.

Current Implementation

Looks up fd in the current process’s per-process file descriptor table and calls FileHandle::write() on the handle.

  • fd 1 (stdout) and fd 2 (stderr) — ConsoleHandle: Interprets buf as UTF-8 and prints to the VGA text buffer via print!(). If not valid UTF-8, falls back to printing printable ASCII (0x20..0x7F) plus \n, \r, \t. Returns count on success.
  • VFS file fds — VfsHandle: Returns -EBADF (-9) — files are read-only.
  • Invalid fds: Returns -EBADF (-9).

Validates that buf falls within user address space (< 0x0000_8000_0000_0000). Returns -EFAULT (-14) on invalid pointers.

Source: osl/src/syscalls/io.rssys_write

Future Work

  • Support writable VFS files.
  • Handle partial writes.

open (nr 2)

Linux Signature

int open(const char *pathname, int flags, mode_t mode);

Description

Opens a file or directory at pathname and returns a file descriptor.

Current Implementation

  1. Reads a null-terminated path string from user space (max 4096 bytes). Returns -EFAULT if the pointer is invalid.
  2. Resolves the path relative to the process’s current working directory (cwd). Normalises . and .. components.
  3. Unless O_DIRECTORY (0o200000) is set, first attempts to open as a file via devices::vfs::read_file() (through osl::blocking::blocking()). On success, the entire file content is loaded into a VfsHandle (buffered in kernel memory) and a new fd is allocated.
  4. If the file open fails with VfsError::NotFound or VfsError::NotAFile, or O_DIRECTORY was requested, falls back to opening as a directory via devices::vfs::list_dir(). On success, creates a DirHandle with the directory listing and allocates a new fd.
  5. Returns the new fd number on success, or a negative errno.

The VFS operations use osl::blocking::blocking() which spawns the async VFS call as a kernel task and blocks the calling user thread until it completes.

Flags supported: O_DIRECTORY (to explicitly request directory). O_RDONLY is implied for all opens. Other flags are accepted but ignored.

Source: osl/src/syscalls/fs.rssys_open

Errors

ErrnoCondition
-EFAULT (-14)Invalid pathname pointer
-ENOENT (-2)File or directory not found
-ENOTDIR (-20)Path is not a directory (when O_DIRECTORY used)
-EMFILE (-24)Per-process fd limit reached (64)
-EIO (-5)VFS I/O error

Future Work

  • Support O_WRONLY, O_CREAT, O_TRUNC for writable files.
  • Streaming reads instead of loading entire file into memory at open time.
  • Proper mode handling.

close (nr 3)

Linux Signature

int close(int fd);

Description

Closes a file descriptor so that it no longer refers to any file and may be reused.

Current Implementation

Looks up fd in the current process’s file descriptor table. If found, calls FileHandle::close() on the handle and sets the table slot to None, making the fd number available for reuse.

  • Returns 0 on success.
  • Returns -EBADF (-9) if the fd is not open or out of range.

Source: osl/src/syscalls/fs.rssys_close

Future Work

  • Flush pending writes for writable file handles before closing.
  • Free resources held by the handle (e.g. release VFS locks).

fstat (nr 5)

Linux Signature

int fstat(int fd, struct stat *statbuf);

Description

Returns information about a file referred to by the file descriptor fd, writing it into the stat structure at statbuf.

Current Implementation

  • Zero-fills the 144-byte struct stat buffer.
  • Sets st_mode at offset 24 to S_IFCHR | 0666 (character device, read/write for all), regardless of which fd is queried.
  • Always returns 0 (success).

This is sufficient for musl’s stdio initialisation, which calls fstat on stdout to determine whether it is a terminal.

Source: osl/src/syscalls/fs.rssys_fstat

Future Work

  • Return different st_mode values depending on the fd (e.g., regular file vs. character device).
  • Populate other stat fields (st_size, st_ino, st_dev, timestamps, etc.).
  • Return -EBADF for invalid file descriptors.
  • Validate that statbuf is a writable user-space address.

lseek (nr 8)

Linux Signature

off_t lseek(int fd, off_t offset, int whence);

Description

Repositions the file offset of the open file descriptor fd to the given offset according to whence (SEEK_SET, SEEK_CUR, SEEK_END).

Current Implementation

Always returns -ESPIPE (illegal seek). The only file descriptors currently in use are stdin/stdout/stderr, which behave as non-seekable character devices (serial console).

Source: osl/src/syscalls/mod.rs — inline stub in syscall_dispatch

Future Work

  • Implement proper seek for regular file descriptors once the VFS exposes them to user-space via open.

mmap (nr 9)

Linux Signature

void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

Description

Maps pages of memory into the calling process’s address space.

Current Implementation

Supports anonymous mappings, file-backed private mappings, and shared memory mappings via shmem_create fds.

Source: osl/src/syscalls/mem.rssys_mmap

Supported modes

FlagsfdBehaviour
MAP_PRIVATE | MAP_ANONYMOUSignoredAllocate fresh zeroed pages (most common)
MAP_PRIVATEfile fdCopy file content into private pages
MAP_SHAREDshmem fdMap the shared memory object’s physical frames
MAP_SHARED | MAP_ANONYMOUSReturns -EINVAL (not supported without fork)

MAP_SHARED and MAP_PRIVATE are mutually exclusive; if both or neither are set, -EINVAL is returned.

Protection flags (prot)

FlagValuePage table flags
PROT_READ0x1PRESENT | USER_ACCESSIBLE
PROT_WRITE0x2+ WRITABLE
PROT_EXEC0x4removes NO_EXECUTE

If prot is 0 (PROT_NONE), pages are mapped as present but not accessible from userspace (guard pages).

Address selection

  • Without MAP_FIXED: a top-down gap finder scans the VMA map for a free gap in the user address range (0x0000_0010_00000x0000_4000_0000_0000), starting from the top. The returned address is the start of the gap.
  • MAP_FIXED: addr must be page-aligned and non-zero. Any existing mappings in the range are implicitly unmapped before the new mapping is created.

MAP_SHARED with shmem fd

When MAP_SHARED is specified with a file descriptor from shmem_create(508), the kernel maps the shared memory object’s existing physical frames into the caller’s page table. Each frame’s reference count is incremented so the frame is not freed until all processes have unmapped it and the last fd is closed.

The offset argument selects the starting frame within the shmem object (must be page-aligned).

File-backed MAP_PRIVATE

When MAP_PRIVATE is specified with a file fd (from open), the file’s content is copied into freshly allocated pages. The pages are private to the calling process — writes do not affect the underlying file or other mappings.

VMA tracking

Each mapping is recorded as a Vma (virtual memory area) in the process’s vma_map (BTreeMap<u64, Vma>), tracking start address, length, protection, flags, fd, and offset. The VMA map is used by munmap, mprotect, the gap finder, and process cleanup.

Lock ordering

PROCESS_TABLE is acquired first (to read VMA state and pml4_phys), then released before acquiring MEMORY (to allocate/map pages), then PROCESS_TABLE is re-acquired to update state. This avoids nested lock deadlocks.

Errors

ErrorCondition
-EINVALLength is 0, MAP_SHARED and MAP_PRIVATE both/neither set, MAP_SHARED | MAP_ANONYMOUS, unaligned MAP_FIXED addr, unaligned offset
-ENOMEMPhysical memory exhausted or no virtual address gap found
-ENODEVMAP_SHARED fd is not a shmem object
-EBADFFile-backed MAP_PRIVATE with an invalid fd

See also

mprotect (nr 10)

Linux Signature

int mprotect(void *addr, size_t len, int prot);

Description

Changes the access protections for the calling process’s memory pages in the range [addr, addr+len).

Implementation

  1. Validates addr is page-aligned and len > 0 (returns -EINVAL otherwise).
  2. Aligns len up to the next page boundary.
  3. Splits/updates VMAs in the range via Process::mprotect_vmas():
    • Entire VMA overlap: updates prot in place.
    • Partial overlap (front, tail, middle): splits VMA and sets new prot on the affected portion.
  4. Converts prot to x86-64 page table flags (prot_to_page_flags()):
    • PROT_NONEUSER_ACCESSIBLE only (no PRESENT — any access faults).
    • PROT_READPRESENT | USER_ACCESSIBLE | NO_EXECUTE.
    • PROT_WRITE → adds WRITABLE.
    • PROT_EXEC → removes NO_EXECUTE.
  5. Updates page table entries via MemoryServices::update_user_page_flags() with TLB flush.
  6. Returns 0 on success.

Lock ordering: PROCESS_TABLE first (VMA split), then MEMORY (page table update).

Returns 0 (no-op) if no VMAs overlap the requested range (Linux semantics).

Source: osl/src/syscalls/mem.rs (sys_mprotect), libkernel/src/process.rs (mprotect_vmas), libkernel/src/memory/mod.rs (update_user_page_flags)

munmap (nr 11)

Linux Signature

int munmap(void *addr, size_t length);

Description

Removes mappings for the specified address range, causing further references to addresses within the range to generate page faults.

Current Implementation

Fully implemented. Validates arguments, splits/removes VMAs, unmaps page table entries, frees physical frames to the free list, and flushes the TLB.

Source: osl/src/syscalls/mem.rssys_munmap

Behaviour

  1. addr must be page-aligned; length must be > 0. Returns -EINVAL otherwise.
  2. length is rounded up to the next page boundary.
  3. Overlapping VMAs are split or removed:
    • Entire VMA consumed — removed from vma_map.
    • Front consumed — VMA start/len adjusted forward.
    • Tail consumed — VMA len shortened.
    • Middle consumed — VMA split into two fragments.
  4. Each page in the unmapped range is removed from the page table. Physical frames are released via refcount-aware logic: shared frames (from MAP_SHARED mappings) are only freed when their reference count reaches 0 (i.e. all processes have unmapped the frame and the backing shmem_create fd has been closed). Non-shared frames are freed immediately.
  5. If no VMAs overlap the range, returns 0 (Linux no-op semantics).

Lock ordering

PROCESS_TABLE is acquired first (to call munmap_vmas), then released before acquiring MEMORY (to unmap and free pages). Same ordering as sys_mmap and sys_brk.

Errors

ErrorCondition
-EINVALaddr not page-aligned, length is 0, or caller is kernel

brk (nr 12)

Linux Signature

int brk(void *addr);

Note: The raw syscall returns the new program break on success (not 0 like the glibc wrapper).

Description

Sets the end of the process’s data segment (the “program break”). Increasing the break allocates memory; decreasing it deallocates.

Current Implementation

  • brk(0) or brk(addr < brk_base): Returns the current program break without modification. This is how musl queries the initial break.
  • brk(addr <= brk_current): Shrinks the break. Updates brk_current but does not unmap or free any pages.
  • brk(addr > brk_current): Grows the break. The requested address is page-aligned up. For each new page:
    • A physical frame is allocated via alloc_dma_pages(1).
    • The frame is zeroed.
    • The frame is mapped into the process’s page table with PRESENT | WRITABLE | USER_ACCESSIBLE | NO_EXECUTE.
    • brk_current is updated to the new page-aligned address.
  • On allocation failure, returns the old brk_current (Linux convention: failure = unchanged break).

Initial state: brk_base and brk_current are set to the page-aligned end of the highest PT_LOAD ELF segment when the process is spawned.

Lock ordering: Process table lock is acquired/released to read state, then memory lock for allocation, then process table lock again to write the update.

Source: osl/src/syscalls/mem.rssys_brk

Future Work

  • Free physical frames and unmap pages when the break is decreased.
  • Guard against growing the break into other mapped regions (stack, mmap area).

rt_sigaction (nr 13)

Linux Signature

int rt_sigaction(int signum, const struct sigaction *act,
                 struct sigaction *oldact, size_t sigsetsize);

Description

Examine and change a signal action.

Current Implementation

Stub: Returns 0 (success) unconditionally. No signal support is implemented. musl’s runtime init calls rt_sigaction to install default signal handlers; the stub allows this to succeed silently.

Source: osl/src/signal.rssys_rt_sigaction

rt_sigprocmask (nr 14)

Linux Signature

int rt_sigprocmask(int how, const sigset_t *set, sigset_t *oldset, size_t sigsetsize);

Description

Examine and change blocked signals.

Current Implementation

Stub: Returns 0 (success) unconditionally. No signal support is implemented. musl’s runtime and posix_spawn call rt_sigprocmask to configure the signal mask; the stub allows this to succeed silently.

Source: osl/src/signal.rssys_rt_sigprocmask

rt_sigreturn (nr 15)

Restore process context after a signal handler returns.

Signature

rt_sigreturn() → (restores original rax)

Arguments

None. The kernel reads the saved context from the signal frame on the user stack.

Return value

Does not return in the normal sense — restores all registers (including rax) from the signal frame, resuming execution at the point where the signal was delivered.

Description

When the kernel delivers a signal, it pushes a signal frame onto the user stack containing the interrupted context (all registers, signal mask, RIP, RSP, RFLAGS) and sets RIP to the user’s signal handler. A trampoline (__restore_rt) is placed on the stack that calls rt_sigreturn when the handler returns.

rt_sigreturn reads the saved context from the signal frame, restores the signal mask, and overwrites the SYSCALL saved registers so that the return to user space resumes the original interrupted code path.

Signal frame layout (on user stack)

[pretcode]          8 bytes — address of __restore_rt trampoline
[siginfo]           128 bytes — siginfo_t
[ucontext]          variable — contains:
  uc_flags          8 bytes
  uc_link           8 bytes
  uc_stack          24 bytes (ss_sp, ss_flags, ss_size)
  [sigcontext]      256 bytes (32 × u64: r8–r15, rdi, rsi, rbp, rbx, rdx, rax, rcx, rsp, rip, rflags, ...)
  uc_sigmask        8 bytes — saved signal mask
[__restore_rt code] 9 bytes — `mov eax, 15; syscall`

Implementation

osl/src/signal.rssys_rt_sigreturn

See also

ioctl (nr 16)

Linux Signature

int ioctl(int fd, unsigned long request, ...);

Description

Manipulates the underlying device parameters of special files. Commonly used to query terminal attributes (TCGETS, TIOCGWINSZ, etc.).

Current Implementation

Always returns -ENOTTY (-25), indicating the file descriptor does not refer to a terminal. All arguments are ignored.

This is sufficient for musl’s stdio, which calls ioctl(fd, TIOCGWINSZ, ...) to check if stdout is a terminal for line buffering decisions. Receiving -ENOTTY causes musl to treat the fd as a non-terminal and use full buffering.

Source: osl/src/syscalls/mod.rs — inline stub in syscall_dispatch

Future Work

  • Return TIOCGWINSZ data for the VGA console (80x25) so musl recognises it as a terminal.
  • Implement TCGETS/TCSETS for basic terminal attribute support.
  • Dispatch based on fd to different device drivers.

writev (nr 20)

Linux Signature

ssize_t writev(int fd, const struct iovec *iov, int iovcnt);

Where struct iovec is:

struct iovec {
    void  *iov_base;  // Starting address
    size_t iov_len;   // Number of bytes
};

Description

Writes data from multiple buffers (a “scatter/gather” array) to a file descriptor in a single atomic operation. This is what musl’s printf uses internally instead of plain write.

Current Implementation

Looks up fd in the current process’s per-process file descriptor table. Iterates through iovcnt iovec entries (each 16 bytes: iov_base: u64, iov_len: u64). For each non-empty buffer, calls FileHandle::write() on the handle.

  • Console fds (stdout/stderr): Each buffer is printed via ConsoleHandle::write() (UTF-8 with ASCII fallback).
  • Invalid fds: Returns -EBADF (-9).
  • Returns the total number of bytes written across all iovec entries on success.
  • Short-circuits on error from any individual write.

Source: osl/src/syscalls/io.rssys_writev

Future Work

  • Validate that iov and all iov_base pointers are valid user-space addresses.
  • Handle partial writes.
  • Cap iovcnt at UIO_MAXIOV (1024) per Linux convention.

pipe (nr 22) / pipe2 (nr 293)

Linux Signature

int pipe(int pipefd[2]);
int pipe2(int pipefd[2], int flags);

Description

Creates a unidirectional data channel (pipe). Returns two file descriptors: pipefd[0] for reading and pipefd[1] for writing.

Both syscalls share the same implementation: pipe(fds) is dispatched as pipe2(fds, 0) (no flags).

Current Implementation

  1. Create pipe: Allocates a PipeInner (shared VecDeque<u8> buffer) wrapped in PipeReader and PipeWriter handles.
  2. Allocate fds: Allocates two file descriptors in the process’s fd table.
  3. Apply flags: If O_CLOEXEC (0o2000000) is set, both fds get FD_CLOEXEC flag.
  4. Write to user buffer: Writes [read_fd, write_fd] as two i32 values to the user buffer.

Pipe Semantics

  • Read: If the buffer is empty and the writer is still open, the reader blocks via block_current_thread(). When the writer appends data, it wakes the blocked reader. Returns 0 (EOF) if the writer has been closed and the buffer is empty.
  • Write: Appends data to the shared buffer and wakes any blocked reader. Currently unbounded (no backpressure).
  • Close: Closing the write end sets write_closed = true and wakes any blocked reader (so it gets EOF). Closing the read end drops the reader’s Arc reference.

Source: osl/src/syscalls/fs.rssys_pipe2, osl/src/syscalls/mod.rs — pipe(22) dispatch, libkernel/src/file.rsPipeReader, PipeWriter, make_pipe

Usage from C (musl)

#include <unistd.h>
#include <fcntl.h>

int fds[2];
pipe2(fds, O_CLOEXEC);
write(fds[1], "hello", 5);
close(fds[1]);

char buf[32];
ssize_t n = read(fds[0], buf, sizeof(buf)); // n = 5
close(fds[0]);

Errors

ErrnoCondition
-EFAULT (-14)Invalid pipefd pointer
-EMFILE (-24)Per-process fd limit reached (64)

Future Work

  • Bounded buffer with write-side blocking (backpressure).
  • O_NONBLOCK flag support.

madvise (nr 28)

Linux Signature

int madvise(void *addr, size_t length, int advice);

Description

Give advice about use of memory.

Current Implementation

Stub: Returns 0 (success) unconditionally. All advice is ignored. musl and Rust std may call madvise(MADV_DONTNEED) on freed memory regions.

Source: osl/src/syscalls/mod.rs — inline stub in syscall_dispatch

dup2 (nr 33)

Linux Signature

int dup2(int oldfd, int newfd);

Description

Duplicates file descriptor oldfd to newfd. If newfd is already open, it is silently closed first.

Current Implementation

  1. If oldfd == newfd: validates that oldfd is open, returns newfd.
  2. Reads the FdEntry (handle + flags) from oldfd.
  3. Clones the Arc<dyn FileHandle> and installs it at newfd.
  4. The new fd does not inherit FD_CLOEXEC from the old fd (per POSIX).
  5. If newfd was previously open, its old handle is dropped (refcount decremented).
  6. The fd table is extended if newfd exceeds the current length.

Source: osl/src/syscalls/fs.rssys_dup2

Errors

ErrnoCondition
-EBADF (-9)oldfd is not a valid open fd

getpid (nr 39)

Linux Signature

pid_t getpid(void);

Description

Returns the process ID of the calling process.

Current Implementation

Returns current_pid().as_u64(). Always succeeds (no error return).

Source: osl/src/syscalls/process.rssys_getpid

clone (nr 56)

Linux Signature

long clone(unsigned long flags, void *child_stack, int *ptid, int *ctid, unsigned long tls);

Description

Creates a new process. ostoo supports the specific flag combination used by musl’s posix_spawn: CLONE_VM | CLONE_VFORK | SIGCHLD (0x4111). The child shares the parent’s address space and the parent blocks until the child calls execve or _exit.

Current Implementation

Only the flag combination CLONE_VM | CLONE_VFORK | SIGCHLD is accepted. Other flag combinations return -ENOSYS.

  1. Validate arguments: child_stack must be non-zero. Unsupported flags return -ENOSYS.
  2. Read parent state: Copies pml4_phys, cwd, fd_table (Arc clones), brk_*, mmap_* from the parent process.
  3. Capture user registers: Reads user_rip, user_rflags, and user_r9 from PerCpuData (saved by the SYSCALL entry stub). These are needed so the child can “return from syscall” at the same instruction as the parent.
  4. Create child process: New PID, same pml4_phys as parent (CLONE_VM), inherited fd table and cwd. Sets vfork_parent_thread to the parent’s scheduler thread index.
  5. Spawn clone thread: Creates a scheduler thread via spawn_clone_thread that enters clone_trampoline. The trampoline sets up kernel state and drops to ring 3 at user_rip with RAX=0 (child return value) and R9=user_r9 (musl’s __clone fn pointer).
  6. Block parent: Calls block_current_thread() (CLONE_VFORK semantics). The parent is unblocked when the child calls execve or _exit.
  7. Return: After unblocking, returns the child’s PID to the parent.

Source: osl/src/clone.rssys_clone, libkernel/src/task/scheduler.rsspawn_clone_thread, clone_trampoline

Usage from C (musl)

Not called directly — musl’s posix_spawn uses it internally:

#include <spawn.h>
#include <sys/wait.h>

pid_t child;
int err = posix_spawn(&child, "/hello", NULL, NULL, argv, envp);
if (err == 0) {
    int status;
    waitpid(child, &status, 0);
}

Errors

ErrnoCondition
-ENOSYS (-38)Unsupported flag combination
-EINVAL (-22)child_stack is NULL

Design Notes

  • musl’s __clone assembly stores the child function pointer in R9 before syscall. The entry stub saves R9 to PerCpuData.user_r9 (offset 32), and clone_trampoline restores it via jump_to_userspace(rax=0, r9=user_r9).
  • The child shares the parent’s PML4 (CLONE_VM). After execve, the child gets a fresh PML4. The old shared PML4 continues to be used by the parent.

execve (nr 59)

Linux Signature

int execve(const char *pathname, char *const argv[], char *const envp[]);

Description

Replaces the current process image with a new ELF binary. On success, the calling process’s address space, stack, and brk are replaced; the process continues execution at the new program’s entry point. On failure, the original process is unchanged.

Current Implementation

  1. Copy arguments from userspace: Reads pathname (null-terminated string), argv (NULL-terminated array of string pointers), and envp (NULL-terminated array of string pointers) into kernel buffers before destroying the address space.
  2. Resolve path: Resolves relative to the process’s cwd.
  3. Read ELF from VFS: Loads the entire ELF binary via devices::vfs::read_file().
  4. Parse ELF: Extracts PT_LOAD segments, entry point, and program headers via libkernel::elf::parse.
  5. Create fresh PML4: Allocates a new user page table (kernel entries 256–510 are copied from the active PML4). The old PML4 and its user-half page tables are freed after switching CR3 (skipped for CLONE_VM shared PML4s).
  6. Map ELF segments: Maps each PT_LOAD segment into the new PML4 with correct permissions (R/W/X).
  7. Map user stack: 8 pages (32 KiB) at 0x0000_7FFF_F000_0000.
  8. Build initial stack: Writes argc, argv pointers, envp pointers, and auxiliary vector (AT_PHDR, AT_PHENT, AT_PHNUM, AT_PAGESZ, AT_ENTRY, AT_UID, AT_RANDOM) onto the user stack.
  9. Update process: Sets new pml4_phys, entry_point, user_stack_top, brk_base/brk_current, resets mmap_next/mmap_regions. Calls close_cloexec_fds() to close all file descriptors with FD_CLOEXEC set. Resets FS_BASE to 0 (new program’s libc will set up TLS).
  10. Unblock vfork parent: If this process was created by clone(CLONE_VFORK), unblocks the parent thread.
  11. Jump to userspace: Switches CR3 to the new PML4 and does iretq to the new entry point. Never returns.

On any error before step 9, returns a negative errno — the original process is unchanged.

Source: osl/src/exec.rssys_execve

Errors

ErrnoCondition
-EFAULT (-14)Invalid pathname, argv, or envp pointer
-ENOENT (-2)File not found on VFS
-ENOEXEC (-8)Invalid ELF binary or no loadable segments
-EINVAL (-22)Too many arguments (>256)

Future Work

  • Support #! (shebang) script execution.
  • Proper AT_RANDOM with real randomness instead of a fixed address.

exit (nr 60) / exit_group (nr 231)

Linux Signature

void _exit(int status);          // nr 60
void exit_group(int status);     // nr 231

Description

  • exit (60): Terminates the calling thread.
  • exit_group (231): Terminates all threads in the calling process.

Both are handled identically in ostoo since each process currently has exactly one thread.

Current Implementation

  1. Looks up the current PID.
  2. If it’s a user process (not ProcessId::KERNEL), calls terminate_process:
    • Logs pid N exited with code C to serial.
    • Unblocks vfork parent: If this process was created by clone(CLONE_VFORK) and has not yet called execve, unblocks the parent thread so it can resume. Clears vfork_parent_thread.
    • Closes all fds: Releases IRQ handles, completion ports, pipes, channels, etc. while the process’s page tables are still active.
    • Frees user address space: Switches CR3 to the kernel boot PML4 and updates the scheduler’s thread record, then frees all user-half page tables and data frames via cleanup_user_address_space. Skipped for CLONE_VM children (shared PML4 still used by parent).
    • Marks zombie: Sets the process state to Zombie with the exit code.
    • Wakes parent: Queues SIGCHLD and unblocks the parent’s wait_thread if set.
    • Yields + dies: Donates remaining quantum to the parent, calls yield_now(), then kill_current_thread() marks the thread as Dead.
  3. If it’s a kernel thread: prints a halt message and calls kill_current_thread().

Zombie processes are reaped by waitpid (when a parent collects exit status) or lazily by reap_zombies() at the start of spawn_process.

Source: osl/src/syscalls/process.rssys_exit, libkernel/src/process.rsterminate_process

CR3 safety on exit

The process’s PML4 frame must not be freed while CR3 still references it. The frame allocator uses an intrusive free-list that overwrites the first 8 bytes of freed frames immediately; if the scheduler later reschedules the dying thread (before kill_current_thread runs), a TLB refill through the corrupted PML4 would triple-fault. terminate_process therefore switches to the kernel boot PML4 (stored in KERNEL_PML4_PHYS during memory::init_services) and updates the scheduler via set_current_cr3 before calling cleanup_user_address_space.

Future Work

  • Properly distinguish exit (single thread) from exit_group (all threads) once multi-threaded processes are supported.
  • Service auto-cleanup: remove service registry entries on process exit.

wait4 (nr 61)

Linux Signature

pid_t wait4(pid_t pid, int *wstatus, int options, struct rusage *rusage);

Description

Waits for a child process to change state (typically exit). Returns the PID of the child whose state changed, and optionally writes the exit status.

Current Implementation

Called as syscall number 61 (wait4). The rusage parameter is ignored.

  1. Determines the calling process’s PID (parent_pid).
  2. Interprets pid argument:
    • -1: Wait for any child process.
    • > 0: Wait for the specific child with that PID.
  3. Searches the process table for a zombie child matching the criteria via find_zombie_child(parent_pid, target_pid).
  4. If a zombie child is found:
    • Writes the exit status to the user-space wstatus pointer (if non-NULL), encoded as (exit_code << 8) matching Linux’s WEXITSTATUS macro.
    • Reaps the child process (removes from process table, frees kernel stack).
    • Restores the console foreground to the parent process.
    • Returns the child’s PID.
  5. If no zombie child exists but living children do:
    • Registers the current scheduler thread index in the parent’s wait_thread field.
    • Calls block_current_thread() to sleep.
    • When woken (by a child calling sys_exit), loops back to step 3.
  6. If no children exist at all: Returns -ECHILD (-10).

Source: osl/src/syscalls/process.rssys_wait4

Usage from C (musl)

#include <sys/wait.h>
#include <sys/syscall.h>

/* Wait for specific child */
int status;
pid_t child = syscall(SYS_wait4, child_pid, &status, 0, 0);
int exit_code = WEXITSTATUS(status);  /* (status >> 8) & 0xFF */

/* Wait for any child */
pid_t any = syscall(SYS_wait4, -1, &status, 0, 0);

Errors

ErrnoCondition
-ECHILD (-10)Calling process has no children

Future Work

  • Support WNOHANG option (return immediately if no child has exited).
  • Support WUNTRACED and WCONTINUED for stopped/continued children.
  • Populate struct rusage with resource usage statistics.
  • Handle the case where multiple children exit simultaneously.

kill (nr 62)

Send a signal to a process.

Signature

kill(pid: pid_t, sig: int) → 0 or -errno

Arguments

ArgRegisterDescription
pidrdiTarget process ID
sigrsiSignal number (1–31)

Return value

Returns 0 on success.

Errors

ErrorCondition
EINVALSignal number is out of range (< 1 or > 31)
ESRCHNo process with the given PID exists

Description

Queues the specified signal on the target process. The signal is delivered before the process next returns to user space (checked after syscalls and interrupts).

Currently only supports sending to a specific PID. Negative PIDs (process groups) and PID 0 (current process group) are not yet supported.

Implementation

osl/src/signal.rssys_kill

See also

fcntl (nr 72)

Linux Signature

int fcntl(int fd, int cmd, ... /* arg */);

Description

Performs operations on file descriptors. Only fd-level flag operations are supported.

Current Implementation

CommandValueBehaviour
F_GETFD1Returns the fd flags (currently only FD_CLOEXEC)
F_SETFD2Sets the fd flags to arg
F_GETFL3Returns 0 (no file status flags tracked)
OtherReturns -EINVAL

Source: osl/src/syscalls/fs.rssys_fcntl

Errors

ErrnoCondition
-EBADF (-9)fd is not a valid open fd
-EINVAL (-22)Unknown cmd

getcwd (nr 79)

Linux Signature

char *getcwd(char *buf, size_t size);

Description

Copies the absolute pathname of the current working directory into buf. On success, returns buf. On failure, returns -1 and sets errno.

Current Implementation

  1. Validates that buf is within user address space. Returns -EFAULT (-14) if not.
  2. Reads the cwd field from the current process’s Process struct.
  3. Checks that size is large enough to hold the cwd string plus a null terminator. Returns -ERANGE (-34) if too small.
  4. Copies the cwd string and null terminator into the user buffer.
  5. Returns buf (the pointer value) on success — matching Linux’s behaviour where the return value is the buffer address.

Each process has its own cwd field (default "/"), updated by chdir.

Source: osl/src/syscalls/fs.rssys_getcwd

Usage from C (musl)

#include <unistd.h>

char buf[256];
if (getcwd(buf, sizeof(buf)) != NULL) {
    /* buf contains the current working directory */
}

Or via raw syscall:

#include <sys/syscall.h>

char buf[256];
long ret = syscall(SYS_getcwd, buf, sizeof(buf));
/* ret > 0 on success (pointer to buf) */

Future Work

  • Support getcwd(NULL, 0) which auto-allocates a buffer (musl handles this in userspace).

chdir (nr 80)

Linux Signature

int chdir(const char *path);

Description

Changes the current working directory to path.

Current Implementation

  1. Reads a null-terminated path string from user space (max 4096 bytes). Returns -EFAULT (-14) if the pointer is invalid.
  2. Resolves the path relative to the process’s current cwd. Normalises . and .. components.
  3. Validates that the resolved path is an existing directory by calling devices::vfs::list_dir() (through osl::blocking::blocking()). This blocks the calling thread while the async VFS operation completes.
  4. On success, updates the process’s cwd field to the resolved path and returns 0.
  5. On failure, returns the error from the VFS (typically -ENOENT or -ENOTDIR).

Source: osl/src/syscalls/fs.rssys_chdir

Usage from C (musl)

#include <unistd.h>

if (chdir("/some/path") < 0) {
    /* error */
}

Errors

ErrnoCondition
-EFAULT (-14)Invalid path pointer
-ENOENT (-2)Path does not exist
-ENOTDIR (-20)A component of the path is not a directory
-EIO (-5)VFS I/O error

Future Work

  • Support fchdir(fd) to change directory via an open directory fd.

sigaltstack (nr 131)

Linux Signature

int sigaltstack(const stack_t *ss, stack_t *old_ss);

Description

Set and/or get the alternate signal stack.

Current Implementation

Stub: Returns 0 (success) unconditionally. No signal support is implemented. Rust’s standard library calls sigaltstack during runtime init to set up an alternate stack for signal handlers.

Source: osl/src/syscalls/mod.rs — inline stub in syscall_dispatch

arch_prctl (nr 158)

Linux Signature

int arch_prctl(int code, unsigned long addr);

Description

Sets or gets architecture-specific thread state. On x86-64, primarily used to set the FS and GS segment base registers for thread-local storage (TLS).

Current Implementation

  • ARCH_SET_FS (0x1002): Writes addr to the IA32_FS_BASE MSR (0xC000_0100). This is how musl sets up its TLS pointer during C runtime initialisation. Returns 0 on success.
  • All other codes: Returns -EINVAL (-22).

Source: osl/src/syscalls/misc.rssys_arch_prctl

Future Work

  • Implement ARCH_GET_FS (0x1003) to read back the current FS base.
  • Implement ARCH_SET_GS (0x1001) and ARCH_GET_GS (0x1004) for GS-based TLS.
  • Save/restore FS_BASE across context switches if multiple user processes use TLS concurrently (currently each process sets it fresh via the trampoline, but preemption during a syscall could lose the value).

futex (nr 202)

Linux Signature

long futex(uint32_t *uaddr, int futex_op, uint32_t val,
           const struct timespec *timeout, uint32_t *uaddr2, uint32_t val3);

Description

Provides fast user-space locking primitives. FUTEX_WAIT blocks the calling thread until the value at uaddr changes; FUTEX_WAKE wakes threads waiting on uaddr.

Current Implementation

Always returns 0 (success). Each process is single-threaded, so musl’s internal locks (used by stdio, malloc, etc.) are never contended. The FUTEX_WAIT path is never reached in practice, and FUTEX_WAKE returning 0 (no waiters woken) is correct.

Source: osl/src/syscalls/mod.rs — inline stub in syscall_dispatch

Future Work

  • Implement FUTEX_WAIT and FUTEX_WAKE properly once multi-threaded user processes are supported.
  • Support FUTEX_WAIT_BITSET and other operations used by musl’s condition variables.

sched_getaffinity (nr 204)

Linux Signature

int sched_getaffinity(pid_t pid, size_t cpusetsize, cpu_set_t *mask);

Description

Get a thread’s CPU affinity mask.

Current Implementation

Zeroes the user-provided mask buffer, then sets bit 0 (CPU 0 only). Returns cpusetsize (the number of bytes written). ostoo is a single-CPU kernel.

Rust’s standard library calls sched_getaffinity during runtime init to determine available parallelism.

Source: osl/src/syscalls/misc.rssys_sched_getaffinity

Errors

ErrnoCondition
-EINVAL (-22)cpusetsize is 0
-EFAULT (-14)Invalid mask pointer

getdents64 (nr 217)

Linux Signature

int getdents64(unsigned int fd, struct linux_dirent64 *dirp, unsigned int count);

Where struct linux_dirent64 is:

struct linux_dirent64 {
    ino64_t        d_ino;    /* 64-bit inode number */
    off64_t        d_off;    /* 64-bit offset to next entry */
    unsigned short d_reclen; /* Size of this dirent */
    unsigned char  d_type;   /* File type */
    char           d_name[]; /* Null-terminated filename */
};

Description

Reads directory entries from a directory file descriptor into a buffer. Returns the number of bytes written, or 0 when all entries have been consumed.

Current Implementation

  1. Validates that dirp buffer is within user address space. Returns -EFAULT (-14) if not.
  2. Looks up fd in the process’s file descriptor table.
  3. Calls FileHandle::getdents64() on the handle. Only DirHandle implements this; other handle types return -ENOTTY (-25).
  4. DirHandle maintains an internal cursor. On each call, it serializes entries starting from the cursor position into the user buffer:
    • d_ino: Synthetic inode number (cursor index + 1).
    • d_off: Index of the next entry.
    • d_reclen: Record length, 8-byte aligned. Computed as 8 + 8 + 2 + 1 + strlen(name) + 1, rounded up to 8.
    • d_type: DT_DIR (4) for directories, DT_REG (8) for regular files.
    • d_name: Null-terminated filename, with zero-padding to alignment.
  5. Returns total bytes written, or 0 when all entries have been read.

The directory listing is loaded entirely at open() time and cached in the DirHandle.

Source: osl/src/syscalls/io.rssys_getdents64, osl/src/file.rsDirHandle::getdents64

Usage from C (musl)

#include <fcntl.h>
#include <sys/syscall.h>
#include <unistd.h>

int fd = open("/", O_RDONLY | O_DIRECTORY);
char buf[2048];
long nread;
while ((nread = syscall(SYS_getdents64, fd, buf, sizeof(buf))) > 0) {
    long pos = 0;
    while (pos < nread) {
        unsigned short reclen = *(unsigned short *)(buf + pos + 16);
        unsigned char  d_type = *(unsigned char *)(buf + pos + 18);
        char *name = buf + pos + 19;
        /* process entry... */
        pos += reclen;
    }
}
close(fd);

Future Work

  • Return proper inode numbers from the VFS.
  • Support lseek / rewinddir on directory handles.

set_tid_address (nr 218)

Linux Signature

pid_t set_tid_address(int *tidptr);

Description

Sets the clear_child_tid pointer for the calling thread. When the thread exits, the kernel writes 0 to *tidptr and wakes any futex waiters. Returns the caller’s TID.

Current Implementation

  • Ignores the tidptr argument entirely (no clear_child_tid tracking).
  • Returns the current process’s PID (used as TID since each process is single-threaded).

This is sufficient for musl’s early startup, which calls set_tid_address to discover its own TID.

Source: osl/src/syscalls/process.rssys_set_tid_address

Future Work

  • Store tidptr in the thread/process structure.
  • On thread exit, write 0 to *tidptr and perform a futex wake (needed for pthread_join).
  • Return a per-thread TID rather than PID once multi-threading is supported.

clock_gettime (nr 228)

Linux Signature

int clock_gettime(clockid_t clk_id, struct timespec *tp);

Description

Retrieves the time of the specified clock.

Current Implementation

Stub: Writes zero for both tv_sec and tv_nsec in the user-provided timespec struct. The clk_id parameter is accepted but ignored. Returns 0 (success).

This satisfies Rust std’s runtime init which calls clock_gettime(CLOCK_MONOTONIC, ...) but doesn’t depend on the actual time value.

Source: osl/src/syscalls/misc.rssys_clock_gettime

Errors

ErrnoCondition
-EFAULT (-14)Invalid tp pointer

Future Work

  • Return real time based on the PIT/HPET/TSC timer.
  • Distinguish CLOCK_REALTIME, CLOCK_MONOTONIC, etc.

set_robust_list (nr 273)

Linux Signature

long set_robust_list(struct robust_list_head *head, size_t len);

Description

Registers a list of robust futexes with the kernel. If a thread exits while holding a robust futex, the kernel marks it as dead and wakes waiters, preventing permanent deadlocks.

Current Implementation

Always returns 0 (success) without recording anything. Both arguments are ignored.

This is sufficient for musl’s startup, which registers a robust list as part of thread initialisation.

Source: osl/src/syscalls/mod.rs — inline stub in syscall_dispatch

Future Work

  • Store the robust list head pointer in the thread structure.
  • On thread exit, walk the robust list and wake any futex waiters on held locks.
  • Implement get_robust_list (nr 274) for completeness.

getrandom (nr 318)

Linux Signature

ssize_t getrandom(void *buf, size_t buflen, unsigned int flags);

Description

Fills a buffer with random bytes. Used by Rust’s HashMap for hash seed randomisation and by musl for stack canary initialisation.

Current Implementation

Uses a simple xorshift64* PRNG seeded from the x86 TSC (Time Stamp Counter via rdtsc). Fills the user buffer byte-by-byte from the PRNG state. The flags parameter is accepted but ignored.

Note: This is not cryptographically secure. It provides enough entropy for HashMap seeds and similar non-security use cases.

Source: osl/src/syscalls/misc.rssys_getrandom

Errors

ErrnoCondition
-EFAULT (-14)Invalid buffer pointer

Future Work

  • Seed from a proper entropy source (e.g. RDSEED/RDRAND instructions).
  • Distinguish GRND_RANDOM vs GRND_NONBLOCK flags.

io_create (nr 501)

Create a completion port for async I/O.

Signature

io_create(flags: u32) → fd or -errno

Arguments

ArgRegisterDescription
flagsrdiReserved, must be 0

Return value

On success, returns a file descriptor for the new completion port.

Errors

ErrorCondition
EINVALflags is non-zero
EMFILEProcess fd table is full

Description

Creates a new kernel CompletionPort object and returns a file descriptor referring to it. The port fd is used as the first argument to io_submit and io_wait.

Completion ports are the core async I/O primitive in ostoo. Operations (reads, writes, timeouts, IRQ waits, IPC send/recv) are submitted to a port via io_submit and their completions are harvested via io_wait.

Implementation

osl/src/io_port.rssys_io_create

See also

io_submit (nr 502)

Submit async I/O operations to a completion port.

Signature

io_submit(port_fd: i32, entries_ptr: *const IoSubmission, count: u32) → processed or -errno

Arguments

ArgRegisterDescription
port_fdrdiCompletion port fd (from io_create)
entries_ptrrsiPointer to array of submission entries
countrdxNumber of entries to submit

Submission entry layout

struct IoSubmission {       // 48 bytes, repr(C)
    uint64_t user_data;     // Opaque value returned in completion
    uint32_t opcode;        // Operation type (see below)
    uint32_t flags;         // Reserved, must be 0
    int32_t  fd;            // Target file descriptor (opcode-dependent)
    int32_t  _pad;
    uint64_t buf_addr;      // User buffer address
    uint32_t buf_len;       // User buffer length
    uint32_t offset;        // Reserved
    uint64_t timeout_ns;    // Timeout in nanoseconds (OP_TIMEOUT)
};

Opcodes

ValueNameDescription
0OP_NOPImmediate completion (testing/synchronization)
1OP_TIMEOUTTimer that completes after timeout_ns nanoseconds
2OP_READAsync read from fd into buf_addr
3OP_WRITEAsync write from buf_addr to fd
4OP_IRQ_WAITWait for an interrupt on an IRQ fd
5OP_IPC_SENDSend an IPC message on a channel send-end fd
6OP_IPC_RECVReceive an IPC message on a channel recv-end fd
7OP_RING_WAITWait for a notification fd signal

Return value

On success, returns the number of entries processed.

Errors

ErrorCondition
EFAULTentries_ptr is invalid
EBADFport_fd is not a valid completion port

Per-entry errors (EBADF, EFAULT, EINVAL) are reported via the completion result field rather than failing the entire submission.

Description

Each submission entry describes an async operation. The kernel processes entries sequentially, spawning async tasks for operations that cannot complete immediately. When an operation finishes, a completion entry is posted to the port and can be harvested via io_wait.

For OP_READ, data is read into a kernel buffer and copied to user space during io_wait (which runs in the process’s syscall context with the correct page tables).

For OP_IPC_SEND/RECV, buf_addr points to an IpcMessage struct. File descriptors in msg.fds are transferred across the channel.

For OP_RING_WAIT, fd must be a notification fd (from notify_create). The completion fires when another process calls notify(fd). Edge- triggered, one-shot: re-submit to rearm.

Implementation

osl/src/io_port.rssys_io_submit

See also

io_wait (nr 503)

Wait for completions on a completion port.

Signature

io_wait(port_fd: i32, completions_ptr: *mut IoCompletion, max: u32, min: u32, timeout_ns: u64) → count or -errno

Arguments

ArgRegisterDescription
port_fdrdiCompletion port fd
completions_ptrrsiUser buffer for completion entries
maxrdxMaximum completions to return
minr10Minimum completions before returning (0 = non-blocking poll)
timeout_nsr8Timeout in nanoseconds (0 = wait forever)

Completion entry layout

struct IoCompletion {       // 24 bytes, repr(C)
    uint64_t user_data;     // Copied from submission
    int64_t  result;        // Bytes transferred, or negative errno
    uint32_t flags;         // Reserved
    uint32_t opcode;        // Operation that completed
};

Return value

On success, returns the number of completions written to completions_ptr (between 0 and max).

Errors

ErrorCondition
EFAULTcompletions_ptr is invalid
EBADFport_fd is not a valid completion port

Description

Blocks the calling thread until at least min completions are available on the port, or the timeout expires. Drains up to max completions and copies them to user memory.

For OP_READ completions, the kernel buffer containing read data is copied to the user-space destination address that was specified in the original submission.

For OP_IPC_RECV completions, transferred file descriptors are installed in the receiver’s fd table and the IpcMessage.fds array is rewritten with the new fd numbers before copying to user memory.

The timeout is implemented as a cancellable async timer task. A timeout of 0 means wait forever (no timeout).

Implementation

osl/src/io_port.rssys_io_wait

See also

irq_create (nr 504)

Create a file descriptor for receiving hardware interrupts.

Signature

irq_create(gsi: u32) → fd or -errno

Arguments

ArgRegisterDescription
gsirdiGlobal System Interrupt number

Return value

On success, returns a file descriptor for the IRQ object.

Errors

ErrorCondition
ENOMEMNo free dynamic interrupt vectors available
EINVALFailed to program the IO APIC for the given GSI
EMFILEProcess fd table is full

Description

Allocates a dynamic interrupt vector, programs the IO APIC to route the specified GSI to that vector (edge-triggered, active-high, initially masked), and returns an fd referring to the IRQ object.

The IRQ fd is used with io_submit OP_IRQ_WAIT to asynchronously wait for interrupts via a completion port. When an interrupt fires, the registered completion port receives a completion with the user_data from the submission. The GSI is then re-masked until the next OP_IRQ_WAIT is submitted.

When the fd is closed, the original IO APIC redirection entry is restored and the dynamic vector is freed.

Implementation

osl/src/irq.rssys_irq_create

See also

ipc_create (nr 505)

Create a bidirectional IPC channel pair.

Signature

ipc_create(fds_ptr: *mut [i32; 2], capacity: u32, flags: u32) → 0 or -errno

Arguments

ArgRegisterDescription
fds_ptrrdiPointer to 2-element i32 array for [send_fd, recv_fd]
capacityrsiChannel buffer capacity (max queued messages)
flagsrdxIPC_CLOEXEC (0x1) to set FD_CLOEXEC on both fds

Return value

On success, writes [send_fd, recv_fd] to fds_ptr and returns 0.

Errors

ErrorCondition
EFAULTfds_ptr is invalid
EINVALUnknown flags
EMFILEProcess fd table is full

Description

Creates an IPC channel and returns two file descriptors: a send end and a receive end. Messages are fixed-size IpcMessage structs (56 bytes) containing:

struct IpcMessage {         // repr(C)
    uint64_t tag;           // User-defined message type
    uint64_t data[3];       // 24 bytes inline payload
    int32_t  fds[4];        // File descriptors to transfer (-1 = unused)
};

The channel supports capability-based fd passing: file descriptors listed in msg.fds are duplicated from the sender’s fd table and installed in the receiver’s fd table on delivery.

Channels can be used in both blocking mode (via ipc_send/ipc_recv syscalls) and async mode (via OP_IPC_SEND/OP_IPC_RECV on a completion port).

Implementation

osl/src/ipc.rssys_ipc_create

See also

ipc_send (nr 506)

Send a message on an IPC channel (blocking).

Signature

ipc_send(fd: i32, msg_ptr: *const IpcMessage, flags: u32) → 0 or -errno

Arguments

ArgRegisterDescription
fdrdiChannel send-end fd (from ipc_create)
msg_ptrrsiPointer to IpcMessage to send
flagsrdxIPC_NONBLOCK (0x1) for non-blocking mode

Return value

On success, returns 0.

Errors

ErrorCondition
EFAULTmsg_ptr is invalid
EBADFfd is not a valid channel send-end
EPIPEReceive end has been closed
EAGAINIPC_NONBLOCK set and channel is full
EMFILE(receiver) fd table full during fd transfer

Description

Sends a message through the channel. If the channel buffer is full and IPC_NONBLOCK is not set, the calling thread blocks until the receiver drains space.

If msg.fds contains valid file descriptors (not -1), those fd objects are extracted from the sender’s fd table and transferred to the receiver. The sender’s fds remain open — this is a dup, not a move.

When a receiver is blocked waiting via ipc_recv, the send uses scheduler donate to directly switch to the receiver thread for low-latency delivery.

For async (non-blocking, multiplexed) sending, use OP_IPC_SEND via io_submit instead.

Implementation

osl/src/ipc.rssys_ipc_send

See also

ipc_recv (nr 507)

Receive a message from an IPC channel (blocking).

Signature

ipc_recv(fd: i32, msg_ptr: *mut IpcMessage, flags: u32) → 0 or -errno

Arguments

ArgRegisterDescription
fdrdiChannel recv-end fd (from ipc_create)
msg_ptrrsiPointer to IpcMessage buffer for received message
flagsrdxIPC_NONBLOCK (0x1) for non-blocking mode

Return value

On success, writes the received message to msg_ptr and returns 0.

Errors

ErrorCondition
EFAULTmsg_ptr is invalid
EBADFfd is not a valid channel recv-end
EPIPESend end has been closed and channel is empty
EAGAINIPC_NONBLOCK set and no message available
EMFILEfd table full during fd transfer installation

Description

Receives a message from the channel. If no message is available and IPC_NONBLOCK is not set, the calling thread blocks until a sender posts a message.

If the received message carries file descriptors (msg.fds entries != -1), the transferred fd objects are installed in the receiver’s fd table and the fds array is rewritten with the new fd numbers before being copied to user memory.

For async (non-blocking, multiplexed) receiving, use OP_IPC_RECV via io_submit instead.

Implementation

osl/src/ipc.rssys_ipc_recv

See also

shmem_create (nr 508)

Create a shared memory object and return a file descriptor.

Signature

shmem_create(size: u64, flags: u32) → fd or -errno

Arguments

ArgRegisterDescription
sizerdiSize of the shared memory object in bytes (must be > 0)
flagsrsiFlags: SHM_CLOEXEC (0x01) sets close-on-exec on the fd

Return value

On success, returns a file descriptor for the shared memory object.

Errors

ErrorCondition
EINVALsize is 0, or unknown flags are set
ENOMEMNot enough physical memory to allocate the backing frames
EMFILEProcess fd table is full

Description

Allocates a shared memory object backed by eagerly-allocated, zeroed physical frames. Returns a file descriptor referring to it.

The fd can be inherited by child processes (via clone + execve, unless SHM_CLOEXEC is set) or transferred via IPC fd-passing (ipc_send / ipc_recv). Both sides can then call mmap(MAP_SHARED, fd) to map the same physical pages into their address spaces.

Physical frames are reference-counted. A frame is freed only when all mappings are removed and the last fd referring to the shared memory object is closed.

Flags

FlagValueDescription
SHM_CLOEXEC0x01Set close-on-exec on the returned fd (analogous to Linux’s MFD_CLOEXEC)

Userspace usage (C)

#define SYS_SHMEM_CREATE 508
#define SHM_CLOEXEC      0x01

static long shmem_create(unsigned long size, unsigned int flags) {
    return syscall(SYS_SHMEM_CREATE, size, flags);
}

/* Create 4 KiB shared memory, mmap it */
int fd = shmem_create(4096, 0);
void *ptr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

Implementation

osl/src/syscalls/shmem.rssys_shmem_create

Backing struct: libkernel/src/shmem.rsSharedMemInner

See also

notify_create (nr 509)

Create a notification file descriptor for inter-process signaling.

Signature

notify_create(flags: u32) → fd or -errno

Arguments

ArgRegisterDescription
flagsrdiFlags: NOTIFY_CLOEXEC (0x01) sets close-on-exec on the fd

Return value

On success, returns a file descriptor for the notification object.

Errors

ErrorCondition
EINVALUnknown flags are set
EMFILEProcess fd table is full

Description

Creates a notification fd for signaling between processes. The fd is used with two operations:

  • Consumer: submits OP_RING_WAIT (opcode 7) via io_submit on the notification fd. The completion port blocks until the producer signals.
  • Producer: calls notify(fd) (syscall 510) to signal. If an OP_RING_WAIT is armed, a completion is posted to the consumer’s port.

The notification fd can be passed to child processes via inheritance (clone + execve) or via IPC fd-passing (ipc_send / ipc_recv).

Semantics

  • Edge-triggered, one-shot: one notify() produces one completion. The consumer must re-submit OP_RING_WAIT to receive the next signal.
  • Buffered: if notify() is called before OP_RING_WAIT is armed, the notification is buffered. The next OP_RING_WAIT completes immediately. Multiple pre-arm signals coalesce into one.
  • Single waiter: only one OP_RING_WAIT can be pending per fd.

Flags

FlagValueDescription
NOTIFY_CLOEXEC0x01Set close-on-exec on the returned fd

Userspace usage (C)

#define SYS_NOTIFY_CREATE 509
#define NOTIFY_CLOEXEC    0x01

static long notify_create(unsigned int flags) {
    return syscall(SYS_NOTIFY_CREATE, flags);
}

int nfd = notify_create(0);

Implementation

osl/src/notify.rssys_notify_create

Backing struct: libkernel/src/notify.rsNotifyInner

See also

notify (nr 510)

Signal a notification file descriptor.

Signature

notify(fd: i32) → 0 or -errno

Arguments

ArgRegisterDescription
fdrdiNotification fd (from notify_create)

Return value

Returns 0 on success.

Errors

ErrorCondition
EBADFfd is invalid or does not refer to a notification object

Description

Signals a notification fd, waking a consumer waiting via OP_RING_WAIT on a completion port.

If an OP_RING_WAIT is armed on the fd, a completion is posted to the consumer’s port with result = 0 and opcode = OP_RING_WAIT.

If no OP_RING_WAIT is armed, the notification is buffered. The next OP_RING_WAIT submission will complete immediately. Multiple buffered notifications coalesce into one event.

The caller uses scheduler donate (set_donate_target + yield_now) for low-latency wakeup of the consumer.

Userspace usage (C)

#define SYS_NOTIFY 510

static long notify(int fd) {
    return syscall(SYS_NOTIFY, fd);
}

notify(nfd);  /* wake consumer */

Implementation

osl/src/notify.rssys_notify

See also

io_setup_rings (nr 511)

Set up shared-memory submission and completion rings on a completion port.

Signature

io_setup_rings(port_fd: i32, params: *mut IoRingParams) → 0 or -errno

Arguments

ArgRegisterDescription
port_fdrdiCompletion port fd (from io_create)
paramsrsiPointer to IoRingParams struct (in/out)

IoRingParams struct

struct io_ring_params {
    uint32_t sq_entries;   /* IN: requested SQ size (rounded to pow2, max 64) */
    uint32_t cq_entries;   /* IN: requested CQ size (rounded to pow2, max 128) */
    int32_t  sq_fd;        /* OUT: shmem fd for SQ ring page */
    int32_t  cq_fd;        /* OUT: shmem fd for CQ ring page */
};

Return value

Returns 0 on success. params->sq_entries and params->cq_entries are updated to the actual (rounded) sizes. params->sq_fd and params->cq_fd are set to new shmem fds.

Errors

ErrorCondition
EFAULTparams pointer is invalid
EBADFport_fd is invalid or not a completion port
EBUSYRing already set up on this port
ENOMEMCould not allocate ring pages
EMFILEfd table full

Description

Transitions a completion port into ring mode. After this call:

  • io_submit still works (completions go to the CQ ring)
  • io_wait returns -EINVAL (replaced by io_ring_enter)
  • Completions are posted to the shared CQ ring, readable by userspace without a syscall

The caller must mmap(MAP_SHARED) both the SQ and CQ fds to access the ring buffers. Each ring is a single 4 KiB page with the layout:

Offset 0:  RingHeader (16 bytes)
  u32 head      — consumer advances
  u32 tail      — producer advances
  u32 mask      — capacity - 1
  u32 flags     — reserved (0)

Offset 64: entries[] (cache-line aligned)
  SQ: IoSubmission[sq_entries]   — 48 bytes each
  CQ: IoCompletion[cq_entries]   — 24 bytes each

Head and tail are accessed with atomic load/store operations with acquire/release ordering.

Capacity limits

RingEntry sizeMax entriesCalculation
SQ48 bytes64(4096 - 64) / 48 rounded to pow2
CQ24 bytes128(4096 - 64) / 24 rounded to pow2

Userspace usage (C)

#define SYS_IO_SETUP_RINGS 511

static long io_setup_rings(int port_fd, struct io_ring_params *p) {
    return syscall(SYS_IO_SETUP_RINGS, port_fd, p);
}

int port = io_create(0);
struct io_ring_params params = { .sq_entries = 64, .cq_entries = 128 };
io_setup_rings(port, &params);

void *sq = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, params.sq_fd, 0);
void *cq = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, params.cq_fd, 0);

Implementation

osl/src/io_port.rssys_io_setup_rings

See also

io_ring_enter (nr 512)

Process SQ entries and optionally wait for CQ completions on a ring-mode completion port.

Signature

io_ring_enter(port_fd: i32, to_submit: u32, min_complete: u32, flags: u32) → i64

Arguments

ArgRegisterDescription
port_fdrdiCompletion port fd (must be in ring mode)
to_submitrsiMax number of SQ entries to process
min_completerdxMin CQ entries to wait for before returning
flagsr10Reserved, must be 0

Return value

On success, returns the number of CQ entries available (tail - head).

Errors

ErrorCondition
EBADFport_fd is invalid or not a completion port
EINVALPort is not in ring mode, or flags != 0

Description

This is the single syscall for ring-mode operation, replacing both io_submit and io_wait.

Processing phases

  1. Drain SQ: reads up to to_submit entries from the shared SQ ring (from the kernel’s SQ head to the userspace-written SQ tail). Each SQE is processed identically to io_submit. The SQ head is advanced.

  2. Flush deferred completions: drains the kernel queue of completions that need syscall-context processing (OP_READ data copy, OP_IPC_RECV fd installation). These are written to the CQ ring.

  3. Wait: if min_complete > 0, blocks until the CQ ring has at least min_complete entries available. On each wakeup, deferred completions are flushed again.

Dual-mode completion posting

When rings are active, CompletionPort::post() routes completions:

  • Simple (no read_buf, no transfer_fds): CQE written directly to the shared CQ ring. This is the fast path for OP_NOP, OP_TIMEOUT, OP_WRITE, OP_IRQ_WAIT, OP_IPC_SEND, OP_RING_WAIT.
  • Deferred: pushed to the kernel queue and flushed by io_ring_enter in syscall context. This handles OP_READ and OP_IPC_RECV which need to copy data to user buffers.

Userspace usage (C)

#define SYS_IO_RING_ENTER 512

static long io_ring_enter(int port_fd, unsigned int to_submit,
                          unsigned int min_complete, unsigned int flags) {
    return syscall(SYS_IO_RING_ENTER, port_fd, to_submit, min_complete, flags);
}

/* Write SQE to SQ ring */
uint32_t tail = __atomic_load_n(&sqh->tail, __ATOMIC_RELAXED);
struct io_submission *sqe = sq_entry(sq, tail, sqh->mask);
sqe->opcode = OP_NOP;
sqe->user_data = 42;
__atomic_store_n(&sqh->tail, tail + 1, __ATOMIC_RELEASE);

/* Process 1 SQE, wait for 1 CQE */
io_ring_enter(port, 1, 1, 0);

/* Read CQE from CQ ring */
uint32_t head = __atomic_load_n(&cqh->head, __ATOMIC_RELAXED);
uint32_t cq_tail = __atomic_load_n(&cqh->tail, __ATOMIC_ACQUIRE);
if (head != cq_tail) {
    struct io_completion *cqe = cq_entry(cq, head, cqh->mask);
    /* process cqe */
    __atomic_store_n(&cqh->head, head + 1, __ATOMIC_RELEASE);
}

Implementation

osl/src/io_port.rssys_io_ring_enter

See also

Userspace Shell Design

Status

All phases are complete. The userspace shell runs as the primary user interface on boot. The kernel shell remains as a fallback when no /shell binary is found on the filesystem.

Context

The shell was migrated from a kernel actor (kernel/src/shell.rs) to a ring-3 process — a C program (user/shell.c) compiled with musl that reads raw keypresses from stdin, does its own line editing, and uses syscalls for file I/O and process management.

Scope decisions:

  • Raw keypresses to userspace (no kernel line editing for foreground user processes)
  • Minimal commands: echo, ls, cat, pwd, cd, export, env, unset, pid, exit, help, and running programs by name
  • Environment variables: shell maintains an env table, passes it to child processes via posix_spawn
  • Kernel provides default environment on boot (PATH=/host/bin, HOME=/, TERM=dumb, SHELL=/bin/shell)
  • Kernel shell kept as fallback (dormant when userspace shell is foreground)
  • No pipes yet

Phase 1: Scheduler Blocking Support ✅ COMPLETE

Goal: Add Blocked thread state so threads can sleep waiting for I/O.

File: libkernel/src/task/scheduler.rs

  1. Add Blocked to ThreadState enum (line 51)
  2. Modify preempt_tick (lines 484, 495) — treat Blocked like Dead: skip quantum decrement, don’t re-queue
  3. Add pub fn block_current_thread() — marks current thread Blocked, spins on enable_and_hlt until rescheduled with non-Blocked state
  4. Add pub fn unblock(thread_idx: usize) — sets thread to Ready, pushes onto ready queue (safe from any context including ISR)

Key detail: Blocking from within syscall_dispatch works because each user process has its own 64 KiB kernel stack (set via PER_CPU.kernel_rsp during context switch). The timer saves/restores the full register state, so when unblocked, execution resumes mid-syscall.


Phase 2: File Descriptor Table ✅ COMPLETE

Goal: Per-process FD table with FileHandle trait, refactor existing syscalls.

2a: FileHandle trait + ConsoleHandle

File: libkernel/src/file.rs

  • FileError enum (BadFd, IsDirectory, NotATty, TooManyOpenFiles) — using snafu for Display
  • FileHandle trait: read(&self, buf) -> Result<usize, FileError>, write(&self, buf) -> Result<usize, FileError>, close(&self), kind(), getdents64()
  • ConsoleHandle { readable: bool } — write prints to kernel console; read delegates to console input buffer
  • Linux errno numeric constants live in osl/src/errno.rs; libkernel has no knowledge of errno numbers

2b: FD table on Process

File: libkernel/src/process.rs

  • Add fd_table: Vec<Option<Arc<dyn FileHandle>>> to Process
  • Initialize fds 0-2 as ConsoleHandle in Process::new()
  • Add alloc_fd(handle) -> Result<usize, FileError> (scan for first None slot)
  • Add close_fd(fd: usize) -> Result<(), FileError>
  • Add get_fd(fd: usize) -> Result<Arc<dyn FileHandle>, FileError>

2c: Refactor syscalls to use FD table

File: osl/src/syscalls/io.rs and osl/src/syscalls/fs.rs

  • sys_write / sys_writev: look up fd in process fd_table, call handle.write() (osl/src/syscalls/io.rs)
  • sys_read: look up fd, call handle.read() (osl/src/syscalls/io.rs)
  • sys_close: call process.close_fd(fd) (osl/src/syscalls/fs.rs)

Phase 3: Console Input (Raw Keypresses) ✅ COMPLETE

Goal: Route decoded keypresses to a buffer that read(0) consumes, with blocking.

3a: Console input buffer

New file: libkernel/src/console.rs

  • CONSOLE_INPUT: Mutex<ConsoleInner> with VecDeque<u8> (256 bytes) and blocked_reader: Option<usize>
  • FOREGROUND_PID: AtomicU64 — PID of the process that receives keyboard input (0 = kernel)
  • push_input(byte) — pushes to buffer, calls scheduler::unblock() if a reader is blocked
  • read_input(buf) -> usize — drains buffer into buf; if empty, registers blocked_reader and calls block_current_thread(), retries on wake
  • set_foreground(pid) / foreground_pid() -> ProcessId
  • flush_input() — clear buffer on foreground change

3b: Wire ConsoleHandle::read to console buffer

File: libkernel/src/file.rs

  • ConsoleHandle::read() calls console::read_input(buf) when readable == true

3c: Modify keyboard actor routing

File: kernel/src/keyboard_actor.rs

  • At top of on_key handler: check console::foreground_pid()
  • If non-kernel PID: convert Key to raw byte(s) and call console::push_input():
    • Key::Unicode(c) → ASCII byte (if c.is_ascii())
    • Enter → \n (0x0A)
    • Backspace → 0x7F (DEL)
    • Ctrl+C → 0x03, Ctrl+D → 0x04, Tab → 0x09
    • Arrow keys → VT100 sequences (ESC [ A/B/C/D) — optional for later
    • Return early (skip kernel line-editor)
  • If kernel PID: existing line-editor behavior unchanged

Phase 4: VFS Syscalls ✅ COMPLETE

Goal: open, read (files), close, getdents64 so userspace can read files and list directories.

4a: Async-to-sync bridge

File: osl/src/blocking.rs

#![allow(unused)]
fn main() {
pub fn blocking<T: Send + 'static>(future: impl Future<Output=T> + Send + 'static) -> T {
    let result = Arc::new(Mutex::new(None));
    let thread_idx = scheduler::current_thread_idx();
    let r = result.clone();
    executor::spawn(Task::new(async move {
        *r.lock() = Some(future.await);
        scheduler::unblock(thread_idx);
    }));
    scheduler::block_current_thread();
    result.lock().take().unwrap()
}
}

Spawns the async VFS operation as a kernel task, blocks the user thread, unblocks when complete.

4b: VfsHandle (buffered file)

File: osl/src/file.rs

  • VfsHandle — holds Vec<u8> content + read position; entire file loaded at open time
  • DirHandle — holds Vec<VfsDirEntry> listing + cursor; loaded at open time

4c: sys_open (syscall 2)

File: osl/src/syscalls/fs.rs

  • Read null-terminated path from userspace, validate pointer
  • Resolve path relative to process cwd (see Phase 5a)
  • Use osl::blocking::blocking() to call devices::vfs::read_file() or devices::vfs::list_dir() (try file first, fall back to dir for O_DIRECTORY)
  • Wrap in VfsHandle or DirHandle, allocate fd via process.alloc_fd()
  • Return fd or -ENOENT

4d: sys_getdents64 (syscall 217)

File: osl/src/syscalls/io.rs

  • Look up fd → must be DirHandle
  • Serialize entries as linux_dirent64 structs into user buffer (d_ino, d_off, d_reclen, d_type, d_name)
  • Return total bytes written, or 0 at end

4e: Existing sys_read/sys_close already work via FD table (Phase 2c)


Phase 5: Process Management Syscalls ✅ COMPLETE

Goal: chdir/getcwd, process creation (clone+execve), waitpid.

5a: chdir / getcwd

File: libkernel/src/process.rs — add cwd: String to Process, default "/"

File: osl/src/syscalls/fs.rs

  • sys_chdir (nr 80): validate path exists via osl::blocking::blocking(devices::vfs::list_dir(path)), update process.cwd
  • sys_getcwd (nr 79): copy process.cwd to user buffer

5b: Process spawning (clone + execve)

Process creation uses standard Linux clone(CLONE_VM|CLONE_VFORK) + execve. musl’s posix_spawn and Rust’s std::process::Command work unmodified.

See clone and execve.

5c: spawn_process_full (kernel-side ELF spawning)

File: osl/src/spawn.rs

  • spawn_process_full takes elf_data, argv: &[&[u8]], envp: &[&[u8]], and parent_pid: ProcessId params
  • build_initial_stack writes argv strings + pointer array + argc (Linux x86_64 ABI)

File: libkernel/src/process.rs

  • parent_pid: ProcessId on Process
  • wait_thread: Option<usize> (thread to wake on child exit)
  • vfork_parent_thread: Option<usize> (thread to unblock after execve)

5d: waitpid (syscall 61 / wait4)

File: osl/src/syscalls/process.rs

  • sys_waitpid(pid, status_ptr, options) -> pid
  • Find zombie child matching requested pid (or any child if pid == -1)
  • If found: write exit status to userspace, reap, return child PID
  • If not found: register wait_thread on parent, block, retry on wake

File: libkernel/src/process.rs

  • find_zombie_child(parent, target_pid) -> Option<(ProcessId, i32)>
  • In sys_exit: if exiting process has a parent with wait_thread, call unblock()
  • Clear foreground to parent when child exits

Phase 6: Userspace Shell Binary ✅ COMPLETE

Goal: Write shell.c, compile with musl, deploy.

6a: shell.c

New file: user/src/shell.c

  • Line editor: read char by char via read(0, &c, 1), handle backspace (erase \b \b), Enter (dispatch), Ctrl+C (cancel line), Ctrl+D (exit on empty line)
  • Echo input: shell echoes each typed character with write(1, &c, 1) since kernel delivers raw keypresses
  • Command dispatch:
    • echo <text> — print args
    • pwdgetcwd() + print
    • cd <path>chdir()
    • ls [path]open() + getdents64() loop + close()
    • cat <file>open() + read() loop + close()
    • exit_exit(0)
    • Anything else — try posix_spawn(cmd) + waitpid(), print error if spawn fails
  • Process spawning: uses posix_spawn() (musl’s wrapper around clone + execve)

6b: Build

File: user/Makefile — builds src/*.cbin/ as static musl binaries.

6c: Deploy to disk image

Compiled shell binary is output to user/bin/shell; available in guest via 9p at /host/bin/shell or /bin/shell (fallback root mount).

6d: Auto-launch on boot

File: kernel/src/main.rs

  • After VFS is mounted, spawn an async task that reads /shell from VFS and calls spawn_process()
  • Set the spawned shell as the foreground process
  • If /shell not found, fall back to kernel shell (log a message)

6e: Kernel shell fallback

Automatic via the keyboard routing in Phase 3c: when foreground PID is 0 (kernel), keys go to the kernel shell actor. When the userspace shell exits or crashes, sys_exit resets foreground to parent (kernel), restoring the old behavior.


File Summary

FileChanges
libkernel/src/task/scheduler.rsBlocked state, block_current_thread(), unblock()
libkernel/src/file.rsFileHandle trait (returns FileError), FileError enum, ConsoleHandle
libkernel/src/console.rsConsole input buffer, foreground PID tracking
libkernel/src/process.rsfd_table, cwd, parent_pid, wait_thread; fd helpers (return FileError)
osl/src/errno.rsLinux errno constants, file_errno() / vfs_errno() converters
osl/src/blocking.rsblocking() async-to-sync bridge
osl/src/file.rsVfsHandle, DirHandle (VFS-backed file handles)
osl/src/syscalls/Syscall dispatch and implementations: read/write/close/open/getdents64/getcwd/chdir/clone/execve/waitpid
osl/src/spawn.rsspawn_process_full with argv + parent PID
libkernel/src/syscall.rsSYSCALL assembly entry stub, PER_CPU data, init
kernel/src/ring3.rsLegacy spawn_process wrapper, blob spawning tests
kernel/src/keyboard_actor.rsForeground routing: raw bytes to console buffer
kernel/src/main.rsAuto-launch /shell on boot
user/shell.cUserspace shell with line editing and commands

Verification

  1. Phase 1: Spawn a kernel thread that blocks itself; have another thread unblock it after a delay. Verify it resumes.
  2. Phase 2-3: exec /hello still works (write goes through fd table). Boot with no userspace shell — kernel shell still functional.
  3. Phase 4: From kernel shell, exec a test program that does open("/hello") + read() + write(1) to cat a file.
  4. Phase 5: Test program that spawns /hello and waits for it.
  5. Phase 6: Boot with /shell on disk. Verify: prompt appears, echo/pwd/cd/ls/cat/exit work, running /hello from shell works, Ctrl+C cancels input, exiting shell returns to kernel shell.

Risks

  • Heap pressure: 512 KiB kernel heap is tight with multiple processes. May need to increase to 1 MiB. Monitor with /proc/meminfo.
  • VFS bridge correctness: The async task must complete before the blocked thread is woken. Guaranteed by design, but a panic in the async path leaves the thread blocked forever. Consider adding a timeout or panic handler.
  • getdents64 format complexity: Must match Linux’s struct linux_dirent64 layout exactly for musl’s readdir() to work. Alternative: shell can use raw syscall(217, ...) with custom parsing.

Cross-Compiling C Programs for ostoo Userspace

This document explains how to compile static musl-linked x86_64 ELF binaries that can run as ostoo user-space processes, using the crosstool-ng toolchain inside Docker.

Prerequisites

  • Docker
  • The ctng Docker image (built from crosstool/Dockerfile)
  • The compiled toolchain at /Volumes/crosstool-ng/x-tools/x86_64-unknown-linux-musl

Toolchain details

ComponentVersion
GCC15.2.0
musl1.2.5
binutils2.46.0
Linux headers6.18.3

Target triple: x86_64-unknown-linux-musl

The toolchain produces fully static-linked ELF binaries with no runtime dependencies (no dynamic linker, no shared libraries).

Building the toolchain from scratch

If you need to rebuild the toolchain:

cd crosstool

# Build the Docker image (includes crosstool-ng and the .config)
docker build . -t ctng

# Run the build (output goes to /Volumes/crosstool-ng/x-tools)
./run.sh

The build runs inside Docker’s case-sensitive overlay filesystem to avoid macOS case-sensitivity issues with the Linux kernel tarball extraction. Only the output (x-tools) and download cache (src) directories are mounted from the host.

Compiling user programs

The scripts/user-build.sh wrapper handles the Docker invocation for you. Arguments are passed through to make:

./scripts/user-build.sh          # build all .c files in user/src/ → user/bin/
./scripts/user-build.sh clean    # clean build artifacts
./scripts/user-build.sh bin/hello  # build a single target

Manual Docker invocation

If you need to run compiler commands directly:

docker run --rm \
  -v /Volumes/crosstool-ng/x-tools:/home/ctng/x-tools \
  -v "$(pwd)/user":/home/ctng/user \
  ctng bash -c '
    export PATH="/home/ctng/x-tools/x86_64-unknown-linux-musl/bin:$PATH"
    cd /home/ctng/user
    x86_64-unknown-linux-musl-gcc -static -Os -Wall -Wextra -o bin/hello src/hello.c
  '

Compiler flags

The recommended flags for ostoo user-space binaries:

FlagPurpose
-staticProduce a fully static binary (required — ostoo has no dynamic linker)
-OsOptimize for size (keeps binaries small for the FAT filesystem image)
-Wall -WextraEnable warnings
-nostdlibSkip libc entirely (for minimal programs that use only raw syscalls)

Verifying the output

You can inspect a compiled binary without Docker using the host file command:

file user/bin/hello
# hello: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, ...

Or use readelf from the toolchain:

docker run --rm \
  -v /Volumes/crosstool-ng/x-tools:/home/ctng/x-tools \
  -v "$(pwd)/user":/home/ctng/user \
  ctng bash -c '
    export PATH="/home/ctng/x-tools/x86_64-unknown-linux-musl/bin:$PATH"
    x86_64-unknown-linux-musl-readelf -h /home/ctng/user/bin/hello
  '

Confirm: Type: EXEC, Machine: Advanced Micro Devices X86-64.

Running on ostoo

  1. Copy the compiled binary onto the FAT filesystem image
  2. Boot ostoo in QEMU
  3. From the shell: exec /hello

The kernel’s ELF loader parses the binary and spawns it as a ring-3 process with the syscall layer providing write, brk, mmap, exit, and other calls needed by musl’s startup code.

Available toolchain binaries

All prefixed with x86_64-unknown-linux-musl-:

BinaryPurpose
gcc / ccC compiler
g++ / c++C++ compiler
asAssembler
ld / ld.bfdLinker
arArchive tool
objcopyBinary manipulation
objdumpDisassembler
readelfELF inspector
stripStrip symbols
nmSymbol table viewer
gdbDebugger

Cross-Compiling for x86_64 on a Non-x86 Host

This project targets x86_64 bare metal but can be built and run on any host architecture (including aarch64-apple-darwin, i.e. Apple Silicon Macs). This document explains how that works.

Overview

The kernel is compiled for a custom x86_64-os target using Rust’s cross-compilation support. QEMU provides x86_64 emulation at runtime. The host machine never executes the kernel code directly.

Toolchain (rust-toolchain.toml)

[toolchain]
channel = "nightly"
components = ["rust-src", "llvm-tools"]
  • nightly is required for the -Z build-std unstable feature (see below).
  • rust-src provides the standard library source, which is needed to compile core, alloc, and compiler_builtins from source for the custom target.
  • llvm-tools provides llvm-objcopy and related tools used by bootimage when assembling the final disk image.

Rustup downloads a pre-built nightly toolchain for the host architecture. The host toolchain is only used to drive the build; the kernel itself is compiled to x86_64 object files by rustc’s bundled LLVM backend regardless of host architecture.

Custom Target Spec (x86_64-os.json)

Rust’s built-in targets assume a host OS. For a bare-metal kernel we need a custom target. The file x86_64-os.json at the workspace root defines it:

{
    "llvm-target": "x86_64-unknown-none",
    "data-layout": "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128",
    "arch": "x86_64",
    "target-endian": "little",
    "target-pointer-width": 64,
    "target-c-int-width": 32,
    "os": "none",
    "executables": true,
    "linker-flavor": "ld.lld",
    "linker": "rust-lld",
    "panic-strategy": "abort",
    "disable-redzone": true,
    "rustc-abi": "softfloat",
    "features": "-mmx,-sse,-sse2,-sse3,-ssse3,-sse4.1,-sse4.2,-avx,-avx2,+soft-float"
}

Key fields:

FieldValueReason
llvm-targetx86_64-unknown-noneBare metal; no OS assumed by LLVM
data-layoutLLVM datalayout stringMust exactly match LLVM’s own layout for this triple; confirmed via rustc +nightly --print target-spec-json --target x86_64-unknown-none -Z unstable-options
linker-flavor / linkerld.lld / rust-lldUses LLVM’s cross-capable linker bundled with rustc; no host ld or cross-linker needed
disable-redzonetrueRequired for kernel interrupt handlers; the red zone is an x86_64 ABI optimisation that is unsafe when interrupts can fire at any stack pointer
rustc-abisoftfloatTells rustc that this target intentionally violates the standard x86_64 ABI’s SSE requirement. Without this, rustc refuses to compile when SSE is disabled
features-mmx,-sse,...,+soft-floatDisables SIMD/SSE in generated code (unsafe in kernel context without SSE state saving) and enables soft-float emulation instead

Why rustc-abi: softfloat is needed

The standard x86_64 System V ABI mandates SSE2 support. If you disable SSE features in a custom target spec, rustc rejects the build with:

error: target feature 'sse2' is required by the ABI but gets disabled

The rustc-abi: softfloat field is an escape hatch for kernel targets: it tells rustc to use a different ABI variant (one that does not assume SSE), suppressing the error. This is the same mechanism used internally by Rust’s x86_64-unknown-none tier-2 target.

Cargo Configuration (.cargo/config.toml)

[build]
target = "x86_64-os.json"

[unstable]
build-std = ["core", "compiler_builtins", "alloc"]
build-std-features = ["compiler-builtins-mem"]
json-target-spec = true
  • target: Makes every cargo build in this workspace default to the custom target. No --target flag is required on the command line.
  • build-std: Compiles core, compiler_builtins, and alloc from source for the custom target. This is necessary because Cargo ships pre-compiled standard library crates only for known built-in targets; a custom JSON target has no pre-built sysroot.
  • build-std-features = ["compiler-builtins-mem"]: Builds the memory intrinsics (memcpy, memset, etc.) into compiler_builtins rather than relying on a C runtime, which does not exist in a bare-metal environment.
  • json-target-spec = true: Unlocks support for .json custom target files in current Cargo nightly. Without this flag, Cargo rejects .json target specs.

Bootloader and Bootimage

The kernel ELF is combined with a real-mode x86 bootloader by the bootimage tool:

cargo bootimage --manifest-path kernel/Cargo.toml

This produces target/x86_64-os/debug/bootimage-kernel.bin, a raw x86 disk image.

The bootloader crate (bootloader = "0.9.x") includes its own target spec (x86_64-bootloader.json) and declares build-std = "core" in its own Cargo metadata. bootimage picks this up and compiles the bootloader from source using -Z build-std, just like the kernel — no separate cross-toolchain or cargo-xbuild is needed.

Why bootloader 0.9.x and not 0.8.x

bootloader 0.8.x was released before -Z build-std became stable enough for the bootloader’s own build. It fell back to cargo xbuild, a now-deprecated wrapper tool. The 0.9.x line added build-std to its metadata and has been actively maintained for compatibility with current Rust nightly (data-layout changes, rustc-abi: softfloat, integer target fields, json-target-spec). The kernel-facing API (entry_point!, BootInfo) is the same in both series.

Running Under QEMU

cargo bootimage run --manifest-path kernel/Cargo.toml

or directly:

qemu-system-x86_64 -drive format=raw,file=target/x86_64-os/debug/bootimage-kernel.bin -serial stdio

QEMU provides full x86_64 CPU emulation. The run-time arguments are configured in kernel/Cargo.toml under [package.metadata.bootimage].

Summary

ConcernSolution
Compiling x86_64 code on ARMrustc’s LLVM backend handles any target regardless of host
Linking for bare metalrust-lld (cross-capable, bundled with rustc)
No pre-built sysroot for custom target-Z build-std compiles core/alloc from source
No OS or C runtimecompiler-builtins-mem provides memory intrinsics
SSE disabled but ABI expects itrustc-abi: softfloat in target spec
Bootable disk imagebootimage + bootloader 0.9.x (self-contained, no xbuild)
Running the kernelqemu-system-x86_64 on any host

Rust Cross-Compilation for ostoo Userspace

Build Rust userspace programs natively on macOS, producing static x86_64 ELF binaries that run on the ostoo kernel.

Architecture

user-rs/                        # Separate Cargo workspace
├── Cargo.toml                  # workspace: rt, hello-rs, hello-std
├── .cargo/config.toml          # custom target, build-std
├── x86_64-ostoo-user.json      # custom target spec (no CRT objects)
├── rt/                         # ostoo-rt runtime crate
│   ├── Cargo.toml              # features: no_std (default)
│   └── src/
│       ├── lib.rs              # _start, panic handler, global allocator
│       ├── syscall.rs          # inline-asm SYSCALL wrappers
│       ├── io.rs               # print!/println!/eprint!/eprintln! macros
│       └── alloc_impl.rs       # brk-based bump allocator
├── hello-rs/                   # no_std + alloc example (~5 KiB)
│   ├── Cargo.toml
│   └── src/main.rs
└── hello-std/                  # full std example (~54 KiB)
    ├── Cargo.toml
    └── src/main.rs
sysroot/                        # musl sysroot (extracted, gitignored)
└── x86_64-ostoo-user/
    ├── lib/                    # libc.a, crt*.o, libunwind.a stub
    └── include/                # C headers

Why a separate workspace?

The kernel uses a custom target (x86_64-os.json) that disables SSE and the red zone. Userspace needs standard x86_64 ABI with SSE and red zone enabled. A separate workspace with its own .cargo/config.toml avoids target conflicts.

Custom target (x86_64-ostoo-user.json)

Based on x86_64-unknown-linux-musl but with empty pre-link-objects and post-link-objects — we provide our own _start in ostoo-rt instead of using musl’s CRT startup files. Has crt-static-default: true so that the libc crate links libc.a statically when building std.

Building

# One-time: extract musl sysroot from the ostoo-compiler Docker image
scripts/extract-musl-sysroot.sh

# Build and deploy to user/ (visible at /host/ in guest via virtio-9p)
# (automatically calls extract-musl-sysroot.sh if needed)
scripts/user-rs-build.sh

# Or manually:
cd user-rs
cargo build --release

Uses build-std to compile std and panic_abort (and transitively core, alloc, compiler_builtins, libc, unwind) from source. Links against libc.a from the musl sysroot. Requires the nightly toolchain with rust-src component (already in rust-toolchain.toml).

Note: packages must be built separately (-p hello-rs, then -p hello-std) because Cargo feature unification would otherwise merge ostoo-rt’s no_std feature across the workspace, causing duplicate #[panic_handler] errors. The build script handles this automatically.

Release profile

opt-level = "s", lto = true, panic = "abort", strip = true — produces small binaries (the hello world example is ~4.6 KiB).

Runtime crate (ostoo-rt)

ostoo-rt has a no_std feature (enabled by default). With no_std, it provides a panic handler, global allocator, and OOM handler. Without it (for std programs), it provides only _start and syscall wrappers.

Tier 1: no_std + alloc programs

#![no_std]
#![no_main]
extern crate ostoo_rt;
use ostoo_rt::println;

#[no_mangle]
fn main() -> i32 {
    println!("Hello from Rust on ostoo!");
    0
}

Depend on ostoo-rt with default features (includes no_std).

Tier 2: std programs

#![feature(restricted_std)]
#![no_main]
extern crate ostoo_rt;

use std::collections::HashMap;

#[no_mangle]
fn main() -> i32 {
    println!("Hello from Rust std on ostoo!");
    let mut map = HashMap::new();
    map.insert("key", 42);
    println!("HashMap works: {:?}", map);
    0
}

Depend on ostoo-rt with default-features = false (disables no_std so std’s panic handler and allocator are used instead). The #![feature(restricted_std)] attribute is required for custom JSON targets.

What ostoo-rt provides

  • _start entry point (always, but behaviour differs by mode):

    • no_std: reads argc/argv from the stack, calls _start_rust → user’s main() -> i32 directly.
    • std: extracts argc/argv from the stack and calls musl’s __libc_start_main(main, argc, argv, ...) which initializes libc (TLS via arch_prctl, stdio, locale, auxvec parsing) before calling main(argc, argv, envp). This is essential — without libc init, musl’s write() and other functions fault on uninitialised TLS.
  • Syscall wrappers (always) — syscall0 through syscall4 via inline asm (SYSCALL instruction). Typed wrappers: write, read, open, close, exit, brk, getcwd, chdir, getdents64, wait4.

  • print!/println!/eprint!/eprintln! macros (always) — write to fd 1/2 via core::fmt::Write. In std mode, prefer std::println! instead.

  • Global allocator (no_std only) — brk-based bump allocator.

  • Panic handler (no_std only) — prints panic info to stderr, exits 101.

Adding new programs

  1. Create user-rs/<name>/Cargo.toml with ostoo-rt dependency
  2. Add "<name>" to workspace members in user-rs/Cargo.toml
  3. Add the binary name to the deploy loop in scripts/user-rs-build.sh

Verification

# Binary format check
file user-rs/target/x86_64-ostoo-user/release/hello-rs
# → ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped

# Entry point is _start (verify disassembly shows mov rdi,[rsp]; lea rsi,[rsp+8])
llvm-objdump -d --start-address=<entry> target/x86_64-ostoo-user/release/hello-rs

# No .interp section (no dynamic linker)
llvm-readobj -S target/x86_64-ostoo-user/release/hello-rs | grep -c .interp  # → 0

Boot ostoo, then:

> spawn /host/hello-rs
Hello from Rust on ostoo!
Heap works: 42

Musl sysroot

The musl sysroot provides libc.a for Rust’s std to link against. It is extracted from the ostoo-compiler Docker image (which builds musl 1.2.5 via crosstool-ng).

# Extract sysroot (skips if already present)
scripts/extract-musl-sysroot.sh

The sysroot is placed at sysroot/x86_64-ostoo-user/ and is gitignored. It contains:

  • lib/libc.a — musl static C library
  • lib/crt1.o, crti.o, crtn.o — CRT objects (not linked by default; our target spec has empty pre-link-objects)
  • lib/libunwind.a — empty stub (satisfies unwind crate’s #[link]; musl’s unwinder is in libc.a, and with panic=abort unwinding is never invoked)
  • include/ — C headers

The cargo config passes -Lnative=../sysroot/x86_64-ostoo-user/lib to the linker so it can find libc.a and libunwind.a.

Binary sizes

ProgramModeSize
hello-rsno_std + alloc~5 KiB
hello-stdfull std~55 KiB

Both use opt-level = "s", LTO, panic = "abort", and stripping.

APIC and IO APIC Initialization

Background

The x86/x86_64 interrupt subsystem has two generations:

  • 8259 PIC (Programmable Interrupt Controller) — the legacy two-chip design. Master (IRQs 0–7) and slave (IRQs 8–15) are chained. Vectors are remapped to 0x20–0x2F to avoid conflicts with CPU exceptions.
  • APIC (Advanced Programmable Interrupt Controller) — the modern design, required for SMP. Consists of a Local APIC (LAPIC) per CPU core and one or more IO APICs for external devices.

ACPI describes which model the firmware uses via the MADT (Multiple APIC Description Table). On QEMU with default settings, the MADT reports InterruptModel::Apic, meaning APIC mode is required.

Architecture

   Device
     │
     ▼
 IO APIC  ──── Redirection Table ───► Local APIC ──► CPU
(external                              (per core)
 IRQs)                                  LAPIC ID

Local APIC (LAPIC)

  • One per CPU core, memory-mapped at physical address 0xFEE00000 by default.
  • Handles inter-processor interrupts (IPIs) and LAPIC-local sources (timer, thermal, etc.).
  • Must be enabled by writing to the Spurious Interrupt Vector Register (SIVR) at offset 0xF0. Setting bit 8 (APIC_ENABLE) activates the LAPIC. Bits 0–7 set the spurious interrupt vector (conventionally 0xFF).
  • EOI (End of Interrupt) is signalled by writing 0 to the EOI register at offset 0xB0. Unlike the PIC, no interrupt number is needed — the write itself is the acknowledgement.

IO APIC

  • Handles external hardware interrupts (ISA IRQs, PCI interrupts).

  • Accessed via two MMIO registers: IOREGSEL (write selector) and IOWIN (read/write data window), both at the IO APIC base address.

  • Contains a Redirection Table with one 64-bit entry per input pin:

    BitsFieldNotes
    0–7VectorIDT vector to deliver
    8–10Delivery mode0 = fixed
    11Destination mode0 = physical (LAPIC ID), 1 = logical
    13Pin polarity0 = active high, 1 = active low
    15Trigger mode0 = edge, 1 = level
    16Mask1 = masked (disabled)
    56–63DestinationPhysical: target LAPIC ID

ACPI and Interrupt Source Overrides

ISA IRQs are conventionally edge-triggered, active-high. However, some IRQs are remapped: QEMU reports that ISA IRQ 0 (the PIT timer) is redirected to GSI 2 with edge/active-high signalling. The ACPI InterruptSourceOverride table entries describe these remappings:

ISA IRQDefault GSIOverride GSIOverride PolarityOverride Trigger
002 (QEMU)Same as busSame as bus
11

The init_io function reads these overrides from apic_info.interrupt_source_overrides and uses the correct GSI, polarity, and trigger mode when programming each redirection entry.

Initialization Sequence

1. Map Local APIC (libkernel::apic::init_local)

The LAPIC physical address is read from the IA32_APIC_BASE MSR. A virtual page at APIC_BASE is mapped to this physical frame (with NO_CACHE flag, as MMIO must not be cached):

Physical 0xFEE00000  →  Virtual 0xFFFF_8001_0000_0000

After mapping:

  • init() logs the LAPIC ID, version, and LVT register state.
  • enable() writes the SIVR: APIC_ENABLE | 0xFF (enable + spurious vector).
  • The EOI virtual address (APIC_BASE + 0xB0) is stored in libkernel::interrupts::LAPIC_EOI_ADDR so interrupt handlers can issue EOI without needing a reference to the apic module.

2. Map IO APIC(s) (libkernel::apic::init_io)

Each IO APIC listed in the ACPI MADT is mapped to consecutive virtual pages starting at APIC_BASE + 4KiB. The global_system_interrupt_base field records which GSIs this IO APIC handles (typically 0 for the first IO APIC).

After mapping all IO APICs:

  • Mask all entries — every redirection table slot is masked before programming, preventing spurious interrupts during setup.
  • Route ISA IRQs — IRQ 0 (timer) and IRQ 1 (keyboard) are routed to IDT vectors 0x20 and 0x21 respectively, targeting the BSP’s LAPIC ID. Source overrides are applied (e.g. timer GSI 2 on QEMU).

3. Update IDT and EOI (libkernel::interrupts)

The IDT is extended with a spurious interrupt handler at vector 0xFF. Spurious LAPIC interrupts must not receive an EOI.

The timer and keyboard handlers are updated to call send_eoi() instead of PICS.notify_end_of_interrupt(). send_eoi() checks LAPIC_EOI_ADDR: if non-zero (APIC mode), it writes 0 to the LAPIC EOI register; otherwise it falls back to the PIC path. This allows the same IDT to work in both PIC and APIC modes.

4. Disable the 8259 PIC (libkernel::interrupts::disable_pic)

After the IO APIC is programmed, the PIC is disabled by masking all IRQs:

#![allow(unused)]
fn main() {
Port::<u8>::new(0x21).write(0xFF);  // mask master PIC
Port::<u8>::new(0xA1).write(0xFF);  // mask slave PIC
}

This prevents the PIC from delivering interrupts that would arrive at the wrong vectors or cause double-delivery with the IO APIC.

Key Constants

SymbolValueDescription
APIC_BASE0xFFFF_8001_0000_0000Virtual base for LAPIC mapping
LAPIC_EOI_OFFSET0xB0Offset of EOI register in LAPIC
LAPIC_SIVR_OFFSET0xF0Offset of SIVR in LAPIC
SPURIOUS_VECTOR0xFFIDT vector for LAPIC spurious IRQs
TIMER_VECTOR0x20IDT vector for timer (ISA IRQ 0)
KEYBOARD_VECTOR0x21IDT vector for keyboard (ISA IRQ 1)

Crate Location

The APIC code lives in libkernel::apic (module libkernel/src/apic/). It was originally a separate apic crate but was merged into libkernel so that libkernel::irq_handle can call IO APIC mask/unmask/write functions directly without duplicating raw MMIO code.

The LAPIC EOI address is communicated via a single AtomicU64 in libkernel::interrupts: the apic module writes the address after mapping the LAPIC, and interrupt handlers read it to perform EOI.

References

  • Intel SDM Vol. 3A, Chapter 10: Advanced Programmable Interrupt Controller (APIC)
  • OSDev Wiki: APIC, IO APIC, MADT
  • ACPI Specification, Section 5.2.12: Multiple APIC Description Table

LAPIC Timer

Overview

The kernel uses the Local APIC (LAPIC) per-core timer as the primary tick source at 1000 Hz (1 ms resolution), replacing the legacy 8253 Programmable Interval Timer (PIT).

PropertyPITLAPIC timer
Frequency100 Hz (10 ms/tick)1000 Hz (1 ms/tick)
ScopeSystem-wide, ISA busPer-core, MMIO
ConfigurationPort I/OMemory-mapped registers
Timer future resolution10 ms1 ms

LAPIC Timer Calibration

The LAPIC timer counts down from a programmed initial value at a rate derived from the CPU bus frequency, which varies between machines. To determine the correct initial count for 1000 Hz, the kernel calibrates against the PIT.

Algorithm

  1. Start one-shot countdown — write 0xFFFF_FFFF to TimerInitialCount with divide-by-16.
  2. Wait 500 ms — busy-wait on TICK_COUNT for 50 PIT ticks (50 × 10 ms = 500 ms).
  3. Read elapsed countelapsed = 0xFFFF_FFFF - TimerCurrentCount.
  4. Compute bus frequency:
    lapic_bus_freq = elapsed × divide × PIT_HZ / PIT_ticks_waited
                   = elapsed × 16 × 100 / 50
    
  5. Compute initial count for 1000 Hz:
    initial_count = lapic_bus_freq / (divide × target_Hz)
                  = lapic_bus_freq / (16 × 1000)
    
  6. Start periodic timer with the computed initial count.

Implementation

libkernel::apic::calibrate_and_start_lapic_timer() in libkernel/src/apic/mod.rs:

  • Called from kernel/src/main.rs after libkernel::apic::init() and disable_pic().
  • Releases the LOCAL_APIC lock before entering the HLT loop (phase 2) so the PIT ISR can proceed without deadlock.
  • The LAPIC EOI address is already registered in libkernel::LAPIC_EOI_ADDR by init_local().

PIT Coexistence During Calibration

During the 500 ms calibration window, the PIT ISR (vector 0x20) is still active and increments TICK_COUNT. This is required — wait_ticks() depends on it. After the LAPIC timer starts:

  • PIT continues at 100 Hz (vector 0x20 → tick())
  • LAPIC fires at 1000 Hz (vector 0x30 → tick())

Both call tick(), giving approximately 1100 increments per second. The Delay future handles this correctly: early wakeups cause re-polls, which re-register the waker. Timing is slightly fast during calibration startup (~0.1% error), which is acceptable for kernel timers.

To eliminate the PIT contribution after calibration, mask GSI 2 in the IO APIC:

#![allow(unused)]
fn main() {
// follow-up: IO_APICS.lock()[0].mask_entry(2);
}

This is not yet implemented.

Multi-Waker Design

Problem with AtomicWaker

futures_util::task::AtomicWaker holds a single waker. With multiple concurrent Delay futures across different tasks, each poll() call overwrites the previous waker. When the ISR fires, only the last registered task is woken; others remain pending indefinitely.

Solution: Fixed Waker Array

libkernel/src/task/timer.rs uses a fixed array of 8 optional wakers behind a spinlock:

#![allow(unused)]
fn main() {
static WAKERS: spin::Mutex<[Option<Waker>; 8]> = Mutex::new([None; 8]);
}

On each tick (tick() called from ISR):

  1. Increment TICK_COUNT.
  2. Acquire the lock (interrupts already disabled by CPU on IDT dispatch — no deadlock).
  3. Take and wake every non-empty slot.

In Delay::poll():

  1. Check TICK_COUNT >= target — return Ready immediately if done.
  2. Clone the waker (may allocate — must be done in task context, before disabling interrupts).
  3. Disable interrupts (without_interrupts) and lock WAKERS.
  4. Find an empty slot and insert the cloned waker. Panic if all slots are full (bug indicator).
  5. Re-check TICK_COUNT >= target — return Ready if the ISR fired between step 1 and step 4.
  6. Return Pending.

ISR/Task Locking Contract

ContextIF flagLock acquisition
ISR (timer handler)0 (CPU clears on IDT dispatch)Always succeeds immediately
Task (Delay::poll)1 (enabled)Uses without_interrupts to prevent ISR re-entry while holding lock

If a task held the lock with interrupts enabled, the ISR could fire and spin forever trying to acquire the same lock — a deadlock. without_interrupts prevents this.

TICKS_PER_SECOND Constant

Defined in libkernel/src/task/timer.rs:

#![allow(unused)]
fn main() {
pub const TICKS_PER_SECOND: u64 = 1000;
}

Use it to convert between ticks and real time:

#![allow(unused)]
fn main() {
// Convert ticks to seconds elapsed
let secs = ticks() / TICKS_PER_SECOND;

// Create a 1-second delay
Delay::from_secs(1).await;

// Create a 250ms delay
Delay::from_millis(250).await;
}

Delay::from_millis(ms) uses ceiling division to avoid returning early:

#![allow(unused)]
fn main() {
Self::new((ms * TICKS_PER_SECOND + 999) / 1000)
}

LAPIC Timer Registers

RegisterOffsetPurpose
LvtTimer0x320Vector[7:0], mask[16], mode: one-shot[17]=0, periodic[17]=1
TimerInitialCount0x380Write to start countdown
TimerCurrentCount0x390Read-only; current value
TimerDivideConfiguration0x3E0Bus clock divisor (0x3 = ÷16)

The kernel uses divide-by-16 (0x3). The formula above accounts for this divisor.

Key Files

FileRole
libkernel/src/task/timer.rstick(), wait_ticks(), Delay, TICKS_PER_SECOND, waker array
libkernel/src/interrupts.rsLAPIC_TIMER_VECTOR = 0x30, IDT entry, lapic_timer_interrupt_handler
apic/src/local_apic/mapped.rsstart_oneshot_timer(), start_periodic_timer(), stop_timer(), read_current_count()
apic/src/lib.rscalibrate_and_start_lapic_timer()
kernel/src/main.rsCalls calibration; spawns timer_task()

Microkernel Design

Overview

This document explores evolving ostoo towards a microkernel architecture where device drivers run as userspace processes rather than in the kernel. It covers the motivation, the kernel primitives required, how other systems solve this, and a migration path from the current monolithic design.

See also: networking-design.md for how networking specifically fits into either a monolithic or microkernel architecture.


Why Consider a Microkernel

  • Fault isolation. A buggy NIC or filesystem driver crashes its own process, not the kernel. The system can restart it.
  • Reduced TCB. The trusted computing base shrinks to just the kernel primitives. Less kernel code = fewer exploitable bugs.
  • Hot-swappable drivers. Replace or upgrade a driver without rebooting.
  • Security boundaries. A compromised driver only has access to the specific device it manages, not all of kernel memory.

How Other Systems Do It

Redox OS — Scheme-Based IPC

Redox uses schemes as both a namespace and IPC channel. A scheme is a named resource (like tcp:, udp:, disk:) backed by a userspace daemon. Standard file operations (open/read/write/close) become IPC messages routed through the kernel to the scheme daemon.

  • smolnetd daemon implements tcp:/udp: schemes using smoltcp
  • NIC drivers (e.g. e1000d) are separate userspace processes
  • The kernel provides irq:N and memory: schemes for hardware access
  • Recent work adds io_uring-style shared-memory rings for high-throughput driver-to-driver communication, bypassing the kernel data path

Strengths: elegant “everything is a file” model, reuses POSIX-like ops for IPC. Weaknesses: every cross-component operation involves at least one context switch through the kernel (mitigated by io_uring for data paths).

seL4 — Capabilities + Shared Memory Rings

seL4 provides minimal kernel primitives and lets userspace build everything else:

  • Synchronous endpoints for RPC (~0.2us round-trip on ARM64). Small messages transfer entirely in CPU registers (zero copy). The kernel has a fastpath for seL4_Call/seL4_ReplyRecv with direct process switch (sender switches directly to receiver without full scheduler invocation).
  • Notifications for async signaling (interrupt delivery, ring wakeups). A notification word acts as a bitmask of binary semaphores — different signalers use different bits, so one notification object can multiplex multiple event sources.
  • Capability system — every kernel object (endpoints, frames, interrupts, page tables) is accessed through unforgeable capability tokens stored in per-thread CSpaces. Capabilities can be derived with reduced rights, transferred via IPC, or revoked.
  • sDDF (seL4 Device Driver Framework) uses SPSC shared-memory ring buffers for zero-copy packet passing between NIC driver → multiplexer → application.

sDDF on an iMX8 ARM board with a 1 Gb/s NIC saturates the link at ~95% CPU while Linux on the same hardware maxes out at ~600 Mb/s. The shared-memory ring design avoids Linux’s sk_buff allocation/copy overhead.

Minix 3 — Classic Microkernel

Each driver is a separate user-mode process. The kernel provides:

  • sys_irqsetpolicy() — subscribe to hardware interrupts
  • HARD_INT notification messages — delivered on next receive()
  • SYS_DEVIO — read/write I/O ports from userspace
  • SYS_PRIVCTL — per-driver access control (which ports, IRQs, memory regions are permitted, declared in /etc/system.conf.d/)
  • SYS_UMAP/SYS_VUMAP — virtual-to-physical translation for DMA setup
  • Fixed-length synchronous IPC: send()/receive()/sendrec() + notify()

Networking uses lwIP in a separate server process. A received packet traverses: NIC driver → lwIP server → VFS → application (~4 IPC hops per direction).

Fuchsia (Zircon) — Driver Hosts + FIDL

Drivers are shared libraries loaded into driver host processes. Multiple drivers can be colocated in the same host for zero-overhead communication.

  • FIDL (Fuchsia Interface Definition Language) for typed IPC across process boundaries via Zircon channels
  • DriverTransport for in-process communication between colocated drivers (no kernel involvement — can invoke handler directly in the same stack frame)
  • VMO (Virtual Memory Objects) for shared memory and device MMIO. Created by the bus driver via zx_vmo_create_physical(), passed as handles to device drivers, mapped into their address space
  • BTI (Bus Transaction Initiators) for DMA with IOMMU control. zx_bti_pin() pins a VMO and returns device-physical addresses
  • Interrupt objects — created by bus drivers, delivered via zx_interrupt_wait() (sync) or port binding (async)
  • Control/data plane split: FIDL messages for setup, pre-allocated shared VMOs for bulk data transfer

Key insight: colocation lets Fuchsia avoid the microkernel IPC tax for tightly-coupled drivers while still getting process isolation for less trusted ones.


Kernel Primitives Needed

To support userspace drivers, ostoo must provide these minimal primitives.

1. Physical Memory Mapping

Map device MMIO BARs into a userspace process’s address space.

ostoo today: mmio_phys_to_virt() maps BARs into kernel space only. sys_mmap only supports anonymous private pages (returns -ENOSYS for non-anonymous).

Options:

  • Extend sys_mmap with MAP_SHARED + a device fd (Linux-like /dev/mem)
  • New sys_mmap_device(phys_addr, size, perms) syscall (simpler)
  • Capability-based: kernel creates a “device memory” handle, process maps it via mmap on that handle’s fd

The capability-based approach (device memory as an fd) fits ostoo’s existing fd_table model well and avoids giving processes a raw “map any physical address” primitive.

2. IRQ Delivery to Userspace

Deliver hardware interrupts as events to a userspace driver process.

ostoo today: Interrupts handled entirely in kernel (APIC/IOAPIC routing to kernel ISRs). No mechanism to notify userspace of IRQs.

Options:

ApproachDescriptionComplexity
IRQ fdopen("/dev/irq/N"), read() blocks until IRQ firesLow — reuses fd/FileHandle
eventfdNew eventfd() syscall, kernel writes on IRQMedium
SignalDeliver SIGIO to driver process on IRQMedium — requires signals
Notification objectDedicated kernel object (seL4-style)High — new primitive

Recommendation: IRQ fd. Create an IrqHandle implementing FileHandle where read() blocks until the interrupt fires and returns a count. The kernel ISR masks the interrupt and calls unblock() on the waiting thread. After handling, the driver writes to the fd to re-enable the interrupt. This fits the existing scheduler block/unblock pattern and the fd_table model.

3. Fast IPC

Efficient communication between driver processes and between drivers and applications.

ostoo today: Only pipes (byte streams, no message boundaries, kernel- buffered copy).

Progressive options:

  1. Pipes (have now) — sufficient for prototyping, ~2 copies per message
  2. Unix domain sockets — adds message boundaries (SOCK_DGRAM), ancillary data for fd passing (needed to transfer device handles between processes)
  3. Shared memory regionsmmap(MAP_SHARED) or shmget/shmat for zero-copy ring buffers between cooperating processes
  4. io_uring-style rings — lock-free SPSC queues in shared memory, kernel only involved for wakeups when rings transition empty → non-empty

For initial microkernel work, pipes + shared memory is sufficient. The performance-critical path (packet data) goes through shared memory rings; the control path (setup, teardown) goes through pipes or sockets.

The long-term goal is a pattern where the control plane uses message-based IPC and the data plane uses shared-memory rings, matching seL4 sDDF and Fuchsia’s architecture.

4. Shared Memory

Map the same physical pages into multiple process address spaces.

ostoo today: Each process gets a private PML4. create_user_page_table copies kernel entries (256-510) but there is no mechanism for two user processes to share pages.

What’s needed:

  • A shared memory object (named or anonymous) backed by physical frames
  • mmap() to map it into each participating process
  • Reference counting so pages are freed only when all mappers unmap
  • Access control: processes must hold a handle/capability to map the region

Design sketch:

Process A                  Kernel                     Process B
    │                                                      │
    ├─ shmget(key, size) ──→ allocate frames ←── shmget(key, size) ─┤
    │                        ref_count = 2                 │
    ├─ shmat(id) ──────────→ map into A's PML4             │
    │                                                      │
    │                        map into B's PML4 ←─── shmat(id) ──────┤
    │                                                      │
    │  (A and B now read/write the same physical pages)    │

Alternative: fd-based approach where mmap(fd, MAP_SHARED) on a memfd works like Linux. This avoids inventing SysV-style APIs and reuses the fd model.

5. DMA Support

Userspace drivers need physical addresses for device DMA programming.

ostoo today: KernelHal::dma_alloc() allocates physically-contiguous pages and returns (paddr, NonNull<u8>). share() translates virtual to physical via page table walk. Both are kernel-internal.

What’s needed:

  • A syscall to allocate DMA-capable memory: physically contiguous, pinned, and mapped into the calling process. Returns both the virtual address and the physical address (the driver needs the physical address to program the device’s DMA descriptors).
  • Or: a two-step model where the kernel allocates DMA buffers and provides an fd. The driver maps the fd and queries the physical address separately.

The fd-based model is safer (physical addresses are not exposed until the driver proves it holds the right handle) and aligns with Fuchsia’s BTI/PMT pattern.

6. Access Control

Prevent arbitrary processes from mapping device memory or claiming IRQs.

Options (increasing sophistication):

ApproachDescriptionPrecedent
Init-time grantOnly the init process can spawn drivers with device access, configured at spawn timeSimple, sufficient for single-user OS
Capability-basedKernel objects (device memory, IRQ handles) are capabilities obtained from a parent or resource managerseL4, Fuchsia
Policy-basedConfiguration file declares which programs may access which devicesMinix 3

Recommendation: Init-time grant for the first iteration. The kernel shell or init process spawns driver processes and passes them fds for their device MMIO region and IRQ. The driver inherits these fds across exec. No new kernel objects needed — just careful fd management.

Later, this can evolve towards a capability model where device handles are kernel objects with typed permissions.


What Stays in the Kernel

Even in a full microkernel, some things must remain:

  • CPU scheduling — timer interrupts, context switching, thread states
  • Memory management — page tables, frame allocation, address space setup
  • IPC primitives — message passing, shared memory mapping, notifications
  • Interrupt routing — top-half ISR that masks IRQ and notifies userspace
  • Capability/access control — enforce which processes access which devices
  • Boot and early init — PCI enumeration can eventually be delegated, but initial hardware discovery typically starts in kernel

Everything else — device drivers, filesystems, network stacks, even the TCP/IP protocol processing — can live in userspace.


Current ostoo Gaps

Summary of what exists vs what’s needed:

PrimitiveCurrent StateGap
Physical memory mappingKernel-only (mmio_phys_to_virt)Need userspace mapping syscall
IRQ deliveryKernel ISRs onlyNeed IRQ fd or notification
IPCPipes only (byte stream, ~2 copies)Need shared memory + message boundaries
Shared memoryNone (private PML4 per process)Need cross-process page sharing
DMAKernel-only (dma_alloc/share)Need userspace DMA allocation
Access controlNone (all processes equal)Need per-process device permissions
mmapAnonymous private onlyNeed MAP_SHARED, device mapping
ioctlNot implementedNeed for device control

Migration Path

A phased approach that starts monolithic and progressively moves towards microkernel:

Phase A — Monolithic Drivers (current)

All drivers (virtio-blk, virtio-9p, and eventually virtio-net) run in kernel space via the devices crate. This is the working baseline. Networking is implemented in kernel with smoltcp (see networking-design.md).

Phase B — Add Kernel Primitives

Implement foundational primitives without yet moving drivers out. These are independently useful:

  1. mmap(MAP_SHARED) — shared memory between processes (needed for efficient multi-process programs even without microkernel goals)
  2. IRQ fdirq_create(gsi) syscall (504) returns an fd backed by FdObject::Irq. Used with OP_IRQ_WAIT on a completion port. Implemented (see libkernel/src/irq_handle.rs, osl/src/irq.rs)
  3. Device MMIO mapping — map physical BAR regions to userspace via an fd
  4. DMA allocation syscall — allocate pinned, physically-contiguous pages accessible from userspace

Phase C — Userspace NIC Driver

Move the virtio-net driver to a userspace process as a proof-of-concept:

  • Process receives device MMIO fd and IRQ fd from init
  • Maps the virtio-net PCI BAR into its address space
  • Opens IRQ fd and polls/blocks for interrupts
  • Allocates DMA buffers for virtqueue descriptors
  • Communicates with the in-kernel TCP/IP stack via shared memory ring buffers

The TCP/IP stack stays in kernel at this stage. This tests the driver primitive infrastructure with a single, well-understood device.

Phase D — Userspace TCP/IP Stack

Move smoltcp to a separate userspace server process:

  • Receives raw Ethernet frames from NIC driver via shared memory rings
  • Processes TCP/IP/ARP/DHCP
  • Delivers data to applications via shared memory or kernel-mediated IPC
  • The kernel’s socket syscall handlers become thin IPC stubs that route requests to this server (preserving POSIX compatibility for musl)

Phase E — Generalize

Apply the same pattern to other drivers:

  • virtio-blk → userspace block driver + userspace filesystem server
  • virtio-9p → userspace 9P client
  • Console/keyboard → userspace terminal driver

At this point the kernel is a true microkernel: scheduler, memory management, IPC, and capability enforcement only. The devices and osl crates either become userspace libraries or are restructured into per-driver binaries.


Performance Considerations

Context Switches Per Packet

PathMonolithicMicrokernel
NIC IRQ → driver0 (in kernel)1 (kernel → driver)
Driver → TCP/IP0 (function call)1 (shared memory signal)
TCP/IP → application1 (return to userspace)1 (IPC or signal)
Total13 (naive) / 1-2 (batched)

Why This Is Acceptable

  • With shared-memory ring buffers and batching, the kernel is only involved for wakeups when rings transition empty → non-empty.
  • Under load, driver and TCP/IP server can poll their rings without any kernel involvement (similar to Linux NAPI busy-polling).
  • seL4 IPC round-trip: ~0.2us. Network I/O latency: 25-500+us. Even 3 IPC hops are a small fraction of total latency.
  • The historical Mach-era penalty (50-100% overhead) is now 5-10% for general workloads and near-zero for I/O-dominated workloads.

Key Optimisations

  1. Shared memory data plane — kernel only signals, never copies data
  2. Batching — process N packets per wakeup, not 1
  3. Direct process switch — IPC sender switches directly to receiver without full scheduler invocation (seL4 fastpath)
  4. Polling under load — skip notifications entirely when rings are busy
  5. Pre-allocated buffer pools — no per-packet allocation
  6. Driver colocation (Fuchsia-style) — run tightly-coupled drivers in the same address space when isolation between them is not needed

Comparison Summary

AspectMonolithic (Phase A)Full Microkernel (Phase E)
Kernel code sizeLarge (drivers + protocols)Small (primitives only)
Driver crashKernel panicRestart driver process
Attack surfaceEntire kernelMinimal kernel + IPC
PerformanceBest (no IPC overhead)Good (shared memory amortises cost)
Implementation effortLowHigh (needs IPC, shared mem, caps)
POSIX compatDirectKernel-mediated IPC stubs
ostoo readinessReady nowNeeds phases B-E

Compositor Design

A Wayland-style userspace compositor for ostoo.

Overview

The compositor takes ownership of the BGA framebuffer, accepts client connections via a service registry, allocates shared-memory pixel buffers for clients, and composites their output to the screen.

MVP scope: window creation, buffer allocation, damage signaling, compositing. No input routing or window management.

Architecture

┌──────────┐  svc_lookup   ┌────────────┐  framebuffer_open  ┌─────┐
│  Client   │──────────────▶│ Compositor │──────────────────▶│ BGA │
│           │  IPC channels │            │  MAP_SHARED mmap   │ LFB │
│  shmem    │◀─────────────▶│  shmem     │                    └─────┘
│  buffer   │  notify fds   │  event     │
└──────────┘               │  loop      │
                            └────────────┘

Kernel Primitives Used

SyscallNrPurpose
svc_register513Compositor registers itself under "compositor"
svc_lookup514Client finds the compositor’s registration channel
framebuffer_open515Compositor gets an shmem fd wrapping the BGA LFB
ipc_create505Channel pairs for registration and per-client comms
ipc_send/ipc_recv506/507Message passing with fd-passing
shmem_create508Per-window pixel buffer allocation
notify_create509Per-window damage notification fd
notify510Client signals “buffer is ready”
io_create501Compositor’s completion port
io_submit502Arm OP_IPC_RECV and OP_RING_WAIT
io_wait503Block until events arrive

Service Registry (syscalls 513–514)

A kernel-global BTreeMap<String, FdObject> keyed by null-terminated name.

  • svc_register(name, fd): clones the fd object + notify_dup(), inserts under name. Returns -EBUSY if taken.
  • svc_lookup(name): clones + notify_dup()s the stored object, allocates fd in caller’s table. Returns -ENOENT if not found.

Max name length: 128 bytes.

Framebuffer Access (syscall 515)

framebuffer_open(flags) creates a SharedMemInner::from_existing() wrapping the BGA LFB physical frames. The frames are non-owning (MMIO memory is never freed). The caller mmaps with MAP_SHARED to get a user-accessible pointer.

The LFB physical address and size are stored in atomics during BGA init and read by the syscall handler.

Connection Protocol

Uses existing IPC fd-passing — no new kernel primitives needed.

Compositor Setup

  1. ipc_create()[reg_send, reg_recv]
  2. svc_register("compositor\0", reg_send)
  3. Create CompletionPort, submit OP_IPC_RECV on reg_recv

Client Connects

  1. svc_lookup("compositor\0") → dup of reg_send
  2. Create two channel pairs: c2s (client→server) and s2c (server→client)
  3. ipc_send(reg_send, MSG_CONNECT { w, h, fds=[c2s_recv, s2c_send] })
  4. ipc_recv(s2c_recv)MSG_WINDOW_CREATED { id, w, h, fds=[buf_fd, notify_fd] }
  5. mmap(MAP_SHARED, buf_fd) → pixel buffer
  6. Draw, then notify_signal(notify_fd)

Compositor Accepts

  1. Extract c2s_recv, s2c_send from message
  2. Allocate shmem buffer + notify fd
  3. ipc_send(s2c_send, MSG_WINDOW_CREATED { fds=[buf_fd, notify_fd] })
  4. Arm OP_RING_WAIT on notify fd, OP_IPC_RECV on c2s_recv
  5. Re-arm OP_IPC_RECV on reg_recv for next client

Wire Protocol

TagNameDirectiondata[]fds[]
1MSG_CONNECTclient→compositor[w, h, 0][c2s_recv, s2c_send]
2MSG_WINDOW_CREATEDcompositor→client[wid, w, h][buf_fd, notify_fd]
3MSG_PRESENTclient→compositor[wid, 0, 0]
4MSG_CLOSEclient→compositor[wid, 0, 0]

Compositor Event Loop

port.wait(min=1) → completions
  TAG_NEW_CLIENT    → handle_connect(), re-arm OP_IPC_RECV on reg_recv
  TAG_DAMAGE(wid)   → mark dirty, re-arm OP_RING_WAIT
  TAG_CMD(wid)      → handle MSG_PRESENT/MSG_CLOSE, re-arm OP_IPC_RECV
if any dirty → composite()

Compositing

  • BGRA throughout (matches BGA native format, zero conversion)
  • Background: solid dark grey (0x00282828)
  • Window placement: auto-tile (2×2 grid)
  • Full-screen repaint on any damage (acceptable at 1024×768)
  • Painter’s algorithm, back-to-front

Files

FileRole
libkernel/src/service.rsService registry (register, lookup)
libkernel/src/framebuffer.rsLFB phys addr globals (set_lfb_phys, get_lfb_phys)
osl/src/syscalls/service.rssys_svc_register, sys_svc_lookup
osl/src/syscalls/fb.rssys_framebuffer_open
user-rs/rt/src/compositor_proto.rsProtocol constants
user-rs/compositor/Compositor binary
user-rs/demo-client/Demo client binary

Usage

# Build kernel
cargo bootimage --manifest-path kernel/Cargo.toml

# Build and deploy Rust userspace (compositor + demo-client)
scripts/user-rs-build.sh

# Run
scripts/run.sh

The compositor is auto-launched by the kernel at boot (see launch_compositor in kernel/src/main.rs). Run demo-client from the shell to display a test gradient.

Future Work

  • Input routing: keyboard events → focused window
  • Window management: move, resize, focus, Z-order
  • Double buffering: back-buffer swap
  • Alpha blending
  • Dirty rect optimization
  • Write-combining PAT entries for LFB pages
  • Service auto-cleanup on process exit

Display & Input Ownership

How the framebuffer and keyboard transition from kernel to compositor using the existing fd-passing and service-registry primitives.

Problem

At boot, three components compete for the display and keyboard:

  1. Kernel WRITERprintln!() renders to the BGA framebuffer via an IrqMutex-protected Framebuffer struct.
  2. User shell — reads from the console input buffer, writes to stdout (which goes through WRITER).
  3. Compositor — mmaps the same LFB via framebuffer_open, composites client windows.

Today there is no ownership model. The kernel WRITER and compositor both hold pointers to the same physical framebuffer memory and write concurrently. Keyboard input routes to the user shell via FOREGROUND_PID but the compositor has no way to receive it.

Design: Capability-Based Handoff

Ownership is expressed through who holds which fds, matching the existing IPC model.

Display Ownership

BOOT                            COMPOSITOR RUNNING
────                            ──────────────────
WRITER ──▶ LFB (active)        WRITER ──▶ serial only
                                Compositor ──▶ LFB (exclusive)

When the compositor calls framebuffer_open (syscall 515), two things happen:

  1. The compositor gets an shmem fd wrapping the BGA LFB (existing behaviour).
  2. Side effect: the kernel marks the WRITER backend as suppressed. All subsequent println!() / log::info!() output is redirected to serial only. The kernel no longer touches the LFB. The status bar and timeline strip are also suppressed.

If the compositor exits or crashes, the kernel detects this (via process exit cleanup in terminate_process) and unsuppresses the WRITER, calling repaint_all() to restore kernel display output.

Implementation: DISPLAY_SUPPRESSED: AtomicBool and DISPLAY_OWNER_PID: AtomicU64 in libkernel/src/vga_buffer/mod.rs.

Input Ownership — Userspace Keyboard Driver

Instead of a kernel-level input_acquire syscall, the keyboard becomes a userspace service (/bin/kbd). This requires no new kernel interfaces — only existing primitives.

┌──────────┐  IRQ fd    ┌──────────┐  IPC channel  ┌────────────┐
│ IO APIC  │───────────▶│ /bin/kbd │──────────────▶│ Compositor │
│ (GSI 1)  │  scancode   │          │  key events    │            │
└──────────┘  in result  └──────────┘               └────────────┘

How it works:

  1. /bin/kbd calls irq_create(1) — claims keyboard IRQ via the existing IRQ fd mechanism. This reroutes the keyboard interrupt from the hardwired kernel ISR (vector 33) to a dynamic vector handled by irq_fd_dispatch, which reads port 0x60 and delivers the scancode in completion.result. The GSI is kept unmasked between interrupts so that edge-triggered IRQ edges are never lost; scancodes that arrive between OP_IRQ_WAIT re-arms are buffered in a 64-entry ring.
  2. Creates a registration channel and calls svc_register("keyboard").
  3. Event loop on CompletionPort:
    • OP_IRQ_WAIT → receives scancode → decodes via scancode set 1 state machine → produces key events
    • OP_IPC_RECV on registration channel → new client connecting (compositor sends a channel send-end) → stores client
  4. For each decoded key event, sends an IpcMessage to all connected clients.

When /bin/kbd exits, close_irq restores the original IO APIC entry, and the kernel keyboard actor resumes automatically — providing fallback.

Safety: if no client connects within 2 seconds, kbd exits to avoid capturing keyboard input with nobody listening.

Keyboard Protocol

MSG_KB_CONNECT (tag=1): client → keyboard service
  data = [0, 0, 0]
  fds  = [event_send_fd, -1, -1, -1]

MSG_KB_KEY (tag=1): keyboard service → client (via passed channel)
  data = [byte, modifiers, key_type]
  fds  = [-1, -1, -1, -1]
  • key_type: 0 = ASCII byte, 1 = special key (arrow, etc.)
  • modifiers: bitmask (bit 0 = shift, bit 1 = ctrl, bit 2 = alt)

Input Ownership — Mouse (Integrated into Compositor)

The mouse is handled directly by the compositor — no separate mouse driver process. The compositor claims IRQ 12 itself and decodes PS/2 packets inline, eliminating an IPC round-trip per mouse event.

┌──────────┐  IRQ fd    ┌────────────┐
│ IO APIC  │───────────▶│ Compositor │
│ (GSI 12) │  byte       │            │
└──────────┘  in result  └────────────┘

How it works:

  1. The compositor calls irq_create(12) — claims mouse IRQ via the existing IRQ fd mechanism. The kernel automatically initializes the PS/2 auxiliary port (i8042 controller) when GSI 12 is claimed.
  2. Arms OP_IRQ_WAIT on the IRQ fd in its completion port event loop.
  3. On each IRQ completion, feeds the raw byte into an inline MouseDecoder that collects 3-byte PS/2 packets (sync on byte 0 bit 3), decodes signed deltas using the OSDev wiki formula, and updates the absolute cursor position (clamped to screen bounds).

Compositor Key & Mouse Forwarding

The compositor connects to the keyboard service on startup using svc_lookup_retry(). Key events are forwarded to the focused window’s client via MSG_KEY_EVENT (tag 5). Mouse events (decoded directly from IRQ 12) drive the cursor, focus, window movement, and resizing.

Window Decorations (Server-Side, CDE Style)

The compositor draws server-side decorations inspired by the Common Desktop Environment (CDE) / Motif toolkit, with 3D beveled borders:

╔═══════════════════════════════╗ ─┐
║ ┌──┐                          ║  │
║ │▪▪│    Win 1 (centered)      ║  │ TITLE_H = 24px
║ └──┘                          ║  │
╠═══════════════════════════════╣ ─┘
║ ┌───────────────────────────┐ ║
║ │                           │ ║
║ │     Client Content        │ ║  client buffer (w × h)
║ │                           │ ║  (sunken inner bevel)
║ └───────────────────────────┘ ║
╚═══════════════════════════════╝
  BORDER_W = 4px, BEVEL = 2px
  • 3D bevels: draw_bevel() renders light/dark edge pairs on all four sides to create a raised or sunken look (2px bevel width)
  • Title bar (24px): raised bevel, blue when focused, grey when unfocused
  • Close button: raised square with inner square motif (CDE style), positioned in the top-left of the title bar
  • Window title: centered text rendered with 8×16 CP437 font
  • Client area: surrounded by a sunken inner bevel
  • Color palette: blue-grey CDE theme (slate blue desktop, cool grey-blue window frames, blue active title bars)

Window Management

Focus: Click anywhere in a window to focus it. The focused window moves to the top of the Z-order and receives keyboard input.

Move: Drag the title bar to move a window.

Resize: Drag the bottom edge, right edge, or bottom-right corner to resize. The cursor changes to indicate the resize direction: diagonal double-arrow for corners, horizontal for right edge, vertical for bottom edge. During drag, the window frame updates live. On mouse-up, the compositor allocates a new shared buffer and sends MSG_WINDOW_RESIZED to the client. The terminal emulator remaps the new buffer, recalculates cols/rows, clears the screen, and nudges the shell to redraw its prompt.

Close: Click the close button to close a window.

Compositor Double Buffering & Cursor-Only Rendering

The compositor uses an offscreen back buffer (heap-allocated, same size as the framebuffer) to eliminate flicker. Full composite passes clear and draw all windows (with decorations) into the back buffer, then copy the finished frame to the LFB in a single memcpy.

Cursor-only optimization: Mouse movement that doesn’t change the scene (no window drag, no focus change) takes a fast path: restore the old cursor rectangle from the back buffer (~12x8 pixels), draw the cursor at the new position, and patch only those two small rectangles on the LFB. This avoids the full 3 MB recomposite on every mouse event.

Terminal Emulator and Shell

The terminal emulator (/bin/term) is a compositor client that spawns the shell with pipe-connected stdin/stdout.

Compositor                      Terminal Emulator              Shell
──────────                      ─────────────────              ─────
         MSG_KEY_EVENT           stdin pipe
  ─────────────────▶  translate  ──────────────▶  read(0)
         s2c channel             stdout pipe
  ◀─────────────────  render    ◀──────────────  write(1)
         damage notify           (pipe pair)
  ◀─────────────────

The terminal emulator:

  • Connects to compositor via svc_lookup_retry("compositor")
  • Gets a 640×384 window (80×24 cells at 8×16 font)
  • Creates pipe pairs for shell stdin/stdout
  • Spawns /bin/shell via clone(CLONE_VM|CLONE_VFORK) + execve (child_stack=0 shares the parent’s stack — safe because the parent is blocked until the child calls execve or _exit)
  • Event loop: key events → shell stdin pipe, shell stdout → VT100 parser → glyph rendering → damage signal

Command interpreter (shell)

A plain stdin/stdout program with no knowledge of the compositor:

  • Reads lines from stdin, writes output to stdout
  • Works identically in both graphical and fallback modes

Fallback Path

If the compositor binary is not present or fails to start:

  • framebuffer_open is never called → WRITER stays active
  • Keyboard IRQ stays with kernel actor → routes to console buffer
  • User shell reads/writes via console as it does today

The system degrades gracefully to the current behaviour.

Startup Sequence

Boot
 ├─ launch_keyboard_driver() [100ms VFS settle]
 │   └─ /bin/kbd: irq_create(1), svc_register("keyboard")
 │      keyboard IRQ rerouted → kernel actor dormant
 │
 ├─ launch_compositor() [100ms VFS settle]
 │   ├─ /bin/compositor: framebuffer_open → WRITER suppressed
 │   ├─ irq_create(12) → PS/2 aux init, inline mouse decoding
 │   ├─ svc_lookup_retry("keyboard") → receive key events
 │   ├─ svc_register("compositor")
 │   └─ kernel spawns /bin/term
 │       ├─ svc_lookup_retry("compositor") → get window
 │       ├─ pipe2 + clone/execve → spawn /bin/shell
 │       └─ event loop (keys → shell stdin, shell stdout → render)
 │
 └─ launch_userspace_shell() [polls DISPLAY_SUPPRESSED every 50ms, up to 1s]
     └─ if DISPLAY_SUPPRESSED → skip (compositor path active)
        else → launch /bin/shell directly (fallback)

Service readiness is coordinated via polling and retry loops rather than hardcoded sleep timings:

  • Userspace: svc_lookup_retry() retries service lookup with 50ms yields
  • Kernel: launch_userspace_shell polls DISPLAY_SUPPRESSED every 50ms (up to 20 iterations / 1 second) before falling back to standalone shell

Fallback Matrix

kbdcompositortermResult
yesyesyesFull graphical: kbd→compositor (mouse integrated)→term→shell
yesyesnoCompositor up, no terminal (display-only)
nono-Classic fallback: kernel kbd actor + shell on console

No New Syscalls

This design uses only existing kernel primitives:

PrimitiveSyscallUse
irq_create504keyboard driver claims IRQ 1, compositor claims IRQ 12 (mouse)
svc_register / svc_lookup513/514keyboard and compositor service discovery
ipc_create / ipc_send / ipc_recv505-507key/mouse event delivery with fd passing
shmem_create508window buffers, resize buffer allocation
framebuffer_open515display ownership (with suppression side effect)
pipe2293terminal↔shell communication
clone / execve / dup256/59/33terminal spawns shell

The only kernel changes beyond the initial suppression flag are:

  • PS/2 auxiliary port initialization on irq_create(12) (libkernel/src/ps2.rs)
  • irq_fd_dispatch reads port 0x60 for GSI 12 (mouse) in addition to GSI 1 (keyboard)

Networking Design

Overview

This document describes the planned networking architecture for ostoo. The design adds TCP/IP networking via a VirtIO network device and the smoltcp protocol stack.

The initial implementation runs entirely in kernel space, matching the existing pattern where VFS and block I/O run in the devices crate. For the longer-term microkernel path where the NIC driver and TCP/IP stack move to userspace, see microkernel-design.md.


Architecture

Kernel-Space (Initial)

Userspace programs (socket/connect/bind/listen/accept/send/recv)
        │
  osl/src/net.rs                  ← syscall → smoltcp socket mapping
        │
  smoltcp::iface::Interface       ← protocol processing (TCP/IP/ARP/DHCP/DNS)
        │
  devices/src/virtio/net.rs       ← smoltcp Device trait wrapping VirtIONet
        │
  VirtIONet<KernelHal, PciTransport>  ← raw Ethernet frame send/receive
        │
  QEMU virtio-net-pci             ← -device virtio-net-pci,netdev=net0
                                     -netdev user,id=net0

Userspace (Future)

Once the microkernel primitives from microkernel-design.md are in place, networking can be restructured:

Userspace programs (socket syscalls, routed by kernel to TCP/IP server)
        │
  TCP/IP server process           ← smoltcp in a userspace daemon
        │
  shared memory ring buffers      ← zero-copy packet passing
        │
  NIC driver process              ← virtio-net via mapped MMIO + IRQ fd
        │
  virtio-net-pci hardware

The kernel’s socket syscall handlers become thin IPC stubs that route requests to the TCP/IP server, preserving POSIX compatibility for musl. This corresponds to Phase C/D in the microkernel migration path.


Kernel-Space vs Userspace

Decision: kernel-space first

The initial implementation runs in kernel space. Reasons:

  • Simpler. No IPC overhead — smoltcp directly accesses the virtio-net driver. No message-passing for every packet.
  • Lower latency. No user/kernel context switches per packet.
  • Matches existing patterns. VFS operations already run in kernel via the devices crate; networking follows the same model.
  • Proven. Hermit OS and Kerla both use smoltcp + virtio-net in kernel space successfully.

The microkernel path (Phases C-D in microkernel-design.md) moves the NIC driver and TCP/IP stack to userspace once shared memory, IRQ delivery, and device MMIO mapping primitives exist.


Protocols

LayerProtocolPriorityNotes
LinkEthernet IIRequiredvirtio-net provides raw frames
LinkARPRequiredAutomatic in smoltcp with Ethernet medium
NetworkIPv4RequiredCore routing
NetworkICMPRequiredPing, error reporting
TransportTCPRequiredStreams (HTTP, SSH, etc.)
TransportUDPRequiredDatagrams (DNS, NTP, etc.)
ApplicationDHCPv4RequiredAuto-configure IP/gateway/DNS from QEMU
ApplicationDNSHighName resolution
NetworkIPv6Deferredsmoltcp supports it when ready

Crates

virtio-drivers 0.13 (already in workspace)

Provides device::net::VirtIONet with:

  • new(transport, buf_len) — initialize with PCI transport
  • mac_address() — read hardware MAC
  • receive() / send(tx_buf) — raw Ethernet frame I/O
  • ack_interrupt() / enable_interrupts() — IRQ support

The existing KernelHal and create_pci_transport() work unchanged. Add device constants for virtio-net PCI IDs (modern: 0x1041, legacy: 0x1000).

smoltcp 0.12

no_std TCP/IP stack. Works with alloc (ostoo already has a heap).

Suggested Cargo features:

smoltcp = { version = "0.12", default-features = false, features = [
    "alloc", "log", "medium-ethernet",
    "proto-ipv4", "proto-dhcpv4", "proto-dns",
    "socket-raw", "socket-udp", "socket-tcp",
    "socket-icmp", "socket-dhcpv4", "socket-dns",
] }

Provides:

  • Interface — central type that drives all protocol processing
  • phy::Device trait — integrate a NIC via RxToken/TxToken (zero-copy, token-based)
  • Socket types — raw, ICMP, TCP, UDP, DHCPv4 client, DNS resolver
  • ARP / neighbor cache — automatic with Ethernet medium

No additional crates needed. embassy-net wraps smoltcp but is tied to the Embassy async runtime — skip it.


Integration Points

NIC Driver (devices/src/virtio/net.rs)

Wrap VirtIONet<KernelHal, PciTransport, 64> in a struct implementing smoltcp’s phy::Device trait:

  • receive()RxToken (read raw frame from virtqueue)
  • transmit()TxToken (write raw frame to virtqueue)
  • capabilities() → MTU 1514, Medium::Ethernet, no checksum offload

Polling

smoltcp requires periodic Interface::poll() calls. Two options:

  1. Timer-driven — poll every 10ms from the scheduler tick (simple, higher latency).
  2. IRQ-driven — virtio-net interrupt triggers poll (responsive, more complex).

Start with timer-driven. Migrate to IRQ-driven once the basics work.

Blocking Bridge

osl::blocking::blocking() already converts async → sync for VFS. Same pattern for socket operations: spawn an async task that polls smoltcp, block the calling thread until data arrives or the operation completes.

Socket File Descriptors

Create a SocketHandle struct implementing the FileHandle trait. Store in the process fd_table like pipes and files:

  • read() on a TCP socket → recv from smoltcp TCP socket
  • write() on a TCP socket → send to smoltcp TCP socket
  • close() → release smoltcp socket handle

UDP sockets need sendto/recvfrom for the address parameter.

DHCP at Boot

After virtio-net init, create a DHCPv4 socket, poll until configured, then set the interface IP/gateway/DNS. QEMU user-mode networking provides DHCP at 10.0.2.2 with default subnet 10.0.2.0/24.


Syscalls

New Linux-compatible syscall numbers to add in osl/src/syscalls/mod.rs:

NrNamePurpose
41socketCreate AF_INET SOCK_STREAM/SOCK_DGRAM
42connectTCP connect to remote
43acceptAccept incoming TCP connection
44sendtoSend datagram with destination address
45recvfromReceive datagram with source address
46sendmsgScatter/gather send (needed by musl)
47recvmsgScatter/gather receive (needed by musl)
49bindBind to local address/port
50listenMark socket as listening
51getsocknameGet local address of socket
54setsockoptSet socket options (SO_REUSEADDR, etc.)
55getsockoptGet socket options

Stubs returning ENOSYS for unsupported options are acceptable initially.


QEMU Configuration

Add to scripts/run.sh:

-device virtio-net-pci,netdev=net0 \
-netdev user,id=net0

QEMU user-mode networking (SLIRP) provides:

  • NAT — guest can reach the internet and the host
  • Built-in DHCP server at 10.0.2.2
  • Built-in DNS forwarder at 10.0.2.3
  • No host-side configuration needed

For host → guest connections, add port forwards: -netdev user,id=net0,hostfwd=tcp::8080-:80


What This Enables

With TCP/UDP + DNS + DHCP, userspace programs compiled against musl can use the standard POSIX socket API. This opens the door to:

  • Ping (ICMP echo)
  • Simple network tools (netcat-like, wget-like)
  • HTTP client/server
  • Eventually SSH (requires more crypto infrastructure)

Code Quality Audit

A review of code smells, magic numbers, duplicated code, and missing abstractions across the codebase. Companion to unsafe-audit.md which covers unsafe specifically.

Date: 2026-03-19


1. Magic Numbers

1.1 Syscall numbers — osl/src/syscalls/mod.rs ✅ DONE

Fixed in 95da4c0. Named constants in osl/src/syscall_nr.rs; dispatch match now uses SYS_READ, SYS_WRITE, etc.

1.2 MSR addresses ✅ DONE

Fixed in 95da4c0. Named constants in libkernel/src/msr.rs (IA32_FS_BASE, IA32_EFER, etc.); all 12+ inline uses replaced.

1.3 Page size ✅ DONE

Fixed in 95da4c0. PAGE_SIZE and PAGE_MASK in libkernel/src/consts.rs; all 20+ inline uses replaced.

1.4 Stack sizes ✅ DONE

Fixed in 95da4c0. KERNEL_STACK_SIZE in libkernel/src/consts.rs; all 4 locations updated.

1.5 I/O port addresses — interrupts.rs

  • 0x21, 0xA1 — PIC data ports (lines 111-112)
  • 0x43, 0x40 — PIT command/channel0 ports (lines 218-220)
  • 0x34 — PIT mode command byte (line 220)
  • 11932 — PIT reload for 100 Hz (line 216)

Fix: Named constants:

#![allow(unused)]
fn main() {
const PIC_MASTER_DATA: u16 = 0x21;
const PIC_SLAVE_DATA: u16 = 0xA1;
const PIT_COMMAND: u16 = 0x43;
const PIT_CHANNEL0: u16 = 0x40;
const PIT_MODE_RATE_GEN: u8 = 0x34;
const PIT_100HZ_RELOAD: u16 = 11932;
}

1.6 stat struct layout ✅ DONE (partial)

Fixed in 95da4c0. STAT_SIZE and S_IFCHR are now named constants in sys_fstat. 0o666 (permission mode) remains inline — acceptable as a well-known octal literal.

1.7 VirtIO vendor/device IDs — kernel/src/main.rs

0x1AF4, 0x1042, 0x1001 used inline in the virtio-blk PCI scan. Now also 0x1049, 0x1009 for virtio-9p.

Fix:

#![allow(unused)]
fn main() {
const VIRTIO_VENDOR_ID: u16 = 0x1AF4;
const VIRTIO_BLK_MODERN_DEVICE_ID: u16 = 0x1042;
const VIRTIO_BLK_LEGACY_DEVICE_ID: u16 = 0x1001;
const VIRTIO_9P_MODERN_DEVICE_ID: u16 = 0x1049;
const VIRTIO_9P_LEGACY_DEVICE_ID: u16 = 0x1009;
}

1.8 Other notable magic numbers

LocationValueSuggested name
scheduler.rs:1380x202RFLAGS_IF_RESERVED
scheduler.rs:283,3370x1F80MXCSR_DEFAULT
vga_buffer/mod.rs:85,3060x20..=0x7ePRINTABLE_ASCII
vga_buffer/mod.rs:3080xfeNONPRINTABLE_PLACEHOLDER
memory/mod.rs:333,3350x1FFPAGE_TABLE_INDEX_MASK
syscalls/io.rs16IOVEC_SIZE
syscalls/fs.rs4096MAX_PATH_LEN
gdt.rs:334096 * 5DOUBLE_FAULT_STACK_SIZE

2. Duplicated Code

2.1 FD table retrieval ✅ DONE

Fixed in 95da4c0. get_fd_handle() helper (now in osl/src/fd_helpers.rs) eliminates 4 identical fd-lookup blocks.

2.2 Page alloc + zero + map loop ✅ DONE

Fixed in 95da4c0. MemoryServices::alloc_and_map_user_pages() in libkernel/src/memory/mod.rs replaces the alloc+zero+map loops in sys_brk and sys_mmap. (The spawn.rs loop is slightly different — it writes ELF segment data — so it was not collapsed.)

2.3 Page clearing ✅ DONE

Fixed in 95da4c0. clear_page() in libkernel/src/consts.rs replaces 6 inline write_bytes calls. (Some calls in spawn.rs that write non-zero data were not replaced.)

2.4 PageTableFlags construction ✅ DONE

Fixed in 95da4c0. USER_DATA_FLAGS constant in osl/src/syscalls/mem.rs replaces 3 identical flag expressions.

2.5 Path normalization — duplicated between crates ✅ DONE

Fixed in libkernel/src/path.rs. normalize() and resolve() are now shared; kernel/src/shell.rs and osl/src/syscalls/ both delegate to libkernel::path.

2.6 History entry restoration — keyboard_actor.rs ✅ DONE

Fixed alongside item 8. LineState::restore_from_history(&mut self, idx) eliminates the duplicated buffer-copy logic.

2.7 read_user_string → path error wrapping — 2 copies

#![allow(unused)]
fn main() {
let path = match read_user_string(path_ptr, 4096) {
    Some(p) => p,
    None => return -errno::EFAULT,
};
}

Fix: fn get_user_path(ptr: u64) -> Result<String, i64>


3. Missing Abstractions / Interface Opportunities

3.1 ProcessManager struct

libkernel/src/process.rs has free functions find_zombie_child, has_children, mark_zombie, reap that all operate on the global PROCESS_TABLE. These should be methods on a ProcessManager type that encapsulates the table.

#![allow(unused)]
fn main() {
pub struct ProcessManager {
    table: Mutex<BTreeMap<ProcessId, Process>>,
}

impl ProcessManager {
    pub fn find_zombie_child(&self, parent: ProcessId, target: i64) -> Option<(ProcessId, i32)>;
    pub fn mark_zombie(&self, pid: ProcessId, code: i32);
    pub fn reap(&self, pid: ProcessId);
    pub fn has_children(&self, pid: ProcessId) -> bool;
}
}

3.2 FileHandle trait is monolithic

Every FileHandle implementor must provide read, write, close, kind, and getdents64, even when nonsensical (e.g. DirHandle::write returns Err).

Options:

  • Split into Readable, Writable, Directory traits
  • Or add default impls returning appropriate errors so implementors only override what they support

3.3 MemoryServices is a god object

~500 lines mixing physical allocation, MMIO mapping, user page tables, address translation, and statistics.

Fix: Split into focused sub-types:

  • PhysicalMemoryManager — frame allocation, phys-to-virt translation
  • MmioMapper — MMIO region registration and caching
  • UserPageTableManager — create/map/switch user address spaces

3.4 SyscallContext struct

Syscall handlers pass (rdi, rsi, rdx, r10, r8, r9) as 6 separate u64 parameters. A context struct would be clearer:

#![allow(unused)]
fn main() {
pub struct SyscallContext {
    pub arg0: u64,
    pub arg1: u64,
    pub arg2: u64,
    pub arg3: u64,
    pub arg4: u64,
    pub arg5: u64,
}
}

This would also be the natural home for the fd-table helper method.

3.5 ConsoleInput encapsulation

libkernel/src/console.rs has CONSOLE_INPUT: Mutex<ConsoleInner> plus FOREGROUND_PID: AtomicU64 as separate globals. These form a single logical unit that should be one type:

#![allow(unused)]
fn main() {
pub struct ConsoleInput {
    inner: Mutex<ConsoleInner>,
    foreground_pid: AtomicU64,
}
}

3.6 Scattered global atomics

These related atomics are standalone statics when they could be encapsulated in manager types:

StaticFileCould belong to
NEXT_PID, CURRENT_PIDprocess.rsProcessManager
NEXT_THREAD_ID, CURRENT_THREAD_IDX_ATOMICscheduler.rsScheduler
FOREGROUND_PIDconsole.rsConsoleInput
LAPIC_EOI_ADDRinterrupts.rsInterrupt manager
CONTEXT_SWITCHESscheduler.rsScheduler

3.7 User vs kernel address types

The type system uses u64 for both user and kernel virtual addresses. Newtype wrappers would prevent accidental misuse:

#![allow(unused)]
fn main() {
pub struct UserVirtAddr(u64);
pub struct KernelVirtAddr(u64);
}

4. Long Functions / Deep Nesting

4.1 keyboard_actor.rs:on_key — 238 lines ✅ DONE

Fixed alongside item 8. Key-handling logic moved to LineState methods (submit, backspace, delete_forward, move_left/right, history_up/down, etc.). on_key is now a thin one-liner-per-key dispatch table.

4.2 scheduler.rs:preempt_tick — 102 lines ✅ DONE

Fixed alongside item 9. Decomposed into save_current_context(), restore_thread_state() (via SwitchTarget struct), and debug_check_initial_alignment(). preempt_tick itself has zero direct unsafe blocks.

4.3 syscalls/mem.rs:sys_mmap — 68 lines

Validation, allocation, and mapping all in one function.

Fix: Break into validate_mmap_request() and the shared alloc_and_map_user_pages() from section 2.2.

4.4 syscalls/mem.rs:sys_brk — 60 lines

Same issue as sys_mmap — does too many things.

4.5 Deep nesting in keyboard_actor.rs:159-331 ✅ DONE

Fixed alongside item 8. Each match arm is now a one-liner calling a LineState method; the actual logic lives in those methods at a single nesting level.


5. Other Code Smells

5.1 Repeated runnable-state check ✅ DONE

Fixed alongside item 9. ThreadState::is_runnable() method replaces the two identical != Dead && != Blocked checks.

5.2 VFS blocking wrappers

osl/src/syscalls/fs.rs has vfs_read_file and vfs_list_dir with identical structure (allocate String, call blocking() with async VFS call).

Fix: Macro or generic wrapper to eliminate the boilerplate.

5.3 Process exit + parent wake pattern

sys_exit (osl/src/syscalls/process.rs) does get-parent → mark_zombie → unblock-parent as separate steps. This should be a single ProcessManager::exit_and_notify(pid, code) method.


Tier 1: Easy wins with high readability payoff — ✅ ALL DONE

All Tier 1 items were completed in 95da4c0:

  1. Named constants for syscall numbers, MSRs, page sizes
  2. Extract get_fd_handle() helper (eliminates 4 copies)
  3. Extract alloc_and_map_user_pages() (eliminates 3 copies)
  4. const USER_DATA_FLAGS for page table flags
  5. clear_page() utility (eliminates 8 copies)

Tier 2: Structural improvements

  1. Share path normalization between kernel shell and osl
  2. ProcessManager struct to encapsulate process table
  3. Decompose on_key into a LineEditor state machine
  4. Decompose preempt_tick into smaller functions
  5. Break sys_brk / sys_mmap into validation + mapping

Tier 3: Architectural refinements

  1. Split MemoryServices into focused sub-managers
  2. SyscallContext struct for cleaner parameter passing
  3. ConsoleInput encapsulation
  4. UserVirtAddr / KernelVirtAddr newtypes
  5. FileHandle trait restructuring (split or default impls)

Unsafe Code Audit & Refactoring Opportunities

An audit of unsafe usage across the codebase, prioritised by density and refactoring payoff.


1. libkernel/src/vga_buffer/mod.rs — Raw pointer to MMIO buffer ✅ DONE

Writer stores a raw *mut Buffer pointer and dereferences it with unsafe { &mut *self.buffer } in 7 separate places. There is also a manual unsafe impl Send to paper over the raw pointer.

Completed (commit 75de8c4):

  • Introduced a VgaBuffer safe wrapper that encapsulates the raw pointer with unsafe confined to construction only. Safe read_cell / write_cell / set_hw_cursor methods replaced all interior unsafe blocks in Writer methods and free functions.
  • unsafe impl Send moved from Writer to VgaBuffer with documented invariant.
  • set_hw_cursor is now a safe method on VgaBuffer (was a standalone unsafe fn).
  • core::mem::transmute in tests replaced with a new Color::from_u8() constructor.
  • timeline_append refactored: ISR now pushes to a lock-free ArrayQueue instead of writing directly to VGA RAM with raw pointers. A new TimelineActor (stream-driven, using #[on_stream]) drains the queue and writes to VGA row 1 through the safe WRITER / VgaBuffer interface. Eliminates the last unsafe block and removes the VGA_BASE atomic.

2. libkernel/src/task/scheduler.rs — Raw stack frame construction & inline asm ✅ DONE

spawn_thread and spawn_user_thread both manually write 20 u64 values to raw stack pointers to construct fake iretq frames. preempt_tick reads raw pointers at computed offsets for sanity checks. process_trampoline contains a large unsafe asm block.

Completed (commit ac60740):

  • Introduced #[repr(C)] SwitchFrame with named fields matching the lapic_timer_stub push/pop order. Constructors new_kernel() and new_user_trampoline() replace magic-number frame.add(N).write(...) in both spawn_thread and spawn_user_thread.
  • preempt_tick sanity check reads frame.rip / frame.rsp through the typed struct instead of raw pointer arithmetic.
  • Extracted drop_to_ring3() unsafe helper from process_trampoline: GS MSR writes + CR3 switch + iretq in one well-documented unsafe fn, making the safety boundary explicit.

3. libkernel/src/syscall.rsstatic mut per-CPU data ✅ DONE

PER_CPU and SYSCALL_STACK are static mut, accessed with bare unsafe throughout. sys_write creates a slice from a raw user-space pointer without any validation.

Completed (commit 1c28010):

  • Replaced static mut PER_CPU with an UnsafeCell wrapper (PerCpuCell) with documented safety invariant (single CPU, interrupts disabled).
  • Replaced static mut SYSCALL_STACK with a safe #[repr(align(16))] static. kernel_stack_top() is now fully safe.
  • sys_write now validates that the user buffer falls entirely within user address space (< 0x0000_8000_0000_0000), returning EFAULT for invalid pointers.
  • init(), set_kernel_rsp(), per_cpu_addr() updated to use new accessors — no more &raw const on static mut.

4. apic/src/local_apic/mapped.rs — Every method is unsafe ✅ DONE

MappedLocalApic has 15 public unsafe methods. The unsafety stems from MMIO access via raw pointers in read_reg_32 / write_reg_32, but the actual invariant is in construction (providing a valid base address), not in each register read/write.

Completed (commit 24a421d):

  • MappedLocalApic::new() is now the sole unsafe boundary with documented safety invariants.
  • All 15 public methods are now safe; read_reg_32 / write_reg_32 trait impl uses core::ptr::read_volatile / write_volatile.
  • Callers in apic/src/lib.rs and devices/src/vfs/proc_vfs/ updated — dozens of unsafe blocks removed.

5. apic/src/io_apic/mapped.rs — Same pattern as local APIC ✅ DONE

Same issue — every public method is unsafe, and register access helpers use raw pointer dereferences without read_volatile / write_volatile.

Completed (commit 24a421d):

  • MappedIoApic::new() is now the sole unsafe boundary with documented safety invariants. base_addr field made private with base_addr() getter.
  • All public methods (mask_all, mask_entry, set_irq, max_redirect_entries, read_version_raw, read_redirect_entry) are now safe. Internal calls to the IoApic trait methods remain unsafe blocks.
  • IoApic trait impl (read_reg_32 / write_reg_32 / read_reg_64 / write_reg_64) now uses core::ptr::read_volatile / write_volatile instead of raw dereferences — correct for MMIO.
  • Callers in apic/src/lib.rs and devices/src/vfs/proc_vfs/ updated.

6. kernel/src/kernel_acpi.rs — Repetitive raw pointer reads/writes

The acpi::Handler impl has 8 nearly identical read_uN / write_uN methods, each doing unsafe { *(addr as *const T) }. No volatile access, no alignment checks.

Recommendations:

  • Create a generic fn mmio_read<T>(addr: usize) -> T / fn mmio_write<T>(addr: usize, val: T) helper using read_volatile / write_volatile, then call it from each trait method. Reduces 16 lines of unsafe to 2.
  • Same for the IO port methods — a single port_read::<T>(port) / port_write::<T>(port, val) generic would collapse 6 methods.

7. kernel/src/ring3.rs — Scattered raw pointer copies

spawn_blob and spawn_process manually call core::ptr::write_bytes and core::ptr::copy_nonoverlapping on physical-memory-mapped addresses. The pattern phys_off + phys_addr → as_mut_ptr → write_bytes repeats multiple times.

Recommendations:

  • Add zero_frame(phys: PhysAddr) and copy_to_frame(phys: PhysAddr, data: &[u8]) utilities on MemoryServices that encapsulate the offset arithmetic and unsafe ptr operations. This would also clean up similar patterns in libkernel/src/memory/mod.rs.

8. libkernel/src/gdt.rs — Mutable cast of static TSS

set_kernel_stack casts &*TSS through *const → *mut to write rsp0. This is technically UB (mutating through a shared reference to a lazy_static).

Recommendations:

  • Store the TSS in an UnsafeCell or Mutex so the mutation is sound. Since it is single-CPU and only called with interrupts off, an UnsafeCell wrapper with a documented invariant is sufficient.

9. libkernel/src/interrupts.rs — Crash-dump raw pointer reads

double_fault_handler and invalid_opcode_handler use core::ptr::read_volatile on raw addresses for crash diagnostics, and the inline-asm MSR reads are duplicated across fault handlers.

Recommendations:

  • Extract a fn dump_cpu_state(frame: &InterruptStackFrame) -> CpuState helper that reads CR2/CR3/CR4/GS MSRs once and returns a struct, eliminating duplicated inline asm across fault handlers.
  • A fn dump_bytes_at(addr: u64, len: usize) -> [u8; 16] helper would replace the raw pointer reads in both handlers.

10. devices/src/vfs/proc_vfs/ — Manual page-table walking

gen_pmap() manually walks PML4 / PDPT / PD / PT levels using raw pointer casts like unsafe { &*((phys_off + addr) as *const PageTable) }.

Recommendations:

  • Add a walk_page_tables iterator or visitor on MemoryServices that safely provides (virt_range, phys_base, flags) entries. Replaces 50+ lines of raw pointer walks.

Summary table

PriorityFileUnsafe countRefactor
Highscheduler.rs12✅ Done — SwitchFrame struct, drop_to_ring3
Highsyscall.rs8✅ Done — UnsafeCell, user pointer validation
Highlocal_apic/mapped.rs18✅ Done — safe methods, unsafe-only construction
Highio_apic/mapped.rs12✅ Done — same + read_volatile / write_volatile
Mediumvga_buffer/mod.rs14✅ Done — VgaBuffer wrapper
Mediumkernel_acpi.rs~16Generic volatile MMIO helpers
Mediumring3.rs~8zero_frame / copy_to_frame on MemoryServices
Mediumgdt.rs2UnsafeCell for TSS mutation
Lowinterrupts.rs~10dump_cpu_state + dump_bytes_at helpers
Lowproc_vfs/~5Page-table walk iterator

SMP Safety Audit

An audit of concurrency issues that would arise when running on multiple CPUs. The kernel currently runs single-core only; this document catalogues what must change before bringing up Application Processors.

Issues are grouped by severity: Critical = data corruption / crash on SMP, High = deadlock or lost wakeup, Medium = ordering bugs or contention, Low = design limitation / hardening.


Critical

1. PerCpuData is a single static

libkernel/src/syscall.rs:44-65

PerCpuData (kernel_rsp, user_rsp, user_rip, user_rflags, user_r9, saved_frame_ptr) lives at a single address. Every CPU’s GS base points there. A SYSCALL on CPU 1 overwrites CPU 0’s saved registers mid-flight.

Impact: Stack corruption, wrong return to userspace, privilege escalation.

Fix: Allocate a distinct PerCpuData page per CPU and set IA32_GS_BASE / IA32_KERNEL_GS_BASE independently during AP bringup.


2. GDT / TSS / IST stacks are shared

libkernel/src/gdt.rs:29-54

A single TSS (with a single double-fault IST stack) and a single GDT are used by all CPUs. set_kernel_stack() (:77-84) unsafely mutates the shared TSS’s rsp0 field.

Impact: Two CPUs taking a ring-3 → ring-0 transition simultaneously use the same kernel stack. Two simultaneous double faults corrupt each other’s IST stack.

Fix: Per-CPU GDT, per-CPU TSS, per-CPU IST stacks.


3. Scheduler has a single current_idx

libkernel/src/task/scheduler.rs:130, 809, 836

SCHEDULER is a single SpinMutex<Scheduler> with one current_idx field that records which thread is currently executing. On SMP each CPU runs a different thread, but current_idx can only represent one.

Impact: Every use of sched.current_idx — preempt_tick, block, save context — operates on whichever CPU wrote it last, not the local CPU’s thread.

Fix: Per-CPU current_idx (or per-CPU scheduler instances).


4. block_current_thread() uses stale current_idx

libkernel/src/task/scheduler.rs:640-661

#![allow(unused)]
fn main() {
pub fn block_current_thread() {
    without_interrupts(|| {
        let mut sched = SCHEDULER.lock();
        let idx = sched.current_idx;        // ← global, not per-CPU
        sched.threads[idx].state = Blocked;
    });
    loop {
        enable_and_hlt();
        let state = without_interrupts(|| {
            let sched = SCHEDULER.lock();
            let idx = sched.current_idx;    // ← may now be another CPU's thread
            sched.threads[idx].state
        });
        if state != Blocked { break; }
    }
}
}

If CPU 0 blocks thread A and CPU 1 runs thread B, the re-check reads current_idx (now B) and tests the wrong thread. Thread A either never wakes or wakes with B’s state.

Impact: Hung threads, wrong-thread wakeup.

Fix: Save the thread’s own index before blocking; check that saved index in the loop, not current_idx.


5. CURRENT_THREAD_IDX_ATOMIC is a single global

libkernel/src/task/scheduler.rs:16-24

#![allow(unused)]
fn main() {
static CURRENT_THREAD_IDX_ATOMIC: AtomicUsize = AtomicUsize::new(0);

pub fn current_thread_idx() -> usize {
    CURRENT_THREAD_IDX_ATOMIC.load(Ordering::Relaxed)
}
}

Called from ISR context (interrupt handlers), syscall context (console, pipes, channels), and the scheduler itself. On SMP the value reflects whichever CPU wrote it last, not the caller’s CPU.

Impact: Signal delivery, pipe wakeup, IPC blocking — all index the wrong thread when the reading CPU differs from the writing CPU.

Fix: Per-CPU current-thread-index (read from a per-CPU variable or from a CPU-local register like GS).


6. IO APIC register select/window interleaving

libkernel/src/apic/io_apic/mapped.rs:120-142

64-bit redirection entries are read/written as two 32-bit MMIO accesses through a shared IOREGSEL / IOWIN register pair. Although callers hold the IO_APICS SpinMutex, the lock does not disable interrupts. If a timer ISR fires between the two halves of a 64-bit access on the same CPU, and the ISR path touches IO APIC registers, the IOREGSEL is clobbered.

Currently no ISR path touches the IO APIC, so this is latent. On SMP with multiple IO APICs, per-APIC locking would be needed.

Impact: Corrupted redirection entry → interrupt routed to wrong vector or silently masked.

Fix: Use IrqMutex (or at minimum without_interrupts) around all IO APIC register-pair accesses. Consider per-APIC locks for scalability.


High

7. SCHEDULER lock is a SpinMutex — ISR can spin on it

libkernel/src/task/scheduler.rs:130

SCHEDULER uses SpinMutex (interrupts stay enabled). Syscall-context callers (block_current_thread, unblock, spawn_thread) wrap acquisitions in without_interrupts, but the lock itself does not enforce this. If a code path acquires the lock without disabling interrupts and the timer ISR fires on the same CPU, preempt_tick (:803) will spin forever waiting for the syscall to release the lock — which it never can, because it’s preempted.

Impact: Deadlock (single-CPU or SMP).

Fix: Change SCHEDULER to IrqMutex, or ensure every acquisition site uses without_interrupts. All current sites do, but the type does not enforce it — a future caller could forget.


8. MEMORY lock is not ISR-safe

libkernel/src/memory/mod.rs:559

MEMORY uses SpinMutex. The comment warns “must not be called from interrupt context”, but this is not enforced by the type. Any future ISR path that triggers frame allocation or page-table manipulation will deadlock on single-CPU if a syscall holds the lock.

Impact: Deadlock.

Fix: Change to IrqMutex, or add a compile-time / runtime ISR guard.


9. Global heap allocator is not ISR-safe

libkernel/src/lib.rs:20

#![allow(unused)]
fn main() {
#[global_allocator]
static ALLOCATOR: LockedHeap = LockedHeap::empty();
}

LockedHeap uses spin::Mutex internally — no interrupt disabling. Any heap allocation from ISR context while a syscall holds the heap lock will deadlock.

The scheduler’s push_back on the ready queue can trigger a Vec reallocation if the queue grows. Currently the scheduler lock is acquired with interrupts disabled, so the heap allocation happens with IF=0 — safe on single-CPU. On SMP, CPU 1’s ISR could try to allocate while CPU 0 holds the heap lock.

Impact: Deadlock (ISR + heap contention).

Fix: Use an ISR-safe allocator wrapper, or guarantee no heap allocation from ISR context.


10. Console ISR → scheduler lock ordering

libkernel/src/console.rs:35-47

push_input() is called from the keyboard ISR. It acquires CONSOLE_INPUT (SpinMutex), then calls scheduler::unblock() which acquires SCHEDULER (SpinMutex, inside without_interrupts).

On SMP:

  • CPU 0: syscall holds SCHEDULER lock (IF disabled), tries to read console → acquires CONSOLE_INPUT.
  • CPU 1: keyboard ISR holds CONSOLE_INPUT, calls unblock() → spins on SCHEDULER.
  • CPU 0: still holds SCHEDULER, spins on CONSOLE_INPUT held by CPU 1.

Impact: Deadlock (lock-order inversion: SCHEDULER → CONSOLE_INPUT vs. CONSOLE_INPUT → SCHEDULER).

Fix: Don’t call unblock() while holding CONSOLE_INPUT. Buffer the thread index and call unblock() after dropping the console lock.


11. DONATE_TARGET is a single global

libkernel/src/task/scheduler.rs:682-700

DONATE_TARGET: AtomicUsize stores one target thread index, consumed by the next yield_tick. On SMP, CPU 0 stores a donate target, but CPU 1’s yield_tick consumes it first.

Impact: Scheduler donate delivers the wrong thread to the wrong CPU; intended recipient never gets donated to.

Fix: Per-CPU donate target, or pass the target through a different mechanism (e.g. IPI + per-CPU mailbox).


12. Lost wakeup in sys_wait4

osl/src/syscalls/process.rs:26-64

1. find_zombie_child(parent) → None
2. ← child exits on CPU 1, calls unblock(parent_wait_thread)
      but wait_thread is not yet set → no-op
3. set wait_thread = current_thread
4. block_current_thread()        → sleeps forever

The zombie check and the wait_thread registration are not atomic.

Impact: Parent process hangs forever waiting for an already-exited child.

Fix: Hold the process table lock across the zombie check and the wait_thread write, so that terminate_process() on another CPU sees the wait_thread before posting the zombie.


Medium

13. Relaxed ordering on cross-CPU atomics

Several atomics use Ordering::Relaxed where Acquire/Release would be more appropriate for cross-CPU visibility:

AtomicFileLineUsed by
CURRENT_THREAD_IDX_ATOMICscheduler.rs16ISR + syscall
current_pidprocess.rs488syscall context
FOREGROUND_PIDconsole.rs31keyboard ISR
LAPIC_EOI_ADDRinterrupts.rs12ISR

On x86-64 all loads/stores are implicitly acquire/release for aligned naturally-sized values, so this is a correctness concern primarily on weakly-ordered architectures or under compiler reordering. Using explicit Acquire/Release is still best practice for documentation and portability.

Impact: Stale reads possible under compiler reordering; wrong-process signal delivery, wrong-process console input routing.

Fix: Release on writes, Acquire on reads.


14. Stack arena contention

libkernel/src/stack_arena.rs:16

A single SpinMutex<ArenaInner> protects a 32-bit free bitmap for all thread stack allocations / deallocations. On SMP with frequent thread creation, this becomes a serialisation bottleneck.

Impact: Performance (lock contention), not correctness.

Fix: Per-CPU arenas, or lock-free bitmap (atomic CAS on u32).


15. Lock ordering not documented or enforced

Multiple subsystems acquire locks in ad-hoc order. Observed nesting:

  • CONSOLE_INPUT → SCHEDULER (push_input → unblock)
  • IrqInner → CompletionPort (irq_fd_dispatch → post)
  • NotifyInner → CompletionPort (signal_notify → post)
  • PROCESS_TABLE → SCHEDULER (with_process → spawn_user_thread)

No static or runtime enforcement exists. Adding a second CPU increases the risk of discovering new inversion paths.

Impact: Latent deadlocks as code evolves.

Fix: Document a global lock ordering. Consider runtime lock-order checking in debug builds (e.g. per-CPU lock-stack tracking).


16. User memory TOCTOU with CLONE_VM

osl/src/user_mem.rs:27-45

user_slice() validates then returns a 'static slice. With CLONE_VM (vfork), the parent and child share an address space. If the child calls mmap / munmap while the parent is mid-syscall with a validated slice, the pages backing the slice may be unmapped.

Currently mitigated because CLONE_VM blocks the parent (vfork semantics), so only the child runs. If shared-address-space threading is added, this becomes exploitable.

Impact: Latent use-after-free in shared address spaces.

Fix: Pin pages for the duration of the syscall, or copy user data into a kernel buffer before releasing the process lock.


17. VMA / page-table flag divergence in mprotect

osl/src/syscalls/mem.rs (sys_mprotect)

The process lock is released between mprotect_vmas() (updates VMA metadata) and update_user_page_flags() (updates hardware page tables). A concurrent mmap or munmap on the same address range could see inconsistent state.

Currently safe because only one thread per process runs at a time (no kernel threading within a process).

Impact: Latent protection-flag inconsistency if intra-process parallelism is added.

Fix: Hold the process lock (or a per-address-space lock) across both the VMA update and the page-table update.


Low

18. LAPIC timer calibration is BSP-only

libkernel/src/apic/mod.rs:205-254

Calibration uses a global PIT busy-wait and assumes a single LAPIC. Each AP would need its own calibration pass (LAPIC frequencies can differ, especially under virtualisation).


19. Dynamic vector allocation uses without_interrupts

libkernel/src/interrupts.rs:22-75

register_handler() disables local interrupts and acquires DYNAMIC_HANDLERS (SpinMutex). On SMP, without_interrupts only affects the local CPU. Two CPUs calling register_handler() concurrently will correctly serialise via the SpinMutex — no bug, but the without_interrupts wrapper is unnecessary and misleading.


20. Single ready queue scalability

libkernel/src/task/scheduler.rs:103

The single VecDeque ready queue serialises all scheduling decisions behind one lock. This is the standard starting point but will need per-CPU run queues and work-stealing for acceptable SMP throughput.


Summary

#SeverityComponentOne-line summary
1Criticalsyscall.rsPerCpuData is a single static shared by all CPUs
2Criticalgdt.rsGDT / TSS / IST stacks shared across CPUs
3Criticalscheduler.rsSingle current_idx — meaningless on SMP
4Criticalscheduler.rsblock_current_thread reads stale current_idx
5Criticalscheduler.rsCURRENT_THREAD_IDX_ATOMIC is one global
6Criticalio_apicRegister select/window not ISR-safe
7Highscheduler.rsSCHEDULER SpinMutex not ISR-enforced
8Highmemory/mod.rsMEMORY SpinMutex not ISR-safe
9Highlib.rsGlobal heap allocator not ISR-safe
10Highconsole.rsISR lock-order inversion (CONSOLE → SCHEDULER)
11Highscheduler.rsDONATE_TARGET is a single global
12Highprocess.rsLost wakeup in sys_wait4
13MediumvariousRelaxed ordering on cross-CPU atomics
14Mediumstack_arena.rsSingle-lock bitmap contention
15MediumvariousLock ordering not documented
16Mediumuser_mem.rsTOCTOU with CLONE_VM (latent)
17Mediummem.rsVMA / page-table flag divergence (latent)
18Lowapic/mod.rsLAPIC calibration BSP-only
19Lowinterrupts.rsMisleading without_interrupts wrapper
20Lowscheduler.rsSingle ready queue scalability
  1. Per-CPU infrastructure: PerCpuData, GDT, TSS, IST stacks, LAPIC init.
  2. Per-CPU scheduler state: current_idx, ready queue, donate target.
  3. Fix block_current_thread to use saved thread index.
  4. Promote SCHEDULER and MEMORY to IrqMutex (or add IF-disable wrappers).
  5. Fix lock-ordering inversions (console, notify, channel → scheduler).
  6. Fix sys_wait4 lost-wakeup race.
  7. Per-CPU LAPIC calibration.
  8. Document and enforce global lock ordering.

Testing

Overview

The kernel uses Rust’s custom_test_frameworks feature since the standard test harness requires std. Tests run inside QEMU in headless mode, communicate results over the serial port, and signal pass/fail via an ISA debug-exit device.

# Kernel crate tests (kernel unit tests + integration tests)
cargo test --manifest-path kernel/Cargo.toml

# libkernel tests (allocator, VGA, path, timer, interrupts, etc.)
cargo test --manifest-path libkernel/Cargo.toml

Or via the Makefile:

make test

Note: make test currently runs only the kernel crate tests. libkernel has its own test binary (with heap initialization) that must be run separately.


How It Works

Custom test runner

libkernel/src/lib.rs defines the framework:

#![allow(unused)]
#![feature(custom_test_frameworks)]
#![test_runner(crate::test_runner)]
#![reexport_test_harness_main = "test_main"]
fn main() {
}

test_runner iterates over every #[test_case] function, runs it, and writes results to serial. On completion it writes to I/O port 0xf4 to exit QEMU:

Exit codeMeaning
0x10All tests passed
0x11A test panicked

QEMU configuration

kernel/Cargo.toml configures bootimage to launch QEMU with:

test-args = [
    "-device", "isa-debug-exit,iobase=0xf4,iosize=0x04",
    "-serial", "stdio",
    "-display", "none"
]
test-success-exit-code = 33
test-timeout = 30
  • isa-debug-exit — writing to port 0xf4 terminates QEMU with an exit code
  • serial stdio — test output appears on the host terminal
  • display none — headless, no VGA window
  • timeout 30s — kills stuck tests

Serial output

Tests print to COM1 (0x3F8) via serial_print! / serial_println! from libkernel/src/serial.rs. Each test prints its name and [ok] on success; the panic handler prints [failed] and the error before exiting.


Test Types

Unit tests (#[test_case])

Standard tests that run inside the kernel. When built with cargo test, the kernel initialises GDT, IDT, heap, and memory, then calls test_main() which invokes test_runner with all collected test cases.

Integration tests (kernel/tests/)

Each file in kernel/tests/ compiles as a separate kernel binary with its own entry point. bootimage boots each one independently in QEMU.

Two integration tests use harness = false because they need custom control flow (e.g. verifying that a panic or exception fires correctly):

[[test]]
name = "should_panic"
harness = false

[[test]]
name = "stack_overflow"
harness = false

Test Inventory

Unit tests (37 tests)

All unit tests live in libkernel and run via cargo test --manifest-path libkernel/Cargo.toml.

FileTestsWhat they cover
libkernel/src/path.rs13normalize and resolve: dots, dotdot, root, relative, absolute
libkernel/src/vga_buffer/mod.rs8println output, color encoding, FixedBuf formatting
libkernel/src/md5.rs7MD5 hash (RFC 1321 test vectors)
libkernel/src/allocator/mod.rs3align_up correctness and boundary conditions
libkernel/src/memory/vmem_allocator.rs3Virtual memory allocator state and page tracking
libkernel/src/task/timer.rs2Delay struct millisecond/second calculations
libkernel/src/interrupts.rs1Breakpoint exception (int3) handling

Integration tests (4 binaries)

FileHarnessWhat it tests
basic_boot.rsstandardKernel boots and VGA println works
heap_allocation.rsstandardBox, Vec, and repeated allocation patterns
should_panic.rscustomPanic handler fires and exits correctly
stack_overflow.rscustomDouble-fault handler catches stack overflow via IST

Execution Flow

cargo test
  |
  bootimage compiles test binary (kernel + bootloader)
  |
  QEMU boots with isa-debug-exit, serial stdio, no display
  |
  Kernel init: GDT, IDT, heap, memory
  |
  test_main() calls test_runner(&[...])
  |
  For each #[test_case]:
      run test
      serial_println!("test_name... [ok]")
  |
  exit_qemu(Success)  -->  write 0x10 to port 0xf4
  |
  bootimage reads exit code, reports result

For harness = false tests, the binary manages its own flow and exit.


Key Files

FileRole
libkernel/src/lib.rstest_runner, test_panic_handler, QemuExitCode
libkernel/src/serial.rsCOM1 serial output (serial_print!, serial_println!)
kernel/Cargo.tomlbootimage test-args, exit codes, timeout
.cargo/config.tomlbootimage runner, build target
kernel/tests/Integration test binaries

Adding a New Test

Unit test

Add #[test_case] to a function in any file that has #[cfg(test)] access to the test framework (libkernel modules or kernel crate modules):

#![allow(unused)]
fn main() {
#[test_case]
fn test_something() {
    serial_print!("test_something... ");
    assert_eq!(1 + 1, 2);
    serial_println!("[ok]");
}
}

Integration test

Create kernel/tests/my_test.rs with its own entry point:

#![no_std]
#![no_main]
#![feature(custom_test_frameworks)]
#![test_runner(libkernel::test_runner)]
#![reexport_test_harness_main = "test_main"]

use bootloader::{entry_point, BootInfo};
use core::panic::PanicInfo;

entry_point!(main);

fn main(boot_info: &'static BootInfo) -> ! {
    libkernel::init();
    // ... any additional setup ...
    test_main();
    libkernel::hlt_loop();
}

#[panic_handler]
fn panic(info: &PanicInfo) -> ! {
    libkernel::test_panic_handler(info)
}

#[test_case]
fn my_test() {
    // ...
}

For tests that need custom panic/exception handling, add harness = false to kernel/Cargo.toml and manage the entry point and exit manually.