Plan: User Space and Process Isolation
Context
The kernel currently runs everything — drivers, shell, filesystem — in a single ring-0 address space as async Rust tasks. This document outlines the path from that baseline to a system where untrusted programs run in isolated ring-3 processes with their own virtual address spaces, communicating with the kernel through system calls, and eventually linked against a ported musl libc.
Progress Summary
Phases 0–6 are complete. The kernel runs a musl-linked C shell
(user/shell.c) as its primary user interface. The shell auto-launches on
boot, supports line editing, built-in commands (echo, pwd, cd, ls,
cat, exit, help), and spawning external programs. Process creation uses standard Linux clone(CLONE_VM|CLONE_VFORK) + execve,
enabling unpatched musl posix_spawn and Rust std::process::Command.
35+ syscalls are implemented including pipe2, dup2, fcntl, getpid,
getrandom, clone/execve, and custom completion port / IPC syscalls.
| Phase | Status | Milestone |
|---|---|---|
| 0 — Toolchain | Done | Hand-crafted assembly blobs and static ELF binaries load and run |
| 1 — Ring-3 + SYSCALL | Done | GDT has ring-3 segments; SYSCALL/SYSRET works; sys_write, sys_exit, sys_arch_prctl implemented |
| 2 — Per-process page tables | Done | create_user_page_table, map_user_page, CR3 switching on context switch; ring-3 page faults kill the process |
| 3 — Process abstraction | Done | Process struct, process table, ELF loader, exec shell command, zombie reaping |
| 4 — System call layer | Done | 14 syscalls implemented; initial stack with auxv; brk/mmap for heap; writev for musl printf |
| 5 — Cross-compiler + musl | Done | Docker-based musl cross-compiler (scripts/user-build.sh); static musl binaries run on ostoo |
| 6 — Spawn / wait / user shell | Done | clone(CLONE_VM|CLONE_VFORK) + execve for process creation; wait4; pipe2, dup2, fcntl, getpid, getrandom; userspace C shell with line editing, auto-launched on boot |
| 7 — Signals | Not started | Requires signal frame push/pop, rt_sigaction, rt_sigreturn |
What works today
- Userspace shell (
user/shell.c): musl-linked C shell compiled via Docker cross-compiler, deployed to disk image at/shell. Auto-launched fromkernel/src/main.rson boot; falls back to kernel shell if not found. - Line editing in the shell: read char-by-char, echo, backspace, Ctrl+C (cancel line), Ctrl+D (exit on empty line).
- Built-in commands:
echo,pwd,cd,ls,cat,exit,help. - External programs:
posix_spawn(path)+waitpidfrom the shell. - Raw keypress delivery to userspace via
libkernel/src/console.rs: foreground PID routing, blockingread(0), keyboard ISR wakeup. - Per-process FD table (fds 0–2 =
ConsoleHandle);FileHandletrait withConsoleHandle,VfsHandle, andDirHandleimplementations. - 35+ syscalls implemented (see
docs/syscalls/for per-syscall docs):read,write,open,close,fstat,lseek,mmap,mprotect,munmap,brk,ioctl,writev,exit/exit_group,wait4,getcwd,chdir,arch_prctl,futex,getdents64,set_tid_address,set_robust_list,clone,execve,pipe2,dup2,fcntl,getpid,getrandom,kill,rt_sigaction,rt_sigprocmask,rt_sigreturn,sigaltstack,madvise,sched_getaffinity,clock_gettime, plus custom syscalls for completion ports (501–503), IRQ (504), and IPC channels (505–507). openresolves paths relative to process CWD; supports both files (VfsHandle) and directories (DirHandlewithO_DIRECTORY).getdents64returnslinux_dirent64structs fromDirHandle.clone(CLONE_VM|CLONE_VFORK)creates a child sharing the parent’s address space;execvereplaces it with a new ELF binary.wait4blocks parent until child exits/zombies.writev(used by musl’sprintf) writes scatter/gather buffers to VGA.brkgrows the process heap by allocating and mapping zero-filled pages.mmapsupports anonymousMAP_PRIVATEallocations via a bump-down allocator starting at0x4000_0000_0000.Processtracksbrk_base/brk_current(computed from ELF segment extents),mmap_next/mmap_regions,fd_table,cwd,parent_pid,wait_thread.- ELF parser extracts
phdr_vaddr,phnum, andphentsizefor the auxiliary vector (musl readsAT_PHDR/AT_PHNUM/AT_PHENTduring startup). spawn_process_full(inosl/src/spawn.rs) builds the initial stack withargc,argvstrings,envp(NULL), and auxiliary vector.- Async-to-sync bridge (
osl/src/blocking.rs): spawns async VFS operations as kernel tasks, blocks the user thread, unblocks on completion. - Unhandled syscalls log a warning with the syscall number and first 3 args,
then return
-ENOSYS. - Ring-3 page faults, GPFs, and invalid opcodes log the fault, mark the process zombie, wake the parent’s wait thread, restore kernel GS polarity, and kill the thread — no kernel panic.
test isolationverifies two independently-created PML4s have genuinely independent user-space mappings at the same virtual address.- System info commands (cpuinfo, meminfo, memmap, pmap, threads, tasks, idt,
pci, lapic, ioapic, drivers, uptime) are exposed as
/procvirtual files accessible viacat /proc/<file>.
Key implementation files
| File | Role |
|---|---|
libkernel/src/gdt.rs | GDT with kernel + user code/data segments, TSS, set_kernel_stack for rsp0 |
libkernel/src/syscall.rs | SYSCALL MSR init, assembly entry stub, per-CPU data |
libkernel/src/file.rs | FileHandle trait, FileError enum, ConsoleHandle |
libkernel/src/console.rs | Console input buffer, foreground PID routing, blocking read |
libkernel/src/process.rs | Process struct (fd_table, cwd, brk/mmap, parent/wait), ProcessManager, zombie lifecycle |
libkernel/src/elf.rs | ELF64 parser (static ET_EXEC, x86-64) with phdr metadata for auxv |
libkernel/src/memory/mod.rs | create_user_page_table, map_user_page, switch_address_space |
libkernel/src/task/scheduler.rs | spawn_user_thread, process_trampoline, CR3 switching in preempt_tick, block/unblock |
libkernel/src/interrupts.rs | Ring-3-aware page fault, GPF, and invalid opcode handlers |
osl/src/syscalls/ | syscall_dispatch + syscall implementations (io.rs, fs.rs, mem.rs, process.rs, misc.rs) |
osl/src/errno.rs | Linux errno constants, file_errno() / vfs_errno() converters |
osl/src/file.rs | VfsHandle, DirHandle (VFS-backed file handles) |
osl/src/blocking.rs | Async-to-sync bridge for VFS calls |
osl/src/spawn.rs | spawn_process_full (ELF spawning with argv and parent PID) |
kernel/src/ring3.rs | Legacy spawn_process wrapper, spawn_blob (raw code), test helpers |
kernel/src/keyboard_actor.rs | Foreground routing: raw bytes to console or kernel line editor |
kernel/src/main.rs | Auto-launch /shell on boot |
devices/src/vfs/proc_vfs/mod.rs | ProcVfs with 12+ virtual files (generator submodules) |
user/shell.c | Userspace shell (musl, static) |
docs/syscalls/*.md | Per-syscall documentation |
Virtual Address Space Layout
The kernel’s heap, APIC, and MMIO window live in the high canonical half
(≥ 0xFFFF_8000_0000_0000), so the entire lower canonical half is available
for user process address spaces. The kernel/user boundary is enforced at the
PML4 level: entries 0–255 (lower half) are user-private; entries 256–510
(high half) are kernel-shared; entry 511 is the per-PML4 recursive
self-mapping.
0x0000_0000_0000_0000 ← canonical zero (null pointer trap page, unmapped)
0x0000_0000_0040_0000 ← ELF load address (4 MiB, standard x86-64)
↓ text, data, BSS
↓ brk heap (grows up from page-aligned end of highest PT_LOAD segment)
...
0x0000_4000_0000_0000 ← mmap region (bump-down allocator, grows downward)
...
0x0000_7FFF_F000_0000 ← ELF user stack base (8 pages = 32 KiB)
0x0000_7FFF_F000_8000 ← ELF user stack top (RSP starts here minus auxv layout)
0x0000_7FFF_FFFF_FFFF ← top of lower canonical half (entire range = user)
(non-canonical gap)
0xFFFF_8000_0000_0000 ← kernel heap (HEAP_START, 512 KiB)
0xFFFF_8001_0000_0000 ← Local APIC MMIO (APIC_BASE)
0xFFFF_8001_0001_0000 ← IO APIC(s)
0xFFFF_8002_0000_0000 ← MMIO window (MMIO_VIRT_BASE, 512 GiB)
phys_mem_offset ← bootloader physical memory identity map (high half)
0xFFFF_FF80_0000_0000 ← recursive PT window (PML4[511])
0xFFFF_FFFF_FFFF_F000 ← PML4 self-mapping
Kernel entries (PML4 indices 256–510) are copied into every process page table
without USER_ACCESSIBLE; they are invisible to ring-3 code.
Phase 0 — Toolchain and Build Infrastructure ✅ COMPLETE
Goal: produce user-space ELF binaries that the kernel can load, without needing musl yet.
0a. Custom linker script
Write user/link.ld:
ENTRY(_start)
SECTIONS {
. = 0x400000;
.text : { *(.text*) }
.rodata : { *(.rodata*) }
.data : { *(.data*) }
.bss : { *(.bss*) COMMON }
}
0b. Rust no_std user target
Add a custom target JSON x86_64-ostoo-user.json with:
"os": "none","env": "","vendor": "unknown""pre-link-args": pass the linker script"panic-strategy": "abort"(no unwinding in user space initially)"disable-redzone": true(same requirement as kernel)
A minimal user/ crate can implement _start in assembly, call a main, then
invoke the exit syscall.
0c. Assembly user programs
Before the ELF loader exists, a hand-crafted binary blob (or raw ELF built from a few lines of NASM) is enough to verify the ring-3 transition and basic syscalls work.
Phase 1 — Ring-3 GDT Segments and SYSCALL Infrastructure ✅ COMPLETE
Goal: the kernel can jump to ring 3 and come back via SYSCALL/SYSRET. No process isolation yet — user code runs in the kernel’s own address space.
What was implemented:
- GDT extended with kernel data, user data, and user code segments in the order
required by
IA32_STAR(libkernel/src/gdt.rs). TSS.rsp0updated viaset_kernel_stack()on every context switch to a user process.- SYSCALL MSRs (
STAR,LSTAR,FMASK,EFER.SCE) configured inlibkernel/src/syscall.rs::init(). - Assembly entry stub with
swapgs, per-CPU kernel/user RSP swap, and SysV64 argument shuffle before callingsyscall_dispatch. - Three syscalls:
write(fd 1/2 to VGA),exit/exit_group(mark zombie + kill thread),arch_prctl(ARCH_SET_FS)(writeIA32_FS_BASEMSR). - Ring-3 test (
test ring3): drops to user mode, writes “Hello from ring 3!” via syscall, exits cleanly.
1a. GDT additions (libkernel/src/gdt.rs)
Add four new descriptors in the order required by IA32_STAR:
Index Selector Descriptor
0 0x00 Null
1 0x08 Kernel code (ring 0, already exists)
2 0x10 Kernel data (ring 0) ← new; SYSRET expects it at STAR[47:32]+8
3 0x18 (padding / null for SYSRET alignment)
4 0x20 User code (ring 3) ← new; STAR[63:48]
5 0x28 User data (ring 3) ← new; at STAR[63:48]+8
6 0x30+ TSS (2 slots for the 16-byte system descriptor)
IA32_STAR layout: bits 47:32 = kernel CS (SYSCALL), bits 63:48 = user CS − 16
(SYSRET uses this+16 for CS and +8 for SS).
Update the Selectors struct and init() in gdt.rs.
1b. TSS kernel-stack field
When the CPU delivers a ring-3 interrupt it loads RSP from TSS.rsp0. This
must point to the current process’s kernel stack top. For now a single global
TSS is fine; when processes exist, rsp0 is updated on every context switch.
1c. SYSCALL MSR setup (libkernel/src/interrupts.rs or new libkernel/src/syscall.rs)
#![allow(unused)]
fn main() {
pub fn init_syscall() {
// IA32_STAR: kernel CS at bits 47:32, user CS-16 at bits 63:48
let star: u64 = ((KERNEL_CS as u64) << 32) | ((USER_CS as u64 - 16) << 48);
unsafe { Msr::new(0xC000_0081).write(star); } // STAR
// IA32_LSTAR: entry point for 64-bit SYSCALL
unsafe { Msr::new(0xC000_0082).write(syscall_entry as u64); }
// IA32_FMASK: clear IF, DF on SYSCALL (but keep other flags)
unsafe { Msr::new(0xC000_0084).write(0x0000_0300); } // IF | DF
// Enable SCE bit in EFER
let efer = unsafe { Msr::new(0xC000_0080).read() };
unsafe { Msr::new(0xC000_0080).write(efer | 1); }
}
}
1d. Assembly syscall entry stub
libkernel/src/syscall_entry.asm (or global_asm! in syscall.rs):
syscall_entry:
swapgs ; switch to kernel GS (store user GS)
mov [gs:USER_RSP], rsp ; save user RSP into per-cpu area
mov rsp, [gs:KERN_RSP] ; load kernel RSP
push rcx ; user RIP (SYSCALL saves it here)
push r11 ; user RFLAGS
; push all scratch registers
push rax
push rdi
push rsi
push rdx
push r10
push r8
push r9
; rax = syscall number, rdi/rsi/rdx/r10/r8/r9 = arguments
mov rdi, rax
call syscall_dispatch ; -> rax = return value
pop r9
pop r8
pop r10
pop rdx
pop rsi
pop rdi
; leave rax as return value
pop r11 ; restore RFLAGS
pop rcx ; restore user RIP
mov rsp, [gs:USER_RSP] ; restore user RSP
swapgs
sysretq
swapgs requires a per-CPU data block holding the kernel stack pointer.
Implement as a small struct at a known virtual address (or via GS_BASE MSR).
1e. Minimal syscall dispatch table
Start with just three numbers (matching Linux x86-64 for musl compatibility):
| Number | Name | Action |
|---|---|---|
| 0 | read | stub → return −ENOSYS |
| 1 | write | write to VGA console if fd==1/2 |
| 60 | exit | terminate current process |
1f. First ring-3 test
Write a tiny inline assembly test in kernel/src/main.rs that:
- Pushes a fake user-mode iret frame (SS, RSP, RFLAGS with IF, CS ring-3, RIP).
iretqinto ring 3.- User code executes
syscallwithrax=1(write), prints one character. - Kernel writes it to VGA and returns to ring 3.
- User code executes
syscallwithrax=60(exit).
This verifies the GDT, SYSCALL, and basic ABI without an ELF loader or address space isolation.
Phase 2 — Per-Process Page Tables and Address Space Isolation ✅ COMPLETE
Goal: each process has its own PML4; kernel mappings are shared; user mappings are private.
What was implemented:
MemoryServices::create_user_page_table()allocates a fresh PML4, copies kernel entries (indices 256–510) withoutUSER_ACCESSIBLE, and sets the recursive self-mapping at index 511.MemoryServices::map_user_page()maps individual 4 KiB pages in a non-active page table given its PML4 physical address.unsafe switch_address_space(pml4_phys)writes CR3.- Page fault handler (
libkernel/src/interrupts.rs) checksstack_frame.code_segment.rpl()— ring-3 faults mark the process zombie (exit code -11 / SIGSEGV), restore kernel GS viaswapgs, and callkill_current_thread(). Kernel faults still panic. test isolationshell command verifies two PML4s map the same user virtual address to different physical frames.- Scheduler
preempt_ticksaves/restores CR3 when switching between threads with different page tables.
2a. Page table creation (libkernel/src/memory/)
Add to MemoryServices:
#![allow(unused)]
fn main() {
/// Allocate a fresh PML4, copy all kernel PML4 entries (indices where
/// virtual_address >= KERNEL_SPLIT) into it, and return the physical
/// address of the new PML4 frame.
pub fn create_user_page_table(&mut self) -> PhysAddr;
/// Map a single 4 KiB page in a specific (possibly non-active) page table.
pub fn map_user_page(
&mut self,
pml4_phys: PhysAddr,
virt: VirtAddr,
phys: PhysAddr,
flags: PageTableFlags, // USER_ACCESSIBLE | PRESENT | WRITABLE | NO_EXECUTE as needed
) -> Result<(), MapToError<Size4KiB>>;
/// Switch the active address space. Must be called with interrupts disabled.
pub unsafe fn switch_address_space(&self, pml4_phys: PhysAddr);
}
2b. Kernel/user PML4 split
The layout gives a clean hardware-level split:
- PML4 indices 0–255 (lower canonical half,
0x0000_*) — user-private. Left empty at process creation; populated by the ELF loader andmmap. - PML4 indices 256–510 (high canonical half,
0xFFFF_8000_*through0xFFFF_FF7F_*) — kernel-shared. Copied from the kernel PML4 at process creation; marked present but neverUSER_ACCESSIBLE. - PML4 index 511 — the recursive self-mapping. Each process PML4 must
have its own entry here pointing to its own physical PML4 frame (not
the kernel’s).
create_user_page_tablemust set this explicitly.
2c. Page fault handler upgrade
Replace the panic in page_fault_handler with:
#![allow(unused)]
fn main() {
extern "x86-interrupt" fn page_fault_handler(frame: InterruptStackFrame, ec: PageFaultErrorCode) {
let faulting_addr = Cr2::read();
if frame.code_segment.rpl() == PrivilegeLevel::Ring3 {
// Fault in user space — kill the process (deliver SIGSEGV later).
kill_current_process(Signal::Segv);
schedule_next(); // does not return to faulting instruction
} else {
panic!("kernel page fault at {:?}\n{:#?}\n{:?}", faulting_addr, frame, ec);
}
}
}
This is the minimum needed to prevent a kernel panic when user code accesses invalid memory; proper CoW / demand paging comes later.
2d. Address space switch on context switch
The scheduler’s preempt_tick function currently saves/restores only kernel
RSP. Extend it to also write CR3 when switching between processes with
different page tables.
Phase 3 — Process Abstraction ✅ COMPLETE
Goal: Process struct, a process table, and a working exec.
What was implemented:
Processstruct (libkernel/src/process.rs) with PID, state (Running/Zombie), PML4 physical address, heap-allocated 64 KiB kernel stack, entry point, user stack top, thread index, and exit code.- Global
PROCESS_TABLE: Mutex<BTreeMap<ProcessId, Process>>andCURRENT_PID: AtomicU64. insert(),current_pid(),set_current_pid(),with_process(),mark_zombie(),reap(),reap_zombies().- Scheduler integration:
SchedulableKind::Kernel | UserProcess(ProcessId).spawn_user_threadcreates a thread targetingprocess_trampolinewhich sets up TSS.rsp0, per-CPU kernel RSP, PID tracking, GS polarity, CR3 switch, and then doesiretqinto ring-3 user code. kill_current_thread()marks the thread Dead and spins; timer preemption skips dead threads.- ELF loader (
libkernel/src/elf.rs): minimal parser for staticET_EXECx86-64 binaries. ReturnsElfInfo { entry, segments, phdr_vaddr, phnum, phentsize }. kernel/src/ring3.rs::spawn_process(elf_data)— parses ELF, creates user PML4, maps all PT_LOAD segments (with correct R/W/X flags) plus a user stack page, creates a Process, and spawns a user thread. ReturnsOk(ProcessId).- Shell command
exec <path>reads an ELF from the VFS and callsspawn_process. spawn_blob(code)helper for test commands: maps a raw code blob + stack, creates a Process, spawns a user thread.- Zombie reaping:
reap_zombies()is called at the start ofspawn_blobandspawn_processto free kernel stacks of fully-exited processes.
3a. Process struct (libkernel/src/process/mod.rs)
#![allow(unused)]
fn main() {
pub struct Process {
pub pid: ProcessId,
pub state: ProcessState, // Running, Ready, Blocked, Zombie
pub pml4_phys: PhysAddr, // physical address of PML4
pub kernel_stack: Vec<u8>, // 64 KiB kernel stack
pub saved_rsp: u64, // kernel RSP when not running
pub user_rsp: u64, // user RSP (restored on ring-3 return)
pub files: FileTable, // open file descriptors
pub parent: Option<ProcessId>,
pub exit_code: Option<i32>,
}
}
3b. Process table
#![allow(unused)]
fn main() {
lazy_static! {
static ref PROCESSES: Mutex<BTreeMap<ProcessId, Process>> = ...;
}
}
CURRENT_PID: AtomicU32 — the PID running on each CPU (single-CPU for now).
3c. Scheduler integration
Replace the bare Thread list in scheduler.rs with process-aware scheduling:
- On
preempt_tick: save user context (if coming from ring 3), switchCR3, load next process’s user context and kernel RSP. TSS.rsp0updated to point to the new process’s kernel stack top.
3d. ELF loader (libkernel/src/elf.rs)
#![allow(unused)]
fn main() {
pub fn load_elf(
bytes: &[u8],
process: &mut Process,
mem: &mut MemoryServices,
) -> Result<VirtAddr, ElfError> // returns entry point
}
Steps:
- Validate ELF magic,
e_machine == EM_X86_64,e_type == ET_EXEC(static) orET_DYN(PIE). - For each
PT_LOADsegment: allocate physical frames, map atp_vaddrwithUSER_ACCESSIBLEand flags derived fromp_flags(R/W/X). - Copy
p_fileszbytes from the ELF image; zero-fill top_memsz. - Allocate and map a user stack (8–16 pages) just below the stack top.
- Set up the initial stack frame:
argc=0,argv=NULL,envp=NULL,auxventries forAT_ENTRY,AT_PHDR,AT_PAGESZ(required by musl’s_start). - Return
e_entry.
3e. sys_execve syscall
#![allow(unused)]
fn main() {
fn sys_execve(path: *const u8, argv: *const *const u8, envp: *const *const u8) -> ! {
let bytes = vfs::read_file(path_str).expect("exec: read failed");
let process = current_process_mut();
process.reset_address_space(); // drop old page table
let entry = load_elf(&bytes, process, &mut memory());
switch_to_user(entry, process.user_stack_top); // does not return
}
}
Phase 4 — System Call Layer ✅ COMPLETE
Goal: a syscall table wide enough to run a static musl binary that prints “Hello, world!” and exits.
What was implemented:
4a. ELF parser extensions (libkernel/src/elf.rs)
ElfInfo now includes phdr_vaddr, phnum, and phentsize. The parser
looks for a PT_PHDR program header (type 6) to get the phdr virtual address
directly; fallback computes it from the PT_LOAD segment containing e_phoff.
These values populate the auxiliary vector that musl reads during startup.
4b. Process memory tracking (libkernel/src/process.rs)
Process gained four new fields:
| Field | Type | Purpose |
|---|---|---|
brk_base | u64 | Page-aligned end of highest PT_LOAD segment (immutable) |
brk_current | u64 | Current program break (starts == brk_base) |
mmap_next | u64 | Bump-down pointer for anonymous mmap (starts at 0x4000_0000_0000) |
mmap_regions | Vec<(u64, u64)> | Tracked (vaddr, len) pairs |
Process::new() now takes a brk_base parameter. spawn_process computes it
from max(seg.vaddr + seg.memsz) page-aligned up.
4c. Initial stack layout (kernel/src/ring3.rs)
ELF processes get an 8-page (32 KiB) contiguous stack at 0x7FFF_F000_0000,
allocated via alloc_dma_pages(8) so the auxv layout can be written through the
kernel’s phys_mem_offset window. build_initial_stack() writes:
[stack_top]
16 bytes pseudo-random data (AT_RANDOM target)
alignment padding (8 bytes)
AT_NULL (0, 0)
AT_RANDOM (25, addr)
AT_ENTRY (9, entry_point)
AT_PHNUM (5, phnum)
AT_PHENT (4, phentsize)
AT_PHDR (3, phdr_vaddr)
AT_PAGESZ (6, 4096)
AT_UID (11, 0)
NULL ← envp terminator
NULL ← argv terminator
0 ← argc = 0
[RSP points here, 16-byte aligned]
4d. Syscall table (osl/src/syscalls/mod.rs)
All syscalls use Linux x86-64 numbers for musl compatibility. Unhandled
numbers log a warning and return -ENOSYS. Errno constants are defined
in osl/src/errno.rs; libkernel uses FileError for structured errors.
| Nr | Name | Implementation |
|---|---|---|
| 0 | read | Via fd_table → FileHandle::read; ConsoleHandle blocks on empty input |
| 1 | write | Via fd_table → FileHandle::write |
| 2 | open | VFS read_file or list_dir → VfsHandle/DirHandle; path resolution relative to CWD |
| 3 | close | Via fd_table |
| 5 | fstat | S_IFCHR for console fds |
| 8 | lseek | Returns -ESPIPE (not seekable) |
| 9 | mmap | Anonymous MAP_PRIVATE only; bump-down allocator |
| 10 | mprotect | Updates page table flags for VMA regions |
| 11 | munmap | Unmaps pages, frees frames, splits/removes VMAs |
| 12 | brk | Query or grow heap; allocates+maps zero-filled pages |
| 16 | ioctl | Returns -ENOTTY |
| 20 | writev | Via fd_table; scatter/gather write |
| 60 | exit | Mark zombie, wake parent wait_thread, kill thread |
| 61 | wait4 | Find zombie child, block if none, reap and return |
| 72 | futex | No-op stub (single-threaded, lock never contended) |
| 79 | getcwd | Copy process.cwd to user buffer |
| 80 | chdir | Validate path via VFS list_dir, update process.cwd |
| 158 | arch_prctl | ARCH_SET_FS writes IA32_FS_BASE MSR |
| 217 | getdents64 | Via DirHandle::getdents64 |
| 218 | set_tid_address | Returns current PID as TID |
| 231 | exit_group | Same as exit (single-threaded) |
| 273 | set_robust_list | No-op, returns 0 |
Lock ordering for brk and mmap: process table lock acquired/released to
read state, then memory lock for frame allocation and page mapping, then process
table lock re-acquired to write updates. This avoids nested lock deadlocks.
See docs/syscalls/ for detailed per-syscall documentation.
4e. What’s still missing (deferred to later phases)
- SMAP enforcement: User pointers in
writev,fstat,brkare accessed withoutstac/clac. Page deallocation: ✅ Fixed —munmapfrees frames and splits VMAs;brkshrink unmaps and frees pages; process exit cleans up the entire user address space.: ✅ Fixed — updates page table flags for the target VMA range.mprotectFS_BASE save/restore: ✅ Fixed — FS_BASE is saved/restored per-thread inpreempt_tickviasave_current_context/restore_thread_state.
Phase 5 — Cross-Compiler and musl Port ✅ COMPLETE
Goal: compile C programs that run as ostoo user processes.
What was implemented:
- Docker-based build environment (
scripts/user-build.sh) usingx86_64-linux-musl-crosstoolchain. user/Makefilecompiles*.cfiles to static musl-linked ELF binaries.user/shell.cis the primary musl binary (see Phase 6).- Binaries are deployed to the exFAT disk image or shared via virtio-9p.
5a. Toolchain strategy
The simplest path: use an existing x86_64-linux-musl sysroot unmodified,
because we implement Linux-compatible syscall numbers (Phase 4). musl does not
inspect the OS name at runtime — it just issues syscalls.
Option A (quickest): install x86_64-linux-musl-gcc from
musl.cc or via brew install x86_64-linux-musl-cross.
Compile with:
x86_64-linux-musl-gcc -static -o hello hello.c
The resulting fully-static ELF should work on ostoo with the Phase 4 syscalls.
Option B (custom triple): build musl from source with a custom --target
configured for ostoo. This is useful once ostoo diverges from Linux’s ABI
(e.g. custom syscall numbers or a different startup convention).
5b. musl build recipe (Option B outline)
# Prerequisites: a bare x86_64-elf-gcc cross-compiler (via crosstool-ng or
# manual binutils + gcc build targeting x86_64-unknown-elf).
git clone https://git.musl-libc.org/cgit/musl
cd musl
./configure \
--target=x86_64 \
--prefix=/opt/ostoo-sysroot \
--syslibdir=/opt/ostoo-sysroot/lib \
CROSS_COMPILE=x86_64-elf-
make -j$(nproc)
make install
Key musl files:
arch/x86_64/syscall_arch.h—__syscall0…__syscall6use thesyscallinstruction; no changes needed if syscall numbers match Linux.crt/x86_64/crt1.o—_startsets upargc/argv/envpfrom the initial stack (ABI defined in the ELF auxiliary vector; match what the ELF loader sets up in Phase 3d).src/env/__init_tls.c— callsarch_prctl(ARCH_SET_FS, ...); requires thesys_arch_prctlsyscall (Phase 4b).
5c. Rust user programs
For Rust programs targeting ostoo, add a custom target
x86_64-ostoo-user.json (from Phase 0b) and a minimal ostoo-rt crate that:
- Provides
_start(sets up a stack frame; callsmain; callssys_exit). - Provides
#[panic_handler]that callssys_exit(1). - Wraps the small syscall ABI.
Users can then write:
#![no_std]
#![no_main]
extern crate ostoo_rt;
#[no_mangle]
pub extern "C" fn main() {
ostoo_rt::write(1, b"Hello from Rust!\n");
}
Phase 6 — Spawn, Wait, and a Minimal Shell ✅ COMPLETE
Goal: a user-mode shell that can launch and wait for child programs.
What was implemented:
Process creation uses the standard Linux clone(CLONE_VM|CLONE_VFORK) +
execve path. musl’s posix_spawn and Rust’s std::process::Command
work unmodified.
6a. clone (syscall 56)
clone(CLONE_VM|CLONE_VFORK|SIGCHLD) creates a child sharing the parent’s
address space. The parent blocks until the child calls execve or _exit.
See clone.
6b. execve (syscall 59)
Replaces the current process image with a new ELF binary. Reads from VFS, creates a fresh PML4, maps segments, builds the initial stack, closes CLOEXEC fds, unblocks the vfork parent, and jumps to userspace.
See execve.
6c. wait4 (syscall 61)
sys_wait4(pid, status_ptr, options)— find zombie child, write exit status, reap, return child PID- If no zombie found: register
wait_threadon parent, block, retry on wake sys_exitwakes parent’swait_threadviascheduler::unblock()
6d. Userspace shell (user/shell.c)
- Compiled with musl (static), deployed at
/shell - Line editing: read char-by-char, echo, backspace, Ctrl+C, Ctrl+D
- Built-in commands:
echo,pwd,cd,ls,cat,exit,help - External programs:
posix_spawn(path)+waitpid(child, &status, 0) - Auto-launched from
kernel/src/main.rs; falls back to kernel shell if/shellis not found on the filesystem
6e. What’s deferred
fork+ CoW page faults — standard POSIXforkis not implemented. Adding it would require: marking all user pages read-only in both parent and child, a CoW page fault handler that copies on write, and reference counting on physical frames.
Phase 7 — Signals ⬜ NOT STARTED
Signals are the last major piece of POSIX plumbing needed for a realistic user-space environment.
Minimal signal implementation
#![allow(unused)]
fn main() {
pub struct SigAction { handler: usize, flags: u32, mask: SigSet }
pub struct SigTable { actions: [SigAction; 32], pending: SigSet, masked: SigSet }
}
sys_rt_sigactioninstalls handlers.- Before returning to user space after a syscall or interrupt, check
pending & ~masked. - If set: push a signal frame on the user stack (siginfo + ucontext), set RIP to the handler, clear the pending bit.
sys_rt_sigreturn: the signal handler calls this when done; the kernel pops the ucontext and resumes normal user execution.
Dependency Graph
Phase 0 ✅ ← Phase 1 ✅ ← Phase 2 ✅ ← Phase 3 ✅ ← Phase 4 ✅ ← Phase 5 ✅ ← Phase 6 ✅
(toolchain) (ring-3, (address (Process, (syscall (musl) (spawn/wait/
syscall) spaces) ELF loader) layer) shell)
↓
Phase 7 (signals)
Key Risks and Design Decisions
SYSCALL vs INT 0x80
Use SYSCALL/SYSRET (64-bit, fast path). INT 0x80 is the 32-bit ABI; musl uses SYSCALL on x86-64 exclusively.
Kernel/user split
The kernel lives entirely in the high canonical half (0xFFFF_8000_* and
above): heap at 0xFFFF_8000_*, APIC at 0xFFFF_8001_*, MMIO window at
0xFFFF_8002_*. The entire lower canonical half is free for user processes.
The split is enforced at the PML4 level — user processes simply have no
mappings at indices 256–510, and the kernel entries they inherit are never
USER_ACCESSIBLE. SMEP (CR4.20) and SMAP (CR4.21) provide the hardware
enforcement layer once ring-3 processes exist.
SMEP and SMAP
Once ring-3 processes exist, enable SMEP (CR4.20) to prevent the kernel from
accidentally executing user-mapped code, and SMAP (CR4.21) to prevent the
kernel from silently accessing user memory without an explicit stac/clac
pair. Any kernel code that copies from user buffers must use a checked copy
function that uses stac to temporarily permit access.
Static-only ELF initially
Dynamic linking requires an in-kernel or user-space ELD interpreter. Start
with -static binaries and the ELF loader described in Phase 3d. PIE static
binaries (ET_DYN with no INTERP segment) should work with minor adjustments to
the loader.
Single CPU for now
The process table and scheduler assume a single CPU. SMP support would require
per-CPU CURRENT_PID, per-CPU kernel stacks in the TSS, and IPI-based TLB
shootdown when modifying another process’s page table.
Heap size
The kernel heap is 1 MiB. Process control blocks each consume 64 KiB
(kernel stack) plus page table frames, plus Vec storage for mmap_regions.
Zombie processes are reaped via wait4 + reap(), but loading multiple
concurrent processes will still pressure the heap.
Memory management
munmap frees frames and splits/removes VMAs. brk shrink frees pages.
Process exit calls cleanup_user_address_space to walk and free all
user-half page tables and frames. The kernel heap (1 MiB) is the main
remaining pressure point for concurrent processes.
Milestones and Test Checkpoints
| Milestone | Observable result | Status |
|---|---|---|
| Phase 1 complete | iretq drops to ring 3; syscall returns to ring 0; “Hello from ring 3!” appears on VGA | ✅ Done |
| Phase 2 complete | Two user processes have separate address spaces; test isolation passes | ✅ Done |
| Phase 3 complete | exec /path/to/elf reads an ELF from the VFS, loads it into a fresh address space, and runs it | ✅ Done |
| Phase 4 complete | 14 syscalls, initial stack with auxv, brk/mmap heap, writev for printf | ✅ Done |
| Phase 5 complete | hello compiled with x86_64-linux-musl-gcc -static prints and exits cleanly | ✅ Done |
| Phase 6 complete | Userspace shell spawns children and waits for them; auto-launches on boot | ✅ Done |
| Phase 7 complete | SIGINT (Ctrl+C) terminates the foreground process | ⬜ |