Building a hypervisor, 2: Booting Linux

In this post, we'll make our hypervisor boot the Linux kernel into a basic userspace consisting of a dummy init, covering topics like long mode, paging, the Linux boot protocol, and various KVM APIs

This post will touch on a lot of topics briefly, but won't go super in-depth into all of them, mainly focusing on the blockers I faced personally as a novice systems programmer

Linux Boot Protocol

The Linux Boot Protocol describes setting up the environment for booting the kernel, involving loading the kernel image, passing kernel command line arguments & other setup parameters, setting up segment registers, etc. laying out everything in memory like so:

        |                        |
0A0000  +------------------------+
        |  Reserved for BIOS     |      Do not use.  Reserved for BIOS EBDA.
09A000  +------------------------+
        |  Command line          |
        |  Stack/heap            |      For use by the kernel real-mode code.
098000  +------------------------+
        |  Kernel setup          |      The kernel real-mode code.
090200  +------------------------+
        |  Kernel boot sector    |      The kernel legacy boot sector.
090000  +------------------------+
        |  Protected-mode kernel |      The bulk of the kernel image.
010000  +------------------------+
        |  Boot loader           |      <- Boot sector entry point 0000:7C00
001000  +------------------------+
        |  Reserved for MBR/BIOS |
000800  +------------------------+
        |  Typically used by MBR |
000600  +------------------------+
        |  BIOS use only         |
000000  +------------------------+

The kernel can be booted from real mode, protected mode, long mode, or from the UEFI directly with EFI stub (not relevant here). In long mode, the bootloader itself has to setup paging, various flags in registers, etc. which the kernel would do by itself if booted in real mode, but we opt for long mode here as this setup will be required when we modify our code to boot an uncompressed vmlinux image directly rather than a compressed bzImage, hence skipping quite a bit of startup code. This is also done by projects like Firecracker

We will be covering these topics briefly first, before finally getting to boot the kernel image

Paging

First, we'll enter long mode, execute 64-bit code under it, and then come back to the Linux boot protocol. In the previous post, we setup a basic environment using the KVM API capable of executing 16-bit real mode code and executed a toy program under it:

$ cargo run -- hello
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/intro hello`
Port: 0x3f8, Char: H
Port: 0x3f8, Char: e
Port: 0x3f8, Char: l
Port: 0x3f8, Char: l
Port: 0x3f8, Char: o
Port: 0x3f8, Char: ,
Port: 0x3f8, Char:  
Port: 0x3f8, Char: K
Port: 0x3f8, Char: V
Port: 0x3f8, Char: M
Port: 0x3f8, Char: !

From the 64-bit boot protocol, At entry, the CPU must be in 64-bit mode with paging enabled. The range with setup_header.init_size from start address of loaded kernel and zero page and command line buffer get ident mapping; a GDT must be loaded with the descriptors for selectors __BOOT_CS(0x10) and __BOOT_DS(0x18); both descriptors must be 4G flat segment; __BOOT_CS must have execute/read permission, and __BOOT_DS must have read/write permission; CS must be __BOOT_CS and DS, ES, SS must be __BOOT_DS; interrupt must be disabled; %rsi must hold the base address of the struct boot_params - So we have a couple of things to do:

Setup Paging (prerequisite for long mode)
Enter Long Mode
Setup Segment Registers

We just need to perform a basic paging setup as mentioned above, involving identity mapped pages, i.e. virtual addresses in a specific range are mapped to the same physical addresses, for instance, the virtual address 0x200000 gets mapped to the same physical address 0x200000. We will do this for the kernel image (the aformentioned ident mapping), and it's required as physical memory can't be accessed directly with paging enabled

For entering long mode, we also need to enable Physical Address Extension (PAE), implying 4 levels of page tables, each having a maximum of 512 entries:

PML4: Page Map Level 4
PDPT: Page Directory Pointer Table
PD: Page Directory
PT: Final level page table, will be omitted in our case as our PD entries will point to a page directly

                             [0]   = 0x000000 - 0x200000
                           / 
PML4[0] -> PDPT[0] -> PD -   [1]   = 0x200000 - 0x400000
                           \
                             [511] = 0x3fe00000 - 0x40000000

Note that this is a very minimal setup, just enough to satisfy the kernel's requirement of an identity mapped region and a lot more entries will be added into these tables after it boots. Further resources about paging are linked at the end, but as an example (the process might feel a bit redundant here as we're just using identity paging), the address 0x401000 would be translated by breaking up the address into ranges of bits, and using them as indices into the various tables:

addr = 0x401000

0000 0000 0000 0000 0000 0000 0100 0000 0001 0000 0000 0000
|PML4      |PDPT      |PD        |Offset                  |
|47 - 39   |38 - 30   |29 - 21   |20 - 0                  |

PML4 = { [0] = PDPT }
PDPT = { [0] = PD }
PD   = { [0] = 0x000000, [1] = 0x200000, ..., [511] = 0x3fe00000 }

Page Offset = addr & 0x1FFFF       = 4096
PD   Index  = (addr >> 21) & 0x1FF = 2
PDPT Index  = (addr >> 30) & 0x1FF = 0
PML4 Index  = (addr >> 39) & 0x1FF = 0

The first 21 bits represent the offset into the page, the next 9 bits the index into the Page Directory, and so on (ref. Figure 4.9 from Intel SDM Vol 3)
For retrieving the physical address, the PML4 index is used to fetch the entry from the cr4 register where the PML4 table's address is stored. Here this index is 0, and that entry points to the PDPT
Then, the PDPT index fetches the entry from the Page Directory Pointer Table, which is again at the 0'th index, pointing to the start of the PD
Finally, the PD index fetches the entry from the Page Directory which points to the physical address of the page, here it is 2 which points to the 0x400000. The Page Offset represents the offset into this page, which is 4096 (0x1000) here, so the final address is 0x400000 + 0x1000

Coming to code, the paging setup can be implemented like this, building on top of what we developed in the previous post:

pub mod PageTables {
    /// Page Map Level 4 Table
    pub const PML4: usize = 0x1000;
    /// Page Directory Pointer Table
    pub const PDPT: usize = 0x2000;
    /// Page Directory
    pub const PD: usize = 0x3000;
}

pub mod PageFlags {
    /// The page is present in physical memory
    pub const PRESENT: u64 = 1 << 0;
    /// The page is read/write
    pub const READ_WRITE: u64 = 1 << 1;
    /// Make PDE map to a 4MiB page, Page Size Extension must be enabled
    pub const PAGE_SIZE: u64 = 1 << 7;
}

pub fn setup_paging(memory: &mut [u64]) {
    // We divide all the addresses by 8 as we treat the KVM's memory region as
    // a buffer of u64's rather than u8's in this function
    let entry_size = mem::size_of::<u64>();

    memory[PageTables::PML4 / entry_size] =
        PageFlags::PRESENT | PageFlags::READ_WRITE | PageTables::PDPT as u64;
    memory[PageTables::PDPT / entry_size] =
        PageFlags::PRESENT | PageFlags::READ_WRITE | PageTables::PD as u64;

    // We need 512 entries to cover 1GB
    let pd = &mut memory[(PageTables::PD / entry_size)..][..512];

    // Identity Mapping
    for (n, entry) in pd.iter_mut().enumerate() {
        *entry =
            PageFlags::PRESENT | PageFlags::READ_WRITE | PageFlags::PAGE_SIZE | ((n as u64) << 21);
    }
}

We choose page-aligned addresses for the tables, and setup entries in each of them pointing to the next level table. In the Page Directory entries, we also set the PAGE_SIZE flag to indicate that the next level page table is omitted and the entry points to a 2MB (contrary to 4MB mentioned in the comment, as we will also enable PAE) physical page directly

Bits 0-12 in the entry are reserved for flags, and 12-31 are used to represent the address, so we OR the flags with the entry's index shifted by 21 bits, as it gets us multiples of 2MB (eg 1 << 21 = 2097152), hence covering 1GB with 512 entries. This doesn't interfere with the flag bits as the lower bits in our addresses are zero, and the entry only needs bits 31-12 of the address.

Segment Registers and Long Mode

The kernel requires a Global Descriptor Table (GDT) and segment registers to be set up as we mentioned above. They are mostly relevant for memory segmentation, which is not relevant as we're using paging, but we still need a minimal setup consisting of a code segment & data segment located at 0x10 and 0x18 respectively:

/// Read permission & Data Segment is implied
/// Code Segment is implicitly executable
pub mod SegmentFlags {
    /// Read permissions for Code Segment
    pub const CODE_READ: u8 = 1 << 1;
    /// Indicate that this is a Code Segment
    pub const CODE_SEGMENT: u8 = 1 << 3;
    /// Write permissions for Data Segment
    pub const DATA_WRITE: u8 = 1 << 1;
}

/// CS, placed at 0x10
pub const CODE_SEGMENT: kvm_segment = kvm_segment {
    base: 0,
    limit: 0xFFFFFFFF,
    selector: 0x10,
    type_: SegmentFlags::CODE_SEGMENT | SegmentFlags::CODE_READ,
    present: 1,
    dpl: 0,
    db: 0,
    s: 1,
    l: 1,
    g: 1,
    avl: 0,
    unusable: 0,
    padding: 0,
};

/// DS, placed at 0x18
pub const DATA_SEGMENT: kvm_segment = kvm_segment {
    base: 0,
    limit: 0xFFFFFFFF,
    selector: 0x18,
    type_: SegmentFlags::DATA_WRITE,
    present: 1,
    dpl: 0,
    db: 1,
    s: 1,
    l: 0,
    g: 1,
    avl: 0,
    unusable: 0,
    padding: 0,
};

We use the kvm_segment as a convenient way to define these segments, but we also need to encode the segments into a 64-bit value to write to the KVM memory at the addresses set in the selector field (ref Section 3.4.5 from Intel SDM Vol. 3):

pub fn pack_segment(segment: &kvm_segment) -> u64 {
    // We don't need to set a base address
    assert_eq!(segment.base, 0);

    // Bits 8 (Segment Type) .. 15 (P)
    let lo_flags =
        // 8 .. 11 (Segment Type)
        segment.type_
        // 12 (S, Descriptor Type)
        // It is set to indicate a code/data segment
        | (segment.s << 4)
        // 13 .. 14 (Descriptor Privilege Level)
        // Leave it as zeroes for ring 0
        | (segment.dpl << 5)
        // 15 (P, Segment-Present)
        // The segment is present (duh)
        | (segment.present << 7);

    // Bits 20 (AVL) .. 23 (G)
    let hi_flags =
        // 20 (AVL)
        // Available for use by system software, undesirable in our case
        segment.avl
        // 21 (L)
        // Code segment is executed in 64-bit mode
        // For DS, L bit must not be set
        | (segment.l << 1)
        // 22 (D/B)
        // Indicates 32-bit, must only be set for DS
        // For CS, if the L-bit is set, then the D-bit must be cleared
        | (segment.db << 2)
        // 23 (G, Granularity)
        // Scales the limit to 4-KByte units, so we can set the limit to 4GB
        // while just occupying 20 bits overall
        // (0xFFFFF * (1024 * 4)) == ((1 << 20) << 12) == (1 << 32) == 4GB
        | (segment.g << 3);

    let packed =
        // 0 .. 8 (Base Addr, zero)
        // 8 .. 15
        ((lo_flags as u64) << 8)
        // 16 .. 19 (Top 4 bits of limit)
        // Can also be written as `segment.limit & 0xF0000`
        | ((segment.limit as u64 & 0xF) << 16)
        // 20 .. 23
        | ((hi_flags as u64) << 20);

    // 24 .. 31, 32 .. 46 (Base Addr, zero)
    // 47 .. 64 (Bottom 16 bits of limit)
    (packed << 32) | (segment.limit as u64 >> 16)
}

/// Sets up the GDT according in the KVM memory region
pub fn setup_gdt(memory: &mut [u64]) {
    // CS (0x10)
    memory[2] = pack_segment(&CODE_SEGMENT);
    // DS (0x18)
    memory[3] = pack_segment(&DATA_SEGMENT);
}

The comments in the above function also elaborate on each of the flags we set. Coming to the kernel's requirements: a GDT must be loaded with the descriptors for selectors __BOOT_CS(0x10) and __BOOT_DS(0x18); both descriptors must be 4G flat segment; __BOOT_CS must have execute/read permission, and __BOOT_DS must have read/write permission; CS must be __BOOT_CS and DS, ES, SS must be __BOOT_DS

We satisfy them as Code Segment is loaded at 0x10 and Data Segment at 0x18, along with the requested permissions, and limit is set to 4GB with the granularity flag

Finally, for entering long mode, we just need to point the cr3 register to our PML4 table's address as discussed in the previous section, and set the relevant flags in various registers:

/// Control Register 0
pub mod Cr0Flags {
    /// Enable protected mode
    pub const PE: u64 = 1 << 0;
    /// Enable paging
    pub const PG: u64 = 1 << 31;
}

/// Control Register 4
pub mod Cr4Flags {
    /// Page Size Extension
    pub const PSE: u64 = 1 << 4;
    /// Physical Address Extension, size of large pages is reduced from
    /// 4MiB to 2MiB and PSE is enabled regardless of the PSE bit
    pub const PAE: u64 = 1 << 5;
}

/// Extended Feature Enable Register
pub mod EferFlags {
    /// Long Mode Enable
    pub const LME: u64 = 1 << 8;
    /// Long Mode Active
    pub const LMA: u64 = 1 << 10;
}

/// Setup the KVM segment registers in accordance with our paging & GDT setup
pub fn setup_sregs() -> kvm_sregs {
    kvm_sregs {
        // https://wiki.osdev.org/Setting_Up_Long_Mode
        cr3: PageTables::PML4 as u64,
        cr4: Cr4Flags::PAE,
        cr0: Cr0Flags::PE | Cr0Flags::PG,
        efer: EferFlags::LMA | EferFlags::LME,
        // `limit` is not required
        // The GDT starts at address 0
        // CS is at 16 (0x10), DS is at 24 (0x18)
        gdt: kvm_dtable {
            base: 0,
            ..Default::default()
        },
        cs: CODE_SEGMENT,
        ds: DATA_SEGMENT,
        es: DATA_SEGMENT,
        fs: DATA_SEGMENT,
        gs: DATA_SEGMENT,
        ss: DATA_SEGMENT,
        ..Default::default()
    }
}

/// Setup the KVM CPU registers in accordance with the Linux boot protocol
pub fn setup_regs(code64_start: u64, boot_params_addr: u64) -> kvm_regs {
    kvm_regs {
        // Just set the reserved bit, leave all other bits off
        // This turns off interrupts as well
        rflags: 1 << 1,
        // The instruction pointer should point to the start of the 64-bit kernel code
        rip: code64_start,
        // The `rsi` register must contain the address of the `boot_params` struct
        rsi: boot_params_addr,
        ..Default::default()
    }
}

Let's patch up our old main function to call all these helpers:

diff --git a/../intro/src/main.rs b/src/main.rs
index 315f4a9..0dbadf3 100644
--- a/../intro/src/main.rs
+++ b/src/main.rs
@@ -136,9 +136,10 @@ impl Kvm {
 }
 
 fn main() -> Result<(), Box<dyn std::error::Error>> {
-    // We don't need a large mapping as our code is tiny
-    // Must be page-size aligned, so minimum is 4KiB
-    const MAP_SIZE: usize = 0x1000;
+    // 1GB
+    const MAP_SIZE: usize = 0x40000000;
+    // Arbritary (within the 1GB that we identity map)
+    const CODE_START: usize = 0x4000;
 
     let mut code = Vec::new();
 
@@ -165,31 +166,25 @@ fn main() -> Result<(), Box<dyn std::error::Error>> {
         },
     );
 
-    assert!(code.len() < MAP_SIZE);
+    assert!((CODE_START + code.len()) < MAP_SIZE);
 
     // The idiomatic way is to write a wrapper struct for `mmap`-ing regions
     // and exposing it as a slice (std::slice::from_raw_parts)
     // But we just copy the code directly here
     unsafe {
-        std::ptr::copy_nonoverlapping(code.as_ptr(), *mapping as *mut _, code.len());
+        std::ptr::copy_nonoverlapping(code.as_ptr(), (*mapping as *mut u8).add(CODE_START), code.len());
     };
 
-    let mut sregs = kvm.get_vcpu_sregs()?;
+    let mapped_slice = unsafe { slice::from_raw_parts_mut(*mapping as _, MAP_SIZE) };
 
-    // CS points to the reset vector by default
-    sregs.cs.base = 0;
-    sregs.cs.selector = 0;
+    util::setup_gdt(mapped_slice);
+    util::setup_paging(mapped_slice);
+
+    // Ignore boot_params for now
+    kvm.set_vcpu_regs(&util::setup_regs(CODE_START as u64, 0))?;
+    kvm.set_vcpu_sregs(&util::setup_sregs())?;
 
-    kvm.set_vcpu_sregs(&sregs)?;
     kvm.set_user_memory_region(0, MAP_SIZE, *mapping as u64)?;
-    kvm.set_vcpu_regs(&kvm_regs {
-        // The first bit must be set on x86
-        rflags: 1 << 1,
-        // The instruction pointer is set to 0 as our code is loaded with 0
-        // as the base address
-        rip: 0,
-        ..Default::default()
-    })?;
 
     loop {
         let kvm_run = kvm.run()?;

The final code for this section can be found here

Sanity Check

Before we go ahead with loading the kernel image, let's write a small hello world program as before, but 64-bit rather than 16-bit, hello.S:

BITS 64

; Output to port 0x3f8
mov dx, 0x3f8

; 0x4000 is added to the message address as that's where our code is loaded
mov rbx, message + 0x4000

loop:
    ; Load a byte from `bx` into the `al` register
    mov al, [rbx]

    ; Jump to the `hlt` instruction if we encountered the NUL terminator
    cmp al, 0
    je end

    ; Output to the serial port
    out dx, al
    ; Increment `rbx` by one byte to point to the next character
    inc rbx

    jmp loop

end:
    hlt

message:
    db "Hello, KVM!", 0

After building with nasm hello.S:

$ nasm hello.S && cargo run hello
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/intro hello`
Port: 0x3f8, Char: H
Port: 0x3f8, Char: e
Port: 0x3f8, Char: l
Port: 0x3f8, Char: l
Port: 0x3f8, Char: o
Port: 0x3f8, Char: ,
Port: 0x3f8, Char:  
Port: 0x3f8, Char: K
Port: 0x3f8, Char: V
Port: 0x3f8, Char: M
Port: 0x3f8, Char: !

Loading the Kernel Image

Now, we can actually get to the main topic! We don't need to do much now, just implement what the boot protocol asks for, and setup a few more things with KVM APIs

First off, we need to load the setup header from our kernel image which is used to provide details like the addresses of the kernel command line and initramfs, among other metadata. It is generated at build time and is embedded at the beginning of the kernel image. The main fields we need to modify are:

ramdisk_image, ramdisk_size: The address and size of the initramfs to be loaded
cmd_line_ptr: The address of the NUL-terminated kernel command line. The ext_cmd_line_ptr field is explicitly zeroed as it seems to be set to a garbage value as a side-effect of commit d9b6b6, breaking calculations in get_cmd_line_ptr() which caused a long drawn out debugging session :p
e820_table: E820 entries tell the kernel about the reserved and available memory regions. We make the memory starting from the kernel's start address till the end of our mapped memory usable, and mark a small section in the beginning as reserved for a 1KB EBDA (Extended BIOS Data Area) region. We don't interact with it directly but the kernel can misbehave without the area being marked as reserved:

[
    // Memory before the EBDA entry
    boot_e820_entry {
        addr: 0,
        size: 0x9fc00,
        // E820_RAM
        type_: 1,
    },
    // Reserved EBDA entry
    boot_e820_entry {
        addr: 0x9fc00,
        size: 1 << 10,
        // E820_RESERVED,
        type_: 2,
    },
    // Memory after the beginning of the kernel image
    boot_e820_entry {
        addr: 0x100000,
        size: MAPPING_SIZE as u64 - 0x100000,
        type_: 1,
    },
]

pub fn new(
    bz_image: &'a [u8],
    cmdline_addr: u32,
    initramfs_addr: Option<u32>,
    initramfs_size: Option<u32>,
    e820_entries: &[boot_e820_entry],
) -> Result<BzImage<'a>, LoaderError> {
    // The setup_header is located at offset 0x1f1 (`hdr` field) from the start
    // of `boot_params` (which is also the start of the kernel image)
    let mut boot_params = boot_params::default();

    // Ref: 1.3. Details of Header Fields
    // We just need to modify a few fields here to tell the kernel about
    // the environment we're setting up. Rest of the information is already
    // filled in the struct (embedded in the bz_image)

    if bz_image.len() < mem::size_of_val(&boot_params) {
        return Err(LoaderError::ImageTooSmall);
    }

    unsafe {
        ptr::copy_nonoverlapping(bz_image.as_ptr().cast(), &mut boot_params, 1);
    }

    // `boot_flag` and `header` are magic values documented in the boot protocol
    // > Then, the setup header at offset 0x01f1 of kernel image on should be
    // > loaded into struct boot_params and examined. The end of setup header
    // > can be calculated as follows: 0x0202 + byte value at offset 0x0201
    // 0x0201 refers to the 16 bit `jump` field of the `setup_header` struct
    // Contains an x86 jump instruction, 0xEB followed by a signed offset relative to byte 0x202
    // So we just read a byte out of it, i.e. the offset from the header (0x0202)
    // It should always be 106 unless a field after `kernel_info_offset` is added
    if boot_params.hdr.boot_flag != 0xAA55
        || boot_params.hdr.header != 0x53726448
        || (boot_params.hdr.jump >> 8) != 106
    {
        return Err(LoaderError::InvalidImage);
    }

    if bz_image.len() < kernel_byte_offset(&boot_params) {
        return Err(LoaderError::ImageTooSmall);
    }

    // VGA display
    boot_params.hdr.vid_mode = 0xFFFF;

    // "Undefined" Bootloader ID
    boot_params.hdr.type_of_loader = 0xFF;

    // LOADED_HIGH: the protected-mode code is loaded at 0x100000
    // CAN_USE_HEAP: Self explanatory
    boot_params.hdr.loadflags |= (LOADED_HIGH | CAN_USE_HEAP) as u8;

    boot_params.hdr.ramdisk_image = initramfs_addr.unwrap_or(0);
    boot_params.hdr.ramdisk_size = initramfs_size.unwrap_or(0);

    // https://www.kernel.org/doc/html/latest/arch/x86/boot.html#sample-boot-configuration
    // 0xe000 - 0x200
    boot_params.hdr.heap_end_ptr = 0xde00;
    // The command line parameters can be located anywhere in 64-bit mode
    // Must be NUL terminated
    boot_params.hdr.cmd_line_ptr = cmdline_addr;
    boot_params.ext_cmd_line_ptr = 0;

    boot_params.e820_entries = e820_entries
        .len()
        .try_into()
        .map_err(|_| LoaderError::TooManyEntries)?;
    boot_params.e820_table[..e820_entries.len()].copy_from_slice(e820_entries);

    Ok(Self {
        bz_image,
        boot_params,
    })
}

Now, we just need to place everything at the respective physical addresses:

const MAPPING_SIZE: usize = 1 << 30;

const CMDLINE: &[u8] = b"console=ttyS0 earlyprintk=ttyS0 rdinit=/init\0";

const ADDR_BOOT_PARAMS: usize = 0x10000;
const ADDR_CMDLINE: usize = 0x20000;
const ADDR_KERNEL32: usize = 0x100000;
const ADDR_INITRAMFS: usize = 0xf000000;

The boot parameters, initramfs and kernel cmdline can be located at arbritary addresses as we explicitly pass their address. Only the kernel has to be loaded at a fixed address. The initramfs is loaded at a higher address as we don't want the kernel image to overflow into the initramfs, causing corruption

We need to setup a few more fundamental components now:

IRQCHIP, PIT2: They emulate the required machinery for handling interrupts inside the guest
CPUID: Setting the CPUID features influences the behaviour of the cpuid instruction inside the VM. kvm_get_supported_cpuid gets us the feature set of the host CPU, which we expose to the guest
Identity Map, TSS (Task State Segment): Intel-specific quirks, these pages can be located anywhere in the first 4GB of guest memory. We opt to store the Identity Map at 0xFFFFC000 and TSS at 0xFFFFD000 (one page after the Identity Map), same as most other projects

@@ -42,15 +42,24 @@ impl Kvm {
         let kvm =
             unsafe { OwnedFd::from_raw_fd(fcntl::open("/dev/kvm", OFlag::O_RDWR, Mode::empty())?) };
         let vm = unsafe { OwnedFd::from_raw_fd(kvm_create_vm(kvm.as_raw_fd(), 0)?) };
+
+        // TODO refactor this, it should be done outside `new`
+        unsafe {
+            kvm_create_irqchip(vm.as_raw_fd())?;
+            kvm_create_pit2(vm.as_raw_fd(), &kvm_pit_config::default())?;
+
+            let idmap_addr = 0xFFFFC000;
+            kvm_set_identity_map_addr(vm.as_raw_fd(), &idmap_addr)?;
+        };
+
         let vcpu = unsafe { OwnedFd::from_raw_fd(kvm_create_vcpu(vm.as_raw_fd(), 0)?) };
@@ -124,6 +133,23 @@ impl Kvm {
         Ok(())
     }
 
+    pub fn set_tss_addr(&self, addr: u64) -> Result<(), std::io::Error> {
+        unsafe { kvm_set_tss_addr(self.vm.as_raw_fd(), addr)? };
+
+        Ok(())
+    }
+
+    pub fn setup_cpuid(&self) -> Result<(), std::io::Error> {
+        let mut cpuid2 = CpuId::new(80).expect("should not fail to construct CpuId!");
+
+        unsafe {
+            kvm_get_supported_cpuid(self.kvm.as_raw_fd(), cpuid2.as_mut_fam_struct_ptr())?;
+            kvm_set_cpuid2(self.vcpu.as_raw_fd(), cpuid2.as_fam_struct_ptr())?;
+        };
+
+        Ok(())
+    }
+
     pub fn run(&self) -> Result<*const kvm_run, std::io::Error> {

That's it! Now, putting all these APIs together:

const CMDLINE: &[u8] = b"console=ttyS0 earlyprintk=ttyS0 rdinit=/init\0";

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let kvm = Kvm::new()?;
    let mut bz_image = Vec::new();

    File::open(env::args().nth(1).expect("no bzImage passed!"))
        .expect("failed to open bzImage!")
        .read_to_end(&mut bz_image)
        .expect("failed to read!");

    let mut initramfs = Vec::new();

    File::open(env::args().nth(2).expect("no initramfs passed!"))
        .expect("failed to open initramfs")
        .read_to_end(&mut initramfs)
        .expect("failed to read!");

    let loader = BzImage::new(
        &bz_image,
        ADDR_CMDLINE.try_into().expect("cmdline address too large!"),
        Some(
            ADDR_INITRAMFS
                .try_into()
                .expect("initramfs address too large!"),
        ),
        Some(initramfs.len().try_into().expect("initramfs too big")),
        &[
            ...
            // Memory after the beginning of the kernel image
            boot_e820_entry {
                addr: 0x100000,
                size: MAPPING_SIZE as u64 - 0x100000,
                type_: 1,
            },
        ],
    )
    .expect("failed to construct loader!");

    // Create a mapping for the "user" memory region where we'll copy the
    // startup code into
    let wrapped_mapping = WrappedAutoFree::new(...);

    let mapped_slice = unsafe { slice::from_raw_parts_mut(*wrapped_mapping as _, MAPPING_SIZE) };

    unsafe {
        ...
        let kernel32 = loader.kernel32_slice();
        std::ptr::copy_nonoverlapping(
            kernel32.as_ptr(),
            wrapped_mapping.add(ADDR_KERNEL32) as *mut _,
            kernel32.len(),
        );
        ...
    }

    util::setup_gdt(mapped_slice);
    util::setup_paging(mapped_slice);

    kvm.set_user_memory_region(0x0, MAPPING_SIZE as u64, *wrapped_mapping as u64)?;
    kvm.set_vcpu_regs(&util::setup_regs(
        // 64-bit code is located 512 bytes ahead of the 32-bit code
        ADDR_KERNEL32 as u64 + 0x200,
        // boot params are stored in rsi
        ADDR_BOOT_PARAMS as u64,
    ))?;
    kvm.set_vcpu_sregs(&util::setup_sregs())?;
    kvm.set_tss_addr(0xFFFFD000)?;
    kvm.setup_cpuid()?;

    let mut buffer = String::new();

    loop {
        let kvm_run = kvm.run()?;

        unsafe {
            match (*kvm_run).exit_reason {
                KVM_EXIT_HLT => {
                    eprintln!("KVM_EXIT_HLT");
                    break;
                }
                KVM_EXIT_IO => {
                    let port = (*kvm_run).__bindgen_anon_1.io.port;
                    let byte = *((kvm_run as u64 + (*kvm_run).__bindgen_anon_1.io.data_offset)
                        as *const u8);

                    if port == 0x3f8 {
                        match byte {
                            b'\r' | b'\n' => {
                                println!("{buffer}");
                                buffer.clear();
                            }
                            c => {
                                buffer.push(c as char);
                            }
                        }
                    }

                    eprintln!("IO for port {port}: {byte:#X}");

                    // `in` instruction, tell it that we're ready to receive data (XMTRDY)
                    // arch/x86/boot/tty.c
                    if (*kvm_run).__bindgen_anon_1.io.direction == 0 {
                        *((kvm_run as *mut u8)
                            .add((*kvm_run).__bindgen_anon_1.io.data_offset as usize)) = 0x20;
                    }
                }
                reason => {
                    eprintln!("Unhandled exit reason: {reason}");
                    break;
                }
            }
        }
    }

    Ok(())
}

Building a Kernel

We'll build a tiny kernel for testing our hypervisor, starting with make tinyconfig, and enabling these options to make it usable (options should be self explanatory):

General setup  --->
    [*] Initial RAM filesystem and RAM disk (initramfs/initrd) support
    [*] Configure standard kernel features (expert users)  --->
        [*]   Enable support for printk
[*] 64-bit kernel
Executable file formats  --->
    [*] Kernel support for ELF binaries
Device Drivers  --->
    Generic Driver Options  --->
        [*] Maintain a devtmpfs filesystem to mount at /dev
Kernel hacking  --->
    printk and dmesg options  --->
        [*] Show timing information on printks
    x86 Debugging  --->
        [*] Enable verbose x86 bootup info messages
        [*] Early printk

Note that the kernel command line we pass is console=ttyS0 earlyprintk=ttyS0 rdinit=/init, and our code prints out data printed on port 0x3f8, which is equivalent to ttyS0

Now, running this kernel as it is would just panic as there's no init found:

# First argument is the kernel image and 2nd is the initramfs
# We don't have an initramfs so we pass /dev/null
# stderr is redirected as we don't want verbose output about IO ports
$ cargo run /tmp/linux-6.8.6/arch/x86/boot/bzImage /dev/null 2>/dev/null
early console in extract_kernel
input_data: 0x00000000018b8298
input_len: 0x00000000000a87c0
output: 0x0000000001000000
output_len: 0x0000000000936900
kernel_total_size: 0x0000000000818000
needed_size: 0x0000000000a00000
trampoline_32bit: 0x0000000000000000
Decompressing Linux... Parsing ELF... done.
Booting the kernel (entry_offset: 0x0000000000000000).
[    0.000000] Linux version 6.8.6 (testuser@shed) (gcc (GCC) 13.2.0, GNU ld (GNU Binutils) 2.42) #5 Mon May  6 00:10:00 IST 2024
[    0.000000] Command line: console=ttyS0 earlyprintk=ttyS0 rdinit=/init
...
[    0.055361] Run /bin/sh as init process
[    0.055659] Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
[    0.056588] Kernel Offset: disabled
[    0.056813] ---[ end Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance. ]---

For now, we can just provide a dummy init program that does an arbitary task, like printing out the contents of /dev:

// Bounds checks omitted
int main(void) {
  char msg[4096] = "Hello from userspace!";
  size_t idx = 0;

  if (mount("dev", "/dev", "devtmpfs", 0, NULL) == -1) {
    return EXIT_FAILURE;
  }

  int kmsg = open("/dev/kmsg", O_WRONLY | O_APPEND);
  if (kmsg == -1) {
    return EXIT_FAILURE;
  }

  DIR *dir = opendir("/dev");
  if (!dir) {
    return EXIT_FAILURE;
  }

  // Write the original message once before overwriting it
  if (write(kmsg, msg, strlen(msg)) == -1) {
    return EXIT_FAILURE;
  }

  for (struct dirent *dp = NULL; (dp = readdir(dir)) != NULL;) {
    for (char *name = dp->d_name; *name != '\0'; name++) {
      msg[idx++] = *name;
    }

    msg[idx++] = ' ';
  }

  msg[idx++] = '\0';

  if (write(kmsg, msg, idx) == -1) {
    return EXIT_FAILURE;
  }

  closedir(dir);
  close(kmsg);
}

We mount a devtmpfs filesystem at /dev to access /dev/kmsg, which is the kernel buffer. We write to it directly as a hack, as we don't emulate a full-fledged serial console yet, so whatever we'd print to stdout would not make it's way to us :p

As for getting this executed, we don't have any mechanism to share the filesystem directly for now, so we will just wrap this up in a dummy initramfs, bypassing the need for a filesystem altogether. We did something similar in this post about init systems:

# Must be a static binary as we don't have a proper rootfs setup
$ cc contrib/init.c -o init -static
# cpio takes the file list from stdin
$ echo init | cpio -o -H newc > initramfs
$ cargo run /tmp/linux-6.8.6/arch/x86/boot/bzImage initramfs 2>/dev/null
...

[    0.049393] Run /init as init process
[    0.049715] Hello from userspace!
[    0.049718] . .. kmsg urandom random full zero null 
[    0.049934] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000000
[    0.050717] Kernel Offset: disabled
[    0.050935] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000000 ]---

It now prints the desired output, but we still get a panic as the init process exits, which is expected. Note that even though we used this tiny kernel here, even a larger distro kernel would boot just fine (you can try it!)

Conclusion

The final code for this post can be found here

We've got our hypervisor to boot into userspace, but it's still quite limited in functionality. We'll be covering VirtIO devices next so we can provide a serial console, and also cleaning up the un-rusty code full of raw pointers and as T spam :p

There was also quite a bit of debugging involved that was not covered here, which would probably need it's own dedicated post. But more or less you're on your own here without a debugger. There's an enable_debug() function which enables single-step debugging, allowing instruction-by-instruction tracing, which can then be used to print out the state of registers at each level. Then, the instruction pointer can be correlated with the disassembly of the kernel image, adjusted for the offset calculated in kernel_byte_offset(). The 64-bit code starts here in arch/x86/boot/compressed/head_64.S, inserting asm("hlt") statements here and there might help binary search the code being executed