Building a hypervisor, 1: KVM Introduction
In this series of blog posts, we'll build a KVM-based hypervisor from scratch, eventually making it boot the Linux kernel. The first post will go over a "hello world" KVM example in Rust, and assumes basic systems programming knowledge (writing trivial assembly, using syscalls, ioctl's, etc.)
The KVM API
The KVM API basically allows us to execute code under a virtualized environment, and emulate the required hardware. It is assisted by the hardware, so we can't virtualize code built for another CPU architecture. The whole KVM subsystem is centered around ioctl
calls, which are a bit boilerplate-y to write by hand in Rust (due to error handling), so we'll use the nix
crate's ioctl
helpers.
The flow for running a VM goes like this:
-
Obtain a handle to the KVM subsystem (
/dev/kvm
) -
Issue a
KVM_CREATE_VM
ioctl on the handle to obtain a VM handle, i.e. a VM with no memory or vCPUs -
Issue a
KVM_CREATE_VCPU
ioctl on the VM handle to obtain a vCPU handle, responsible for actually executing the code -
Associate a shared memory region for communication with the guest, to the vCPU handle (
KVM_GET_VCPU_MMAP_SIZE
+mmap
) -
Associate another memory region with the guest containing the code to be executed, using the
KVM_SET_USER_MEMORY_REGION
ioctl -
Setup the guest's vCPU registers to execute the code, making progress on the execution by issuing the
KVM_RUN
ioctl
vCPU
/
KVM -> VM
\
Memory
Now, coming to making ioctl
calls, one would just call ioctl()
with the appropriate arguments in C (error-checking ommited):
int kvm = open("/dev/kvm", O_RDWR);
int vm = ioctl(kvm, KVM_CREATE_VM, 0);
In Rust, the aforementioned ioctl
helpers allow us to generate a wrapper for these calls, including error handling. Refer to linux/kvm.h
for the request code:
ioctl_write_int_bad!(kvm_create_vm, request_code_none!(KVMIO, 0x01));
This will generate the kvm_create_vm
function that checks errno
for errors and allows us to cleanly propagate them. We can similarly generate wrappers for all the other ioctl
calls we dicussed before:
ioctl_write_int_bad!(kvm_get_vcpu_mmap_size, request_code_none!(KVMIO, 0x04));
ioctl_write_int_bad!(kvm_run, request_code_none!(KVMIO, 0x80));
ioctl_write_int_bad!(kvm_create_vcpu, request_code_none!(KVMIO, 0x41));
ioctl_write_ptr!(
kvm_set_user_memory_region,
KVMIO,
0x46,
kvm_userspace_memory_region
);
ioctl_write_ptr!(kvm_set_regs, KVMIO, 0x82, kvm_regs);
ioctl_read!(kvm_get_sregs, KVMIO, 0x83, kvm_sregs);
ioctl_write_ptr!(kvm_set_sregs, KVMIO, 0x84, kvm_sregs);
We use the C structs like kvm_userspace_memory_region
from the kvm_bindings
crate, which is generated using bindgen
.
Building the abstractions
Now, we can define a Kvm
struct to provide an abstraction for setting up the VM:
struct Kvm {
/// KVM subsystem handle
kvm: OwnedFd,
/// VM handle
vm: OwnedFd,
/// vCPU handle
vcpu: OwnedFd,
/// Shared kvm_run structure for communication
kvm_run: WrappedAutoFree<*mut kvm_run, Box<dyn FnOnce(*mut kvm_run)>>,
}
Here, kvm
, vm
and vcpu
are just file descriptors, which are wrapped with the OwnedFd
abstraction, which closes the FDs automatically on being dropped. Similarly, the mmap
'd kvm_run
region is wrapped with a custom WrappedAutoFree
abstraction, which unmaps the region on being dropped (with a cleanup callback)
Now, we can implement the new
function to perform the basic setup, covering steps 1 to 4 described in the previous section:
pub fn new() -> Result<Self, std::io::Error> {
let kvm =
unsafe { OwnedFd::from_raw_fd(fcntl::open("/dev/kvm", OFlag::O_RDWR, Mode::empty())?) };
let vm = unsafe { OwnedFd::from_raw_fd(kvm_create_vm(kvm.as_raw_fd(), 0)?) };
let vcpu = unsafe { OwnedFd::from_raw_fd(kvm_create_vcpu(vm.as_raw_fd(), 0)?) };
// Size of the shared `kvm_run` mapping
let mmap_size = NonZeroUsize::new(unsafe {
kvm_get_vcpu_mmap_size(kvm.as_raw_fd(), 0)?
.try_into()
.expect("mmap_size too big for usize!")
})
.expect("mmap_size is zero");
let kvm_run = WrappedAutoFree::new(
unsafe {
mman::mmap(
None,
mmap_size,
ProtFlags::PROT_READ | ProtFlags::PROT_WRITE,
MapFlags::MAP_SHARED,
Some(&vcpu),
0,
)? as *mut kvm_run
},
Box::new(move |map: *mut kvm_run| unsafe {
mman::munmap(map as _, mmap_size.get()).expect("failed to unmap kvm_run!");
}) as _,
);
Ok(Self { kvm, vm, vcpu, kvm_run })
}
We obtain the kvm
, vm
, and vcpu
handles using the wrappers described before, and converting them to OwnedFd
s. Then, we get the size of the region to be mapped for the kvm_run
structure from the kernel, and mmap
it, associating it with the vcpu
file descriptor.
For step 5, we have a wrapper for setting the code-containing memory region, set_user_memory_region
:
pub fn set_user_memory_region(
&self,
guest_phys_addr: u64,
memory_size: usize,
userspace_addr: u64,
) -> Result<(), std::io::Error> {
unsafe {
kvm_set_user_memory_region(
self.vm.as_raw_fd(),
&kvm_userspace_memory_region {
slot: 0,
flags: 0,
guest_phys_addr,
memory_size: memory_size as u64,
userspace_addr,
},
)?;
}
Ok(())
}
-
guest_phys_addr
: Since this mapping will be presented as "physical" memory to the guest, this field refers to the address where it will be placed. For example, setting this to0x1000
will make the mapping accessible at physical address0x1000
in the guest -
memory_size
: The size of the mapping -
userspace_addr
: The address of the mapping in the current process (virtual address)
Finally, for step 6, we define a few tiny helpers:
get_vcpu_sregs
: Fetches the vCPU's special registers, pre-filled with defaults, which we later modify. This consists of segment registers, control registers and a couple more -kvm_sregs
pub fn get_vcpu_sregs(&self) -> Result<kvm_sregs, std::io::Error> {
let mut sregs = kvm_sregs::default();
unsafe { kvm_get_sregs(self.vcpu.as_raw_fd(), &mut sregs)? };
Ok(sregs)
}
set_vcpu_sregs
: Sets the vCPU's special registers
pub fn set_vcpu_sregs(&self, regs: *const kvm_sregs) -> Result<(), std::io::Error> {
unsafe { kvm_set_sregs(self.vcpu.as_raw_fd(), regs)? };
Ok(())
}
set_vcpu_regs
: We don't define aget_vcpu_regs
here as we don't need to fetch them for now.regs
refers to the general-purpose CPU registers like the instruction pointer (rip
) -kvm_regs
pub fn set_vcpu_regs(&self, regs: *const kvm_regs) -> Result<(), std::io::Error> {
unsafe { kvm_set_regs(self.vcpu.as_raw_fd(), regs)? };
Ok(())
}
run
: Actually makes progress on the execution of the VM, issuing theKVM_RUN
ioctl. Relevant information is filled in thekvm_run
structure, and the control is returned back to the VMM whenever aVMexit
occurs; so when a guest writes to a serial port, the structure will contain information like the port and the data to be written to it
pub fn run(&self) -> Result<*const kvm_run, std::io::Error> {
unsafe { kvm_run(self.vcpu.as_raw_fd(), 0)?; }
// The `kvm_run` struct is filled with new data as it was associated
// with the `vcpu` FD in the mmap() call
Ok(*self.kvm_run as _)
}
NOTE: The dereference operator might seem a bit confusing here. We use it to get the actual pointer to the structure out of the WrappedAutoFree
object, which implements the Deref
trait.
Driver code
Now, we can finally write the code to drive this VM! At this stage, the VM will run in real mode, which means we can access the "physical" memory directly, and only execute 16-bit code:
fn main() -> Result<(), Box<dyn std::error::Error>> {
// We don't need a large mapping as our code is tiny
// Must be page-size aligned, so minimum is 4KiB
const MAP_SIZE: usize = 0x1000;
let mut code = Vec::new();
// Read the passed file into the `code` buffer
File::open(env::args().nth(1).expect("no argument passed"))?.read_to_end(&mut code)?;
let kvm = Kvm::new()?;
// Mapping to store the code
// MAP_ANONYMOUS is used as we're not backing this mapping by any fd
let mapping = WrappedAutoFree::new(
unsafe {
mman::mmap(
None,
NonZeroUsize::new(MAP_SIZE).expect("mapping size is zero"),
ProtFlags::PROT_READ | ProtFlags::PROT_WRITE,
MapFlags::MAP_ANONYMOUS | MapFlags::MAP_SHARED,
None::<BorrowedFd>,
0,
)?
},
|map| unsafe {
mman::munmap(map, MAP_SIZE).expect("failed to unmap user memory region");
},
);
assert!(code.len() < MAP_SIZE);
// The idiomatic way is to write a wrapper struct for `mmap`-ing regions
// and exposing it as a slice (std::slice::from_raw_parts)
// But we just copy the code directly here
unsafe {
std::ptr::copy_nonoverlapping(code.as_ptr(), *mapping as *mut _, code.len());
};
let mut sregs = kvm.get_vcpu_sregs()?;
// CS points to the reset vector by default
sregs.cs.base = 0;
sregs.cs.selector = 0;
kvm.set_vcpu_sregs(&sregs)?;
kvm.set_user_memory_region(0, MAP_SIZE, *mapping as u64)?;
kvm.set_vcpu_regs(&kvm_regs {
// The first bit must be set on x86
rflags: 1 << 1,
// The instruction pointer is set to 0 as our code is loaded with 0
// as the base address
rip: 0,
..Default::default()
})?;
loop {
let kvm_run = kvm.run()?;
unsafe {
match (*kvm_run).exit_reason {
KVM_EXIT_HLT => break,
KVM_EXIT_IO => { /* Handle IO here */ }
reason => panic!("Unhandled exit reason: {reason}"),
}
}
}
Ok(())
}
We accept a path to the compiled 16-bit real mode program as the first argument, and copy it into the shared mapping. The size of the mapping can be relatively small as our code won't do much, but it must be aligned to the page-size, which is 4KiB. Then, we fetch the special registers, and make the code segment point to address 0, rather than the reset vector. The general-purpose registers are all set to defaults (zero), except for rflags
, which has the 1st reserved bit set, and the instruction pointer (rip
) also holds 0, as our code is present at physical address 0.
Finally, we have a loop to repeatedly issue the KVM_RUN
ioctl. Whenever a VMexit
is generated, we get back some information in the shared kvm_run
mapping (as mentioned before), which we can use to emulate the behaviour of a physical device for the guest. The exit_reason
tells us what exactly prompted a VMexit
, but we only care about two constants here, treating any other value as an error:
-
KVM_EXIT_HLT
- The guest executed thehlt
instruction, which we can use as an indication to stop the VM -
KVM_EXIT_IO
- The guest performed some I/O on a serial port, we will implement a handler for this in the next section
Some assembly required
Now, we will write a simple 16-bit assembly program to print the string Hello, KVM!
:
; Output to port 0x3f8
mov dx, 0x3f8
; Store the address of the message in bx, so we can increment it
mov bx, message
loop:
; Load a byte from `bx` into the `al` register
mov al, [bx]
; Jump to the `hlt` instruction if we encountered the NUL terminator
cmp al, 0
je end
; Output to the serial port
out dx, al
; Increment `bx` by one byte to point to the next character
inc bx
jmp loop
end:
hlt
message:
db "Hello, KVM!", 0
Build it with nasm -fbin hello.S -o hello
Now, every time we hit the out
instruction, it will generate a VMexit
, and we'll be able to print out the character:
unsafe {
match (*kvm_run).exit_reason {
KVM_EXIT_HLT => break,
- KVM_EXIT_IO => { /* Handle IO here */ }
+ KVM_EXIT_IO => {
+ let port = (*kvm_run).__bindgen_anon_1.io.port;
+ let offset = (*kvm_run).__bindgen_anon_1.io.data_offset as usize;
+ let character = *((kvm_run as *const u8).add(offset)) as char;
+
+ println!("Port: {port:#x}, Char: {character}");
+ }
reason => panic!("Unhandled exit reason: {reason}"),
}
}
Accessing kvm_run
without any abstractions is quite unwieldy. The __bindgen_anon_*
fields are generated by bindgen as anonymous unions cannot be represented in the same manner in Rust as they are in C, since only one field can be active at a time.
The io.port
field tells us the target port, and io.data_offset
gives us the offset into the kvm_run
mapping where we can find the written byte, which we use to perform some pointer arithmetic to get the final result:
$ cargo run -- hello
Finished dev [unoptimized + debuginfo] target(s) in 0.01s
Running `target/debug/intro hello`
Port: 0x3f8, Char: H
Port: 0x3f8, Char: e
Port: 0x3f8, Char: l
Port: 0x3f8, Char: l
Port: 0x3f8, Char: o
Port: 0x3f8, Char: ,
Port: 0x3f8, Char:
Port: 0x3f8, Char: K
Port: 0x3f8, Char: V
Port: 0x3f8, Char: M
Port: 0x3f8, Char: !
Now obviously, there is quite a bit of overhead when emulating hardware in this manner, as every interaction will cause a VMexit
. virtio
devices are much more efficient, though we won't be covering them in this post.
Conclusion
In this post, we covered a small overview of the Linux KVM API, and implemented a hello-world esque hypervisor. In the next posts, we'll be exploring long mode, paging, and implementing the Linux boot protocol to boot a small Linux kernel image. The code for this blog can be found here.