Testing NVMe CMB and PMR#
This is a how-to for exercising an emulated NVMe Controller Memory Buffer
(CMB) and Persistent Memory Region (PMR) inside a guest VM. It covers two
mutually exclusive paths: the SPDK userspace driver (the primary path) and
the in-kernel nvme driver (the no-SPDK alternative).
The guests are produced by the boot flow (f/qsu/boot); see
Inspecting guests for how to inspect a running guest. QEMU emits the CMB
on PCI BAR 2 and the PMR on BAR 4/5, controlled by the NVMe knobs in the boot
flow (f/qsu/boot).
SPDK versus the kernel nvme driver#
SPDK is a userspace NVMe driver. Its scripts/setup.sh unbinds the
controller from the kernel nvme driver and rebinds it to vfio-pci,
then SPDK drives it from userspace over VFIO. The guest therefore needs three
things the imageless build provides automatically: a VFIO stack, an active
vIOMMU, and hugepages. The in-kernel nvme driver is the opposite: it keeps
the device and uses the CMB itself. The two methods are mutually exclusive on a
given controller at a given time.
- Owns the controller
vfio-pciunder SPDK;nvmefor the kernel path.- CMB
SPDK uses
spdk_nvme_cmb_copyand identify reports it; the kernel exposes/sys/class/nvme/nvmeN/cmband places submission queues in the CMB.- PMR
SPDK uses
spdk_nvme_pmr_persistence; Linux does not use the PMR, so you poke the BAR directly.- Requirements
SPDK needs VFIO, a vIOMMU, and hugepages; the kernel path needs nothing extra.
What is wired#
All of this is built into the imageless product; you do not configure it per-run beyond choosing the vIOMMU and the NVMe knobs.
VFIO,VFIO_PCI(=m), and the AMD/Intel/virtio IOMMU drivers come fromlinux-config-fragments(the imageless preset pluscore/vfio.config).The vIOMMU
caching-mode=on/dma-remap=onsettings are emitted by theqemu-system-unitsvm.env.j2template whenever an intel or amd vIOMMU is set.spdkand thecmb_copy/pmr_persistenceexamples come from thenixos-flakeoverlays/spdk.nixandprofiles/devel.vfio_iommu_type1 allow_unsafe_interrupts=1is set by thenixos-flakeprofiles/devel(boot.extraModprobeConfig).
Use iommu=intel-iommu. QEMU’s emulated amd-iommu does not support guest
VFIO: even with dma-remap=on, DPDK fails with failed to select IOMMU
type. The intel-iommu with caching-mode=on is the proven path, and
the emulated vIOMMU is independent of the host CPU, so it works on an AMD host.
Quick start#
Boot a guest with NVMe CMB+PMR and an Intel vIOMMU. A full build picks up any
kernel or closure changes; pass nix_lock.update_lock=true so a
vendored-flake edit reaches the closure.
$ wmill flow run f/qsu/bringup -d '{
$ "kernel_source": "build", "closure_source": "build",
$ "qemu_source": "nixpkgs",
$ "nix_lock": {"update_lock": true},
$ "boot_vm": {"auto_vm_name": false, "vm_name": "vm-spdk"},
$ "boot_qemu": {"iommu": "intel-iommu"},
$ "boot_nvme": {"nvme_drive_count": 4, "customize_drives": true,
$ "cmb_size_mb": "64", "pmr_size": "16777216",
$ "pmr_share": true}
$ }'
Then ssh vm-spdk. To change the vIOMMU later, reconfigure in place (reuse
the kernel and closure, just re-render): set kernel_source and
closure_source to reuse, reuse_from_vm to the guest, and flip
boot_qemu.iommu.
Using SPDK#
Run these as root in the guest. setup.sh reserves hugepages and binds the
NVMe controllers to vfio-pci; allow_unsafe_interrupts is already set by
the devel profile, so no manual sysfs poke is needed.
$ SPDK=$(dirname $(dirname $(readlink -f $(command -v spdk_nvme_identify))))
$ HUGEMEM=1024 "$SPDK"/scripts/setup.sh # bind vfio-pci + hugepages
$ "$SPDK"/scripts/setup.sh status # list the NVMe BDFs
The BDFs shift with the vIOMMU. Pick a controller BDF from status (with
intel-iommu they are 0000:00:04.0 through 0000:00:07.0).
Identify#
This proves SPDK drives the device over VFIO and reports CMB/PMR:
$ spdk_nvme_identify -r 'trtype:PCIe traddr:0000:00:04.0' \
| grep -iE 'Memory Buffer|Persistent Memory'
# Controller Memory Buffer Support
# Persistent Memory Region Support
PMR persistence#
The -p device, -n nsid, -r/-w LBAs, and -l count are all
mandatory:
$ spdk_nvme_pmr_persistence -p 0000:00:04.0 -n 1 -r 0 -l 1 -w 0
# attach_cb - attached 0000:00:04.0!
# PMR Data is Persistent across Controller Reset
CMB copy#
Copy a namespace between controllers using one controller’s CMB as the data
buffer. The parameters are <pci>-<ns>-<startLBA>-<nLBAs>; -c is the
controller whose CMB to use:
$ spdk_nvme_cmb_copy -r 0000:00:04.0-1-0-16 -w 0000:00:05.0-1-0-16 \
-c 0000:00:04.0
# attached 0000:00:04.0! / attached 0000:00:05.0! (exit 0)
When done, return the controllers to the kernel with
"$SPDK"/scripts/setup.sh reset.
Kernel-only access (no SPDK)#
With the controllers on the in-kernel nvme driver (the default, or after
setup.sh reset), the CMB is reachable through P2PDMA and the PMR through
direct BAR access.
CMB via P2PDMA#
The kernel registers the CMB as P2P memory and places I/O submission queues in
it. This needs CONFIG_PCI_P2PDMA=y, which the imageless preset sets:
$ cat /sys/bus/pci/devices/0000:00:04.0/p2pmem/{size,published,available}
$ cat /sys/class/nvme/nvme0/cmb # cmbsz bit0 (SQS) set => SQs in CMB
$ dd if=/dev/nvme0n1 of=/dev/null bs=1M count=8 # exercise the CMB SQ path
An available below size is the kernel having allocated SQs out of the
CMB. If PCI_P2PDMA were off, dmesg would show failed to register
the CMB and p2pmem/ would be absent.
PMR via MMIO#
The Linux nvme driver does not use the PMR, so drive it as a userspace
driver: unbind nvme, enable PMRCTL.EN in BAR0, then read or write the
PMR data in BAR4. MMIO needs single aligned word accesses (ctypes, not
mmap/struct bulk copy). The write reaches the share=on backing file
in the per-VM StateDirectory, proving persistence:
# guest, root, after: echo 0000:00:05.0 > /sys/bus/pci/drivers/nvme/unbind
import ctypes, mmap, os
DEV = "/sys/bus/pci/devices/0000:00:05.0"
MAGIC = b"PMRTEST"
m0 = mmap.mmap(os.open(DEV + "/resource0", os.O_RDWR | os.O_SYNC), 4096)
base = ctypes.addressof(ctypes.c_char.from_buffer(m0)) # BAR0 registers
ctypes.cast(base + 0xE04, ctypes.POINTER(ctypes.c_uint32)).contents.value = 1
m4 = mmap.mmap(os.open(DEV + "/resource4", os.O_RDWR | os.O_SYNC), 4096)
m4[0:len(MAGIC)] = MAGIC # BAR4 PMR data
Confirm on the host that the guest write appears in the backing file:
$ grep -a PMRTEST ~/.local/state/qemu-system/<vm>/nvme-pmr-1.img
The PMR size must be a power of two and at least one host page
(f/qsu/common rejects anything smaller). Here 00:05.0 maps to drive
index 1, which maps to nvme-pmr-1.img.
Pitfalls#
failed to select IOMMU typeEither the vIOMMU is
amd-iommu(useintel-iommuinstead), orallow_unsafe_interruptsis not set. Thedevelprofile sets it; a non-develclosure needsecho 1 > /sys/module/vfio_iommu_type1/parameters/allow_unsafe_interrupts.- No
spdkafter editing the vendorednixos-flake The closure pins it by narHash, so rebuild with
nix_lock.update_lock=true. A new file (for example an overlay) must also begit add``ed in ``vendor/nixos-flake, because a git flake sees only tracked files; copying it in is not enough.- nixpkgs SPDK lags upstream
The pinned channel ships an older SPDK than the latest
vYY.MMtag. The overlay only recovers the missing example binaries; it does not bump the version.
References#
SPDK upstream documentation for
setup.shand the NVMe examples.The NVMe knobs live in the boot flow’s NVMe group (
f/qsu/boot); the CMB/PMR mechanics are in theqemu-system-unitsnvme.env.j2macros.