Run an xfstests check#
The f/fstests/check flow runs an xfstests ./check against an
already-booted, fstests-ready guest: the Windmill equivalent of an xfstests
./check run. The guest is produced separately by f/qsu/bringup with a
writable fstests virtiofs share mounted at /var/lib/xfstests and the
test/scratch NVMe drives attached.
The flow is thin and mirrors xfstests/systemd vocabulary one-to-one:
discover: gate the guest over vsock-SSH and enumerate its devices andFSTYP.render_config: writelocal.config(HOST_OPTIONS) and thecheck.envEnvironmentFileonto the host side of the share. Its[section]names drive the loop.for each
sectionin turn:start→wait→collect.report: fold the per-section results into one verdict.
On the guest each [section] runs as a xfstests@<section>.service
template unit started with --no-block, executing ./check -s <section>.
The unit sets TimeoutStartSec to infinity, so a section is never
bounded by systemd’s start timeout. The patched check runs each individual
test inside its own transient scope, fstests-<test>.scope (for example
fstests-generic-310.scope), created with systemd-run in --scope
mode. This is what makes a single test independently observable and killable
from outside the run.
Every step carries a worker tag: the quick lifecycle and control steps run on
the vm tag, and the long-lived wait poll runs on the vm-run tag, so
a long run never starves a quick control op. The vm-run worker instance
count is the concurrent-test-run cap; see Nix.
Service units to query#
A run exposes two kinds of systemd object on the guest, which you drive with the
tools in Inspecting guests (systemctl --host <vm> … for the units, ssh <vm>
journalctl … for their logs):
xfstests@<section>.service: one per[section], running./check -s <section>. The<section>is the name as it appears inlocal.config, for examplexfs_realtime_rtx2_bs4k_ss4k.fstests-<test>.scope: the transient scope wrapping the single test currently executing inside that section, for examplefstests-generic-310.scope.
How a flow surfaces its state in the Windmill job log, and why these recipes are the out-of-band view, is covered in Inspecting guests.
Querying section status and logs#
List the sections currently running on a guest, and the per-test scope inside the live section:
$ systemctl --host <vm> list-units 'xfstests@*'
$ systemctl --host <vm> list-units --type=scope # the fstests-<test>.scope
Full status of one section (the cgroup line shows the running ./check and
the current test’s helper processes):
$ systemctl --host <vm> status xfstests@<section>.service
The three properties the wait step polls to decide a section is done are
Result, ExecMainStatus and ActiveState; read them the same way:
$ systemctl --host <vm> show xfstests@<section>.service \
--property=Result --property=ExecMainStatus --property=ActiveState
ActiveState=activating means the section is still running, active or
failed is terminal; Result carries systemd’s outcome enum
(success / exit-code / signal / timeout / …). Follow the live
journal of a section, the same stream the job log shows:
$ ssh <vm> journalctl --unit=xfstests@<section>.service --follow
Each test’s progress line (generic/310, then its elapsed seconds), its
[failed, ...] verdict, and the .out.bad path on a mismatch all appear
here. The per-section results, the .out.bad diffs and the check.log also
land on the host side of the share under
$WORKERS_DIR/shared/fstests/<vm>/<kver>/ once collect runs, and the
folded run verdict is written to report.json in that directory.
Restarting a hung test#
A single test can wedge: a livelock, or a thread stuck in uninterruptible
sleep. Because the section unit is TimeoutStartSec=infinity, nothing bounds
that one test unless the per-test watchdog is armed. The Per-test Timeout
form field (test_timeout → TEST_TIMEOUT) sets each test’s scope
RuntimeMaxSec, so systemd kills an overrunning test and the run
continues; it is 0 (no limit) by default, taking effect only on a guest
built with the patched xfstests. When it is unset, or you want to intervene on
a run already in flight, kill the test by hand: this reproduces exactly what the
watchdog would have done.
The symptom is a section that makes no progress: its journal stops emitting new
generic/<n> lines and status keeps reporting activating for far
longer than the test should take. Find the in-flight scope and confirm which
test it is:
$ systemctl --host <vm> list-units --type=scope
UNIT ACTIVE SUB DESCRIPTION
fstests-generic-310.scope active running [systemd-run] ... generic/310
Kill that scope:
$ systemctl --host <vm> kill --signal=SIGKILL fstests-generic-310.scope
check sees the test killed, records it as a failure, and proceeds to the
next test. The failure surfaces as an output mismatch with exit status 137
(128 + SIGKILL), the diff naming the killed systemd-run --scope command, for
example:
generic/310 [failed, exit status 137]- output mismatch
-*** done
+/tmp/xfstests.XXXXXX/check: line 700: NNNNNN Killed systemd-run ...
and the run moves on to generic/311.
To abort the whole section instead of skipping one test, stop its unit (this
is the documented fallback in f/fstests/stop.py, and also what the flow’s
failure_module runs when you cancel the Windmill job):
$ systemctl --host <vm> stop xfstests@<section>.service
$ systemctl --host <vm> reset-failed xfstests@<section>.service
The wait step observes the unit go inactive and the run ends that section.
Cancelling the Windmill job (a clean cancel, not a force-kill of the worker)
runs the failure_module for you, so it tears the running section down on the
guest; a force-kill bypasses that and leaves ./check burning CPU under
TimeoutStartSec=infinity.