Automated visual regression tests
The screenshot test suite is already run during CI in branches + main
. There is already tooling written to diff the screenshots. Currently, this diffing is performed manually.
Maybe this could be automated instead? That is: on every pushed/merged commit, run the screenshot tests, and then trigger a downstream pipeline that diffs these screenshot artifacts against a known good baseline.
Baseline
Currently, the baseline is maintained in the phosh-screenshots repo, and there's one baseline per released version.
For an automated regression suite to work, there would need to be a baseline that evolves alongside the code itself. This presents the first challenge - where should this baseline be stored? Storing it directly in Git seems problematic, as the baseline would likely evolve quite frequently, as the code does. I suspect it would add a lot of weight to the Git repository quite rapidly.
Some options here are:
-
Use Git LFS to track these objectsGit LFS is a no-go - Don't track it in Git at all and instead pull the baseline from pipeline artifact storage
This is the main unresolved impediment for this machinery.
Baseline updates
Once a baseline is established and being checked regularly, the diffing pipeline would be expected to fail quite regularly. Every time a change is made that affects rendering or layout of the UI, in fact. For such a pipeline to not be considered a nuisance, it needs to be very easy to update the baseline after manual review.
Here's how I think it should work:
- The diff pipeline clearly logs which screenshot tests have differences, ideally with a URL that links directly to the artifacts for before+after+diff mask, so that the results can be quickly and easily examined.
- A manual "update screenshot baseline" pipeline is available that can be run, and will push the new "after" screenshots into the baseline for that branch.
In this way, it should be quite painless and quick to understand which screenshots have changed. And if the changes are expected and acceptable, it should be trivial to update the baseline.
Expanding the suite
If this automated visual regression setup were to exist, it could then be expanded to cover more testing setups. That is, a CI test matrix could be set up that runs the screenshot tests against a wide variety of display resolutions and scale factors. Ideally this would help to catch regressions much earlier: during the MR rather than far, far downstream.
Such an idea presents another potential issue though: this may be too taxing on the existing build infra. Before such a thing is rolled out it would need to be discussed with gitlab.gnome.org
sysadmins.
Determinism in tests
For this whole approach to work at all, some work will need to be done to ensure that the screenshots being taken are more deterministic. It is unacceptable to introduce a new CI pipeline that is flaky because the screenshots keep jumping around for reasons outside of the control of a contributor.
Time
This is the most obvious one. Many screenshots include the top bar or the lockscreen, which shows the current time. At present, this is being obtained directly from GnomeWallClock
.
It's been discussed with Guido already, and the current idea is to derive a PhoshWallClock
that can return a mocked (and static) time when enabled via PHOSH_DEBUG
.
This work is now done and prepped for 0.39: !1408 (merged)
Consistency of screenshots across environments
This is the trickiest part: ensuring that screenshots taken from a dev machine are the same as the ones taken inside a Docker container in CI.
Username
At least in emergency contacts, the username is displayed. We'll need to find a way to mock this.
Guido suggested a small LD_PRELOAD shim for this.
File path in ticketbox prefs
Currently it's defaulting to the home directory of the current user. Would be better to pin this to something fixed so that it isn't subject to CI environmental changes, and is consistent on local development machines as well.
This can be overridden with gsettings
already.
Battery info + Wifi + Bluetooth
These are all driven by DBus interfaces already. Spawning a python-dbusmock
and using that from test fixtures seems like the way to go here.