Automated visual regression tests

The screenshot test suite is already run during CI in branches + main. There is already tooling written to diff the screenshots. Currently, this diffing is performed manually.

Maybe this could be automated instead? That is: on every pushed/merged commit, run the screenshot tests, and then trigger a downstream pipeline that diffs these screenshot artifacts against a known good baseline.

Baseline

Currently, the baseline is maintained in the phosh-screenshots repo, and there's one baseline per released version.

For an automated regression suite to work, there would need to be a baseline that evolves alongside the code itself. This presents the first challenge - where should this baseline be stored? Storing it directly in Git seems problematic, as the baseline would likely evolve quite frequently, as the code does. I suspect it would add a lot of weight to the Git repository quite rapidly.

Some options here are:

~~Use Git LFS to track these objects~~ Git LFS is a no-go
Don't track it in Git at all and instead pull the baseline from pipeline artifact storage

This is the main unresolved impediment for this machinery.

Baseline updates

Once a baseline is established and being checked regularly, the diffing pipeline would be expected to fail quite regularly. Every time a change is made that affects rendering or layout of the UI, in fact. For such a pipeline to not be considered a nuisance, it needs to be very easy to update the baseline after manual review.

Here's how I think it should work:

The diff pipeline clearly logs which screenshot tests have differences, ideally with a URL that links directly to the artifacts for before+after+diff mask, so that the results can be quickly and easily examined.
A manual "update screenshot baseline" pipeline is available that can be run, and will push the new "after" screenshots into the baseline for that branch.

In this way, it should be quite painless and quick to understand which screenshots have changed. And if the changes are expected and acceptable, it should be trivial to update the baseline.

Expanding the suite

If this automated visual regression setup were to exist, it could then be expanded to cover more testing setups. That is, a CI test matrix could be set up that runs the screenshot tests against a wide variety of display resolutions and scale factors. Ideally this would help to catch regressions much earlier: during the MR rather than far, far downstream.

Such an idea presents another potential issue though: this may be too taxing on the existing build infra. Before such a thing is rolled out it would need to be discussed with gitlab.gnome.org sysadmins.

Determinism in tests

For this whole approach to work at all, some work will need to be done to ensure that the screenshots being taken are more deterministic. It is unacceptable to introduce a new CI pipeline that is flaky because the screenshots keep jumping around for reasons outside of the control of a contributor.

Time

This is the most obvious one. Many screenshots include the top bar or the lockscreen, which shows the current time. At present, this is being obtained directly from GnomeWallClock.

It's been discussed with Guido already, and the current idea is to derive a PhoshWallClock that can return a mocked (and static) time when enabled via PHOSH_DEBUG.

This work is now done and prepped for 0.39: !1408 (merged)

Consistency of screenshots across environments

This is the trickiest part: ensuring that screenshots taken from a dev machine are the same as the ones taken inside a Docker container in CI.

Username

At least in emergency contacts, the username is displayed. We'll need to find a way to mock this.

File path in ticketbox prefs

Currently it's defaulting to the home directory of the current user. Would be better to pin this to something fixed so that it isn't subject to CI environmental changes, and is consistent on local development machines as well.

This can be overridden with gsettings already.

Battery info + Wifi + Bluetooth

These are all driven by DBus interfaces already. Spawning a python-dbusmock and using that from test fixtures seems like the way to go here.

Edited Apr 25, 2024 by Sam Day