test suite times out on arm64 when running the closures test
@mbiebl
Submitted by Michael Biebl Link to original bug (#791706)
Description
Version: 2.54.2 Bug-Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=880883
In Debian we have the problem, that glib does not build reliably on arm64, or more specifically, the test-suite fails rather often with a timeout, if it runs on specific (slower) arm64 hardware.
See https://buildd.debian.org/status/logs.php?pkg=glib2.0&arch=arm64
This seems to happen when the closures test is run, which is killed by the buildd after 150min. If you want to read the whole bug report, see the referenced downstream bug report. I'm copying here the reply from Punit Agrawal, which seem to be the most relevant ones:
I had a chance to dig into this further and think I now know what's going on.
I ran the test on an AMD Seattle system (same as arm-arm-04) and noticed that the closures test (tests/refcount/closures) didn't finish in >150 minutes. Profiling the test using "perf" pointed to g_closure_ref() and g_closure_unref() as taking the bulk of execution time. This confirms Emilio's suspicion in #20 (closed).
$ perf report
# To display the perf.data header info, please use
--header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 3M of event 'cycles:uppp'
# Event count (approx.): 1497293182208
#
# Overhead Command Shared Object Symbol
# ........ ........ ..........................
............................................
#
67.85% closures libgobject-2.0.so.0.5400.2 [.] g_closure_ref
29.10% closures libgobject-2.0.so.0.5400.2 [.] g_closure_unref
2.45% closures libgobject-2.0.so.0.5400.2 [.] g_closure_invoke
0.22% closures closures [.] test_closure
0.11% closures closures [.] g_closure_unref@plt
Copying the annotated profile of g_closure_ref(). First column is %
samples at instructions as observed by "perf record"
│
│ INC_ASSIGN (closure, ref_count, &new_ref_count);
0.20 │ 30: ldr w1, [x19]
0.31 │ str w1, [x29, #32]
1.06 │ ldr w0, [x29, #32]
0.32 │ ldr w0, [x29, #32]
0.24 │ ldr w2, [x29, #32]
│ add w0, w0, #0x1 <------- A
│ bfxil w2, w0, #0, #15
0.13 │ str w2, [x29, #32]
2.35 │ ldr w0, [x29, #32]
0.46 │ ldr w2, [x29, #32]
│ and w0, w0, #0x7fff
54.73│ 5c: ldxr w3, [x19] <------- B
│ cmp w3, w1
0.02 │ ↓b.ne 70
0.97 │ stlxr w4, w2, [x19]
│ ↑cbnz w4, 5c
38.56│ 70: dmb ish
│ ↑b.ne 30
│ g_return_val_if_fail (new_ref_count > 1, NULL);
0.00 │ cmp w0, #0x1
│ ↓b.ls c8
│
│ return closure;
ldxr and the dmb instruction together make up ~92% of all the samples.
Also the generated code looks quite bad - notice the redundant loads to w0/w2. It's as if the code is compiled with optimisations turned off - which is not the case ( it's compile with -O2 )
The duplicate loads also lead to a large window between incrementing the reference[A] and trying to do a compare and swap starting at [B]. Another thread performing the same accesses between A and B will cause the cache line to bounce between the two cores and also fail the ldxr, restarting the loop.
Aside: The x86 assembly for ref/unref also has duplicate loads but it seems to not mind them much.
On further inspection, I noticed that the following definition of the GCloure structure.
struct _GClosure
{
/*`< private >`*/
volatile guint ref_count : 15;
/* meta_marshal is not used anymore but must be zero for historical reasons
as it was exposed in the G_CLOSURE_N_NOTIFIERS macro */
volatile guint meta_marshal_nouse : 1;
volatile guint n_guards : 1;
volatile guint n_fnotifiers : 2; /*
finalization notifiers */
volatile guint n_inotifiers : 8; /*
invalidation notifiers */
volatile guint in_inotify : 1;
volatile guint floating : 1;
/*`< protected >`*/
volatile guint derivative_flag : 1;
/*`< public >`*/
volatile guint in_marshal : 1;
volatile guint is_invalid : 1;
...
}
Specifically, the ref_count is defined as a volatile bit field. That coupled with the implementation of increment explains the bad assembly sequence generated for g_closure_ref()/g_closure_unref().
AFAICT, there is no need for the ref_count (or the other bit fields in the structure) to be volatile, especially as they seem to be updated with gcc __sync builtins.
I have a few ideas on how to improve the code by re-factoring INC_ASSIGN/DEC_ASSIGN to use specialised versions of CHANGE_FIELD in gobject/gclosure.c. I'll give those a shot and see if they lead to an improvement in performance. Happy to discuss any suggestions or try patches and report back.
It might also be worth raising/discussing the findings here with upstream. I am not at all familiar with glib and they might be able to help explain the code structure and have suggestions for improvement.
Since I hadn't dug into this enough... ;)
I also ran a few iteration of the test on some other ARM64 platforms, Juno R2 and Juno R0 (this one is the same as arm-arm-01). On the R2 the test always finished in a couple of minutes. But on Juno R0, it sometimes finished in a couple of minutes but seemed to get tied up in knots and took 15min (at which point I terminated the test). After some experiments with restricting the cores the test ran on (using taskset), I concluded that when the test ran on the A57s (incidentally the same cores as on the AMD Seattle system) it took a long time but not when it was running on the A53s.
So it seems the micro-architecture behaviour combined with the poor code sequence is making this especially painful while other cores seem to be chug through without breaking a sweat.
Version: 2.54.x