test suite times out on arm64 when running the closures test

Submitted by Michael Biebl `@mbiebl`

Description

Version: 2.54.2 Bug-Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=880883

In Debian we have the problem, that glib does not build reliably on arm64, or more specifically, the test-suite fails rather often with a timeout, if it runs on specific (slower) arm64 hardware.

See https://buildd.debian.org/status/logs.php?pkg=glib2.0&arch=arm64

This seems to happen when the closures test is run, which is killed by the buildd after 150min. If you want to read the whole bug report, see the referenced downstream bug report. I'm copying here the reply from Punit Agrawal, which seem to be the most relevant ones:

I had a chance to dig into this further and think I now know what's going on.

I ran the test on an AMD Seattle system (same as arm-arm-04) and noticed that the closures test (tests/refcount/closures) didn't finish in >150 minutes. Profiling the test using "perf" pointed to g_closure_ref() and g_closure_unref() as taking the bulk of execution time. This confirms Emilio's suspicion in #20 (closed).

$ perf report

# To display the perf.data header info, please use
--header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 3M of event 'cycles:uppp'
# Event count (approx.): 1497293182208
#
# Overhead  Command   Shared Object               Symbol
# ........  ........  ..........................
............................................
#
    67.85%  closures  libgobject-2.0.so.0.5400.2  [.] g_closure_ref
    29.10%  closures  libgobject-2.0.so.0.5400.2  [.] g_closure_unref
     2.45%  closures  libgobject-2.0.so.0.5400.2  [.] g_closure_invoke
     0.22%  closures  closures                    [.] test_closure
     0.11%  closures  closures                    [.] g_closure_unref@plt


Copying the annotated profile of g_closure_ref(). First column is %
samples at instructions as observed by "perf record"

          │
          │       INC_ASSIGN (closure, ref_count, &new_ref_count);
  0.20 │ 30: ldr    w1, [x19]
  0.31 │       str    w1, [x29, #32]
  1.06 │       ldr    w0, [x29, #32]
  0.32 │       ldr    w0, [x29, #32]
  0.24 │       ldr    w2, [x29, #32]
          │       add  w0, w0, #0x1                 <------- A
          │       bfxil  w2, w0, #0, #15
  0.13 │       str    w2, [x29, #32]
  2.35 │       ldr    w0, [x29, #32]
  0.46 │       ldr    w2, [x29, #32]
          │       and  w0, w0, #0x7fff
 54.73│ 5c: ldxr   w3, [x19]                       <------- B
          │       cmp  w3, w1
  0.02 │     ↓b.ne  70
  0.97 │       stlxr  w4, w2, [x19]
          │     ↑cbnz  w4, 5c
 38.56│ 70: dmb  ish
          │     ↑b.ne  30
          │       g_return_val_if_fail (new_ref_count > 1, NULL);
  0.00 │       cmp  w0, #0x1
          │     ↓b.ls   c8
          │
          │       return closure;

ldxr and the dmb instruction together make up ~92% of all the samples.

Also the generated code looks quite bad - notice the redundant loads to w0/w2. It's as if the code is compiled with optimisations turned off - which is not the case ( it's compile with -O2 )

The duplicate loads also lead to a large window between incrementing the reference[A] and trying to do a compare and swap starting at [B]. Another thread performing the same accesses between A and B will cause the cache line to bounce between the two cores and also fail the ldxr, restarting the loop.

Aside: The x86 assembly for ref/unref also has duplicate loads but it seems to not mind them much.

On further inspection, I noticed that the following definition of the GCloure structure.

struct _GClosure
{
  /*`< private >`*/
  volatile              guint    ref_count : 15;
  /* meta_marshal is not used anymore but must be zero for historical reasons
     as it was exposed in the G_CLOSURE_N_NOTIFIERS macro */
  volatile              guint    meta_marshal_nouse : 1;
  volatile              guint    n_guards : 1;
  volatile              guint    n_fnotifiers : 2;      /*
finalization notifiers */
  volatile              guint    n_inotifiers : 8;      /*
invalidation notifiers */
  volatile              guint    in_inotify : 1;
  volatile              guint    floating : 1;
  /*`< protected >`*/
  volatile              guint    derivative_flag : 1;
  /*`< public >`*/
  volatile              guint    in_marshal : 1;
  volatile              guint    is_invalid : 1;
...
}

Specifically, the ref_count is defined as a volatile bit field. That coupled with the implementation of increment explains the bad assembly sequence generated for g_closure_ref()/g_closure_unref().

AFAICT, there is no need for the ref_count (or the other bit fields in the structure) to be volatile, especially as they seem to be updated with gcc __sync builtins.

I have a few ideas on how to improve the code by re-factoring INC_ASSIGN/DEC_ASSIGN to use specialised versions of CHANGE_FIELD in gobject/gclosure.c. I'll give those a shot and see if they lead to an improvement in performance. Happy to discuss any suggestions or try patches and report back.

It might also be worth raising/discussing the findings here with upstream. I am not at all familiar with glib and they might be able to help explain the code structure and have suggestions for improvement.

Since I hadn't dug into this enough... ;)

I also ran a few iteration of the test on some other ARM64 platforms, Juno R2 and Juno R0 (this one is the same as arm-arm-01). On the R2 the test always finished in a couple of minutes. But on Juno R0, it sometimes finished in a couple of minutes but seemed to get tied up in knots and took 15min (at which point I terminated the test). After some experiments with restricting the cores the test ran on (using taskset), I concluded that when the test ran on the A57s (incidentally the same cores as on the AMD Seattle system) it took a long time but not when it was running on the A53s.

So it seems the micro-architecture behaviour combined with the poor code sequence is making this especially painful while other cores seem to be chug through without breaking a sweat.

Version: 2.54.x

Edited Aug 02, 2018 by Simon McVittie

test suite times out on arm64 when running the closures test

Submitted by Michael Biebl @mbiebl

Description

Submitted by Michael Biebl `@mbiebl`