Encoding problems with introspection in g_log_writer_format_fields()
From the user point of view the problem appears on windows platform with python apps calling g_log_writer_format_fields via gobject introspection: doing this results in python exceptions if the returned string contains non-ASCII. The exceptions looks like:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 103: invalid start byte
In the python code is similar to:
def structured_log_adapter(level, fields, field_count, user_data):
message = GLib.log_writer_format_fields(level, fields, True)
GLib.log_set_writer_func(structured_log_adapter, None)
The user-visible bug was triggered in meld on windows: meld#222 (closed)
The reason is that g_log_writer_format_fields() introduced in 2.50 returns gchar*
which points to string in non-utf8 encoding, but the special comments used for introspection generation doesn't tell that string is not utf8; such annotation looks to be possible, see for example recent fix for g_locale_from_utf8 8a93e2d5
Also see meld#222 (comment 315507) for a bit more details and example python code.
The first fix direction coming to mind would be changing introspection comments, so the introspection system would understand that the returned data is null-terminated byte array and not utf8 string. But after some analyses I found that some other variants can be considered.
At first some plain facts:
- as of today google, github and internet code searching suggests that opensource apps using this function are python apps - it's meld and in pychess (other usages are either from glib itself, or from wrappers autogenerated from gi introspection); it uses it from python, and only python usages were found in public code.
- pychess worked around the exception by catching UnicodeDecodeError, see https://github.com/pychess/pychess/commit/40631d75a607d1cfccfdefeac454dfdf7f0a2a85 (included in 0.99.0 release).
- meld (recently, not released by now) in master and active meld-3-18 branch worked around the exception by catching all Exceptions meld@9587146d
Now a list of I fix methods I suggest to consider:
- The simple one: fix the comment used for introspection generation
- Change returned encoding to utf8: change function, docs, and internal usages to "return utf8" interface.
- Deprecate as is, and add new API: keep current behavior as is, and add new function returning utf8 (with according internal changes to avoid code dup, etc.)
Now a table describing positive and negative sides of those fix ways:
The negative change is marked by -
, the positive by +
, and no mark means no change in this area.
Affected subject | Fix introspection comment | Change returned encoding to utf8 | Deprecate, add new |
---|---|---|---|
Existing now-working python apps on typical utf8 systems (linux) would raise exception several lines below every usage due to python return type changing from utf8-string to non-utf8 bytes | - | ||
Existing now-broken python apps on typical non-utf8 system like (windows) start to work due to new return type | + | ||
Note: possibly-existing now-working C apps on utf8 systems (linux) that rely on header and not on introspection data are unaffected by all variants | |||
possibly-existing now-working C apps on non-utf8 systems (windows) may break due to receiving data in utf8 instead of locale encoding | - | ||
Documentation of returned type would be different between gtk versions making confusion | - | ||
There would be a way to get message in utf-8 on windows for example for outputting it to some other logging system or utf8 log (without fallback-replacing characters, that is a case when string is coerced during converted in locale encoding) | + | + |