GLib issueshttps://gitlab.gnome.org/GNOME/glib/-/issues2023-01-26T15:36:23Zhttps://gitlab.gnome.org/GNOME/glib/-/issues/2g_filename_from_utf8() should normalize?2023-01-26T15:36:23ZBugzillag_filename_from_utf8() should normalize?## Submitted by Owen Taylor
**[Link to original bug (#72190)](https://bugzilla.gnome.org/show_bug.cgi?id=72190)**
## Description
Should g_utf8_to_filename() convert to a normalized form
if the filename encoding is UTF-8? Apparently ...## Submitted by Owen Taylor
**[Link to original bug (#72190)](https://bugzilla.gnome.org/show_bug.cgi?id=72190)**
## Description
Should g_utf8_to_filename() convert to a normalized form
if the filename encoding is UTF-8? Apparently MacOSX does this:
http://developer.apple.com/techpubs/macosx/Essentials/SystemOverview/FileSystem/File_Encodings_and_Fonts.html
Non-normalized filenames can cause ambiguity between visual
representation and on-disk representation.
Version: 1.3.xhttps://gitlab.gnome.org/GNOME/glib/-/issues/58g_unichar_isxdigit() and g_unichar_xdigit_value() should deal with full-width...2019-05-24T17:17:25ZBugzillag_unichar_isxdigit() and g_unichar_xdigit_value() should deal with full-width a-fA-F## Submitted by Nikolai Weibull
**[Link to original bug (#347844)](https://bugzilla.gnome.org/show_bug.cgi?id=347844)**
## Description
Please describe the problem:
Currently, g_unichar_isxdigit() and g_unichar_xdigit_value() ignore ...## Submitted by Nikolai Weibull
**[Link to original bug (#347844)](https://bugzilla.gnome.org/show_bug.cgi?id=347844)**
## Description
Please describe the problem:
Currently, g_unichar_isxdigit() and g_unichar_xdigit_value() ignore the full-width representations of the characters 'a' through 'f' and 'A' through 'F', even though it does take full-width digits into consideration.
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
Does this happen every time?
Other information:
A simple addition of
#define G_UNICHAR_FULLWIDTH_A 0xff21
#define G_UNICHAR_FULLWIDTH_F 0xff26
#define G_UNICHAR_FULLWIDTH_a 0xff41
#define G_UNICHAR_FULLWIDTH_f 0xff46
with the appropriate test should suffice.
Version: 2.12.xhttps://gitlab.gnome.org/GNOME/glib/-/issues/62g_utf8_collate_key_for_filename breaks case ordering2019-01-28T12:53:46ZBugzillag_utf8_collate_key_for_filename breaks case ordering## Submitted by Alexander Larsson `@alexl`
**[Link to original bug (#352237)](https://bugzilla.gnome.org/show_bug.cgi?id=352237)**
## Description
For some reason g_utf8_collate_key_for_filename sorts case-different files different. ...## Submitted by Alexander Larsson `@alexl`
**[Link to original bug (#352237)](https://bugzilla.gnome.org/show_bug.cgi?id=352237)**
## Description
For some reason g_utf8_collate_key_for_filename sorts case-different files different. As a test case, use: "Test1.txt", "test2.txt", "test3.txt". g_utf8_collate_key will show them in that order. However, g_utf8_collate_key_for_filename will order them as "test2.txt", "test3.txt", "Test1.txt".
Version: 2.10.x
### See also
* [Bug 355152](https://bugzilla.gnome.org/show_bug.cgi?id=355152)https://gitlab.gnome.org/GNOME/glib/-/issues/72glib should not create/handle long UTF-8 forms2019-04-11T16:24:14ZBugzillaglib should not create/handle long UTF-8 forms## Submitted by Roozbeh Pournader
**[Link to original bug (#391261)](https://bugzilla.gnome.org/show_bug.cgi?id=391261)**
## Description
Presently, glib's UTF-8 functions use the ISO/IEC 10646 definition of UTF-8 both when handling ...## Submitted by Roozbeh Pournader
**[Link to original bug (#391261)](https://bugzilla.gnome.org/show_bug.cgi?id=391261)**
## Description
Presently, glib's UTF-8 functions use the ISO/IEC 10646 definition of UTF-8 both when handling and when generating UTF-8 data. This means that it accepts and generates UTF-8 for values larger than the largest allowed Unicode character, U+10FFFF. This means that the applications will get invalid Unicode characters instead of an error, making glib not conforming to The Unicode Standard.
As an example, the following piece of code, accepts the "ill-formed" UTF-8 sequence and gives an invalid unicode codepoint of U+11000 without an error:
#include <glib.h>
#include <stdio.h>
int
main ()
{
gunichar *result;
gchar input[] = "\xF4\x90\x80\x80";
result = g_utf8_to_ucs4 (input, -1, NULL, NULL, NULL);
if (result != NULL)
printf ("result is: U+%x\n", result[0]);
g_free (result);
return 0;
}
The same happens with g_unichar_to_utf8, which takes invalid Unicode code points and generates an ill-formed UTF-8 sequence.
Quoting relevant parts from the Unicode 5.0 book:
Page 73:
"[Conformance clause] C9 When a process generates a code unit sequence which purports to be in a Unicode character encoding form, it shall not emit ill-formed code unit sequences.
[...]
C10 When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters.
"
Page 103:
"Any UTF-8 byte sequence that does not match the patterns listed in Table 3-7 is ill-formed." [The patterns in Table 3-7, on page 104, do not match `<F4 90 80 80>`.]
We can of course claim that "we support ISO/IEC 10646's UTF-8" and ignore the problem altogether, but this is considered a security problem. Quoting UTR #6, Unicode Security Considerations:
http://www.unicode.org/reports/tr36/#Non_Visual_Recommendations
"A. Ensure that all implementations of UTF-8 used in a system are conformant to the latest version of Unicode. In particular,
A. Always use the so-called "shortest form" of UTF-8
B. Never go outside of 0..10FFFF16
C. Never use 5 or 6 byte UTF-8."
Going this way, also increases the performance of at least those functions that handle UTF-8 data, as the tests become simpler.
Not doing a patch yet as this may be controversial. Please comment.
Version: 2.12.xhttps://gitlab.gnome.org/GNOME/glib/-/issues/75UTF-8 decoder failure2019-08-31T13:56:02ZBugzillaUTF-8 decoder failure## Submitted by Chris Wilson
**[Link to original bug (#400468)](https://bugzilla.gnome.org/show_bug.cgi?id=400468)**
## Description
I stumbled across this test file whilst looking for some example UTF-8 documents:
http://www.cl.cam....## Submitted by Chris Wilson
**[Link to original bug (#400468)](https://bugzilla.gnome.org/show_bug.cgi?id=400468)**
## Description
I stumbled across this test file whilst looking for some example UTF-8 documents:
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt and according to that document the utf8 validator fails test 3.1.9
Version: 2.12.xhttps://gitlab.gnome.org/GNOME/glib/-/issues/114add g_utf8_strtitle2019-07-22T16:34:36ZBugzillaadd g_utf8_strtitle## Submitted by Eric Albright
**[Link to original bug (#498065)](https://bugzilla.gnome.org/show_bug.cgi?id=498065)**
## Description
gchar *g_utf8_strtitle (gchar *s);
This should return a copy of the string s where the first le...## Submitted by Eric Albright
**[Link to original bug (#498065)](https://bugzilla.gnome.org/show_bug.cgi?id=498065)**
## Description
gchar *g_utf8_strtitle (gchar *s);
This should return a copy of the string s where the first letter of the string is titlecased (as defined by unicode) and all following characters are lowercase.
This is necessary to handle U+01f2 "Dz" LATIN CAPITAL LETTER D WITH SMALL LETTER Z.
While this can be done with the current public api (but it is easy to get wrong since g_unichar_totitle is not locale sensitive).
http://bugzilla.gnome.org/show_bug.cgi?id=416390#c4 mentions g_utf8_strtitle as being problematic because "titlecase is not well defined: we just capitalize the first letter of each word, but sometimes people would like it to follow localized capitalization rules (e.g Phantom of the Opera vs Phantom Of The Opera, Murders in the Rue Morgue vs Murders In The Rue Morgue, etc)". It is true that there may be language specific overrides that need to happen, however by defining the semantics to titlecase the string instead of each word in the string, it moves that burden to the application where it belongs. Thus either of the choices above are possible and would delegate the titlecasing functionality to glib after splitting the string on spaces (and checking for short words like in, the, on, that aren't the first word of the string etc.)
The following is my implementation using glib public api but it could be simplified if within glib.
gchar* g_utf8_strtitle(const gchar*str, gssize len)
{
gunichar title_case_char;
gchar* result;
gchar* upperStr, * upperTail, * lowerTail;
gchar title_case_utf8[7];
gint utf8len;
upperStr = g_utf8_strup(str, len); /* for locale sensitive casing */
title_case_char = g_unichar_totitle(g_utf8_get_char(upperStr));
utf8len = g_unichar_to_utf8(title_case_char, title_case_utf8);
title_case_utf8[utf8len] = '\0';
upperTail = g_utf8_next_char(upperStr);
lowerTail = g_utf8_strdown(upperTail, -1);
result = g_strconcat(title_case_utf8,
lowerTail,
NULL);
g_free(upperStr);
g_free(lowerTail);
return result;
}
Version: 2.14.x
### Blocking
* [Bug 735336](https://bugzilla.gnome.org/show_bug.cgi?id=735336)
* [Bug 751807](https://bugzilla.gnome.org/show_bug.cgi?id=751807)https://gitlab.gnome.org/GNOME/glib/-/issues/115add with_locale variety utf8 case mapping functions2019-08-31T13:56:03ZBugzillaadd with_locale variety utf8 case mapping functions## Submitted by Eric Albright
**[Link to original bug (#498068)](https://bugzilla.gnome.org/show_bug.cgi?id=498068)**
## Description
I need g_utf8_strup_with_locale and g_utf8_strdown_with_locale.
Looks to me like this functionalit...## Submitted by Eric Albright
**[Link to original bug (#498068)](https://bugzilla.gnome.org/show_bug.cgi?id=498068)**
## Description
I need g_utf8_strup_with_locale and g_utf8_strdown_with_locale.
Looks to me like this functionality was planned but not implemented:
http://mail.gnome.org/archives/gtk-i18n-list/2001-June/msg00053.html states "Since we don't have a method of representing locale in GLib
right now, I think we should start out with:
g_utf8_toupper (string); [ priority A ]
g_utf8_tolower (string); [ priority A ]
Defined to use the "current" locale as the minimum.
We can add g_utf8_to_upper_with_locale (string, locale) later."
I don't see why locale cannot be const char * as would be returned by setlocale (LC_CTYPE, NULL) or g_win32_getlocale()
Version: 2.14.xhttps://gitlab.gnome.org/GNOME/glib/-/issues/116Document UTF-8 behaviour and requirements throughout GLib2023-05-16T19:56:29ZBugzillaDocument UTF-8 behaviour and requirements throughout GLibGLib needs a documentation page somewhere which explains how UTF-8 is used throughout GLib (even on Windows), and that unless explicitly stated otherwise, all functions require *valid* UTF-8 as input or they will crash. This means potent...GLib needs a documentation page somewhere which explains how UTF-8 is used throughout GLib (even on Windows), and that unless explicitly stated otherwise, all functions require *valid* UTF-8 as input or they will crash. This means potentially-invalid input must be validated at the program or library boundary.
It would be good if this could be linked to from all functions which take UTF-8 strings as input, although that could be left to future work depending on whether the new docs system supports that kind of thing at the moment.
The original case of `g_utf8_normalize()` has been fixed.
---
Old issue body:
## Submitted by Stian Skjelstad
**[Link to original bug (#501997)](https://bugzilla.gnome.org/show_bug.cgi?id=501997)**
## Description
Documentation
Section: glib/glib-Unicode-Manipulation.html#g-utf8-normalize
Nothing about what happends if the string is not valid utf8
Correct version:
That if the string is not valid utf8, NULL will be returned
Other information:Philip WithnallPhilip Withnallhttps://gitlab.gnome.org/GNOME/glib/-/issues/135g_unichar_totitle(0) returns 0x00001F88 instead of 02019-08-31T13:56:03ZBugzillag_unichar_totitle(0) returns 0x00001F88 instead of 0## Submitted by Ivan Peikov
**[Link to original bug (#526123)](https://bugzilla.gnome.org/show_bug.cgi?id=526123)**
## Description
Please describe the problem:
The title-case version of 0 is obviously 0 (according to the Unicode sta...## Submitted by Ivan Peikov
**[Link to original bug (#526123)](https://bugzilla.gnome.org/show_bug.cgi?id=526123)**
## Description
Please describe the problem:
The title-case version of 0 is obviously 0 (according to the Unicode standard). However, due to the title-case characters implementation in glib (title_table[]) 0 is also used whenever no lower/upper version of the title-case character exists.
This leads to the buggy behavior.
Steps to reproduce:
1. pass 0 to g_unichar_totitle()
Actual results:
returns 0x00001F88
Expected results:
0
Does this happen every time?
yes
Other information:
Version: 2.16.xhttps://gitlab.gnome.org/GNOME/glib/-/issues/178wcwidth-like functions2021-05-26T16:11:02ZBugzillawcwidth-like functions## Submitted by Behdad Esfahbod
**[Link to original bug (#563503)](https://bugzilla.gnome.org/show_bug.cgi?id=563503)**
## Description
We provide all the ingredients in glib to write a `wcwidth()` functions. Namely:
```
return g...## Submitted by Behdad Esfahbod
**[Link to original bug (#563503)](https://bugzilla.gnome.org/show_bug.cgi?id=563503)**
## Description
We provide all the ingredients in glib to write a `wcwidth()` functions. Namely:
```
return g_unichar_iszerowidth(c): 0 ? g_unichar_iswide(c) ? 2 : 1;
```
However, writing this and writing a string loop around it is uglier than I like. I need that in pangofc and pangocairo. Means that I had to repeat the following code in two places:
```
static inline G_GNUC_UNUSED int
pango_unichar_width (gunichar c)
{
return G_UNLIKELY (g_unichar_iszerowidth (c)) ? 0 :
G_UNLIKELY (g_unichar_iswide (c)) ? 2 : 1;
}
static G_GNUC_UNUSED glong
pango_utf8_strwidth (const gchar *p)
{
glong len = 0;
g_return_val_if_fail (p != NULL, 0);
while (*p)
{
len += pango_unichar_width (g_utf8_get_char (p));
p = g_utf8_next_char (p);
}
return len;
}
```
Which is short enough and ok. But try writing a non-nul-terminal version and things quickly get hard to get right.
The reason we didn't add a `g_unichar_width()` to glib was that depending on the locale, the user may want to use `g_unichar_iswide_cjk()` instead. Ideally, I want my pango code should automatically use `iswide_cjk()` for CJK locales. But I didn't have the list in Pango, so didn't do that.
Note that vte also has all this code. But the requirements there are a bit more restricted, so I don't think we will be able to reuse code anyway.
So, here is one proposal:
* Add `g_get_lc_ctype()`, which will move code from pango down to glib. The code is:
```
static gchar *
_pango_get_lc_ctype (void)
{
#ifdef G_OS_WIN32
/* Somebody might try to set the locale for this process using the
* LANG or LC_ environment variables. The Microsoft C library
* doesn't know anything about them. You set the locale in the
* Control Panel. Setting these env vars won't have any affect on
* locale-dependent C library functions like ctime(). But just for
* kicks, do obey LC_ALL, LC_CTYPE and LANG in Pango. (This also makes
* it easier to test GTK and Pango in various default languages, you
* don't have to clickety-click in the Control Panel, you can simply
* start the program with LC_ALL=something on the command line.)
*/
gchar *p;
p = getenv ("LC_ALL");
if (p != NULL)
return g_strdup (p);
p = getenv ("LC_CTYPE");
if (p != NULL)
return g_strdup (p);
p = getenv ("LANG");
if (p != NULL)
return g_strdup (p);
return g_win32_getlocale ();
#else
return g_strdup (setlocale (LC_CTYPE, NULL));
#endif
}
```
* Add `g_unichar_width()` that uses `_cjk` variant if running under a CJK locale. The list of CJK locales will be lifted from VTE. We can document that the user can decide whether running under a CJK locale by testing the return value of `g_unichar_width(SOME-AMBIGUOUS-CHAR)`. So no extra api needed there.
* Add `g_utf8_strwidth()` that is similar to `g_utf8_strlen()` but uses `g_unichar_width()`.
How does it sound? Still too trivial stuff to live in glib?https://gitlab.gnome.org/GNOME/glib/-/issues/196g_utf8_collate_key_for_filename () should support Roman numbers2022-01-19T12:13:58ZBugzillag_utf8_collate_key_for_filename () should support Roman numbers## Submitted by LAZA
**[Link to original bug (#571025)](https://bugzilla.gnome.org/show_bug.cgi?id=571025)**
## Description
https://bugs.launchpad.net/nautilus/+bug/238077
Version: 2.19.x## Submitted by LAZA
**[Link to original bug (#571025)](https://bugzilla.gnome.org/show_bug.cgi?id=571025)**
## Description
https://bugs.launchpad.net/nautilus/+bug/238077
Version: 2.19.xhttps://gitlab.gnome.org/GNOME/glib/-/issues/199Add g_utf8_simple_casefold() for simple case folding (instead of full case fo...2020-07-27T00:34:11ZBugzillaAdd g_utf8_simple_casefold() for simple case folding (instead of full case folding)## Submitted by Sébastien Bacher `@seb128`
**[Link to original bug (#572952)](https://bugzilla.gnome.org/show_bug.cgi?id=572952)**
## Description
the bug has been opened on https://bugs.launchpad.net/bugs/332321
"Version of Package...## Submitted by Sébastien Bacher `@seb128`
**[Link to original bug (#572952)](https://bugzilla.gnome.org/show_bug.cgi?id=572952)**
## Description
the bug has been opened on https://bugs.launchpad.net/bugs/332321
"Version of Package: 2.24.2-0ubuntu1
I expect: When I search and replace "ß" (to change to ß for html) I want all "ss" left unchanged.
What happens: "ss" is also replace as if it is an "ß""
Version: 2.37.x
### Blocking
* [Bug 703165](https://bugzilla.gnome.org/show_bug.cgi?id=703165)https://gitlab.gnome.org/GNOME/glib/-/issues/212sort "by name" shows files in wrong alphanumeric order2019-05-14T13:10:17ZBugzillasort "by name" shows files in wrong alphanumeric order## Submitted by Savvas Radević
**[Link to original bug (#578048)](https://bugzilla.gnome.org/show_bug.cgi?id=578048)**
## Description
According to a bug report ( http://bugzilla.gnome.org/show_bug.cgi?id=547350#c3 ), nautilus uses g...## Submitted by Savvas Radević
**[Link to original bug (#578048)](https://bugzilla.gnome.org/show_bug.cgi?id=578048)**
## Description
According to a bug report ( http://bugzilla.gnome.org/show_bug.cgi?id=547350#c3 ), nautilus uses glib for sorting filenames, so I thought that glib is the package that should handle this usability bug.
I have a list of filenames (images), which use alphanumeric characters in their names. But when I try to sort the files "by name", it shows them in this order:
13437719_8903f96f59_o.jpg
179584429_849c737f48_o.jpg
200396260-001.jpg
200500922-001.jpg
427180864_363a521a3f_b.jpg
1288699637_05815693d8_b.jpg
... etc, here's the image screenshot of nautilus:
http://img70.imageshack.us/img70/8084/nautiluswrongsortorder.png
However using "ls -1" in command line the files appear properly sorted:
$ ls -1
1288699637_05815693d8_b.jpg
13437719_8903f96f59_o.jpg
1468995258_d30e95ef1b_o.jpg
179584429_849c737f48_o.jpg
200396260-001.jpg
200500922-001.jpg
2293045156_bbf1d06a46_b.jpg
2371633439_1c1a327e26_b.jpg
2400320371_325da8d281_o.jpg
2530198045_088931c8be_b_remixed.png
2960052107_598573b960_o.jpg
3068888802_6b6b0a3ea4_o.jpg
427180864_363a521a3f_b.jpg
Other information:
Version: 2.20.xhttps://gitlab.gnome.org/GNOME/glib/-/issues/302New inline functions for iteration over UTF-82023-07-09T16:15:03ZBugzillaNew inline functions for iteration over UTF-8## Submitted by Mikhail Zabaluev `@mzabaluev`
**[Link to original bug (#619437)](https://bugzilla.gnome.org/show_bug.cgi?id=619437)**
## Description
Moving out from bug #614856, patches adding two inline functions for faster iterati...## Submitted by Mikhail Zabaluev `@mzabaluev`
**[Link to original bug (#619437)](https://bugzilla.gnome.org/show_bug.cgi?id=619437)**
## Description
Moving out from bug #614856, patches adding two inline functions for faster iteration over UTF-8 characters, g_utf8_iterate() and g_utf8_iterate_back().
Version: 2.25.x
### Blocking
* [Bug 614856](https://bugzilla.gnome.org/show_bug.cgi?id=614856)https://gitlab.gnome.org/GNOME/glib/-/issues/366g_unichar_toupper/lower with any character types2019-05-14T10:22:52ZBugzillag_unichar_toupper/lower with any character types## Submitted by Carlos Garcia Campos
**[Link to original bug (#633436)](https://bugzilla.gnome.org/show_bug.cgi?id=633436)**
## Description
g_unichar_tolower() and toupper() only work with uppercase/lowercase or titlecase letters, h...## Submitted by Carlos Garcia Campos
**[Link to original bug (#633436)](https://bugzilla.gnome.org/show_bug.cgi?id=633436)**
## Description
g_unichar_tolower() and toupper() only work with uppercase/lowercase or titlecase letters, however there are other characters defined in UnicodeData.txt that have a 1:1 case mapping. ICU for example, doesn't restrict single character case mapping to upper/lower and title letters:
"A character is considered to have a lowercase, uppercase, or title case equivalent if there is a respective "simple" case mapping specified for the character in the Unicode Character Database (UnicodeData.txt). If a character has no mapping equivalent, the result is the character itself."
See: http://userguide.icu-project.org/transforms/casemappings
Version: 2.27.xhttps://gitlab.gnome.org/GNOME/glib/-/issues/390Wrong behaviour of g_utf8_strdown() using tr_TR.utf8 locale2021-02-10T16:26:13ZBugzillaWrong behaviour of g_utf8_strdown() using tr_TR.utf8 locale## Submitted by Giulio Paci
**[Link to original bug (#640095)](https://bugzilla.gnome.org/show_bug.cgi?id=640095)**
## Description
Converting to upper case and then to lower case of the string "i" does not work properly in the tr_TR...## Submitted by Giulio Paci
**[Link to original bug (#640095)](https://bugzilla.gnome.org/show_bug.cgi?id=640095)**
## Description
Converting to upper case and then to lower case of the string "i" does not work properly in the tr_TR.utf8 locale. The upper case version of the string is right, but the lower case version is an i with a dot.
I did not try it, but I think that adding this code to the real_tolower() function should fix the issue:
else if (locale_type == LOCALE_TURKIC && c == '0x130')
{
/* LATIN CAPITAL LETTER I WITH DOT ABOVE => i */
len += g_unichar_to_utf8 (0x069, out_buffer ? out_buffer + len : NULL);
}
Another, probably related issue, is that using g_utf8_casefold() on"İi" and "iİ" leads to different results.https://gitlab.gnome.org/GNOME/glib/-/issues/517g_utf8_collate returns 0 on U+C5D0 vs U+CD942022-03-03T14:53:58ZBugzillag_utf8_collate returns 0 on U+C5D0 vs U+CD94## Submitted by Morten Welinder
**[Link to original bug (#670403)](https://bugzilla.gnome.org/show_bug.cgi?id=670403)**
## Description
U+C5D0 "에"
U+CD94 "추"
According to g_utf8_collate these two are identical. They don't look
the ...## Submitted by Morten Welinder
**[Link to original bug (#670403)](https://bugzilla.gnome.org/show_bug.cgi?id=670403)**
## Description
U+C5D0 "에"
U+CD94 "추"
According to g_utf8_collate these two are identical. They don't look
the same, so I don't think that is correct.
### Blocking
* [Bug 670232](https://bugzilla.gnome.org/show_bug.cgi?id=670232)Philip WithnallPhilip Withnallhttps://gitlab.gnome.org/GNOME/glib/-/issues/603add API to access unicode grapheme break property2021-06-09T11:33:38ZBugzillaadd API to access unicode grapheme break property## Submitted by Christian Persch `@chpe`
**[Link to original bug (#684222)](https://bugzilla.gnome.org/show_bug.cgi?id=684222)**
## Description
We should add API to access the unicode character's GraphemeBreakProperty.
This will b...## Submitted by Christian Persch `@chpe`
**[Link to original bug (#684222)](https://bugzilla.gnome.org/show_bug.cgi?id=684222)**
## Description
We should add API to access the unicode character's GraphemeBreakProperty.
This will be neccessary for the update to PCRE 8.32 (which uses this character property to implement \X ) and also be useful for pango.
Proposed API on wip/unicode-graphemebreak branch, using the same approach as the line break property API.
### Blocking
* [Bug 689791](https://bugzilla.gnome.org/show_bug.cgi?id=689791)https://gitlab.gnome.org/GNOME/glib/-/issues/872ucs4 functions have wrong return transfer2019-06-24T02:53:37ZBugzillaucs4 functions have wrong return transfer## Submitted by mod..@..ush.ai
**[Link to original bug (#729919)](https://bugzilla.gnome.org/show_bug.cgi?id=729919)**
## Description
The following functions are documented to allocate their return value but are marked in GI as havi...## Submitted by mod..@..ush.ai
**[Link to original bug (#729919)](https://bugzilla.gnome.org/show_bug.cgi?id=729919)**
## Description
The following functions are documented to allocate their return value but are marked in GI as having "none" transfer on their return type:
g_ucs4_to_utf16
g_utf16_to_ucs4
g_utf8_to_ucs4
g_utf8_to_ucs4_fast
g_utf8_to_utf16
g_unicode_canonical_decomposition
Also, the out arguments to g_unichar_decompose, g_unichar_fully_decompose, and g_unicode_canonical_decomposition are not known by GI to have 'out' direction.
These cause e.g. pygi to fail:
$ python3
Python 3.4.0 (default, Apr 27 2014, 23:33:09)
[GCC 4.8.2 20140206 (prerelease)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import gi
>>> from gi.repository import GLib
>>> #pygi tries to interpret the pointer value of the ucs4 array as a unichar here:
...
>>> GLib.utf8_to_ucs4('foo', -1, 0, 0)
Traceback (most recent call last):
File "`<stdin>`", line 1, in `<module>`
TypeError: Invalid unicode codepoint 38962016
>>> #here, it doesn't recognize the second-to-last argument as being 'out':
...
>>> GLib.unichar_fully_decompose('a', 0, 'b', 0)
1
>>> GLib.unichar_fully_decompose('é', 0, 'b', 0)
2
>>> #and it just crashes on this one, because the second arg is treated as a raw pointer value rather than the ffi allocating an int temporary and passing its address:
...
>>> GLib.unicode_canonical_decomposition('a', 0)
zsh: segmentation fault python3
Version: 2.40.x2.61.2https://gitlab.gnome.org/GNOME/glib/-/issues/900glib doesn't detect many invalid utf-8 sequences2019-05-14T13:47:54ZBugzillaglib doesn't detect many invalid utf-8 sequences## Submitted by Behdad Esfahbod
**[Link to original bug (#733073)](https://bugzilla.gnome.org/show_bug.cgi?id=733073)**
## Description
I recently reworked this in HarfBuzz. Can be copied.
https://github.com/behdad/harfbuzz/blob/ma...## Submitted by Behdad Esfahbod
**[Link to original bug (#733073)](https://bugzilla.gnome.org/show_bug.cgi?id=733073)**
## Description
I recently reworked this in HarfBuzz. Can be copied.
https://github.com/behdad/harfbuzz/blob/master/src/hb-utf-private.hh