GLib issueshttps://gitlab.gnome.org/GNOME/glib/-/issues2019-01-28T12:53:46Zhttps://gitlab.gnome.org/GNOME/glib/-/issues/62g_utf8_collate_key_for_filename breaks case ordering2019-01-28T12:53:46ZBugzillag_utf8_collate_key_for_filename breaks case ordering## Submitted by Alexander Larsson `@alexl`
**[Link to original bug (#352237)](https://bugzilla.gnome.org/show_bug.cgi?id=352237)**
## Description
For some reason g_utf8_collate_key_for_filename sorts case-different files different. ...## Submitted by Alexander Larsson `@alexl`
**[Link to original bug (#352237)](https://bugzilla.gnome.org/show_bug.cgi?id=352237)**
## Description
For some reason g_utf8_collate_key_for_filename sorts case-different files different. As a test case, use: "Test1.txt", "test2.txt", "test3.txt". g_utf8_collate_key will show them in that order. However, g_utf8_collate_key_for_filename will order them as "test2.txt", "test3.txt", "Test1.txt".
Version: 2.10.x
### See also
* [Bug 355152](https://bugzilla.gnome.org/show_bug.cgi?id=355152)https://gitlab.gnome.org/GNOME/glib/-/issues/72glib should not create/handle long UTF-8 forms2019-04-11T16:24:14ZBugzillaglib should not create/handle long UTF-8 forms## Submitted by Roozbeh Pournader
**[Link to original bug (#391261)](https://bugzilla.gnome.org/show_bug.cgi?id=391261)**
## Description
Presently, glib's UTF-8 functions use the ISO/IEC 10646 definition of UTF-8 both when handling ...## Submitted by Roozbeh Pournader
**[Link to original bug (#391261)](https://bugzilla.gnome.org/show_bug.cgi?id=391261)**
## Description
Presently, glib's UTF-8 functions use the ISO/IEC 10646 definition of UTF-8 both when handling and when generating UTF-8 data. This means that it accepts and generates UTF-8 for values larger than the largest allowed Unicode character, U+10FFFF. This means that the applications will get invalid Unicode characters instead of an error, making glib not conforming to The Unicode Standard.
As an example, the following piece of code, accepts the "ill-formed" UTF-8 sequence and gives an invalid unicode codepoint of U+11000 without an error:
#include <glib.h>
#include <stdio.h>
int
main ()
{
gunichar *result;
gchar input[] = "\xF4\x90\x80\x80";
result = g_utf8_to_ucs4 (input, -1, NULL, NULL, NULL);
if (result != NULL)
printf ("result is: U+%x\n", result[0]);
g_free (result);
return 0;
}
The same happens with g_unichar_to_utf8, which takes invalid Unicode code points and generates an ill-formed UTF-8 sequence.
Quoting relevant parts from the Unicode 5.0 book:
Page 73:
"[Conformance clause] C9 When a process generates a code unit sequence which purports to be in a Unicode character encoding form, it shall not emit ill-formed code unit sequences.
[...]
C10 When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters.
"
Page 103:
"Any UTF-8 byte sequence that does not match the patterns listed in Table 3-7 is ill-formed." [The patterns in Table 3-7, on page 104, do not match `<F4 90 80 80>`.]
We can of course claim that "we support ISO/IEC 10646's UTF-8" and ignore the problem altogether, but this is considered a security problem. Quoting UTR #6, Unicode Security Considerations:
http://www.unicode.org/reports/tr36/#Non_Visual_Recommendations
"A. Ensure that all implementations of UTF-8 used in a system are conformant to the latest version of Unicode. In particular,
A. Always use the so-called "shortest form" of UTF-8
B. Never go outside of 0..10FFFF16
C. Never use 5 or 6 byte UTF-8."
Going this way, also increases the performance of at least those functions that handle UTF-8 data, as the tests become simpler.
Not doing a patch yet as this may be controversial. Please comment.
Version: 2.12.xhttps://gitlab.gnome.org/GNOME/glib/-/issues/1286g_unichar_iszerowidth does not handle Prepended_Concatenation_Mark correctly2019-05-14T10:22:26ZBugzillag_unichar_iszerowidth does not handle Prepended_Concatenation_Mark correctly## Submitted by Mike Frysinger
**[Link to original bug (#787229)](https://bugzilla.gnome.org/show_bug.cgi?id=787229)**
## Description
glib currently marks all Cf (Format Character) as zero width, but this ignores Prepended_Concatena...## Submitted by Mike Frysinger
**[Link to original bug (#787229)](https://bugzilla.gnome.org/show_bug.cgi?id=787229)**
## Description
glib currently marks all Cf (Format Character) as zero width, but this ignores Prepended_Concatenation_Mark codepoints. i guess gen-unicode-tables.pl should be consulting PropList.txt from the Unicode releases.
specifically these should all return false w/g_unichar_iszerowidth:
0600..0605 ; Prepended_Concatenation_Mark # Cf ARABIC NUMBER SIGN..ARABIC NUMBER MARK ABOVE
06DD ; Prepended_Concatenation_Mark # Cf ARABIC END OF AYAH
070F ; Prepended_Concatenation_Mark # Cf SYRIAC ABBREVIATION MARK
08E2 ; Prepended_Concatenation_Mark # Cf ARABIC DISPUTED END OF AYAH
110BD ; Prepended_Concatenation_Mark # Cf KAITHI NUMBER SIGN
Unicode 10.0.0 chapter 9 section 2 page 377-378 [1] states:
Signs Spanning Numbers. Several other special signs are written in association with numbers in the Arabic script. All of these signs can span multiple-digit numbers, rather than just a single digit. They are not formally considered combining marks in the sense used by the Unicode Standard, although they clearly interact graphically with their associated sequence of digits. In the text representation they precede the sequence of digits that they span, rather than follow a base character, as would be the case for a combining mark. Their General_Category value is Cf (format character). Unlike most other format characters, however, they should be rendered with a visible glyph, even in circumstances where no suitable digit or sequence of digits follows them in logical order. The characters have the Bidi_Class value of Arabic_Number to make them appear in the same run as the numbers following them.
A few similar signs spanning numbers or letters are associated with scripts other than Arabic. See the discussion of U+070F syriac abbreviation mark in Section 9.3, Syriac, and the discussion of U+110BD kaithi number sign in Section 15.2, Kaithi. All of these prefixed format controls, including the non-Arabic ones, are given the property value Prepended_Concatenation_Mark=True, to identify them as a class. They also have special behavior in text segmentation. (See Unicode Standard Annex #29, “Unicode Text Segmentation.”)
[1] http://unicode.org/versions/Unicode10.0.0/ch09.pdfhttps://gitlab.gnome.org/GNOME/glib/-/issues/366g_unichar_toupper/lower with any character types2019-05-14T10:22:52ZBugzillag_unichar_toupper/lower with any character types## Submitted by Carlos Garcia Campos
**[Link to original bug (#633436)](https://bugzilla.gnome.org/show_bug.cgi?id=633436)**
## Description
g_unichar_tolower() and toupper() only work with uppercase/lowercase or titlecase letters, h...## Submitted by Carlos Garcia Campos
**[Link to original bug (#633436)](https://bugzilla.gnome.org/show_bug.cgi?id=633436)**
## Description
g_unichar_tolower() and toupper() only work with uppercase/lowercase or titlecase letters, however there are other characters defined in UnicodeData.txt that have a 1:1 case mapping. ICU for example, doesn't restrict single character case mapping to upper/lower and title letters:
"A character is considered to have a lowercase, uppercase, or title case equivalent if there is a respective "simple" case mapping specified for the character in the Unicode Character Database (UnicodeData.txt). If a character has no mapping equivalent, the result is the character itself."
See: http://userguide.icu-project.org/transforms/casemappings
Version: 2.27.xhttps://gitlab.gnome.org/GNOME/glib/-/issues/900glib doesn't detect many invalid utf-8 sequences2019-05-14T13:47:54ZBugzillaglib doesn't detect many invalid utf-8 sequences## Submitted by Behdad Esfahbod
**[Link to original bug (#733073)](https://bugzilla.gnome.org/show_bug.cgi?id=733073)**
## Description
I recently reworked this in HarfBuzz. Can be copied.
https://github.com/behdad/harfbuzz/blob/ma...## Submitted by Behdad Esfahbod
**[Link to original bug (#733073)](https://bugzilla.gnome.org/show_bug.cgi?id=733073)**
## Description
I recently reworked this in HarfBuzz. Can be copied.
https://github.com/behdad/harfbuzz/blob/master/src/hb-utf-private.hhhttps://gitlab.gnome.org/GNOME/glib/-/issues/115add with_locale variety utf8 case mapping functions2019-08-31T13:56:03ZBugzillaadd with_locale variety utf8 case mapping functions## Submitted by Eric Albright
**[Link to original bug (#498068)](https://bugzilla.gnome.org/show_bug.cgi?id=498068)**
## Description
I need g_utf8_strup_with_locale and g_utf8_strdown_with_locale.
Looks to me like this functionalit...## Submitted by Eric Albright
**[Link to original bug (#498068)](https://bugzilla.gnome.org/show_bug.cgi?id=498068)**
## Description
I need g_utf8_strup_with_locale and g_utf8_strdown_with_locale.
Looks to me like this functionality was planned but not implemented:
http://mail.gnome.org/archives/gtk-i18n-list/2001-June/msg00053.html states "Since we don't have a method of representing locale in GLib
right now, I think we should start out with:
g_utf8_toupper (string); [ priority A ]
g_utf8_tolower (string); [ priority A ]
Defined to use the "current" locale as the minimum.
We can add g_utf8_to_upper_with_locale (string, locale) later."
I don't see why locale cannot be const char * as would be returned by setlocale (LC_CTYPE, NULL) or g_win32_getlocale()
Version: 2.14.xhttps://gitlab.gnome.org/GNOME/glib/-/issues/390Wrong behaviour of g_utf8_strdown() using tr_TR.utf8 locale2021-02-10T16:26:13ZBugzillaWrong behaviour of g_utf8_strdown() using tr_TR.utf8 locale## Submitted by Giulio Paci
**[Link to original bug (#640095)](https://bugzilla.gnome.org/show_bug.cgi?id=640095)**
## Description
Converting to upper case and then to lower case of the string "i" does not work properly in the tr_TR...## Submitted by Giulio Paci
**[Link to original bug (#640095)](https://bugzilla.gnome.org/show_bug.cgi?id=640095)**
## Description
Converting to upper case and then to lower case of the string "i" does not work properly in the tr_TR.utf8 locale. The upper case version of the string is right, but the lower case version is an i with a dot.
I did not try it, but I think that adding this code to the real_tolower() function should fix the issue:
else if (locale_type == LOCALE_TURKIC && c == '0x130')
{
/* LATIN CAPITAL LETTER I WITH DOT ABOVE => i */
len += g_unichar_to_utf8 (0x069, out_buffer ? out_buffer + len : NULL);
}
Another, probably related issue, is that using g_utf8_casefold() on"İi" and "iİ" leads to different results.https://gitlab.gnome.org/GNOME/glib/-/issues/1150Support collation of non-ASCII digits with g_utf8_collate_key_for_filename()2022-01-19T12:20:00ZBugzillaSupport collation of non-ASCII digits with g_utf8_collate_key_for_filename()## Submitted by Mahdi Rajabi
**[Link to original bug (#764225)](https://bugzilla.gnome.org/show_bug.cgi?id=764225)**
## Description
I have many file . It name is Persian numbers. (۱.mp4, ۲.mp4)
Nautilus doesn't sort by Persian numbe...## Submitted by Mahdi Rajabi
**[Link to original bug (#764225)](https://bugzilla.gnome.org/show_bug.cgi?id=764225)**
## Description
I have many file . It name is Persian numbers. (۱.mp4, ۲.mp4)
Nautilus doesn't sort by Persian numbers.
On Ubuntu 15.10
Arrange Item : By Name
Version: 2.48.x
---
As per #2576, Bangla numbers are also not currently supported. And should be.https://gitlab.gnome.org/GNOME/glib/-/issues/1840Introspection information of g_unicode_canonical_ordering incorrect2022-01-19T12:20:40ZGhost UserIntrospection information of g_unicode_canonical_ordering incorrectHi all-
As far as I can tell, g_unicode_canonical_ordering should be marked up a unichar array with an associated array size, but, it seems to be marked up as a simple pointer to a single unichar.Hi all-
As far as I can tell, g_unicode_canonical_ordering should be marked up a unichar array with an associated array size, but, it seems to be marked up as a simple pointer to a single unichar.https://gitlab.gnome.org/GNOME/glib/-/issues/2778g_unichar_to_utf8 creates invalid UTF-82023-07-03T18:36:09ZJussi Pakkaneng_unichar_to_utf8 creates invalid UTF-8According to documentation g_unichar_to_utf8 produces a string where "The value is a NUL terminated UTF-8 string." This does not seem to be the case for some unichar values. Here is a sample program:
```c
#include<glib.h>
#include<stdio...According to documentation g_unichar_to_utf8 produces a string where "The value is a NUL terminated UTF-8 string." This does not seem to be the case for some unichar values. Here is a sample program:
```c
#include<glib.h>
#include<stdio.h>
int main(int argc, char **argv) {
char buf[10];
memset(buf, 0xff, 10); // Filling with 0 works, this does not.
g_unichar_to_utf8(0xbb, buf);
if(g_utf8_validate(buf, -1, NULL)) {
printf("Output is valid utf-8.\n");
return 0;
} else {
printf("Output is NOT valid utf-8.\n");
return 1;
}
}
```
The input is the guillement character (») and the output is not valid UTF-8. If you change the `memset` to fill the buffer with zeros, the output is valid. This would indicate that the function does not write the zero terminator in its proper place.https://gitlab.gnome.org/GNOME/glib/-/issues/1667UTF8 collation doesn’t respect underscore-prefixed files2023-12-27T01:06:36ZFrank BrüttingUTF8 collation doesn’t respect underscore-prefixed filesDot-prefixed files are sorted before others, but underscore-prefixed files are not. Can this please be corrected?
Here’s an example from Gnome Builder:
![Bildschirmfoto_von_2019-01-25_18-08-18](/uploads/d7aa753d75703ca575d79d3b1760f88a...Dot-prefixed files are sorted before others, but underscore-prefixed files are not. Can this please be corrected?
Here’s an example from Gnome Builder:
![Bildschirmfoto_von_2019-01-25_18-08-18](/uploads/d7aa753d75703ca575d79d3b1760f88a/Bildschirmfoto_von_2019-01-25_18-08-18.png)
Related issue: https://gitlab.gnome.org/GNOME/gnome-builder/issues/786https://gitlab.gnome.org/GNOME/glib/-/issues/1344g_utf8_collate_key_for_filename() corner cases with digits2024-01-17T01:14:40ZBugzillag_utf8_collate_key_for_filename() corner cases with digits## Submitted by Paul `@20YearsOfGnome`
**[Link to original bug (#793747)](https://bugzilla.gnome.org/show_bug.cgi?id=793747)**
## Description
Created attachment 368820
Screenshot of Nautilus sorting the test files by name
Moved her...## Submitted by Paul `@20YearsOfGnome`
**[Link to original bug (#793747)](https://bugzilla.gnome.org/show_bug.cgi?id=793747)**
## Description
Created attachment 368820
Screenshot of Nautilus sorting the test files by name
Moved here from the relevant Nautilus bug: https://gitlab.gnome.org/GNOME/nautilus/issues/264
Create some test files as follows:
`$ touch 000001000010-0.jpg 000001000010-A.jpg 000001A00010-0.jpg 000003BBF000-0.jpg 00003bA1A000-0.jpg 00003BD22000-0.jpg 0000A4AC3000-0.jpg 000100001 000100001.jpg 000200001`
View them at the command line and in Nautilus:
```
$ ls -1
000001000010-0.jpg
000001000010-A.jpg
000001A00010-0.jpg
000003BBF000-0.jpg
00003bA1A000-0.jpg
00003BD22000-0.jpg
0000A4AC3000-0.jpg
000100001
000100001.jpg
000200001
$ nautilus .
[see attached screenshot]
```
ls sorts files as one might expect. It is not case sensitive (unless you use a case sensitive locale, e.g. LANG=C), but sorts alphabetically.
Nautilus sorts the files in a bizarre order, regardless of which locale is used. Weird behaviours include:
* Longer but otherwise equal filenames sort before shorter ones
* Sometimes ignores runs of zeros, but not punctuation
* Seems to detect runs of digits and push them to the end
The actual behaviour is very complex and difficult to predict, though it must follow some internal logic. The end result is that files don't sort in any reasonable order. This impacts several Gnome applications, such as Eye of Gnome and Nautilus. Other applications, like Transmission, respect locale.
**Attachment 368820**, "Screenshot of Nautilus sorting the test files by name":
![there_was_an_attempt](/uploads/0a6444999fc1ca5275a0cd9dbf22390e/there_was_an_attempt.png)
Version: 2.54.x
### Blocking
* [Bug 355152](https://bugzilla.gnome.org/show_bug.cgi?id=355152)