[META] normalize strings for sorting, searching, comparison, filenames, etc.
@moyogo
Submitted by Denis Jacquerye Link to original bug (#423036)
Description
This is a metabug, it is not a glib bug but rather involves applications using glib but not doing "the right thing" regarding strings.
Unicode define canonically equivalent sequences of characters. For example these are equivalent: ẹ́ <U+0065 LATIN SMALL LETTER E + U+0323 COMBINING DOT BELOW + U+0301 COMBINING ACUTE ACCENT> ẹ́ <U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT + U+0323 COMBINING DOT BELOW> ẹ́ <U+1EB9 LATIN SMALL LETTER E WITH DOT BELOW + U+0301 COMBINING ACUTE ACCENT>
For sorting, g_utf8_collate() should be used instead of strcmp.
For comparison, eg. for matching string in search, g_utf8_normalize() should be use before strcmp. With either G_NORMALIZE_DEFAULT = G_NORMALIZE_NFD or = G_NORMALIZE_DEFAULT_COMPOSE = G_NORMALIZE_NFC.
Applications should also use this before creating files, i.e. unicode equivalent filenames should be considered as the same unique filename.
Remember the user doesn't care about byte value or character sequence. Input methods might use one sequence or another, applications should handle the rest.