glib should not create/handle long UTF-8 forms

Submitted by Roozbeh Pournader

Description

Presently, glib's UTF-8 functions use the ISO/IEC 10646 definition of UTF-8 both when handling and when generating UTF-8 data. This means that it accepts and generates UTF-8 for values larger than the largest allowed Unicode character, U+10FFFF. This means that the applications will get invalid Unicode characters instead of an error, making glib not conforming to The Unicode Standard.

As an example, the following piece of code, accepts the "ill-formed" UTF-8 sequence and gives an invalid unicode codepoint of U+11000 without an error:

#include <glib.h> #include <stdio.h>

int main () { gunichar *result; gchar input[] = "\xF4\x90\x80\x80";

result = g_utf8_to_ucs4 (input, -1, NULL, NULL, NULL);

if (result != NULL) printf ("result is: U+%x\n", result[0]);

g_free (result); return 0; }

The same happens with g_unichar_to_utf8, which takes invalid Unicode code points and generates an ill-formed UTF-8 sequence.

Quoting relevant parts from the Unicode 5.0 book:

Page 73: "[Conformance clause] C9 When a process generates a code unit sequence which purports to be in a Unicode character encoding form, it shall not emit ill-formed code unit sequences. [...]

C10 When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters. "

Page 103: "Any UTF-8 byte sequence that does not match the patterns listed in Table 3-7 is ill-formed." [The patterns in Table 3-7, on page 104, do not match <F4 90 80 80>.]

We can of course claim that "we support ISO/IEC 10646's UTF-8" and ignore the problem altogether, but this is considered a security problem. Quoting UTR #6 (closed), Unicode Security Considerations:

http://www.unicode.org/reports/tr36/#Non_Visual_Recommendations

"A. Ensure that all implementations of UTF-8 used in a system are conformant to the latest version of Unicode. In particular,

A. Always use the so-called "shortest form" of UTF-8 B. Never go outside of 0..10FFFF16 C. Never use 5 or 6 byte UTF-8."

Going this way, also increases the performance of at least those functions that handle UTF-8 data, as the tests become simpler.

Not doing a patch yet as this may be controversial. Please comment.

Version: 2.12.x