[BZ#696407] add a function to convert arbitrary data to valid JSON strings
Submitted by Emmanuele Bassi (:ebassi) <<eba..@..com>>
Assigned to json-glib-maint@gnome.bugs
Link to original bug (#696407)
Description
JSON strings are defined in the RFC as:
The representation of strings is similar to conventions used in the C family of programming languages. A string begins and ends with quotation marks. All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).
Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point. The hexadecimal letters A though F can be upper or lowercase. So, for example, a string containing only a single reverse solidus character may be represented as "\u005C".
Alternatively, there are two-character sequence escape representations of some popular characters. So, for example, a string containing only a single reverse solidus character may be represented more compactly as "\".
To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".
see: http://www.ietf.org/rfc/rfc4627.txt?number=4627
the parsing code can deal with escaped Unicode code points in both UTF-8 and UTF-16 surrogate pairs, but we don't have anything that can generate escaped sequences from arbitrary data.
all functions in JSON-GLib dealing with strings assume that the string is UTF-8 encoded, and without control points; we cannot change that to work compatibly: we'd have to add a "length" argument to all functions dealing with strings, or we'd have to duplicate each entry point dealing with strings.
instead, we could add a function like:
char *json_escape_string (const guint8 *data, gsize len);
that behaves like g_markup_escape_text(), and escapes all Unicode characters (including control characters) into \uXXXX and \uXXXX\uXXXX sequences.