... | ... | @@ -2,14 +2,6 @@ If you are not really familiar with Internationalization (usual shortcut is I18N |
|
|
|
|
|
If you don't understand why **it does not make sense to have a string without knowing what encoding it uses**, then as Joel Spolsky said [please do not write another line of code until you finish reading that article.](http://www.joelonsoftware.com/articles/Unicode.html). It is a prerequisite to understand this page, and avoid a lot of problems with libxml2, XML or text processing in general.
|
|
|
|
|
|
Table of Content:
|
|
|
|
|
|
1. [What does internationalization support mean ?](http://xmlsoft.org/encoding.html#What)
|
|
|
2. [The internal encoding, how and why](http://xmlsoft.org/encoding.html#internal)
|
|
|
3. [How is it implemented ?](http://xmlsoft.org/encoding.html#implemente)
|
|
|
4. [Default supported encodings](http://xmlsoft.org/encoding.html#Default)
|
|
|
5. [How to extend the existing support](http://xmlsoft.org/encoding.html#extend)
|
|
|
|
|
|
### What does internationalization support mean ?
|
|
|
|
|
|
XML was designed from the start to allow the support of any character set by using Unicode. Any conformant XML parser has to support the UTF-8 and UTF-16 default encodings which can both express the full unicode ranges. UTF8 is a variable length encoding whose greatest points are to reuse the same encoding for ASCII and to save space for Western encodings, but it is a bit more complex to handle in practice. UTF-16 use 2 bytes per character (and sometimes combines two pairs), it makes implementation easier, but looks a bit overkill for Western languages encoding. Moreover the XML specification allows the document to be encoded in other encodings at the condition that they are clearly labeled as such. For example the following is a wellformed XML document encoded in ISO-8859-1 and using accentuated letters that we French like for both markup and content:
|
... | ... | |