Preamble

Lasaris

Outline

Characters in XML
UNICODE

Characters in XML documents

The specification allows at some places in XML documents (eg. element name, attribute content…) not all characters.lt is good to know the meaning of:

character sets (set of characters with respective codes/numbers), tj. attaching the ordinal value to character (eg. Unicode) a
character encoding (from a given set), eg. UTF-8, eg. the ordinal value is encoded into a sequence of bytes

Unicode and ISO 10646 Standards

Both standards try to resolve the problem: charsets with more than 256 chars.

Originally 16-bit Unicode: upto 64 K chars, enough for European languages/alphabets, not sufficient for world languages (eg. Chinese).
32-bit Unicode: covers " everything".

Unicode and ISO 10646 Standards

Nowadays, out of the 32-bit set just the Basic Multilingual Plane (BMP) is used, covering most of the typical languages.
For names in XML (non-terminal Qualified Name - QName) only BMP chars may be used.
Otherwise any Unicode char may be used.

Unicode encodings

All XML applications (particularly parsers) must be able to process some Unicode encodings. The most common in CZ/SK/EU are:

8-bit, traditional: US-ASCII, ISO-8859-2 (ISO Latin 2), Windows-1250 (=Cpl250) - just a subset of Unicode.
UTF-8: encoding of all chars in Unicode, each char to 1-6 bytes (different), US-ASCII to 1 byte, Czech/Slovak chars to 2 bytes.
UTF-16: same principle as UTF-8, but 16 bit (2 bytes) word is the basic unit

Unicode encodings

UCS-2: direct encoding of Unicode, chars from BMP are directly represented as their ordinal numbers
UCS-4: dtto, but for whole Unicode at 4 bytes - not efficient, 4 bytes even for US-ASCII, EU-langs…
UTF: encodings are the most important for XML, particularly UTF-8 (but parsers must know both).

Allowed chars

Any chars from UNICODE upto x1OFFFF (except of xFFFE, xFFFF and the range xD800 — xDFFF).
names must be composed of non-whitespace chars: numerals, letters, . (dot) - (comma, minus) _ (underscore) : etc., must start with a letter or
Encoding of the UNICODE chars is not important.

Allowed chars

Implicitly if not in prolog indicated otherwise, eg. <?xml version=" 1. 0" encoding="Windows-1250"?> then UTF-8 or UTF-16 is used.
The distinction between UTF-8 and UTF-16 is done according to the first two bytes of the document entity (ie. file), by so-called byte-order-mark xFFFE.
If not present, UTF-8 is assumed, thus UTF-8 is the implicit encoding of UNICODE in XML.