Outline
-
Characters in XML
-
UNICODE
Characters in XML documents
The specification allows at some places in XML documents (eg. element name, attribute content…) not all characters.lt is good to know the meaning of:
-
character sets (set of characters with respective codes/numbers), tj. attaching the ordinal value to character (eg. Unicode) a
-
character encoding (from a given set), eg. UTF-8, eg. the ordinal value is encoded into a sequence of bytes
Unicode and ISO 10646 Standards
Both standards try to resolve the problem: charsets with more than 256 chars.
- Originally 16-bit Unicode
-
upto 64 K chars, enough for European languages/alphabets, not sufficient for world languages (eg. Chinese).
- 32-bit Unicode
-
covers " everything".
Unicode and ISO 10646 Standards
-
Nowadays, out of the 32-bit set just the Basic Multilingual Plane (BMP) is used, covering most of the typical languages.
-
For names in XML (non-terminal Qualified Name - QName) only BMP chars may be used.
-
Otherwise any Unicode char may be used.
Unicode encodings
All XML applications (particularly parsers) must be able to process some Unicode encodings. The most common in CZ/SK/EU are:
- 8-bit, traditional
-
US-ASCII, ISO-8859-2 (ISO Latin 2), Windows-1250 (=Cpl250) - just a subset of Unicode.
- UTF-8
-
encoding of all chars in Unicode, each char to 1-6 bytes (different), US-ASCII to 1 byte, Czech/Slovak chars to 2 bytes.
- UTF-16
-
same principle as UTF-8, but 16 bit (2 bytes) word is the basic unit
Unicode encodings
- UCS-2
-
direct encoding of Unicode, chars from BMP are directly represented as their ordinal numbers
- UCS-4
-
dtto, but for whole Unicode at 4 bytes - not efficient, 4 bytes even for US-ASCII, EU-langs…
- UTF
-
encodings are the most important for XML, particularly UTF-8 (but parsers must know both).
Allowed chars
-
Any chars from UNICODE upto x1OFFFF (except of xFFFE, xFFFF and the range xD800 — xDFFF).
-
names must be composed of non-whitespace chars: numerals, letters, . (dot) - (comma, minus) _ (underscore) : etc., must start with a letter or
-
Encoding of the UNICODE chars is not important.
Allowed chars
-
Implicitly if not in prolog indicated otherwise, eg.
<?xml version=" 1. 0" encoding="Windows-1250"?>
then UTF-8 or UTF-16 is used. -
The distinction between UTF-8 and UTF-16 is done according to the first two bytes of the document entity (ie. file), by so-called byte-order-mark xFFFE.
-
If not present, UTF-8 is assumed, thus UTF-8 is the implicit encoding of UNICODE in XML.