Preamble

Lasaris

Outline

  • Characters in XML
  • UNICODE

Characters in XML documents

The specification allows at some places in XML documents (eg. element name, attribute content…) not all characters.lt is good to know the meaning of:

  • character sets (set of characters with respective codes/numbers), tj. attaching the ordinal value to character (eg. Unicode) a
  • character encoding (from a given set), eg. UTF-8, eg. the ordinal value is encoded into a sequence of bytes

Unicode and ISO 10646 Standards

Both standards try to resolve the problem: charsets with more than 256 chars.

Originally 16-bit Unicode

upto 64 K chars, enough for European languages/alphabets, not sufficient for world languages (eg. Chinese).

32-bit Unicode

covers " everything".

Unicode and ISO 10646 Standards

  • Nowadays, out of the 32-bit set just the Basic Multilingual Plane (BMP) is used, covering most of the typical languages.
  • For names in XML (non-terminal Qualified Name - QName) only BMP chars may be used.
  • Otherwise any Unicode char may be used.

Unicode encodings

All XML applications (particularly parsers) must be able to process some Unicode encodings. The most common in CZ/SK/EU are:

8-bit, traditional

US-ASCII, ISO-8859-2 (ISO Latin 2), Windows-1250 (=Cpl250) - just a subset of Unicode.

UTF-8

encoding of all chars in Unicode, each char to 1-6 bytes (different), US-ASCII to 1 byte, Czech/Slovak chars to 2 bytes.

UTF-16

same principle as UTF-8, but 16 bit (2 bytes) word is the basic unit

Unicode encodings

UCS-2

direct encoding of Unicode, chars from BMP are directly represented as their ordinal numbers

UCS-4

dtto, but for whole Unicode at 4 bytes - not efficient, 4 bytes even for US-ASCII, EU-langs…

UTF

encodings are the most important for XML, particularly UTF-8 (but parsers must know both).

Allowed chars

  • Any chars from UNICODE upto x1OFFFF (except of xFFFE, xFFFF and the range xD800 — xDFFF).
  • names must be composed of non-whitespace chars: numerals, letters, . (dot) - (comma, minus) _ (underscore) : etc., must start with a letter or
  • Encoding of the UNICODE chars is not important.

Allowed chars

  • Implicitly if not in prolog indicated otherwise, eg. <?xml version=" 1. 0" encoding="Windows-1250"?> then UTF-8 or UTF-16 is used.
  • The distinction between UTF-8 and UTF-16 is done according to the first two bytes of the document entity (ie. file), by so-called byte-order-mark xFFFE.
  • If not present, UTF-8 is assumed, thus UTF-8 is the implicit encoding of UNICODE in XML.