Parsing XML documents encoded in UTF-8

You can parse XML documents that are encoded in Unicode UTF-8 in a manner similar to parsing other XML documents. However, some additional requirements apply.

About this task

To parse a UTF-8 XML document, code the XML PARSE statement as you would normally for parsing XML documents:


XML PARSE xml-document
    PROCESSING PROCEDURE xml-event-handler
    . . .
END-XML
Observe the following additional requirements though:
  • The parse data item (xml-document in the example above) must be category alphanumeric, and the CHAR(EBCDIC) compiler option must not be in effect.
  • So that the XML document will be parsed as UTF-8 rather than ASCII, ensure that at least one of the following conditions applies:
    • The runtime locale is a UTF-8 locale.
    • The document contains an XML encoding declaration that specifies UTF-8 (encoding="UTF-8").
    • The document starts with a UTF-8 byte order mark.
  • The document must not contain any characters that have a Unicode scalar value that is greater than x'FFFF'. Use a character reference ("&#xhhhhh;") for such characters.

The parser returns the XML document fragments in the alphanumeric special register XML-TEXT.

UTF-8 characters are encoded using a variable number of bytes per character. Most COBOL operations on alphanumeric data assume a single-byte encoding, in which each character is encoded in 1 byte. When you operate on UTF-8 characters as alphanumeric data, you must ensure that the data is processed correctly. Avoid operations (such as reference modification and moves that involve truncation) that can split a multibyte character between bytes. You cannot reliably use statements such as INSPECT to process multibyte characters in alphanumeric data.

Related concepts
XML-TEXT and XML-NTEXT  

Related references  
CHAR
  
The encoding of XML documents  
XML PARSE statement (COBOL for Linux® on x86 Language Reference)