Parsing XML documents encoded in UTF-8
You can parse XML documents that are encoded in Unicode UTF-8 in a manner similar to parsing other XML documents. However, some additional requirements apply.
About this task
To parse a UTF-8 XML document, code the XML
PARSE
statement as you would normally for parsing XML documents:
XML PARSE xml-document
PROCESSING PROCEDURE xml-event-handler
. . .
END-XML
- The parse data item (
xml-document
in the example above) must be category alphanumeric, and theCHAR(EBCDIC)
compiler option must not be in effect. - So that the XML document will
be parsed as UTF-8 rather than ASCII,
ensure that at least one of the following conditions applies:
- The runtime locale is a UTF-8 locale.
- The
document contains an XML encoding declaration that specifies
UTF-8 (
encoding="UTF-8"
). - The document starts with a UTF-8 byte order mark.
- The document must
not contain any characters that have a Unicode
scalar value that is greater than x'FFFF'. Use a character reference
(
"&#xhhhhh;"
) for such characters.
The parser returns the XML
document fragments in the alphanumeric special register XML-TEXT
.
UTF-8 characters are encoded using
a variable number of bytes per character. Most COBOL operations on
alphanumeric data assume a single-byte encoding, in which each character
is encoded in 1 byte. When you operate on UTF-8 characters as alphanumeric
data, you must ensure that the data is processed correctly. Avoid
operations (such as reference modification and moves that involve
truncation) that can split a multibyte character between bytes. You
cannot reliably use statements such as INSPECT
to
process multibyte characters in alphanumeric data.
Processing UTF-8 data using UTF-16 (national) data types
Parsing XML documents
Specifying the encoding
CHAR
The encoding of XML documents
XML PARSE statement (COBOL for Linux® on x86 Language Reference)