Parsing XML documents encoded in UTF-8
If
the XMLPARSE(XMLSS)
compiler
option is in effect, you can parse
XML documents that are encoded in Unicode UTF-8 in a manner similar
to parsing other XML documents. However, some additional requirements
apply.
To parse a UTF-8 XML document, you
must specify CCSID 1208 in the ENCODING
phrase of
the XML PARSE
statement, as shown in the following
code fragment:
XML PARSE xml-document
WITH ENCODING 1208
PROCESSING PROCEDURE xml-event-handler
. . .
END-XML
You define xml-document
as
an alphanumeric data item or alphanumeric group item in WORKING-STORAGE
or LOCAL-STORAGE
.
If you do not code the RETURNING
NATIONAL
phrase in the XML PARSE
statement,
the parser returns the XML
document fragments in the alphanumeric special registers XML-TEXT
, XML-NAMESPACE
,
and XML-NAMESPACE-PREFIX
.
UTF-8 characters are encoded using
a variable number of bytes per character. Most COBOL operations on
alphanumeric data assume a single-byte encoding, in which each character
is encoded in 1 byte. When you operate on UTF-8 characters as alphanumeric
data, you must ensure that the data is processed correctly. Avoid
operations (such as reference modification and moves that involve
truncation) that can split a multibyte character between bytes. You
cannot reliably use statements such as INSPECT
to
process multibyte characters in alphanumeric data.
You
can more reliably process UTF-8 document fragments by specifying the RETURNING
NATIONAL
phrase in the XML PARSE
statement.
If you use the RETURNING NATIONAL
phrase, XML document
fragments are efficiently converted to UTF-16 encoding and are returned
to the application in the national special registers XML-NTEXT
, XML-NNAMESPACE
,
and XMLNNAMESPACE-PREFIX
. Then you can process the
XML text fragments in national data items. (The UTF-16 encoding in
national data items greatly facilitates Unicode processing in COBOL.)
The following code fragment illustrates the use of
both the ENCODING
phrase and the RETURNING
NATIONAL
phrase for parsing a UTF-8 XML document:
XML PARSE xml-document
WITH ENCODING 1208 RETURNING NATIONAL
PROCESSING PROCEDURE xml-event-handler
ON EXCEPTION
DISPLAY 'XML document error ' XML-CODE
STOP RUN
NOT ON EXCEPTION
DISPLAY 'XML document was successfully parsed.'
END-XML
XML-TEXT and XML-NTEXT
XML-NAMESPACE and XML-NNAMESPACE
XML-NAMESPACE-PREFIX and XML-NNAMESPACE-PREFIX
XMLPARSE (compiler option)
The encoding of XML documents
XML PARSE statement (Enterprise COBOL for z/OS® Language Reference)