Parsing XML documents encoded in UTF-8

If the XMLPARSE(XMLSS) compiler option is in effect, you can parse XML documents that are encoded in Unicode UTF-8 in a manner similar to parsing other XML documents. However, some additional requirements apply.

To parse a UTF-8 XML document, you must specify CCSID 1208 in the ENCODING phrase of the XML PARSE statement, as shown in the following code fragment:


XML PARSE xml-document
    WITH ENCODING 1208  
    PROCESSING PROCEDURE xml-event-handler
    . . .
END-XML

You define xml-document as an alphanumeric data item or alphanumeric group item in WORKING-STORAGE or LOCAL-STORAGE.

If you do not code the RETURNING NATIONAL phrase in the XML PARSE statement, the parser returns the XML document fragments in the alphanumeric special registers XML-TEXT, XML-NAMESPACE, and XML-NAMESPACE-PREFIX.

UTF-8 characters are encoded using a variable number of bytes per character. Most COBOL operations on alphanumeric data assume a single-byte encoding, in which each character is encoded in 1 byte. When you operate on UTF-8 characters as alphanumeric data, you must ensure that the data is processed correctly. Avoid operations (such as reference modification and moves that involve truncation) that can split a multibyte character between bytes. You cannot reliably use statements such as INSPECT to process multibyte characters in alphanumeric data.

You can more reliably process UTF-8 document fragments by specifying the RETURNING NATIONAL phrase in the XML PARSE statement. If you use the RETURNING NATIONAL phrase, XML document fragments are efficiently converted to UTF-16 encoding and are returned to the application in the national special registers XML-NTEXT, XML-NNAMESPACE, and XMLNNAMESPACE-PREFIX. Then you can process the XML text fragments in national data items. (The UTF-16 encoding in national data items greatly facilitates Unicode processing in COBOL.)

The following code fragment illustrates the use of both the ENCODING phrase and the RETURNING NATIONAL phrase for parsing a UTF-8 XML document:


XML PARSE xml-document
    WITH ENCODING 1208 RETURNING NATIONAL 
    PROCESSING PROCEDURE xml-event-handler
  ON EXCEPTION
     DISPLAY 'XML document error ' XML-CODE
     STOP RUN
  NOT ON EXCEPTION
     DISPLAY 'XML document was successfully parsed.'
END-XML

related references  
XMLPARSE (compiler option)
  
The encoding of XML documents  
XML PARSE statement (Enterprise COBOL for z/OS® Language Reference)