XML input document encoding

To parse an XML document using the XML PARSE statement, the document must be encoded in a supported encoding.

The supported encodings for a given parse operation depend on:

  • The category of the data item that contains the XML document
  • The setting of the XMLPARSE compiler option
  • The optional phrases that are specified in the XML PARSE statement

For XML documents that are contained in a national data item, the supported encoding is Unicode UTF-16 in big-endian format, CCSID 1200.

For XML documents that are contained in an alphanumeric data item, the supported encodings if the XMLPARSE(XMLSS) compiler option is in effect are as follows:

  • If the RETURNING NATIONAL phrase is specified in the XML PARSE statement: UTF-8 or any EBCDIC or ASCII encoding that is supported by the z/OS® Unicode Services for conversion to UTF-16
  • If the RETURNING NATIONAL phrase is not specified: UTF-8 or any of the single-byte EBCDIC CCSIDs listed in the related reference about the encoding of XML documents

For XML documents that are contained in an alphanumeric data item, the supported CCSIDs if XMLPARSE(COMPAT) is in effect are those specified in the related reference about the encoding of XML documents.

To parse an XML document that is encoded in an unsupported code page, first convert the document to national character data (UTF-16) by using the NATIONAL-OF intrinsic function. You can convert the individual pieces of document text that are passed to the processing procedure in special register XML-NTEXT back to the original code page by using the DISPLAY-OF intrinsic function.

XML declaration and white space:

XML documents can begin with white space only if they do not have an XML declaration:
  • If an XML document begins with an XML declaration, the first angle bracket (<) in the document must be the first character in the document.
  • If an XML document does not begin with an XML declaration, the first angle bracket in the document can be preceded only by white space.

White-space characters have the hexadecimal values shown in the following table.

Table 1. Hexadecimal values of white-space characters
White-space character EBCDIC Unicode
Space X'40' X'0020'
Horizontal tabulation X'05' X'0009'
Carriage return X'0D' X'000D'
Line feed X'25' X'000A'
New line / next line X'15' X'0085'

Determining the encoding of an input XML document

The parser must know the encoding of an XML document in order to process the document correctly.

If the specified encoding is not one of the supported coded character sets, the parser signals an XML exception event before beginning the parse operation. If the actual document encoding does not match the specified encoding, the parser signals an appropriate XML exception after beginning the parse operation.

Several sources are used in determining the encoding of an XML document:

  • If the XMLPARSE(XMLSS) option is in effect:
    • The data type of the data item that contains the XML document
    • The ENCODING phrase (if used) of the XML PARSE statement
    • The CCSID specified in the CODEPAGE compiler option
  • If the XMLPARSE(COMPAT) option is in effect:
    • The data type of the data item that contains the XML document
    • The actual encoding determined when the parser examines the first few bytes of the document
    • The encoding declaration specified within the XML document
    • The CCSID specified in the CODEPAGE compiler option

If XMLPARSE(XMLSS) is in effect:

  • Any encoding declaration specified within the XML document is ignored.
  • For XML documents that are contained in a national data item, the ENCODING phrase of the XML PARSE statement must be omitted or must specify CCSID 1200. The CCSID specified in the CODEPAGE compiler option is ignored. The parser signals an XML exception event if the actual document encoding is not UTF-16 in big-endian format.
  • For XML documents that are contained in an alphanumeric data item, the CCSID specified in the ENCODING phrase overrides the CODEPAGE compiler option. The parser raises an XML exception event at the beginning of the parse operation if the actual document encoding is not consistent with the specified CCSID.

related references
XMLPARSE (compiler option)    
  
The encoding of XML documents  
EBCDIC code-page-sensitive characters in XML markup