Contribute in GitHub: |

Character Encoding in the XML Parser

You can use the method provided here to perform character encoding in the XML parser.

The XML Parser has a parameter Character Encoding that you can use to set the name of the encoding. When set the encoding will be used to decode the InputStream passed to the parser during the initialization. When this parameter is other than blank (empty string) then it will be used, regardless of its value. If for example the InputStream is UTF-16BE encoded, has a Byte Order Mark (BOM) at the beginning and the Character Encoding parameter is set to "UTF-16BE" then the parser will be able to recognize the BOM sequence and will skip it automatically. If the Character Encoding parameter is set to a different encoding (not compatible with the InputStream's encoding) then an exception will be throw which will indicate that an inappropriate encoding is specified.

When you are not sure about the encoding of the InputStream or file then you can let the parser to discover it (if possible). This is the order that the Parser will follow to discover the encoding of the XML if it is not explicitly specified in the configuration (that is, the Character Encoding parameter is empty):

The Parser will check for a BOM. If it is found then the parser will decode the InputStream using the information provided by that BOM. The recognizable encodings (based on the BOM) are: UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE. Note: The parser does not recognize unusual (reversed) four byte sequences similar to the UTF-32's sequences. In this case an explicit configuration will be required (using the Character Encoding parameter).
If the InputStream or file does not provide a BOM sequence and no explicit configuration is set then the parser will try to guess the encoding and read the XML declaration's encoding attribute value. The encodings: UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE and IBM-1047 (EBCDIC variation) will be used to read the specific encoding. If found, that value will be used to decode the rest of the InputStream or file. Note: The XML declaration must be set on the first line of the document and must start from the first character.
If the Character Encoding parameter is not set, no BOM is found and no XML declaration is found (or the XML declaration does not have the encoding attribute) then the parser will use the default encoding which is UTF-8.

We recommend that, if the encoding is known at design time, then it is better to be set it explicitly in the XML Parser's configuration. This will increase the performance of the Parser's initialization process because no lookup for an encoding will be done.

When the parser is initialized for writing (Output Mode) then it expects an explicit assignment of the Character Encoding parameter. If no such assignment is done the Parser will use UTF-8 as a default encoding (UTF-8 with no BOM sequence). If any BOM compatible encoding is explicitly specified (UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE) then the parser will set a BOM sequence at the beginning of the stream.