Troubleshooting
Problem
Some XML entities are interpreted as 2 characters by the Xerces parser.
Symptom
Some XML entities are interpreted as 2 characters by the Xerces parser.
Cause
The incorrect length returned for certain characters is a limitation with Xerces parser that occurs with surrogate characters.
The XML Parser uses an internal UCS format to convert all the XML data. Nearly all of the characters fit into 2 bytes. However some use 4 bytes these are called surrogates. UCS uses surrogates to address characters outside the initial Basic Multilingual Plane.
Environment
For example:
Using IBM Transformation Extender (ITX) with 𝐀 as XML Input.
The code 𝐀 is rejected.
This code represents a mathematical bold capital A. It should be UTF-8 code f0 9d 90 80.
Diagnosing The Problem
The following error is contained with the TX XML trace log file.
Error (-1), "XMLParser: Input XML data is invalid."
SAXParseException, Error [line: 29186 column: 28] Datatype error: Type:InvalidDatatypeValueException,
Message:Value '' with length '2' exceeds maximum length facet of '1' .
Resolving The Problem
The incorrect length returned for certain characters is a limitation with Xerces parser that occurs with surrogate characters.
Was this topic helpful?
Document Information
Modified date:
16 June 2018
UID
swg21979084