After my previous posting CDATA 1/x I wanted to make some comments on CDATA sections in general and then for CDATA and DataPower.
Doing a simple search on developerWorks for "CDATA" showed an interesting hit: Dealing with data in XML It talks about character data, parsed character data, ... and so with that link I am done with the first part.
As already said in the previous posting XML data is stored internally as UTF-8 by DataPower XSLT processor.
Document cdata-alpha-iso.xml is used to show that there is no difference for DataPower XSLT processor if the greek alpha character is given in the input
in clear inside CDATA section
(the hexdump shows that alpha is encoded as e1 in ISO-8859-7 encoding):
$ cat cdata-alpha-iso.xml <?xml version="1.0" encoding="ISO-8859-7"?> <!DOCTYPE root [ <!ENTITY alpha "α"> ]> <tag>�<![CDATA[�]]>α</tag> $ $ od -Ax -tcx1 cdata-alpha-iso.xml | tail -5 000060 341 < ! [ C D A T A [ 341 ] ] > & a e1 3c 21 5b 43 44 41 54 41 5b e1 5d 5d 3e 26 61 000070 l p h a ; < / t a g > \n 6c 70 68 61 3b 3c 2f 74 61 67 3e 0a 00007c $
In previous CDATA posting I showed how to use string-length() to get an idea of DataPower internal data representation.
Here I want to show the "magnifying glass" technique, which may be used to inspect more internals than just CDATA.
The extension funtion dp:binary-encode() Base64-encodes arbitrary binary data for DataPower XSLT processor. Since base64 data is not that easy to inspect by humans extension function dp:radix-convert(_,64,16) may be used to generate hexadecimal output from base64 data.
The second line is the "magnifying glass" output that demonstrates that every time CEB1 is stored internally, which is the UTF-8 encoding of alpha,
One last comment on converting a base64 string to a hex string.
dp:radix-convert() is a "number" function, and therefore it will strip leading 0x00 bytes (no problem for above application).
In case you want to perserve the length of the encoded base64 string use this function:
<!-- convert base64 string $b64str to hex string (2 hex digits per byte) --> <func:function name="ub:base64-to-hex"> <xsl:param name="b64str"/>
This is a real customer problem and the first of several postings on CDATA sections.
The input XML document containted CDATA sections with an embedded embedded XML document.
Both XML declarations (see here), the XML file one and the one inside CDATA section are ISO-8859-1 encoded.
For ease of demonstration the XML file inside the CDATA section contains one element containing a single text character (German 'ü'):
Now what happens after the parser has parsed the XML file and passed it to DataPower XSLT processor? The CDATA section is removed and the complete input document will be converted to DataPower internal data format. And this internal data format is UTF-8.
So now the problem is that a dp:parse() on the inner XML document finds a mismatch of the encoding in XML declaration (ISO-8859-1) and the current (UTF-8) encoding of the document. While this is working as designed, there are the five options to get parsing done correctly:
do not use CDATA section
use CDATA section without an XML declaration inside
use CDATA section with an XML declaration inside without encoding
use CDATA section with an XML declaration inside with UTF-8 encoding
skip the XML declaration when parsing the inner XML document
Skipping the XML declaration can be simply done by "substring-after(_, '?>')".
Below stylesheet demonstrates
the difference. lenA is 2 because the UTF-8 encoding "fc 3c" for 'ü' will be counted as two (ISO-8859-1) characters. lenB is 1 because by skipping the XML declaration the UTF-8 default encoding is used which perfectly matches the (internal) encoding.