Topic
  • 18 replies
  • Latest Post - ‏2015-12-07T13:01:14Z by Daviiid
DesiInPhx
DesiInPhx
4 Posts

Pinned topic How to override UTF-8 presumption while processing XML?

‏2010-07-13T01:19:55Z |
Hi

We have a situation where the response from backend systems arrives as an ISO-8859-1 encoded XML document. However this document does not have any XML prolog declaring its encoding.

I believe DataPower presumes this to be an UTF-8 encoded XML (per the standard) as there is no prolog and tries to process the document. Unfortunately there are some characters from the Latin 1 supplement set e.g. the Registered Trademark Symbol (0xAE) and this causes parsing to fail as U+00AE needs to be mapped using 2 bytes and there's only one in the input.

So the question is - How do we override the UTF-8 presumption on the response policy so that we can transform the otherwise valid XML? Unfortunately it is not possible for the source system to set the correct encoding in the XML prolog - so that is not an option.

Appreciate all your help!
Updated on 2011-04-07T11:35:55Z at 2011-04-07T11:35:55Z by HermannSW
  • HermannSW
    HermannSW
    6065 Posts

    Re: How to override UTF-8 presumption while processing XML?

    ‏2010-07-13T12:25:16Z  
    > So the question is - How do we override the UTF-8 presumption on the
    > response policy so that we can transform the otherwise valid XML?
    > Unfortunately it is not possible for the source system to set the
    > correct encoding in the XML prolog - so that is not an option.
    >

    You are lucky, there were other customers which ran into this before.

    What you experience is a "broken Webservice", and here you may find how to (easily) deal with it on DataPower:
    http://www-01.ibm.com/support/docview.wss?uid=swg27019119&aid=1#page=6
  • DesiInPhx
    DesiInPhx
    4 Posts

    Re: How to override UTF-8 presumption while processing XML?

    ‏2010-07-13T15:11:10Z  
    • HermannSW
    • ‏2010-07-13T12:25:16Z
    > So the question is - How do we override the UTF-8 presumption on the
    > response policy so that we can transform the otherwise valid XML?
    > Unfortunately it is not possible for the source system to set the
    > correct encoding in the XML prolog - so that is not an option.
    >

    You are lucky, there were other customers which ran into this before.

    What you experience is a "broken Webservice", and here you may find how to (easily) deal with it on DataPower:
    http://www-01.ibm.com/support/docview.wss?uid=swg27019119&aid=1#page=6
    Thanks a lot for linking to a very informative document. Internally we were indeed kicking around a similar proxying solution but are still interested to see if there is a solution which would not incur the overhead (development, maintenance and performance) of an additional proxy. We expect to encounter this situation pretty often in our environment and were hoping that it would be possible to "declare" the expected encoding in some fashion in the WS-Proxy configuration.

    Is there any such configuration available?

    Appreciate all the help. Thanks!
  • HermannSW
    HermannSW
    6065 Posts

    Re: How to override UTF-8 presumption while processing XML?

    ‏2010-07-13T16:45:34Z  
    • DesiInPhx
    • ‏2010-07-13T15:11:10Z
    Thanks a lot for linking to a very informative document. Internally we were indeed kicking around a similar proxying solution but are still interested to see if there is a solution which would not incur the overhead (development, maintenance and performance) of an additional proxy. We expect to encounter this situation pretty often in our environment and were hoping that it would be possible to "declare" the expected encoding in some fashion in the WS-Proxy configuration.

    Is there any such configuration available?

    Appreciate all the help. Thanks!
    > ... We expect to encounter this situation pretty often in our
    > environment and were hoping that it would be possible to "declare"
    > the expected encoding in some fashion in the WS-Proxy configuration.
    >
    > Is there any such configuration available?
    >

    What you get back from the Webservice might "look like" XML for humans, but it is Non-XML.
    This Non-XML must be converted to XML, eg. by the service I pointed to.

    I am not aware of any option at any DataPower service helping you to resolve this issue by just configuration, sorry.

    I just did a quick test:
    • Apache bench from Laptop directly connected to a 9004
    • accessing an external Webserver, 330 byte XML document
    • through TCP Proxy
    • or XML FW with repair.xsl
    • 30000 requests, concurrency 200

    Through TCP proxy I got 5300 requests/sec.
    Through XML FW with repair.xsl I got 840 requests/sec.

    This scenario is a little bit unrealistic since CPU is up to 100%.

    You have to do measurements in your environment with the traffic patterns you expect to calculate the cost of repair.xsl.

    (The cheapest solution is getting the Webservice fixed)
  • HermannSW
    HermannSW
    6065 Posts

    Re: How to override UTF-8 presumption while processing XML?

    ‏2010-07-21T13:58:09Z  
    • HermannSW
    • ‏2010-07-13T16:45:34Z
    > ... We expect to encounter this situation pretty often in our
    > environment and were hoping that it would be possible to "declare"
    > the expected encoding in some fashion in the WS-Proxy configuration.
    >
    > Is there any such configuration available?
    >

    What you get back from the Webservice might "look like" XML for humans, but it is Non-XML.
    This Non-XML must be converted to XML, eg. by the service I pointed to.

    I am not aware of any option at any DataPower service helping you to resolve this issue by just configuration, sorry.

    I just did a quick test:
    • Apache bench from Laptop directly connected to a 9004
    • accessing an external Webserver, 330 byte XML document
    • through TCP Proxy
    • or XML FW with repair.xsl
    • 30000 requests, concurrency 200

    Through TCP proxy I got 5300 requests/sec.
    Through XML FW with repair.xsl I got 840 requests/sec.

    This scenario is a little bit unrealistic since CPU is up to 100%.

    You have to do measurements in your environment with the traffic patterns you expect to calculate the cost of repair.xsl.

    (The cheapest solution is getting the Webservice fixed)
    Just found a misconfiguration in my setup.
    Here are corrected numbers (nearly no overhead over the TCP Proxy pass-thru of the box):

    Through TCP proxy I got 5300 requests/sec.
    Through XML FW with repair.xsl I got 840 2370 requests/sec.
  • DesiInPhx
    DesiInPhx
    4 Posts

    Re: How to override UTF-8 presumption while processing XML?

    ‏2010-07-23T22:24:30Z  
    • HermannSW
    • ‏2010-07-21T13:58:09Z
    Just found a misconfiguration in my setup.
    Here are corrected numbers (nearly no overhead over the TCP Proxy pass-thru of the box):

    Through TCP proxy I got 5300 requests/sec.
    Through XML FW with repair.xsl I got 840 2370 requests/sec.
    Thanks for all the help. FYI - we are also considering a WTX solution to this issue and are testing that at the moment.
  • SystemAdmin
    SystemAdmin
    6772 Posts

    Re: How to override UTF-8 presumption while processing XML?

    ‏2010-07-26T15:52:06Z  
    • DesiInPhx
    • ‏2010-07-13T15:11:10Z
    Thanks a lot for linking to a very informative document. Internally we were indeed kicking around a similar proxying solution but are still interested to see if there is a solution which would not incur the overhead (development, maintenance and performance) of an additional proxy. We expect to encounter this situation pretty often in our environment and were hoping that it would be possible to "declare" the expected encoding in some fashion in the WS-Proxy configuration.

    Is there any such configuration available?

    Appreciate all the help. Thanks!
    This is NOT just WebSphere DataPower assuming that the XML is UTF-8 and therefore something we control. The XML 1.0 specification is very clear on this item - any XML fragment without a valid preamble or explicitly declared encoding should be treated as UTF-8. So the only other way (besides Hermann's supplied work around) is to lobby the W3C to change the XML specification, which I think we can agree is not going to happen. Any parser that does not assume UTF-8 for undeclared encoding would be in contradiction with the spec.

    HTH,
    Corey
  • HermannSW
    HermannSW
    6065 Posts

    Re: How to override UTF-8 presumption while processing XML?

    ‏2010-07-30T15:30:50Z  
    • DesiInPhx
    • ‏2010-07-23T22:24:30Z
    Thanks for all the help. FYI - we are also considering a WTX solution to this issue and are testing that at the moment.
    > ... FYI - we are also considering a WTX solution to this issue and are testing that at the moment.

    I asked a colleague for a WTX repair solution.
    He came up with the map you can see in attached screenshot.
    With the corresponding .dpa file I got 1800 requests/sec in above described test scenario.
  • DPHelp
    DPHelp
    1 Post

    Re: How to override UTF-8 presumption while processing XML?

    ‏2011-04-06T13:40:16Z  
    • HermannSW
    • ‏2010-07-13T12:25:16Z
    > So the question is - How do we override the UTF-8 presumption on the
    > response policy so that we can transform the otherwise valid XML?
    > Unfortunately it is not possible for the source system to set the
    > correct encoding in the XML prolog - so that is not an option.
    >

    You are lucky, there were other customers which ran into this before.

    What you experience is a "broken Webservice", and here you may find how to (easily) deal with it on DataPower:
    http://www-01.ibm.com/support/docview.wss?uid=swg27019119&aid=1#page=6
    We ran into similar problem.

    Most of responses will have characters which are part of UTF-8, but sometimes when there is need to embed the registered trademark, ISO-8859-1 characters are included. But the xml declaration still have the UTF-8 encoding due to which its failing.

    If I use the repair.xsl, does this try to parse everything in all the responses coming to that processing rule or only ISO-8859-1 characters? if it try to parse everything, do we run into parsing errors when handling UTF-8 only characters?

    Thanks in advance.
  • HermannSW
    HermannSW
    6065 Posts

    Re: How to override UTF-8 presumption while processing XML?

    ‏2011-04-07T11:35:55Z  
    • DPHelp
    • ‏2011-04-06T13:40:16Z
    We ran into similar problem.

    Most of responses will have characters which are part of UTF-8, but sometimes when there is need to embed the registered trademark, ISO-8859-1 characters are included. But the xml declaration still have the UTF-8 encoding due to which its failing.

    If I use the repair.xsl, does this try to parse everything in all the responses coming to that processing rule or only ISO-8859-1 characters? if it try to parse everything, do we run into parsing errors when handling UTF-8 only characters?

    Thanks in advance.
    > We ran into similar problem.
    >
    > Most of responses will have characters which are part of UTF-8, but sometimes when
    > there is need to embed the registered trademark, ISO-8859-1 characters are included.
    The ASCII character-encoding scheme is a subset of UTF-8 as well as ISO-8859-1.

    > But the xml declaration still have the UTF-8 encoding due to which its failing.
    >
    OK, a XML file without a XML declaration (last entry under Key terminology) is considered to have UTF-8 encoding (the default).

    A Webservice returning a "file" which
    • "looks" like XML
    • contains no XML declaration
    • but is not UTF-8 encoded
    is what I called a "broken Webservice" on this slide:
    http://www-01.ibm.com/support/docview.wss?uid=swg27019119&aid=1#page=6

    Do I understand you right that your "webservice" returns a "file" which
    • "looks" like XML
    • contains a XML declaration with encoding="UTF-8"
    • but is not UTF-8 encoded?
    If that is true, I would call it a "really broken Webservice" ...

    Please clarify whether you have a "really broken Webservice".
    I may provide a modified repair.xsl that can deal with that
    (current repair.xsl just "preprends" a XML declaration with ISO-8859-1 encoding).

     
    Hermann<myXsltBlog/>
  • Asim80
    Asim80
    22 Posts

    Re: How to override UTF-8 presumption while processing XML?

    ‏2014-07-18T21:30:00Z  
    • HermannSW
    • ‏2011-04-07T11:35:55Z
    > We ran into similar problem.
    >
    > Most of responses will have characters which are part of UTF-8, but sometimes when
    > there is need to embed the registered trademark, ISO-8859-1 characters are included.
    The ASCII character-encoding scheme is a subset of UTF-8 as well as ISO-8859-1.

    > But the xml declaration still have the UTF-8 encoding due to which its failing.
    >
    OK, a XML file without a XML declaration (last entry under Key terminology) is considered to have UTF-8 encoding (the default).

    A Webservice returning a "file" which
    • "looks" like XML
    • contains no XML declaration
    • but is not UTF-8 encoded
    is what I called a "broken Webservice" on this slide:
    http://www-01.ibm.com/support/docview.wss?uid=swg27019119&aid=1#page=6

    Do I understand you right that your "webservice" returns a "file" which
    • "looks" like XML
    • contains a XML declaration with encoding="UTF-8"
    • but is not UTF-8 encoded?
    If that is true, I would call it a "really broken Webservice" ...

    Please clarify whether you have a "really broken Webservice".
    I may provide a modified repair.xsl that can deal with that
    (current repair.xsl just "preprends" a XML declaration with ISO-8859-1 encoding).

     
    Hermann<myXsltBlog/>

    I have a similar problem if Hermann can provide any input that would be great>>

    Experiencing this issue on a response. External service provider response to Datapower is JSON format, based on charset "ISO-8859-1" (Windows platform). Special character in vendor response JSON message is "®" (UTF-8 Equivalent: 0xAE) Datapower interface's "Response Type" is set as JSON, which appear to use UTF-8 as default. Since incoming data is in ISO base, it is conflicting in parsing. Parsing failure happens right at the entry point of response policy. Probe displays following error Transaction aborted in Step 0 
    Unable to parse JSON and generate JSONx: illegal character 0xae at offset 3027 of http:---------  

    If you can elaborate on the steps in the broken web service slide that would be great. Thanks 

  • HermannSW
    HermannSW
    6065 Posts

    Re: How to override UTF-8 presumption while processing XML?

    ‏2014-07-19T13:51:04Z  
    • Asim80
    • ‏2014-07-18T21:30:00Z

    I have a similar problem if Hermann can provide any input that would be great>>

    Experiencing this issue on a response. External service provider response to Datapower is JSON format, based on charset "ISO-8859-1" (Windows platform). Special character in vendor response JSON message is "®" (UTF-8 Equivalent: 0xAE) Datapower interface's "Response Type" is set as JSON, which appear to use UTF-8 as default. Since incoming data is in ISO base, it is conflicting in parsing. Parsing failure happens right at the entry point of response policy. Probe displays following error Transaction aborted in Step 0 
    Unable to parse JSON and generate JSONx: illegal character 0xae at offset 3027 of http:---------  

    If you can elaborate on the steps in the broken web service slide that would be great. Thanks 

    DataPower JSON processing with pre 7.0.0.0 DataPower firmwares is according to rfc4627.

    The spec only mentions UTF encoding in encoding section:
    http://tools.ietf.org/html/rfc4627#section-3

    And JSON processing by conversion to JSONx is only able to process all kinds of UTF-x encodings.

    Starting with simple UTF-8 encoded reg.json I created other encodings:

    $ cat reg.json 
    ["®"]
    $ 
    $ od -tcx1 reg.json
    0000000   [   " 302 256   "   ]  \n
             5b  22  c2  ae  22  5d  0a
    0000007
    $ 
    $ iconv -f utf-8 -t utf-16 reg.json  > reg.bom.utf-16.json
    $ iconv -f utf-8 -t utf-32 reg.json  > reg.bom.utf-32.json
    $ iconv -f utf-8 -t iso-8859-1 reg.json  > reg.8859-1.json
    $ 
    $ od -tcx1 reg.bom.utf-16.json 
    0000000 377 376   [  \0   "  \0 256  \0   "  \0   ]  \0  \n  \0
             ff  fe  5b  00  22  00  ae  00  22  00  5d  00  0a  00
    0000016
    $ od -tcx1 reg.bom.utf-32.json 
    0000000 377 376  \0  \0   [  \0  \0  \0   "  \0  \0  \0 256  \0  \0  \0
             ff  fe  00  00  5b  00  00  00  22  00  00  00  ae  00  00  00
    0000020   "  \0  \0  \0   ]  \0  \0  \0  \n  \0  \0  \0
             22  00  00  00  5d  00  00  00  0a  00  00  00
    0000034
    $ od -tcx1 reg.8859-1.json 
    0000000   [   " 256   "   ]  \n
             5b  22  ae  22  5d  0a
    0000006
    $
    

     

    Here you can see that JSON2JSONx processing for the UTF-x encodings is fine, all with same output:

    $ curl --data-binary @reg.json http://dp2-l3:2057 ; echo
    <?xml version="1.0" encoding="UTF-8"?>
    <json:array xsi:schemaLocation="http://www.datapower.com/schemas/json jsonx.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:json="http://www.ibm.com/xmlns/prod/2009/jsonx">
    <json:string>&#174;</json:string></json:array>
    $ 
    $ curl --data-binary @reg.bom.utf-16.json http://dp2-l3:2057 ; echo
    <?xml version="1.0" encoding="UTF-8"?>
    <json:array xsi:schemaLocation="http://www.datapower.com/schemas/json jsonx.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:json="http://www.ibm.com/xmlns/prod/2009/jsonx">
    <json:string>&#174;</json:string></json:array>
    $ 
    $ curl --data-binary @reg.bom.utf-32.json http://dp2-l3:2057 ; echo
    <?xml version="1.0" encoding="UTF-8"?>
    <json:array xsi:schemaLocation="http://www.datapower.com/schemas/json jsonx.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:json="http://www.ibm.com/xmlns/prod/2009/jsonx">
    <json:string>&#174;</json:string></json:array>
    $
    

     

    Starting with firmware 6.0.0.0 you can use JSONiq to process JSON as well.

    But JSONiq does also process UTF-x encoded strings correctly.

    While this looks as if JSONiq would be able to deal with ISO-8859-1 (as in this posting):

    $ coproc2 identity-json.xq reg.8859-1.json http://dp2-l3:2226 -s | od -tcx1
    0000000  \n   [  \n           " 256   "  \n   ]
             0a  5b  0a  20  20  22  ae  22  0a  5d
    0000012
    $
    

     

    It does NOT and you should not rely on this, see string-length() as proof:

    $ coproc2 string-length.xq reg.8859-1.json http://dp2-l3:2226 ; echo
    <?xml version="1.0" encoding="UTF-8"?>
    <x>0</x>
    $ 
    $ coproc2 string-length.xq reg.json http://dp2-l3:2226 ; echo
    <?xml version="1.0" encoding="UTF-8"?>
    <x>1</x>
    $ 
    $ coproc2 string-length.xq reg.bom.utf-16.json http://dp2-l3:2226 ; echo
    <?xml version="1.0" encoding="UTF-8"?>
    <x>1</x>
    $ 
    $ coproc2 string-length.xq reg.bom.utf-32.json http://dp2-l3:2226 ; echo
    <?xml version="1.0" encoding="UTF-8"?>
    <x>1</x>
    $
    

     

    You need tranform binary action for conversion of ISO-8859-1 to UTF-8.
    I will post how later, have to jump now.


    Hermann.

  • HermannSW
    HermannSW
    6065 Posts

    Re: How to override UTF-8 presumption while processing XML?

    ‏2014-07-19T16:19:07Z  
    • HermannSW
    • ‏2014-07-19T13:51:04Z

    DataPower JSON processing with pre 7.0.0.0 DataPower firmwares is according to rfc4627.

    The spec only mentions UTF encoding in encoding section:
    http://tools.ietf.org/html/rfc4627#section-3

    And JSON processing by conversion to JSONx is only able to process all kinds of UTF-x encodings.

    Starting with simple UTF-8 encoded reg.json I created other encodings:

    <pre class="javascript dw" data-editor-lang="js" data-pbcklang="javascript" dir="ltr">$ cat reg.json ["®"] $ $ od -tcx1 reg.json 0000000 [ " 302 256 " ] \n 5b 22 c2 ae 22 5d 0a 0000007 $ $ iconv -f utf-8 -t utf-16 reg.json > reg.bom.utf-16.json $ iconv -f utf-8 -t utf-32 reg.json > reg.bom.utf-32.json $ iconv -f utf-8 -t iso-8859-1 reg.json > reg.8859-1.json $ $ od -tcx1 reg.bom.utf-16.json 0000000 377 376 [ \0 " \0 256 \0 " \0 ] \0 \n \0 ff fe 5b 00 22 00 ae 00 22 00 5d 00 0a 00 0000016 $ od -tcx1 reg.bom.utf-32.json 0000000 377 376 \0 \0 [ \0 \0 \0 " \0 \0 \0 256 \0 \0 \0 ff fe 00 00 5b 00 00 00 22 00 00 00 ae 00 00 00 0000020 " \0 \0 \0 ] \0 \0 \0 \n \0 \0 \0 22 00 00 00 5d 00 00 00 0a 00 00 00 0000034 $ od -tcx1 reg.8859-1.json 0000000 [ " 256 " ] \n 5b 22 ae 22 5d 0a 0000006 $ </pre>

     

    Here you can see that JSON2JSONx processing for the UTF-x encodings is fine, all with same output:

    <pre class="javascript dw" data-editor-lang="js" data-pbcklang="javascript" dir="ltr">$ curl --data-binary @reg.json http://dp2-l3:2057 ; echo <?xml version="1.0" encoding="UTF-8"?> <json:array xsi:schemaLocation="http://www.datapower.com/schemas/json jsonx.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:json="http://www.ibm.com/xmlns/prod/2009/jsonx"> <json:string>&#174;</json:string></json:array> $ $ curl --data-binary @reg.bom.utf-16.json http://dp2-l3:2057 ; echo <?xml version="1.0" encoding="UTF-8"?> <json:array xsi:schemaLocation="http://www.datapower.com/schemas/json jsonx.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:json="http://www.ibm.com/xmlns/prod/2009/jsonx"> <json:string>&#174;</json:string></json:array> $ $ curl --data-binary @reg.bom.utf-32.json http://dp2-l3:2057 ; echo <?xml version="1.0" encoding="UTF-8"?> <json:array xsi:schemaLocation="http://www.datapower.com/schemas/json jsonx.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:json="http://www.ibm.com/xmlns/prod/2009/jsonx"> <json:string>&#174;</json:string></json:array> $ </pre>

     

    Starting with firmware 6.0.0.0 you can use JSONiq to process JSON as well.

    But JSONiq does also process UTF-x encoded strings correctly.

    While this looks as if JSONiq would be able to deal with ISO-8859-1 (as in this posting):

    <pre class="javascript dw" data-editor-lang="js" data-pbcklang="javascript" dir="ltr">$ coproc2 identity-json.xq reg.8859-1.json http://dp2-l3:2226 -s | od -tcx1 0000000 \n [ \n " 256 " \n ] 0a 5b 0a 20 20 22 ae 22 0a 5d 0000012 $ </pre>

     

    It does NOT and you should not rely on this, see string-length() as proof:

    <pre class="javascript dw" data-editor-lang="js" data-pbcklang="javascript" dir="ltr">$ coproc2 string-length.xq reg.8859-1.json http://dp2-l3:2226 ; echo <?xml version="1.0" encoding="UTF-8"?> <x>0</x> $ $ coproc2 string-length.xq reg.json http://dp2-l3:2226 ; echo <?xml version="1.0" encoding="UTF-8"?> <x>1</x> $ $ coproc2 string-length.xq reg.bom.utf-16.json http://dp2-l3:2226 ; echo <?xml version="1.0" encoding="UTF-8"?> <x>1</x> $ $ coproc2 string-length.xq reg.bom.utf-32.json http://dp2-l3:2226 ; echo <?xml version="1.0" encoding="UTF-8"?> <x>1</x> $ </pre>

     

    You need tranform binary action for conversion of ISO-8859-1 to UTF-8.
    I will post how later, have to jump now.


    Hermann.

    Its even easier to see that JSON2JSONx conversion cannot deal with ISO-8859-1 encoded "JSON" input.

    $ cat regsp.json 
    ["® "]
    $ iconv -f utf-8 -t iso-8859-1 regsp.json > regsp.8859-1.json
    $ od -tcx1 regsp.8859-1.json 
    0000000   [   " 256       "   ]  \n
             5b  22  ae  20  22  5d  0a
    0000007
    $
    


    Send regsp.8859-1.json against service with Non-XML request type, convert-http(JSON) action and final xform(store:///identity.xsl) action. The byte sequence "ae 20" will end up in ...<json:string>� </json:string>... after JSON2JSONx conversion and then error out in xform because 0xAE is begin of UTF-8 multi-byte character, and 2nd byte of such character has to be in range 0x80-0xFF and DataPower internal encoding is UTF-8.


    This discussion is not restricted to DataPower and agreement that JSON has to be UTF-x encoded by spec:
    https://www.google.de/search?q=json+iso-8859-1


    Hermann.

  • HermannSW
    HermannSW
    6065 Posts

    Re: How to override UTF-8 presumption while processing XML?

    ‏2014-07-19T22:56:29Z  
    • HermannSW
    • ‏2014-07-19T16:19:07Z

    Its even easier to see that JSON2JSONx conversion cannot deal with ISO-8859-1 encoded "JSON" input.

    <pre class="javascript dw" data-editor-lang="js" data-pbcklang="javascript" dir="ltr">$ cat regsp.json ["® "] $ iconv -f utf-8 -t iso-8859-1 regsp.json > regsp.8859-1.json $ od -tcx1 regsp.8859-1.json 0000000 [ " 256 " ] \n 5b 22 ae 20 22 5d 0a 0000007 $ </pre>


    Send regsp.8859-1.json against service with Non-XML request type, convert-http(JSON) action and final xform(store:///identity.xsl) action. The byte sequence "ae 20" will end up in ...<json:string>� </json:string>... after JSON2JSONx conversion and then error out in xform because 0xAE is begin of UTF-8 multi-byte character, and 2nd byte of such character has to be in range 0x80-0xFF and DataPower internal encoding is UTF-8.


    This discussion is not restricted to DataPower and agreement that JSON has to be UTF-x encoded by spec:
    https://www.google.de/search?q=json+iso-8859-1


    Hermann.

    OK, lets state this again, JSON has to be UTF-x encloded by the spec.

    That does not mean that there is no workaround with DataPower.
    But it is really tricky.


    I have done a lot of binary transform stylesheets before and posted on this ([1], [2]).
    I never had to do such a binary xform before, this is tricky.

    The problem is that "less than" and "ampersand" characters HAVE to be escaped by the XML spec.

    Therefore any use of the converted binary input in an XSLT statement immediately breaks strings containing either of the two special XML characters.

    The conversion from iso-8859-1 to utf-8 is done by this FFD (slide 10 of 1st webcast link above):

    $ cat String.iso-8859-1.ffd 
    <!--  
        This FFD converts the input into an XML tree like this:
    
        <object>
          <message>***string data***</message> 
        </object> 
    --> 
    <File name="object" syntax="syn">
      <Syntax name="syn" encoding="iso-8859-1"/> 
      <Field name='message' type='String'/>
    </File>
    $
    

     

    As said the result of this input-mapping conversion cannot be used in any XSLT construct.

    The solution to this dilemma I found is to immediately convert the input to a binaryNode.
    And that is done by  dp:binary-decode(dp:encode(., 'base-64'))  !

    So this is Transform Binary stylesheet doing "binary" iso-8859-1 to utf-8 conversion:

    $ cat iso-8859-1.2.utf-8.binary.xsl 
    <xsl:stylesheet version="1.0" 
      xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
      xmlns:dp="http://www.datapower.com/extensions" 
      extension-element-prefixes="dp" 
    >
      <dp:input-mapping  href="String.iso-8859-1.ffd" type="ffd"/>
      <dp:output-mapping href="store:///pkcs7-convert-input.ffd" type="ffd"/>
    
      <xsl:output omit-xml-declaration="yes"/>
    
      <xsl:template match="/">
        <object>
          <message>
            <xsl:value-of select="dp:binary-decode(dp:encode(., 'base-64'))"/>
          </message>
        </object>
      </xsl:template>
    </xsl:stylesheet>
    $
    

     

    The FFD file as well as the stylesheet are attached for your use.

    Just have a Transform Binary action with above Stylesheet in front of convert-http action with JSON conversion map, and Non-XML request type for the service, and you are good.


    Hermann <myBlog/> <myTweets/> | <GraphvizFiddle/> | <xqib/> | <myCE/> <myFrameless/>

  • Asim80
    Asim80
    22 Posts

    Re: How to override UTF-8 presumption while processing XML?

    ‏2014-07-21T14:48:39Z  
    • HermannSW
    • ‏2014-07-19T22:56:29Z

    OK, lets state this again, JSON has to be UTF-x encloded by the spec.

    That does not mean that there is no workaround with DataPower.
    But it is really tricky.


    I have done a lot of binary transform stylesheets before and posted on this ([1], [2]).
    I never had to do such a binary xform before, this is tricky.

    The problem is that "less than" and "ampersand" characters HAVE to be escaped by the XML spec.

    Therefore any use of the converted binary input in an XSLT statement immediately breaks strings containing either of the two special XML characters.

    The conversion from iso-8859-1 to utf-8 is done by this FFD (slide 10 of 1st webcast link above):

    <pre class="javascript dw" data-editor-lang="js" data-pbcklang="javascript" dir="ltr">$ cat String.iso-8859-1.ffd <!-- This FFD converts the input into an XML tree like this: <object> <message>***string data***</message> </object> --> <File name="object" syntax="syn"> <Syntax name="syn" encoding="iso-8859-1"/> <Field name='message' type='String'/> </File> $ </pre>

     

    As said the result of this input-mapping conversion cannot be used in any XSLT construct.

    The solution to this dilemma I found is to immediately convert the input to a binaryNode.
    And that is done by  dp:binary-decode(dp:encode(., 'base-64'))  !

    So this is Transform Binary stylesheet doing "binary" iso-8859-1 to utf-8 conversion:

    <pre class="javascript dw" data-editor-lang="js" data-pbcklang="javascript" dir="ltr">$ cat iso-8859-1.2.utf-8.binary.xsl <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:dp="http://www.datapower.com/extensions" extension-element-prefixes="dp" > <dp:input-mapping href="String.iso-8859-1.ffd" type="ffd"/> <dp:output-mapping href="store:///pkcs7-convert-input.ffd" type="ffd"/> <xsl:output omit-xml-declaration="yes"/> <xsl:template match="/"> <object> <message> <xsl:value-of select="dp:binary-decode(dp:encode(., 'base-64'))"/> </message> </object> </xsl:template> </xsl:stylesheet> $ </pre>

     

    The FFD file as well as the stylesheet are attached for your use.

    Just have a Transform Binary action with above Stylesheet in front of convert-http action with JSON conversion map, and Non-XML request type for the service, and you are good.


    Hermann <myBlog/> <myTweets/> | <GraphvizFiddle/> | <xqib/> | <myCE/> <myFrameless/>

    Thanks it seems to be working :)

  • Daviiid
    Daviiid
    340 Posts

    Re: How to override UTF-8 presumption while processing XML?

    ‏2015-12-02T14:56:18Z  
    • HermannSW
    • ‏2014-07-19T22:56:29Z

    OK, lets state this again, JSON has to be UTF-x encloded by the spec.

    That does not mean that there is no workaround with DataPower.
    But it is really tricky.


    I have done a lot of binary transform stylesheets before and posted on this ([1], [2]).
    I never had to do such a binary xform before, this is tricky.

    The problem is that "less than" and "ampersand" characters HAVE to be escaped by the XML spec.

    Therefore any use of the converted binary input in an XSLT statement immediately breaks strings containing either of the two special XML characters.

    The conversion from iso-8859-1 to utf-8 is done by this FFD (slide 10 of 1st webcast link above):

    <pre class="javascript dw" data-editor-lang="js" data-pbcklang="javascript" dir="ltr">$ cat String.iso-8859-1.ffd <!-- This FFD converts the input into an XML tree like this: <object> <message>***string data***</message> </object> --> <File name="object" syntax="syn"> <Syntax name="syn" encoding="iso-8859-1"/> <Field name='message' type='String'/> </File> $ </pre>

     

    As said the result of this input-mapping conversion cannot be used in any XSLT construct.

    The solution to this dilemma I found is to immediately convert the input to a binaryNode.
    And that is done by  dp:binary-decode(dp:encode(., 'base-64'))  !

    So this is Transform Binary stylesheet doing "binary" iso-8859-1 to utf-8 conversion:

    <pre class="javascript dw" data-editor-lang="js" data-pbcklang="javascript" dir="ltr">$ cat iso-8859-1.2.utf-8.binary.xsl <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:dp="http://www.datapower.com/extensions" extension-element-prefixes="dp" > <dp:input-mapping href="String.iso-8859-1.ffd" type="ffd"/> <dp:output-mapping href="store:///pkcs7-convert-input.ffd" type="ffd"/> <xsl:output omit-xml-declaration="yes"/> <xsl:template match="/"> <object> <message> <xsl:value-of select="dp:binary-decode(dp:encode(., 'base-64'))"/> </message> </object> </xsl:template> </xsl:stylesheet> $ </pre>

     

    The FFD file as well as the stylesheet are attached for your use.

    Just have a Transform Binary action with above Stylesheet in front of convert-http action with JSON conversion map, and Non-XML request type for the service, and you are good.


    Hermann <myBlog/> <myTweets/> | <GraphvizFiddle/> | <xqib/> | <myCE/> <myFrameless/>

    Hi Herman

    When i use <dp:input-mapping href="store:///pkcs7-convert-input.ffd" type="ffd"/>, 0d0a caracter are similar before and after transformation

    But when i use <dp:input-mapping href="String.iso-8859-1.ffd" type="ffd"/>, 0d0a caracter is transform by 0a..

    So, i can't use the code below

    <xsl:variable name="rowsep" select="'&#13;'"/>
     <xsl:for-each select="str:split($str,$rowsep)">
     ...
     </xsl:for-each>
    

    Is it normal ?

    i change by 

    <xsl:variable name="rowsep" select="'&#10;'"/>
    

    But i don't understantd the conversion between 0d0a and 0a

    Updated on 2015-12-02T15:03:06Z at 2015-12-02T15:03:06Z by Daviiid
  • Daviiid
    Daviiid
    340 Posts

    Re: How to override UTF-8 presumption while processing XML?

    ‏2015-12-07T10:37:15Z  
    • Daviiid
    • ‏2015-12-02T14:56:18Z

    Hi Herman

    When i use <dp:input-mapping href="store:///pkcs7-convert-input.ffd" type="ffd"/>, 0d0a caracter are similar before and after transformation

    But when i use <dp:input-mapping href="String.iso-8859-1.ffd" type="ffd"/>, 0d0a caracter is transform by 0a..

    So, i can't use the code below

    <pre class="html dw" data-editor-lang="js" data-pbcklang="html" dir="ltr"><xsl:variable name="rowsep" select="'&#13;'"/> <xsl:for-each select="str:split($str,$rowsep)"> ... </xsl:for-each> </pre>

    Is it normal ?

    i change by 

    <pre class="html dw" data-editor-lang="js" data-pbcklang="html" dir="ltr"><xsl:variable name="rowsep" select="'&#10;'"/> </pre>

    But i don't understantd the conversion between 0d0a and 0a

    Hermann

    Can you explain why 0d0a is convert in 0a ?

  • HermannSW
    HermannSW
    6065 Posts

    Re: How to override UTF-8 presumption while processing XML?

    ‏2015-12-07T11:42:33Z  
    • Daviiid
    • ‏2015-12-07T10:37:15Z

    Hermann

    Can you explain why 0d0a is convert in 0a ?

    Hi,

    the "String.*.ffd" type conversions mentioned in the webcasts are one of the very few features you are allowed to use while not having a Contivo Analyst license.
    We cannot change the behavior because of Contivo Analyst.

    Your change of rowsep should work, now you split on LF instead of CR.


    Hermann.

     

  • Daviiid
    Daviiid
    340 Posts

    Re: How to override UTF-8 presumption while processing XML?

    ‏2015-12-07T13:01:14Z  
    • HermannSW
    • ‏2015-12-07T11:42:33Z

    Hi,

    the "String.*.ffd" type conversions mentioned in the webcasts are one of the very few features you are allowed to use while not having a Contivo Analyst license.
    We cannot change the behavior because of Contivo Analyst.

    Your change of rowsep should work, now you split on LF instead of CR.


    Hermann.

     

    Thanks :-)