Topic
IC4NOTICE: developerWorks Community will be offline May 29-30, 2015 while we upgrade to the latest version of IBM Connections. For more information, read our upgrade FAQ.
4 replies Latest Post - ‏2014-06-19T20:12:14Z by barbara_morris
L.Maartens
L.Maartens
2 Posts
ACCEPTED ANSWER

Pinned topic utf-8 parsing in httpapi with expat returns so si data

‏2014-06-18T21:02:21Z |

I am parsing a xml file with expat. The file us correctly makred as ccsid(1208) and the data within is also utf-8.

It contains a lines like this:

<AdrLine>Industrieweg 43&#8211;hal 6</AdrLine>

This translates into my program ( ccsid 37, same as my job ccsid ) to <AdrLine>Industrieweg 43ÚÚhal 6</AdrLine>

(in hex this is 3c 41 64 72 4c 69 6e 65 3e 49 6e 64 75 73 74 72 69 65 77 65 67 20 34 33 0e c3 9a c3 9a 0f 68 61 6c 20 36 3c 2f 41 64 72 4c 69 6e 65 3e )

The string x' 0e c3 9a c3 9a 0f' is the place where the error is. Somehow the expat parser provides a string to the httpapi program that is then converted to ebcdic, but somehow the input string &#8211 is mangled into a so si string.

Can anybody show some light on this problem?

Kind regards,

Loek Maartens

 

Updated on 2014-06-18T23:31:28Z at 2014-06-18T23:31:28Z by L.Maartens
  • scott_klement
    scott_klement
    242 Posts
    ACCEPTED ANSWER

    Re: utf-8 parsing in httpapi with expat returns so si data

    ‏2014-06-19T14:32:20Z  in response to L.Maartens

    Are you truly using Expat directly?  Or are you using HTTPAPI's XML parsing routine (which uses Expat under the covers, but there's additional processing going on aside from Expat in that case.)

    Expat does not understand CCSIDs or EBCDIC at all.  The way I typically configure it, it will read your data in binary, automatically detect whether it is ASCII, ISO-8859-1, UTF-8 or UTF-16.  (These are the only things Expat understands.)   On output, it'll always send it to your program as UTF-16, so you can use RPG's "UCS-2" data type with the output.

    However, if you are using HTTPAPI's routines, they do additional processing.  They will convert the input data into UTF-8 using CCSIDs based on the options you pass to the routines.  They will also convert the result into EBCDIC for you, which is where the SO/SI is likely to come from (This is all done to simplify the process for programmers who don't want to do all of this work manually.)

    Can you explain how you're using it?   Also, can you explain what you're expecting to get from &#8211; ??

    • L.Maartens
      L.Maartens
      2 Posts
      ACCEPTED ANSWER

      Re: utf-8 parsing in httpapi with expat returns so si data

      ‏2014-06-19T17:45:12Z  in response to scott_klement

      Hi Scott,

      I am using the HTTPAPI support to parse a xml file. I did see in the expat source code that it converts the &#8211; string to a unicode value. This would lead me to think that somewhere in the iconv (?) routines there is a return value of this unicode value into a DBCS value with the sosi codes. Not what I would have expected since we do have a SBCS system.

      I would expect the value of the default substitution value to be returned (x' 3F') in ebcdic. I will then convert that value to a full stop(.).  I still need to do some debugging on the case at hand to see what is returned from expat into httpapi, and then onward to my endtag handler procedure.

      Just for information; I do like all of the projects you have made available to the RPG community. It taught me a lot of things about items an avarage RPG programmer would never be involved with, or would not begin to know where to obtain the knowledge from, starting with stream files and all the way to interfacing with java programs. And all this in understandable code, not the rubbish (sorry but it had to be said) that IBM provided as examples for the use of their XML parsers...

      Updated on 2014-06-19T17:46:54Z at 2014-06-19T17:46:54Z by L.Maartens
      • scott_klement
        scott_klement
        242 Posts
        ACCEPTED ANSWER

        Re: utf-8 parsing in httpapi with expat returns so si data

        ‏2014-06-19T19:43:17Z  in response to L.Maartens

        HTTPAPI is calling the IBM-supplied iconv() API to convert the data from CCSID 13488 (UCS-2) to the job CCSID (it uses CCSID 0, which defaults to the job CCSID.)

        Iconv is configured by calling the QtqIconvOpen() API.  All options are set to 0=default (besides the CCSID itself).  The IBM docs are here:

        http://www-01.ibm.com/support/knowledgecenter/ssw_ibm_i_71/apis/QTQICONV.htm?lang=en

         

        I don't know why it'd use SI/SO characters if your job CCSID is a SBCS CCSID...    But I guess that's what's happening?

  • barbara_morris
    barbara_morris
    384 Posts
    ACCEPTED ANSWER

    Re: utf-8 parsing in httpapi with expat returns so si data

    ‏2014-06-19T20:12:14Z  in response to L.Maartens

    It looks like it's adding the x'0E' and x'0F' to the UTF-8 data.

    That hex string that Loek posted is UTF-8. The first three characters, x'3c4164', are UTF-8 "<Ad"

    x'0ec391c3910f' is the value being set for &#8211. &#8211 (EN DASH) maps to UTF-16 x'2013' and UTF-8 x'E28093'

    Here's how XML-INTO handles it, and also what the UCS2 value is when converted to UTF-8 (using the new 7.2 CCSID support in RPG). (I used a shorter XML value, just "abc&#8211;def")

        dcl-s a varucs2(10);                                   
        dcl-s b varchar(10) ccsid(*utf8);                     
        xml-into a %xml('<a>abc&#8211;def</a>' : 'ccsid=ucs2');
        b = a;  // convert UCS2 to UTF-8
        *inlr = '1';                                         

    In debug after the assignment b = a

    > EVAL a:x                                        
         00000     00070061 00620063 20130064 00650066
         00010     00000000 0000.... ........ ........
    > EVAL b:x                                        
         00000     00096162 63E28093 64656600 ........