Topic
9 replies Latest Post - ‏2014-01-17T13:24:47Z by kimbert@uk.ibm.com
Professor
Professor
9 Posts
ACCEPTED ANSWER

Pinned topic FileInput node processing of XMl files

‏2013-08-14T21:44:29Z |

WMB v8.0.2 on RHEL

I have input XML file and need to process it with the FileInput node. For now I made it so all of my individual messages that are all XML messages are separated by a special separator so that FileInput node parses those as separate records from the file. Now it looks like this (pseudo-code):

----file start
<msg>........</msg>
separator
<msg>........</msg>
separator
<msg>........</msg>
separator
---end of file

this allows me to easily configure FileInput node to parse each message, however I would like to use true XML for the entire file and not sure how to make my FileInput node to split one large XML message into "chunks" of individual messages. My input file will look like this:

----file start
<container>
<msg>........</msg>
<msg>........</msg>
<msg>........</msg>
</container>
---end of file

Any suggestions?

Or do I have to use compute node to split my large message into chunks? If yes, any code snippets for that for high performance?

  • KY0J_Simbu_Selvarasu
    40 Posts
    ACCEPTED ANSWER

    Re: FileInput node processing of XMl files

    ‏2013-08-15T16:50:30Z  in response to Professor

    http://www.ibm.com/developerworks/websphere/library/techarticles/0505_storey/0505_storey.html

  • ThomasBien
    ThomasBien
    10 Posts
    ACCEPTED ANSWER

    Re: FileInput node processing of XMl files

    ‏2013-08-19T17:27:04Z  in response to Professor

    The proper way to read an XML file would be to use the FileInput node in "Whole File" mode and then handle the records in your flow.
    The article linked by Simbu would be quite relevant for that.

    Another way would be to use the FileInput node in "Parsed Record Sequence" mode.
    The <container> element may cause some difficulty, here.


    If you are absolutely set on using the FileInput node in "Delimiter" mode, you would need to ensure the physical format of the XML file being delivered.
    If you can guarantee a new line between each <msg> element, then your delimiter could be the new line character(s).
    If you can guarantee there is no extraneous whitespace, the your delimiter could be the element tags (though this would be messy).
    Again, the <container> element may cause some difficulty, here.

  • This reply was deleted by GeneRK4 2014-01-13T02:41:14Z.
  • GeneRK4
    GeneRK4
    8 Posts
    ACCEPTED ANSWER

    Re: FileInput node processing of XMl files

    ‏2014-01-13T02:42:58Z  in response to Professor
    Hi,
     As I read through the suggested link,I see the below code before doing the parser on demand,
    "Before we can start processing the outer message, we need to take the InputRoot and make it mutable. This allows us to destroy parts of the tree after we have successfully parsed them thus freeing up memory resource. We can do this by copying the input tree backed by the bitstream to the environment."

     

    -- Copy the input tree, backed by the bitstream, to the environment
    -- Set a message pointer to this copied message tree
    SET Environment.Variables.InputRoot = InputRoot.XML;
    DECLARE InMessageCopy REFERENCE TO Environment.Variables.InputRoot;

     

    It looks like the entire message is stored in Environment Variable.If the file size is very huge,though we try to do parser on demand,does it not consume more broker memory and heap size to store the entire message into Environment variable? 
    If this understanding is wrong,please correct me..
  • kimbert@uk.ibm.com
    kimbert@uk.ibm.com
    515 Posts
    ACCEPTED ANSWER

    Re: FileInput node processing of XMl files

    ‏2014-01-14T10:15:44Z  in response to Professor

    Only the bitstream ( actually a reference to the bitstream ) gets copied to the Environment. Not the entire message, and certainly not the entire message tree. So memory usage will not be a problem, even for huge messages.

    • GeneRK4
      GeneRK4
      8 Posts
      ACCEPTED ANSWER

      Re: FileInput node processing of XMl files

      ‏2014-01-15T02:55:54Z  in response to kimbert@uk.ibm.com

      Hi,

      Thanks for your response..

      As I read through the below analysis,it looks like the whole file gets copied and not the only referance.

      http://www.mqseries.net/phpBB/viewtopic.php?t=56136&postdays=0&postorder=asc&start=15

      Could you please confirm whether this is correct?

      • kimbert@uk.ibm.com
        kimbert@uk.ibm.com
        515 Posts
        ACCEPTED ANSWER

        Re: FileInput node processing of XMl files

        ‏2014-01-15T10:30:09Z  in response to GeneRK4

        Fair question. I'll go into a bit more detail now ( having discussed the question with the developer who implemented the streaming support in WMB ).

        Firstly, let me emphasize that there are many customers already using this technique to process multi-Gb XML ( and non-XML ) files. WMB/IIB is explicitly designed to be capable of processing of very large files, and it does work when configured correctly.

        XMLNSC is a streaming parser. If used correctly, it will never use more than 64Kb of memory for the input stream. The BLOB parser is not a streaming parser - it will read the entire input document into memory. So don't try to test streaming scenarios using the BLOB domain!

        A message flow can accidentally cause high memory usage. This can happen in several ways

        - calling ASBITSTREAM(InputRoot.XMLNSC) in the message flow. This will assign the entire bitstream to a BLOB variable. 

        - forgetting to assign the correct domain to the Environment tree before copying the fragment of input message. This will cause inflation of the tree to occur before the copy happens ( when source and target trees have the same domain it is just a bitstream copy )

        - writing code that causes the entire message tree to be inflated.

        If you implement your flow as described in the document ( but use XMLNSC, not the XML domain! ) then none of these things will occur and you should find that your flow can handle files of arbitrary size.

        • PRichelle
          PRichelle
          14 Posts
          ACCEPTED ANSWER

          Re: FileInput node processing of XMl files

          ‏2014-01-15T16:16:16Z  in response to kimbert@uk.ibm.com

          Hello,

          I have made in the past a PoC where I need to parse a 2GB XML file. The file had a root element InstrumentValues that contains a repeating structure of "InstrumentValue".

          I was able to process the files without consuming too much memory (just enough to parse the each repeating structure at a time).

          In order to do this, I used a fileInput node with XMLNSC, parse on demand and the recordDetection has "parsedRecordSequence".

          This node was followed by a compute node that propagates only one repeating structure at a time.

          The ESQL code used was

           

          CREATE COMPUTE MODULE LARGE_FILE_PROCESSING_XML_MF_Compute
              CREATE FUNCTION Main() RETURNS BOOLEAN
              BEGIN
                  
                  -- CREATE FIRSTCHILD OF Environment DOMAIN 'XMLNSC' NAME 'XML_InstrumentValues';
                  -- SET Environment.XML_InstrumentValues = InputRoot.XMLNSC;
                  --DECLARE ptrOLEV_InstValue REFERENCE TO Environment.XML_InstrumentValues.InstrumentValues.InstrumentValue;
                  DECLARE ptrOLEV_InstValue REFERENCE TO InputRoot.XMLNSC.InstrumentValues.InstrumentValue;
                  DECLARE numbRecords INT 0;
                  DECLARE numbRecordsMatched INT 0;
                  WHILE LASTMOVE(ptrOLEV_InstValue) DO
                      SET numbRecords = numbRecords + 1;
                      IF ptrOLEV_InstValue.CurrencyId = 'USD' THEN
                          CALL CopyMessageHeaders();
                          SET OutputRoot.XMLNSC.InstrumentValue = ptrOLEV_InstValue;
                          SET numbRecordsMatched = numbRecordsMatched + 1;
                          PROPAGATE;
                      END IF;
                      MOVE ptrOLEV_InstValue NEXTSIBLING;
                      IF LASTMOVE(ptrOLEV_InstValue) THEN
                          DELETE PREVIOUSSIBLING OF ptrOLEV_InstValue;
                      END IF;
                  END WHILE;
                          
                  SET Environment.NumberOfRecords = numbRecords;
                  SET Environment.NumberOfRecordsMatched = numbRecordsMatched;
                  
                  PROPAGATE TO TERMINAL 1;
                  -- CALL CopyMessageHeaders();
                  -- CALL CopyEntireMessage();
                  RETURN FALSE;
              END;
           

           

          • GeneRK4
            GeneRK4
            8 Posts
            ACCEPTED ANSWER

            Re: FileInput node processing of XMl files

            ‏2014-01-16T01:31:48Z  in response to PRichelle

            Thanks a lot ! Very helpful replies... !

            One more question...

            If my input is repetitive structure,then I can take it up as XMLNSC parser  in the FileInput node..

            But if I have to take up fixed length as Record detection type,then whether it is still possible to get the 20MB input file (to be splitted as 100 KB files) possible?

            Or using demiliter as Record detection type to split 20000 records into individual files...?

             

             

            Updated on 2014-01-16T01:43:47Z at 2014-01-16T01:43:47Z by GeneRK4
  • kimbert@uk.ibm.com
    kimbert@uk.ibm.com
    515 Posts
    ACCEPTED ANSWER

    Re: FileInput node processing of XMl files

    ‏2014-01-17T13:24:47Z  in response to Professor

    GeneRK4 said:

    "If my input is repetitive structure,then I can take it up as XMLNSC parser  in the FileInput node..

    But if I have to take up fixed length as Record detection type,then whether it is still possible to get the 20MB input file (to be splitted as 100 KB files) possible?

    Or using demiliter as Record detection type to split 20000 records into individual files...?"

    The 'Record Detection' property controls how the file is split into sub-transactions. You can use this low-memory-usage technique with any of its settings ( Fixed Length / Delimited / Parsed Record Sequence ). Does that answer your question?