Topic
  • 9 replies
  • Latest Post - ‏2014-01-17T13:24:47Z by kimbert@uk.ibm.com
Professor
Professor
9 Posts

Pinned topic FileInput node processing of XMl files

‏2013-08-14T21:44:29Z |

WMB v8.0.2 on RHEL

I have input XML file and need to process it with the FileInput node. For now I made it so all of my individual messages that are all XML messages are separated by a special separator so that FileInput node parses those as separate records from the file. Now it looks like this (pseudo-code):

----file start
<msg>........</msg>
separator
<msg>........</msg>
separator
<msg>........</msg>
separator
---end of file

this allows me to easily configure FileInput node to parse each message, however I would like to use true XML for the entire file and not sure how to make my FileInput node to split one large XML message into "chunks" of individual messages. My input file will look like this:

----file start
<container>
<msg>........</msg>
<msg>........</msg>
<msg>........</msg>
</container>
---end of file

Any suggestions?

Or do I have to use compute node to split my large message into chunks? If yes, any code snippets for that for high performance?

  • KY0J_Simbu_Selvarasu
    40 Posts

    Re: FileInput node processing of XMl files

    ‏2013-08-15T16:50:30Z  

    http://www.ibm.com/developerworks/websphere/library/techarticles/0505_storey/0505_storey.html

  • ThomasBien
    ThomasBien
    10 Posts

    Re: FileInput node processing of XMl files

    ‏2013-08-19T17:27:04Z  

    The proper way to read an XML file would be to use the FileInput node in "Whole File" mode and then handle the records in your flow.
    The article linked by Simbu would be quite relevant for that.

    Another way would be to use the FileInput node in "Parsed Record Sequence" mode.
    The <container> element may cause some difficulty, here.


    If you are absolutely set on using the FileInput node in "Delimiter" mode, you would need to ensure the physical format of the XML file being delivered.
    If you can guarantee a new line between each <msg> element, then your delimiter could be the new line character(s).
    If you can guarantee there is no extraneous whitespace, the your delimiter could be the element tags (though this would be messy).
    Again, the <container> element may cause some difficulty, here.

  • GeneRK4
    GeneRK4
    8 Posts

    Re: FileInput node processing of XMl files

    ‏2014-01-13T02:42:58Z  
    Hi,
     As I read through the suggested link,I see the below code before doing the parser on demand,
    "Before we can start processing the outer message, we need to take the InputRoot and make it mutable. This allows us to destroy parts of the tree after we have successfully parsed them thus freeing up memory resource. We can do this by copying the input tree backed by the bitstream to the environment."

     

    -- Copy the input tree, backed by the bitstream, to the environment
    -- Set a message pointer to this copied message tree
    SET Environment.Variables.InputRoot = InputRoot.XML;
    DECLARE InMessageCopy REFERENCE TO Environment.Variables.InputRoot;

     

    It looks like the entire message is stored in Environment Variable.If the file size is very huge,though we try to do parser on demand,does it not consume more broker memory and heap size to store the entire message into Environment variable? 
    If this understanding is wrong,please correct me..
  • kimbert@uk.ibm.com
    kimbert@uk.ibm.com
    515 Posts

    Re: FileInput node processing of XMl files

    ‏2014-01-14T10:15:44Z  

    Only the bitstream ( actually a reference to the bitstream ) gets copied to the Environment. Not the entire message, and certainly not the entire message tree. So memory usage will not be a problem, even for huge messages.

  • GeneRK4
    GeneRK4
    8 Posts

    Re: FileInput node processing of XMl files

    ‏2014-01-15T02:55:54Z  

    Only the bitstream ( actually a reference to the bitstream ) gets copied to the Environment. Not the entire message, and certainly not the entire message tree. So memory usage will not be a problem, even for huge messages.

    Hi,

    Thanks for your response..

    As I read through the below analysis,it looks like the whole file gets copied and not the only referance.

    http://www.mqseries.net/phpBB/viewtopic.php?t=56136&postdays=0&postorder=asc&start=15

    Could you please confirm whether this is correct?

  • kimbert@uk.ibm.com
    kimbert@uk.ibm.com
    515 Posts

    Re: FileInput node processing of XMl files

    ‏2014-01-15T10:30:09Z  
    • GeneRK4
    • ‏2014-01-15T02:55:54Z

    Hi,

    Thanks for your response..

    As I read through the below analysis,it looks like the whole file gets copied and not the only referance.

    http://www.mqseries.net/phpBB/viewtopic.php?t=56136&postdays=0&postorder=asc&start=15

    Could you please confirm whether this is correct?

    Fair question. I'll go into a bit more detail now ( having discussed the question with the developer who implemented the streaming support in WMB ).

    Firstly, let me emphasize that there are many customers already using this technique to process multi-Gb XML ( and non-XML ) files. WMB/IIB is explicitly designed to be capable of processing of very large files, and it does work when configured correctly.

    XMLNSC is a streaming parser. If used correctly, it will never use more than 64Kb of memory for the input stream. The BLOB parser is not a streaming parser - it will read the entire input document into memory. So don't try to test streaming scenarios using the BLOB domain!

    A message flow can accidentally cause high memory usage. This can happen in several ways

    - calling ASBITSTREAM(InputRoot.XMLNSC) in the message flow. This will assign the entire bitstream to a BLOB variable. 

    - forgetting to assign the correct domain to the Environment tree before copying the fragment of input message. This will cause inflation of the tree to occur before the copy happens ( when source and target trees have the same domain it is just a bitstream copy )

    - writing code that causes the entire message tree to be inflated.

    If you implement your flow as described in the document ( but use XMLNSC, not the XML domain! ) then none of these things will occur and you should find that your flow can handle files of arbitrary size.

  • PRichelle
    PRichelle
    14 Posts

    Re: FileInput node processing of XMl files

    ‏2014-01-15T16:16:16Z  

    Fair question. I'll go into a bit more detail now ( having discussed the question with the developer who implemented the streaming support in WMB ).

    Firstly, let me emphasize that there are many customers already using this technique to process multi-Gb XML ( and non-XML ) files. WMB/IIB is explicitly designed to be capable of processing of very large files, and it does work when configured correctly.

    XMLNSC is a streaming parser. If used correctly, it will never use more than 64Kb of memory for the input stream. The BLOB parser is not a streaming parser - it will read the entire input document into memory. So don't try to test streaming scenarios using the BLOB domain!

    A message flow can accidentally cause high memory usage. This can happen in several ways

    - calling ASBITSTREAM(InputRoot.XMLNSC) in the message flow. This will assign the entire bitstream to a BLOB variable. 

    - forgetting to assign the correct domain to the Environment tree before copying the fragment of input message. This will cause inflation of the tree to occur before the copy happens ( when source and target trees have the same domain it is just a bitstream copy )

    - writing code that causes the entire message tree to be inflated.

    If you implement your flow as described in the document ( but use XMLNSC, not the XML domain! ) then none of these things will occur and you should find that your flow can handle files of arbitrary size.

    Hello,

    I have made in the past a PoC where I need to parse a 2GB XML file. The file had a root element InstrumentValues that contains a repeating structure of "InstrumentValue".

    I was able to process the files without consuming too much memory (just enough to parse the each repeating structure at a time).

    In order to do this, I used a fileInput node with XMLNSC, parse on demand and the recordDetection has "parsedRecordSequence".

    This node was followed by a compute node that propagates only one repeating structure at a time.

    The ESQL code used was

     

    CREATE COMPUTE MODULE LARGE_FILE_PROCESSING_XML_MF_Compute
        CREATE FUNCTION Main() RETURNS BOOLEAN
        BEGIN
            
            -- CREATE FIRSTCHILD OF Environment DOMAIN 'XMLNSC' NAME 'XML_InstrumentValues';
            -- SET Environment.XML_InstrumentValues = InputRoot.XMLNSC;
            --DECLARE ptrOLEV_InstValue REFERENCE TO Environment.XML_InstrumentValues.InstrumentValues.InstrumentValue;
            DECLARE ptrOLEV_InstValue REFERENCE TO InputRoot.XMLNSC.InstrumentValues.InstrumentValue;
            DECLARE numbRecords INT 0;
            DECLARE numbRecordsMatched INT 0;
            WHILE LASTMOVE(ptrOLEV_InstValue) DO
                SET numbRecords = numbRecords + 1;
                IF ptrOLEV_InstValue.CurrencyId = 'USD' THEN
                    CALL CopyMessageHeaders();
                    SET OutputRoot.XMLNSC.InstrumentValue = ptrOLEV_InstValue;
                    SET numbRecordsMatched = numbRecordsMatched + 1;
                    PROPAGATE;
                END IF;
                MOVE ptrOLEV_InstValue NEXTSIBLING;
                IF LASTMOVE(ptrOLEV_InstValue) THEN
                    DELETE PREVIOUSSIBLING OF ptrOLEV_InstValue;
                END IF;
            END WHILE;
                    
            SET Environment.NumberOfRecords = numbRecords;
            SET Environment.NumberOfRecordsMatched = numbRecordsMatched;
            
            PROPAGATE TO TERMINAL 1;
            -- CALL CopyMessageHeaders();
            -- CALL CopyEntireMessage();
            RETURN FALSE;
        END;
     

     

  • GeneRK4
    GeneRK4
    8 Posts

    Re: FileInput node processing of XMl files

    ‏2014-01-16T01:31:48Z  
    • PRichelle
    • ‏2014-01-15T16:16:16Z

    Hello,

    I have made in the past a PoC where I need to parse a 2GB XML file. The file had a root element InstrumentValues that contains a repeating structure of "InstrumentValue".

    I was able to process the files without consuming too much memory (just enough to parse the each repeating structure at a time).

    In order to do this, I used a fileInput node with XMLNSC, parse on demand and the recordDetection has "parsedRecordSequence".

    This node was followed by a compute node that propagates only one repeating structure at a time.

    The ESQL code used was

     

    CREATE COMPUTE MODULE LARGE_FILE_PROCESSING_XML_MF_Compute
        CREATE FUNCTION Main() RETURNS BOOLEAN
        BEGIN
            
            -- CREATE FIRSTCHILD OF Environment DOMAIN 'XMLNSC' NAME 'XML_InstrumentValues';
            -- SET Environment.XML_InstrumentValues = InputRoot.XMLNSC;
            --DECLARE ptrOLEV_InstValue REFERENCE TO Environment.XML_InstrumentValues.InstrumentValues.InstrumentValue;
            DECLARE ptrOLEV_InstValue REFERENCE TO InputRoot.XMLNSC.InstrumentValues.InstrumentValue;
            DECLARE numbRecords INT 0;
            DECLARE numbRecordsMatched INT 0;
            WHILE LASTMOVE(ptrOLEV_InstValue) DO
                SET numbRecords = numbRecords + 1;
                IF ptrOLEV_InstValue.CurrencyId = 'USD' THEN
                    CALL CopyMessageHeaders();
                    SET OutputRoot.XMLNSC.InstrumentValue = ptrOLEV_InstValue;
                    SET numbRecordsMatched = numbRecordsMatched + 1;
                    PROPAGATE;
                END IF;
                MOVE ptrOLEV_InstValue NEXTSIBLING;
                IF LASTMOVE(ptrOLEV_InstValue) THEN
                    DELETE PREVIOUSSIBLING OF ptrOLEV_InstValue;
                END IF;
            END WHILE;
                    
            SET Environment.NumberOfRecords = numbRecords;
            SET Environment.NumberOfRecordsMatched = numbRecordsMatched;
            
            PROPAGATE TO TERMINAL 1;
            -- CALL CopyMessageHeaders();
            -- CALL CopyEntireMessage();
            RETURN FALSE;
        END;
     

     

    Thanks a lot ! Very helpful replies... !

    One more question...

    If my input is repetitive structure,then I can take it up as XMLNSC parser  in the FileInput node..

    But if I have to take up fixed length as Record detection type,then whether it is still possible to get the 20MB input file (to be splitted as 100 KB files) possible?

    Or using demiliter as Record detection type to split 20000 records into individual files...?

     

     

    Updated on 2014-01-16T01:43:47Z at 2014-01-16T01:43:47Z by GeneRK4
  • kimbert@uk.ibm.com
    kimbert@uk.ibm.com
    515 Posts

    Re: FileInput node processing of XMl files

    ‏2014-01-17T13:24:47Z  

    GeneRK4 said:

    "If my input is repetitive structure,then I can take it up as XMLNSC parser  in the FileInput node..

    But if I have to take up fixed length as Record detection type,then whether it is still possible to get the 20MB input file (to be splitted as 100 KB files) possible?

    Or using demiliter as Record detection type to split 20000 records into individual files...?"

    The 'Record Detection' property controls how the file is split into sub-transactions. You can use this low-memory-usage technique with any of its settings ( Fixed Length / Delimited / Parsed Record Sequence ). Does that answer your question?