Parsing XML documents one segment at a time

You can parse XML documents by passing the parser one segment (or record) of XML text at a time. Processing very large documents, or processing XML documents that reside in a data set, are two possible major applications of this technique.

To use this feature, compile your program with the XMLPARSE(XMLSS) compiler option in effect.

You parse an XML document a segment at a time by initializing the parse data item to the first segment of the XML document, and then executing the XML PARSE statement. The parser processes the XML text and returns XML events to your processing procedure as usual.

At the end of the text segment, the parser signals an END-OF-INPUT XML event, with XML-CODE set to zero. If there is another segment of the document to process, in your processing procedure move the next segment of XML data to the parse data item, set XML-CODE to one, and return to the parser. To signal the end of XML segments to the parser, return to the parser with XML-CODE still set to zero.

The length of the parse data item is evaluated for each segment, and determines the segment length.

Variable-length segments: If the XML document segments are variable length, specify a variable-length item for the parse data item. For example, for variable-length XML segments, you can define the parse data item as one of the following items:

A variable-length group item that contains an OCCURS DEPENDING ON clause
A reference-modified item
An FD record that specifies the RECORD IS VARYING DEPENDING ON clause, where the depending-on data item is used as the length in a reference modifier or ODO object for the FD record

When you send an XML document to the parser in multiple segments, document content is in some cases returned to the processing procedure in multiple fragments by means of multiple events, rather than as one large fragment in a single event.

For example, if the document is split into two segments with the split point in the middle of a string of content characters, the parser returns the content in two separate CONTENT-CHARACTERS events. In the processing procedure, you must reassemble the string of content as needed by the application.

Starting element tags, attribute names, namespace declarations, and ending element tags are always delivered to the processing procedure in a single event, even if those items are split between two segments of a document.

If a segment split occurs between the bytes of a multibyte character, the parser detects the split and reassembles the character for delivery in a single event.

If you are parsing an XML document with an unknown number of repetitive elements to be processed, use unbounded tables. For more information on unbounded tables, see Working with unbounded tables and groups.

For each such element in a given document, manage the table size using one of the following methods:

Calculating number of elements:
1. Count the number of elements in the document during an initial parse.
2. Set the OCCURS DEPENDING ON object for the table to that size
3. Allocate storage for the table
4. Parse the document a second time to process the XML
Incremental expansion:
1. Set an initial size in the OCCURS DEPENDING ON object for the unbounded table
2. Parse the document normally. For each element
  1. Check the limit and expand the unbounded table if necessary.
3. Allocate a new, larger storage area:
4. Copy the data from the smaller area
5. Free the smaller area
6. Set the table pointer to the address of the larger storage area.

QSAM and VSAM files: You can process XML documents stored in a QSAM or VSAM file as follows:

Open the file and read the first record of the XML document.
Execute the XML PARSE statement with the FD record as the parse data item.
In the processing-procedure logic for handling the END-OF-INPUT event, read the next record of the XML document into the parse data item. If not end-of-file (file status code 10), set XML-CODE to one and return to the parser. If end-of-file, return to the parser with XML-CODE still set to zero.
In your processing procedure logic for the END-OF-DOCUMENT event, close the file.

Miscellaneous information after the root element:

The root element of an XML document might be followed by zero or more occurrences of a comment or processing instruction, in any order. If you parse the document one segment at a time, the parser signals an END-OF-INPUT XML event after processing the end tag of the root element only if the last item in the segment is incomplete. If the segment ends with a complete XML item (such as the root element end tag, or after that tag, a complete comment or processing instruction), the next XML event after the event for the item itself is the END-OF-DOCUMENT XML event.

Tip: To provide successive segments of XML data after the end of the root element, include at least the first nonspace character of an XML item at the end of each segment. Include a complete item only on the last segment that you want the parser to process.

For instance, in the following example, in which each line represents a segment of an XML document, the segment that includes the text This comment ends this segment is the last segment to be parsed:

                                       
  <Tagline>                            
  COBOL is the language of the future! 
  </Tagline> <                         
  !--First comment--                   
  > <?pi data?> <!-                    
  -This comment ends this segment-->
  <!-- This segment is not included in the parse-->

Example: parsing an XML document one segment at a time

related concepts
XML events
XML-CODE

related tasks
Parsing XML documents one segment at a time
XML-CODE

related references
XMLPARSE (compiler option)