Parsing XML with DFSORT
MartinPacker 11000094DH Comments (6) Visits (9402)
Following on from Generating XML Using DFSORT - Part II here are some thoughts on how to parse XML with DFSORT.
NOTE: For more complex XML than this entry describes you probably want to use the XML Toolkit for z/OS. This provides C++ and Java parsers for XML and a stand-alone XSLT processor.
In this example I'll show you how to take XML that looks like this and create a flat file from it:
<?xml version="1.0" encoding="UTF-8" ?> <band> <member surname="Mercury" firstname="Freddie" job="Singer" /> <member surname="May" firstname="Brian" job="Guitarist" /> <member surname="Taylor" firstname="Roger" job="Drummer" /> <member surname="Deacon" firstname="John" job="Bassist" /> </band>
and turn it into our (now familiar)
Mercury Freddie Singer May Brian Guitarist Taylor Roger Drummer Deacon John Bassist
which could be mapped with DFSORT Symbols:
Surname,*,16,CH Firstname,*,16,CH Job,*,10,CH
In fact - in this example - we won't use these symbols.
The first thing we need to do is to keep only the
which throws away the first two rows and the last row.
Next we need to parse the data rows using the following INREC statement:
INREC IFTHEN=(WHEN=INIT, PARS
which looks rather complicated.
This uses IFTHEN (introduced in 2004) and PARSE (introduced in 2006 with UK90006/UK90007).
In fact the IFTHEN clauses are a
The first stage in this pipeline is
which strips off the surrounding angle brackets from each line, producing an 80-byte record.
The second stage is
which extracts the surname attribute into a field which is prepended onto the 80-byte record created in the first stage.
The third stage is
which extracts the firstname attribute into a field which is prepended onto the 80-byte record created in the first stage (but after the 16-byte (surname) field extracted in the second stage).
The fourth and final stage is
which extracts the job attribute into a field which is appended onto the the 16-byte (surname) field extracted in the second stage and the 16-byte (firstname) field extracted in the third stage.
I admit this looks complicated but it does allow for the attributes to appear in any order in an XML element. What it doesn't do is to allow any old multiple-line format for the input XML. For that you really do need the toolkit. But I'm convinced there are tricks we can teach DFSORT when it comes to parsing XML. It's just that we'd need time to think about them. :-)
A note on pipelining: When I first saw IFTHEN I thought of it potentially as a pipelining technique. This example is quite a good one for pipelining as everything happens in 4 IFTHEN WHEN=INIT stages. It's actually proved a lot simpler to construct the DFSORT processing this way - and it has isolated all the processing to the INREC statement. So there's lots you can do later on in the DFSORT invocation. (And the ability to allow the attributes (surname, firstname and job) to be in any order was made much easier by this pipelining approach.
But I have to be sanguine about pipelining: At this stage in DFSORT's development we don't have all the capabilities for branching etc that CMS (/TSO) Pipelines has. But I offer you the pipelining model as another way of thinking about what IFTHEN can do, as well as the "treat different records in different ways" original intention. However, we do have nice constructs like WHEN=ANY, WHEN=NONE and HIT=NEXT to construct reasonable pipelines with.
What has been really nice about recent DFSORT innovations is that you find out more things you can do with them every day.
PARSE is worthy of some more discussion: It's brand new (April 2006) and allows DFSORT to parse (duh!) variable-format data. In this case the length of each attribute is variable. %1, %2, %3 and %4 refer to different variable-length fields that we can use in subsequent processing stages. Let's take one usage as an example:
In this case we extract into the variable %1 the first string in the input record that starts with
Note: We don't actually need to know how long the string we're extracting is. Prior to PARSE we would've had to know. And that, to me, is one of the very nice features of PARSE.