XProc is a markup language that describes processing pipelines composed of discrete steps that apply operations on XML documents. If a specification's importance is related to the quality of individuals working on it, then XProc is significant, indeed. The W3C XML Processing Model WG is packed with pragmatic XML practitioners and superstars as well as grizzled veterans of past XML-related efforts: Erik Bruchez, Andrew Fang, Paul Grosso, Rui Lopes, Murray Maloney, Alex Milowski, Michael Sperberg-McQueen, Jeni Tennison, Henry Thompson, Richard Tobin, Alessandro Vernet, Norman Walsh (Chair), and Mohamed Zergaoui, to name a few.
XProc is not the first W3C attempt to establish an XML processing pipelines standard. In 2002, as part of the XML Processing Model Workshop, there was the "XML Pipeline Definition Language Submission," submitted by Sun Microsystems, Alis Technologies, Arbortext, Cisco Systems, Fujitsu, Markup Technology, and Oracle. This submission was published on 28 February 2002 as "XML Pipeline Definition Language Version 1.0."
In 2004, a W3C Note attempted to set out requirements for an XML processing model: "XML Processing Model Requirements," W3C Working Group Note 05 April 2004. In 2005, another W3C member submission was proposed: "XML Pipeline Language (XPL) Version 1.0" (draft), submitted by Orbeon, Inc., on 11 March and published on 11 April.
I haven't seen any specific studies citing the need for XProc, so I here proffer a few of my own unabashedly biased opinions:
- XProc’s declarative format, combined with the simplicity of thinking in terms of pipelines, will mean that non-technical people can be involved in writing and maintaining processing workflows.
- XProc, in many configurations, is amenable to streaming, whereas other approaches to control XML processes are not (for example, XSLT).
- XProc steps focus on performing specific operations, which over time should experience greater optimization (in an XProc processor used by many) versus one-off code that you or I write (used by few).
- XProc's standard step library and extensibility mechanisms position XProc to be an all-encompassing solution.
- Structured data (such as XProc markup) is typically easier to reuse than structured code.
- One of XProc's inspirations is UNIX® pipelines, which hopefully all can agree is a good thing!
Not surprisingly, XProc will probably gain considerable favor amongst those groups who work and generate XML documents. You can also imagine that people with business workflows and XML documents flowing through them might be excited by the possibility of modeling their workflows with XProc pipelines, and then running them on their XML documents.
XProc is comprised of a small vocabulary divided into three categories: core elements, ancillary elements, and a standard step library. The core elements provide modern computing language constructs, such as conditional and iterative processing and try/catch error mechanisms:
<p:for-each>: Iterative processing statement
<p:choose>: Case logic statement (similar to XSLT
<p:group>: Groups a series of steps into a named sub-pipeline
<p:try>: Provides a try/catch mechanism to handle dynamic errors
<p:viewport>: Applies a sub-pipeline process to subtrees contained in a single XML document
The elements used in the declaration and definition of steps provide the basis for XProc extensibility and reusability:
<p:library>: Contains step declarations to provide reusable step libraries
<p:declare-step>: Defines a step and its functional signature, typically in a
<p:import>: Brings in through a Uniform Resource Identifier (URI) any declared pipelines or library to the current pipeline
XProc ancillary elements are mainly children nodes of XProc steps and handle tasks such as step bindings, making it easy to configure a step. These elements consist of:
- Inputs and outputs: These elements define ports that can bind to the inputs or outputs of other steps and define the flow of XML documents. In addition, you can define XML documents inline (directly in the XProc document) or bring in documents through an external URI.
Options: Options are the primary mechanism for configuring steps,
<p:with-option>element or as a name-value attribute on the step instance. Note that options are part of the functional signature of a step, and their names are invariant.
- Variables: Variables are used with compound steps and define XPath variables for use within a compound step sub-pipeline.
Parameters: Unlike options and variables, parameters have names
that are computed at run time and are not related to any functional signature,
as defined by
<p:declare-step>. Perhaps the most significant aspect to XProc is the 30-40 steps defined in a standard XProc library, which are split into a set of required and optional steps.
The real power of XProc is embodied in its standard library of required and optional steps, which perform a wide variety of tasks, such as:
- XSLT, XQuery, XInclude processing
- Schema validation (DTD, RelaxNG, Schematron, XML schema)
- XML update operations, such as inserting or deleting XML elements and attributes
- XML storage and retrieval
- Wrap, unwrap, escape, and unescape XML
- HTTP requests
- Execute native commands
Here is a brief overview of each step contained in the XProc standard library:
- Required steps:
<p:add-attribute>: Add an attribute to a set of matching elements.
<p:add-xml-base>: Add or correct
xml:baseattributes on elements.
<p:compare>: Compare two documents for equivalence.
<p:count>: Count the number of documents in source input.
<p:delete>: Delete items specified by a match pattern from the source input.
<p:directory-list>: Enumerate the directory listing into the result output.
<p:error>: Generate a dynamic error.
<p:escape-markup>: Escape source input.
<p:http-request>: Interact with resources identified by Internationalized Resource Identifiers (IRIs) over HTTP.
<p:identity>: Make an exact copy of an input source to the result output.
<p:insert>: Insert an XML selection into the source input.
<p:label-elements>: Create a label for each matched element, and store the value of the label in an attribute.
<p:load>: Load an XML resource that an IRI specifies and provide it as result output.
<p:make-absolute-uris>: Make the value of an element or attribute in the source input an absolute IRI value in the result output .
<p:namespace-rename>: Rename the namespace declarations.
<p:pack>: Merge two document sequences.
<p:parameters>: Make available a set of parameters as a
c:param-setXML document in the result output.
<p:rename>: Rename elements, attributes, or processing instruction.
<p:replace>: Replace matching elements.
<p:set-attributes>: Set attributes on matching elements.
<p:sink>: Accept source input and generate no result output.
<p:split-sequence>: Divide a single sequence into two.
<p:store>: Store a serialized version of its source input to a URI.
<p:string-replace>:Perform string replacement on the source input.
<p:unescape-markup>: Unescape the source input.
<p:unwrap>: Replace matched elements with their children.
<p:wrap>: Wrap matching nodes in the source document with a new parent element.
<p:wrap-sequence>: Produce a new sequence of documents.
<p:xinclude>: Apply XInclude processing to the input source.
<p:xslt>: Apply an XSLT version 1.0 or XSLT version 2.0 style sheet input source.
- Optional steps:
<p:exec>: Apply an external command to the input source.
<p:hash>: Generate a message digest or a digital fingerprint for some value.
<p:uuid>: Generate a Universally Unique Identifier (UUID).
<p:validate-with-relax-ng>: Validate the input XML with RelaxNG schema.
<p:validate-with-schematron>: Validate the input XML with Schematron schema.
<p:validate-with-xml-schema>: Validate the input XML with XML schema.
<p:www-form-urldecode>: Decode the x-www-form-urlencoded string into a set of XProc parameters.
<p:www-form-urlencode>: Encode a set of XProc parameter values as an x-www-form-urlencoded string.
<p:xquery>: Apply an XQuery version 1.0 query.
<p:xsl-formatter>: Render an XSL version 1.1 document (as in XSL-FO).
It is easy to create new steps from existing pipelines. If you want, you can even create third-party libraries with extension steps that augment the XProc processor itself.
Note: As the specification process is ongoing, the standard library is one area that continues to experience a bit of volatility. I suggest referring to up-to-date definitions in the current WD (see Resources) for specific details.
Listing 1 illustrates an XProc pipeline with a single step that applies an XSLT operation to an XML document.
Listing 1. Simple implicit pipeline
<p:pipeline xmlns:p="http://www.w3.org/ns/xproc" name="xslt-example"> <p:xslt> <p:input port="stylesheet"> <p:document href="mystylesheet.xslt"/> </p:input> </p:xslt> </p:pipeline>
XProc pipelines accept zero or more XML documents as their input and produce zero
or more XML documents as output. The XProc code in Listing 1 consists of a
<p:pipeline> top-level element, a
<p:xslt> step, and not much else. An
XML document that comes into the standard input of the XProc processor is
handed off to the
<p:xslt> step, which then applies
an XSLT process using mystylesheet.xslt (where mystylesheet is
defined by the
element) on the XML document.
With only a single step, its results are placed onto the result port for the entire pipeline, which (incidentally) typically outputs the XML document to standard output. Figure 1 shows this process, outlining where the XML document flows from source and result ports.
Figure 1. Logic flow for a simple pipeline
These connections between ports are known as step bindings, and they control the flow of XML document processing. These bindings can be implicitly or explicitly defined. In the Listing 1 example, bindings were implicit, with process flow dictated by XProc's natural defaulting mechanisms.
Listing 2 shows a functionally equivalent pipeline with explicit step bindings.
Listing 2. Simple explicit pipeline
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" name="xslt-example"> <p:input port="my-source" primary="true" sequence="false"/> <p:output port="my-result" primary="true" sequence="false"> <p:pipe step="step1" port="result"/> </p:output> <p:xslt name="step1"> <p:input port="source"> <p:pipe step="xslt-example" port="my-source"/> </p:input> <p:input port="stylesheet"> <p:document href="mystylesheet.xslt"/> </p:input> </p:xslt> </p:declare-step>
In Listing 1, I used
which implicitly declared a source input and result output port. Using
<p:declare-step> now means that I have to
explicitly define these ports as well as declare step bindings between sequential
sibling steps. These bindings and ports are summarized below:
- Top-level my-source input port will receive any standard input.
- Top-level my-result output port will receive the results of the step1 result port and place them on the standard output.
- The step1 source input is bound to the my-source input port.
It's difficult to illustrate step bindings between steps using a pipeline with a single step; so, I created a nontrivial example in which I show several XProc steps and logic structures. Listing 3 presents a more representative XProc example containing multiple steps along with some conditional logic steps.
Listing 3. Complex pipeline
<p:pipeline name="mypipeline" type="myexample" xmlns:p="http://www.w3.org/ns/xproc"> <p:xinclude name="step1"/> <p:choose name="step2"> <p:when test="/*[@version < 2.0]"> <!-- subpipeline //--> <p:validate-with-xml-schema name="step2a1"> <p:input port="schema"> <p:document href="newer-schema.xsd"/> </p:input> </p:validate-with-xml-schema> </p:when> <p:otherwise> <!-- subpipeline //--> <p:validate-with-xml-schema name="step2b1"> <p:input port="schema"> <p:document href="older-schema.xsd"/> </p:input> </p:validate-with-xml-schema> </p:otherwise> </p:choose> <p:for-each name="step3"> <p:iteration-source select="//div"/> <!-- subpipeline //--> <p:string-replace name="step3a1"> <p:option name="match" value="//span[@class=’css1’]"/> <p:option name="replace" value=""/> </p:string-replace> </p:for-each> <p:wrap-sequence name="step4"> <p:option name="wrapper" value="document"/> </p:wrap-sequence> </p:pipeline>
This pipeline roughly translates to the following:
- Input stdin XML documents.
- Choose (step2) between using an newer (step2a1) or older (step2b1) schema and validate.
- Extract each (step3) HTML
<div>element, applying a string replace operation (step3a1).
- Wrap up (step4) the final sequence of
<div>elements with a
- Output the XML documents to stdout.
Steps can have other input or output ports defined that work with non-XML documents, but only XML documents (as in XML infoset) can flow between primary input and output ports.
I used all three kinds of XProc steps in this example. Step types are represented as rectangles in the workflow diagram that Figure 2 shows.
Figure 2. Logic flow for a complex pipeline
The largest rectangle represents the whole
<p:pipeline>, which can itself be invoked
as a step. Because it contains a sub-pipeline, it is a compound step.
<p:choose> step has two sub-pipelines,
which makes it a multi-container step. It chooses which sub-pipeline to follow
based on the evaluation of a
<p:for-each> step contains a single sub-pipeline consisting of one step, so it's a compound step.
Most XProc steps are atomic steps. These steps apply a specific operation
to an XML document, examples of which are
Throughout the XProc specification process, the WG had to navigate several issues:
Namespaces: XProc works with XML documents, which
means for many operations, a processor has
to keep track of namespaces in documents. For example, consider
p:unwrapstep; which removes the top-level element of a document. If this step removes a top-level element that happens to have the namespace
xmlnsdeclaration, you must ensure that XProc applies a fixup to ensure that the stripped namespace declaration is copied to child elements in the resultant XML documents, mindful that you don't overwrite any other valid namespace declaration.
The WD contains a non-normative "Section E. Guidance on Namespace Fixup" that attempts to outline what a processor must handle.
- XSLT and XPath versions: The timing of the XProc effort means that it finds itself in the middle of an adoption cycle between versions of XPath and XSLT. I think that when we start to use XProc, we will see how well XProc navigated the thorny issue of supporting multiple versions of XPath and XSLT.
- Options, variables, and parameters: Too much seems to be going on with options, variables, and parameters. I think many people might agree with me that a single entity might serve just as adequately.
- Streaming: I feel that the WG might have given this requirement too much weight in their earlier decisions on XProc. Considering that the preponderance of XML technologies that XProc will control are themselves not 100% up to standardized streaming, it seems like early optimization to me, especially in a world where MapReduce and parallelization techniques seem to be gaining in prominence.
As with any W3C specification, XProc underwent a lot of changes in terms of syntax and semantics (see Resources). The latest WD, dated 1 May 2008, is evolutionary as it irons out the details from the churn of the previous draft's decision to allow both version 1 and version 2 of XPath and XSLT.
The current WD is also revolutionary in that the spec was completely rewritten to
disentangle some of the various notions that the
element had become. For example,
<p:option> is now
used only in the functional signature of a step (only in
<p:with-option>used in the step instance itself,
when setting an option's value. In addition, the
element was added to hold computed values, especially for use with compound
The most interesting change with options, parameters, and variables is the removal
of the value attribute; values are now the result of evaluating an XPath expression
defined by a select attribute. Listing 4 illustrates how this
p:with-option as an example.
Listing 4. Example of p:with-option use with an XPath expression
<ex:someStep> <p:with-option name="some-option-name" select="'some value'"/> </ex:someStep>
Defining simple values for options, using an XPath expression, becomes tedious [for example, the need to use nested double (") and single (') quotation marks]. The syntactic shorthand (see Listing 5) was retained as the preferred method whereby options can be defined with name-value attributes on the step element itself.
Listing 5. Setting an option with syntactic shorthand
<ex:stepType option-name="some value"/>
All these changes make the specification much clearer, but at the expense of a larger XProc vocabulary
<p:with-param>, and so on).
One last significant change to mention is that you now must use
<p:declare-step/> consistently to declare new
steps. This change adds a slight cognitive load to users who now have to think
about an instance of a step versus its declaration (in a library or pipeline).
Overloading too many concepts onto the idea of what a single step element
is in XProc might be potentially constraining. I think that the
WG splitting up concepts now was a pragmatic decision.
It's important for XML technologists to remind themselves that some families and phylum of developers do not work with XML. When someone from these groups asks, "Why do I need XProc?," my first response is usually that XProc is designed to be platform neutral, meaning that XProc can run everywhere a compliant XProc processor can run. However, if you already work with XML documents and technologies, XProc is probably something you have emulated with other approaches (XSLT, Apache Ant, Apache Cocoon site maps, Jelly, and so on), and you will be happy to see the arrival of XProc processors.
Note: No implementations of XProc are production-ready, but several are in development. See Resources for more information.
With XML seeping into every aspect and tier of computing, having a single, easy-to-understand processing approach like XProc to orchestrate one's expanding XML ecosystem might be a disruptive technology. The XProc Standard library, combined with the extensibility of writing your own third-party step libraries, provides a powerful facade over existing and future XML processors. Thus, rather than fit your workflows around the vagaries of any specific technology, you can now honestly define your processing as a series of operations on XML documents.
XProc is expected to go into a second Last Call sometime in the coming months, which seems to indicate that you might see a W3C recommendation before the end of 2008.
XML Pipeline Definition Language Version 1.0: W3C Note 28 February 2002: Read the note submitted by Sun Microsystems, Alis Technologies, Arbortext, Cisco Systems, Fujitsu, Markup Technology, and Oracle.
Processing Model Requirements: W3C Working Group Note 05 April 2004: Peruse the W3C WG Note from 05 April 2004.
XML Pipeline Language (XPL)
Version 1.0 (Draft): Review the W3C Member Submission submitted by Orbeon on 11 March 2005 and published on 11 April 2005.
XProc: An XML Pipeline Language:
Explore this W3C XProc Editors Draft (the W3C working draft dated 01 May 2008).
Section E. Guidance on Namespace Fixup: Review the non-normative list of suggestions for implementors to follow to reduce the need to fix up namespaces.
XProc: An XML Pipeline Language (with
revision marks): Peruse the W3C Working Draft with differences dated 8 May 2008.
Processing Model Requirements and Use Cases: Read the W3C XProc requirements and use cases document dated 11 April 2006.
Pipeline Language (XPL) Version 1.0 (Draft): Check out this draft of the early W3C member submission of an XML pipeline language.
IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
developerWorks technical events and webcasts: Stay current with technology in these sessions.
- The technology
bookstore: Browse for books on these and other technical topics.
podcasts: Listen to interesting interviews and discussions for software developers.
Get products and technologies
List of XProc implementations: See XProc implementations currently undergoing development.
Smallx: Explore Smallx. It is, according to its developers, "a library and set of tools that is being developed to process XML infosets. It has two distinct features in that the infoset implementation allows streaming of documents and that processing of infosets can be accomplished using a concept called pipelines."
Sxpipe: Try Simple XML Pipelines (sxpipe) to build a simple processing model for XML documents and choose the order in which components are evaluated.
Apache Ant: Get more information about and download this
Java™-based build tool.
Apache Cocoon: Download and get more information on this Web development framework built around the concepts of separation of concerns.
Jelly: Get the tool to turn your XML code into executable files.
trial software for product evaluation: Build your next project with trial software available for download directly from developerWorks, including application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
XProc Public Comment : Check out the W3C XProc public comments mailing list or view the archives.
XML zone discussion forums: Participate in any of several XML-related discussions.
developerWorks XML zone: Share your thoughts: After you read this article, post your comments and thoughts in this forum. The XML zone editors moderate the forum and welcome your input.
developerWorks blogs: Check out these blogs and get involved in the developerWorks community.
Jim Fuller has been a professional developer for 15 years, working with several blue-chip software companies in both his native USA and the UK. He has co-written a few technology-related books and regularly speaks and writes articles focusing on XML technologies. He is a founding committee member for XML Prague and was in the gang responsible for EXSLT. He spends his free time playing with XML databases and XQuery. Jim is technical director for a few companies (FlameDigital, Webcomposite s.r.o.) and can be reached at email@example.com.