Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Discovering XProc

Enable the XML ecosystem with pipelines

James R. Fuller (jim.fuller@webcomposite.com), Technical Director, FlameDigital Limited & Webcomposite s.r.o.
Photo of Jim Fuller
Jim Fuller has been a professional developer for 15 years, working with several blue-chip software companies in both his native USA and the UK. He has co-written a few technology-related books and regularly speaks and writes articles focusing on XML technologies. He is a founding committee member for XML Prague and was in the gang responsible for EXSLT. He spends his free time playing with XML databases and XQuery. Jim is technical director for a few companies (FlameDigital, Webcomposite s.r.o.) and can be reached at jim.fuller@webcomposite.com.

Summary:  Since October 2005, the W3C XML Processing Model Working Group (WG) has collaborated on a Working Draft (WD) specification titled "XProc: An XML Pipeline Language." As early implementations start to appear on the horizon and the anticipation of a second Last Call by the W3C WG (paving the way to a W3C draft recommendation), it has become clear that over the past 12 months, the XProc specification effort has picked up pace. Discover what XProc is today and its future, get the back story on some of the more contentious issues, and even run through a few examples.

Date:  24 Jun 2008
Level:  Intermediate
Also available in:   Chinese  Japanese

Activity:  30781 views
Comments:  

XProc is a markup language that describes processing pipelines composed of discrete steps that apply operations on XML documents. If a specification's importance is related to the quality of individuals working on it, then XProc is significant, indeed. The W3C XML Processing Model WG is packed with pragmatic XML practitioners and superstars as well as grizzled veterans of past XML-related efforts: Erik Bruchez, Andrew Fang, Paul Grosso, Rui Lopes, Murray Maloney, Alex Milowski, Michael Sperberg-McQueen, Jeni Tennison, Henry Thompson, Richard Tobin, Alessandro Vernet, Norman Walsh (Chair), and Mohamed Zergaoui, to name a few.

Frequently used acronyms

  • DTD: Document Type Definition
  • HTTP: Hypertext Transfer Protocol
  • W3C: World Wide Web Consortium
  • XML: Extensible Markup Language
  • XSL: Extensible Stylesheet Language
  • XSLT: Extensible Stylesheet Language Transformations

XProc is not the first W3C attempt to establish an XML processing pipelines standard. In 2002, as part of the XML Processing Model Workshop, there was the "XML Pipeline Definition Language Submission," submitted by Sun Microsystems, Alis Technologies, Arbortext, Cisco Systems, Fujitsu, Markup Technology, and Oracle. This submission was published on 28 February 2002 as "XML Pipeline Definition Language Version 1.0."

In 2004, a W3C Note attempted to set out requirements for an XML processing model: "XML Processing Model Requirements," W3C Working Group Note 05 April 2004. In 2005, another W3C member submission was proposed: "XML Pipeline Language (XPL) Version 1.0" (draft), submitted by Orbeon, Inc., on 11 March and published on 11 April.

Use cases

XProc’s goal is to promote an interoperable and standard approach to the processing of XML documents. These requirements were formally set out in a group of use cases (see Resources), some of which I list below:

  • Apply a sequence of operations to XML documents.
  • Parse XML, validate it against a schema, and then apply an XSLT transformation.
  • Combine multiple XML documents (document aggregation).
  • Interact with Web services.
  • Use metadata retrieval.

I haven't seen any specific studies citing the need for XProc, so I here proffer a few of my own unabashedly biased opinions:

  • XProc’s declarative format, combined with the simplicity of thinking in terms of pipelines, will mean that non-technical people can be involved in writing and maintaining processing workflows.
  • XProc, in many configurations, is amenable to streaming, whereas other approaches to control XML processes are not (for example, XSLT).
  • XProc steps focus on performing specific operations, which over time should experience greater optimization (in an XProc processor used by many) versus one-off code that you or I write (used by few).
  • XProc's standard step library and extensibility mechanisms position XProc to be an all-encompassing solution.
  • Structured data (such as XProc markup) is typically easier to reuse than structured code.
  • One of XProc's inspirations is UNIX® pipelines, which hopefully all can agree is a good thing!

Not surprisingly, XProc will probably gain considerable favor amongst those groups who work and generate XML documents. You can also imagine that people with business workflows and XML documents flowing through them might be excited by the possibility of modeling their workflows with XProc pipelines, and then running them on their XML documents.

The XProc vocabulary

XProc is comprised of a small vocabulary divided into three categories: core elements, ancillary elements, and a standard step library. The core elements provide modern computing language constructs, such as conditional and iterative processing and try/catch error mechanisms:

  • <p:for-each>: Iterative processing statement
  • <p:choose>: Case logic statement (similar to XSLT <xsl:choose>)
  • <p:group>: Groups a series of steps into a named sub-pipeline
  • <p:try>: Provides a try/catch mechanism to handle dynamic errors
  • <p:viewport>: Applies a sub-pipeline process to subtrees contained in a single XML document

The elements used in the declaration and definition of steps provide the basis for XProc extensibility and reusability:

  • <p:library>: Contains step declarations to provide reusable step libraries
  • <p:declare-step>: Defines a step and its functional signature, typically in a <p:library> element
  • <p:import>: Brings in through a Uniform Resource Identifier (URI) any declared pipelines or library to the current pipeline

XProc ancillary elements are mainly children nodes of XProc steps and handle tasks such as step bindings, making it easy to configure a step. These elements consist of:

  • Inputs and outputs: These elements define ports that can bind to the inputs or outputs of other steps and define the flow of XML documents. In addition, you can define XML documents inline (directly in the XProc document) or bring in documents through an external URI.
  • Options: Options are the primary mechanism for configuring steps, with the <p:with-option> element or as a name-value attribute on the step instance. Note that options are part of the functional signature of a step, and their names are invariant.
  • Variables: Variables are used with compound steps and define XPath variables for use within a compound step sub-pipeline.
  • Parameters: Unlike options and variables, parameters have names that are computed at run time and are not related to any functional signature, as defined by <p:declare-step>. Perhaps the most significant aspect to XProc is the 30-40 steps defined in a standard XProc library, which are split into a set of required and optional steps.

The real power of XProc is embodied in its standard library of required and optional steps, which perform a wide variety of tasks, such as:

  • XSLT, XQuery, XInclude processing
  • Schema validation (DTD, RelaxNG, Schematron, XML schema)
  • XML update operations, such as inserting or deleting XML elements and attributes
  • XML storage and retrieval
  • Wrap, unwrap, escape, and unescape XML
  • HTTP requests
  • Execute native commands

Here is a brief overview of each step contained in the XProc standard library:

  • Required steps:
    • <p:add-attribute>: Add an attribute to a set of matching elements.
    • <p:add-xml-base>: Add or correct xml:base attributes on elements.
    • <p:compare>: Compare two documents for equivalence.
    • <p:count>: Count the number of documents in source input.
    • <p:delete>: Delete items specified by a match pattern from the source input.
    • <p:directory-list>: Enumerate the directory listing into the result output.
    • <p:error>: Generate a dynamic error.
    • <p:escape-markup>: Escape source input.
    • <p:http-request>: Interact with resources identified by Internationalized Resource Identifiers (IRIs) over HTTP.
    • <p:identity>: Make an exact copy of an input source to the result output.
    • <p:insert>: Insert an XML selection into the source input.
    • <p:label-elements>: Create a label for each matched element, and store the value of the label in an attribute.
    • <p:load>: Load an XML resource that an IRI specifies and provide it as result output.
    • <p:make-absolute-uris>: Make the value of an element or attribute in the source input an absolute IRI value in the result output .
    • <p:namespace-rename>: Rename the namespace declarations.
    • <p:pack>: Merge two document sequences.
    • <p:parameters>: Make available a set of parameters as a c:param-set XML document in the result output.
    • <p:rename>: Rename elements, attributes, or processing instruction.
    • <p:replace>: Replace matching elements.
    • <p:set-attributes>: Set attributes on matching elements.
    • <p:sink>: Accept source input and generate no result output.
    • <p:split-sequence>: Divide a single sequence into two.
    • <p:store>: Store a serialized version of its source input to a URI.
    • <p:string-replace>:Perform string replacement on the source input.
    • <p:unescape-markup>: Unescape the source input.
    • <p:unwrap>: Replace matched elements with their children.
    • <p:wrap>: Wrap matching nodes in the source document with a new parent element.
    • <p:wrap-sequence>: Produce a new sequence of documents.
    • <p:xinclude>: Apply XInclude processing to the input source.
    • <p:xslt>: Apply an XSLT version 1.0 or XSLT version 2.0 style sheet input source.
  • Optional steps:
    • <p:exec>: Apply an external command to the input source.
    • <p:hash>: Generate a message digest or a digital fingerprint for some value.
    • <p:uuid>: Generate a Universally Unique Identifier (UUID).
    • <p:validate-with-relax-ng>: Validate the input XML with RelaxNG schema.
    • <p:validate-with-schematron>: Validate the input XML with Schematron schema.
    • <p:validate-with-xml-schema>: Validate the input XML with XML schema.
    • <p:www-form-urldecode>: Decode the x-www-form-urlencoded string into a set of XProc parameters.
    • <p:www-form-urlencode>: Encode a set of XProc parameter values as an x-www-form-urlencoded string.
    • <p:xquery>: Apply an XQuery version 1.0 query.
    • <p:xsl-formatter>: Render an XSL version 1.1 document (as in XSL-FO).

It is easy to create new steps from existing pipelines. If you want, you can even create third-party libraries with extension steps that augment the XProc processor itself.

Note: As the specification process is ongoing, the standard library is one area that continues to experience a bit of volatility. I suggest referring to up-to-date definitions in the current WD (see Resources) for specific details.


Example pipelines

Listing 1 illustrates an XProc pipeline with a single step that applies an XSLT operation to an XML document.


Listing 1. Simple implicit pipeline
                
<p:pipeline xmlns:p="http://www.w3.org/ns/xproc" name="xslt-example">
	<p:xslt>
		<p:input port="stylesheet">
			<p:document href="mystylesheet.xslt"/>
		</p:input>
	</p:xslt>
</p:pipeline>

XProc pipelines accept zero or more XML documents as their input and produce zero or more XML documents as output. The XProc code in Listing 1 consists of a <p:pipeline> top-level element, a <p:xslt> step, and not much else. An XML document that comes into the standard input of the XProc processor is handed off to the <p:xslt> step, which then applies an XSLT process using mystylesheet.xslt (where mystylesheet is defined by the <p:input>/<p:document> element) on the XML document.

With only a single step, its results are placed onto the result port for the entire pipeline, which (incidentally) typically outputs the XML document to standard output. Figure 1 shows this process, outlining where the XML document flows from source and result ports.


Figure 1. Logic flow for a simple pipeline
Logic flow for a simple pipeline

These connections between ports are known as step bindings, and they control the flow of XML document processing. These bindings can be implicitly or explicitly defined. In the Listing 1 example, bindings were implicit, with process flow dictated by XProc's natural defaulting mechanisms.

Listing 2 shows a functionally equivalent pipeline with explicit step bindings.


Listing 2. Simple explicit pipeline
                
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" name="xslt-example">

	<p:input port="my-source" primary="true" sequence="false"/>
	<p:output port="my-result" primary="true" sequence="false">     
		<p:pipe step="step1" port="result"/>
	</p:output>

	<p:xslt name="step1">
		<p:input port="source">       
			<p:pipe step="xslt-example" port="my-source"/>     
		</p:input>
		<p:input port="stylesheet">
			<p:document href="mystylesheet.xslt"/>
		</p:input>
	</p:xslt>

</p:declare-step>

In Listing 1, I used <p:pipeline>, which implicitly declared a source input and result output port. Using <p:declare-step> now means that I have to explicitly define these ports as well as declare step bindings between sequential sibling steps. These bindings and ports are summarized below:

  • Top-level my-source input port will receive any standard input.
  • Top-level my-result output port will receive the results of the step1 result port and place them on the standard output.
  • The step1 source input is bound to the my-source input port.

It's difficult to illustrate step bindings between steps using a pipeline with a single step; so, I created a nontrivial example in which I show several XProc steps and logic structures. Listing 3 presents a more representative XProc example containing multiple steps along with some conditional logic steps.


Listing 3. Complex pipeline
                
<p:pipeline name="mypipeline" type="myexample" xmlns:p="http://www.w3.org/ns/xproc">
	<p:xinclude name="step1"/>
	<p:choose name="step2">
		<p:when test="/*[@version &lt; 2.0]">
			<!-- subpipeline //-->
			<p:validate-with-xml-schema name="step2a1">
				<p:input port="schema">
					<p:document href="newer-schema.xsd"/>
				</p:input>
			</p:validate-with-xml-schema>
		</p:when>
		<p:otherwise>
			<!-- subpipeline //-->
			<p:validate-with-xml-schema name="step2b1">
				<p:input port="schema">
					<p:document href="older-schema.xsd"/>
				</p:input>
			</p:validate-with-xml-schema>
		</p:otherwise>
	</p:choose>
	<p:for-each name="step3">
		<p:iteration-source select="//div"/>
		<!-- subpipeline //-->
		<p:string-replace name="step3a1">
			<p:option name="match" value="//span[@class=’css1’]"/>
			<p:option name="replace" value=""/>
		</p:string-replace>
	</p:for-each>
	<p:wrap-sequence name="step4">
		<p:option name="wrapper" value="document"/>
	</p:wrap-sequence>
</p:pipeline>

This pipeline roughly translates to the following:

  1. Input stdin XML documents.
  2. Apply <xinclude> processing (step1).
  3. Choose (step2) between using an newer (step2a1) or older (step2b1) schema and validate.
  4. Extract each (step3) HTML <div> element, applying a string replace operation (step3a1).
  5. Wrap up (step4) the final sequence of <div> elements with a <document> element.
  6. Output the XML documents to stdout.

Steps can have other input or output ports defined that work with non-XML documents, but only XML documents (as in XML infoset) can flow between primary input and output ports.

I used all three kinds of XProc steps in this example. Step types are represented as rectangles in the workflow diagram that Figure 2 shows.


Figure 2. Logic flow for a complex pipeline
Logic flow for a complex pipeline

The p:pipeline compound step

The largest rectangle represents the whole <p:pipeline>, which can itself be invoked as a step. Because it contains a sub-pipeline, it is a compound step.

The p:choose multi-container step

The <p:choose> step has two sub-pipelines, which makes it a multi-container step. It chooses which sub-pipeline to follow based on the evaluation of a <p:when> XPath expression.

The p:for-each compound step

The <p:for-each> step contains a single sub-pipeline consisting of one step, so it's a compound step.

Atomic steps

Most XProc steps are atomic steps. These steps apply a specific operation to an XML document, examples of which are <p:xinclude>, <p:validate->, <p:string-replace>, and <p:wrap>.


Considerations in the development of XProc

Throughout the XProc specification process, the WG had to navigate several issues:

  • Namespaces: XProc works with XML documents, which means for many operations, a processor has to keep track of namespaces in documents. For example, consider the XProc p:unwrap step; which removes the top-level element of a document. If this step removes a top-level element that happens to have the namespace xmlns declaration, you must ensure that XProc applies a fixup to ensure that the stripped namespace declaration is copied to child elements in the resultant XML documents, mindful that you don't overwrite any other valid namespace declaration.

    The WD contains a non-normative "Section E. Guidance on Namespace Fixup" that attempts to outline what a processor must handle.

  • XSLT and XPath versions: The timing of the XProc effort means that it finds itself in the middle of an adoption cycle between versions of XPath and XSLT. I think that when we start to use XProc, we will see how well XProc navigated the thorny issue of supporting multiple versions of XPath and XSLT.
  • Options, variables, and parameters: Too much seems to be going on with options, variables, and parameters. I think many people might agree with me that a single entity might serve just as adequately.
  • Streaming: I feel that the WG might have given this requirement too much weight in their earlier decisions on XProc. Considering that the preponderance of XML technologies that XProc will control are themselves not 100% up to standardized streaming, it seems like early optimization to me, especially in a world where MapReduce and parallelization techniques seem to be gaining in prominence.

Current status

As with any W3C specification, XProc underwent a lot of changes in terms of syntax and semantics (see Resources). The latest WD, dated 1 May 2008, is evolutionary as it irons out the details from the churn of the previous draft's decision to allow both version 1 and version 2 of XPath and XSLT.

The current WD is also revolutionary in that the spec was completely rewritten to disentangle some of the various notions that the <p:option> element had become. For example, <p:option> is now used only in the functional signature of a step (only in <p:declare-step>), with <p:with-option>used in the step instance itself, when setting an option's value. In addition, the <p:variable/> element was added to hold computed values, especially for use with compound steps.

The most interesting change with options, parameters, and variables is the removal of the value attribute; values are now the result of evaluating an XPath expression defined by a select attribute. Listing 4 illustrates how this works, using p:with-option as an example.


Listing 4. Example of p:with-option use with an XPath expression
                			
<ex:someStep>
	<p:with-option name="some-option-name" select="'some value'"/>
</ex:someStep>

Defining simple values for options, using an XPath expression, becomes tedious [for example, the need to use nested double (") and single (') quotation marks]. The syntactic shorthand (see Listing 5) was retained as the preferred method whereby options can be defined with name-value attributes on the step element itself.


Listing 5. Setting an option with syntactic shorthand
                			
<ex:stepType option-name="some value"/>

All these changes make the specification much clearer, but at the expense of a larger XProc vocabulary (<p:variable>, <p:with-option>, <p:with-param>, and so on).

One last significant change to mention is that you now must use <p:declare-step/> consistently to declare new steps. This change adds a slight cognitive load to users who now have to think about an instance of a step versus its declaration (in a library or pipeline). Overloading too many concepts onto the idea of what a single step element is in XProc might be potentially constraining. I think that the WG splitting up concepts now was a pragmatic decision.


Summary

It's important for XML technologists to remind themselves that some families and phylum of developers do not work with XML. When someone from these groups asks, "Why do I need XProc?," my first response is usually that XProc is designed to be platform neutral, meaning that XProc can run everywhere a compliant XProc processor can run. However, if you already work with XML documents and technologies, XProc is probably something you have emulated with other approaches (XSLT, Apache Ant, Apache Cocoon site maps, Jelly, and so on), and you will be happy to see the arrival of XProc processors.

Note: No implementations of XProc are production-ready, but several are in development. See Resources for more information.

With XML seeping into every aspect and tier of computing, having a single, easy-to-understand processing approach like XProc to orchestrate one's expanding XML ecosystem might be a disruptive technology. The XProc Standard library, combined with the extensibility of writing your own third-party step libraries, provides a powerful facade over existing and future XML processors. Thus, rather than fit your workflows around the vagaries of any specific technology, you can now honestly define your processing as a series of operations on XML documents.

XProc is expected to go into a second Last Call sometime in the coming months, which seems to indicate that you might see a W3C recommendation before the end of 2008.


Resources

Learn

Get products and technologies

  • List of XProc implementations: See XProc implementations currently undergoing development.

  • Smallx: Explore Smallx. It is, according to its developers, "a library and set of tools that is being developed to process XML infosets. It has two distinct features in that the infoset implementation allows streaming of documents and that processing of infosets can be accomplished using a concept called pipelines."

  • Sxpipe: Try Simple XML Pipelines (sxpipe) to build a simple processing model for XML documents and choose the order in which components are evaluated.

  • Apache Ant: Get more information about and download this Java™-based build tool.

  • Apache Cocoon: Download and get more information on this Web development framework built around the concepts of separation of concerns.

  • Jelly: Get the tool to turn your XML code into executable files.

  • IBM trial software for product evaluation: Build your next project with trial software available for download directly from developerWorks, including application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

Discuss

About the author

Photo of Jim Fuller

Jim Fuller has been a professional developer for 15 years, working with several blue-chip software companies in both his native USA and the UK. He has co-written a few technology-related books and regularly speaks and writes articles focusing on XML technologies. He is a founding committee member for XML Prague and was in the gang responsible for EXSLT. He spends his free time playing with XML databases and XQuery. Jim is technical director for a few companies (FlameDigital, Webcomposite s.r.o.) and can be reached at jim.fuller@webcomposite.com.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=315718
ArticleTitle=Discovering XProc
publish-date=06242008
author1-email=jim.fuller@webcomposite.com
author1-email-cc=dwxed@us.ibm.com