Get to know the QueryPath PHP library

A fast, easy way to work with XML and HTML


You could make the case that, over the past 15 years, the three technologies that have contributed most to the Web's explosive growth are HTML, HTTP, and XML. You might expand upon this triumvirate, pointing perhaps to CSS, JavaScript, and similar technologies. But the "big three" remain unchallenged at the top of the list.

PHP has also made a splash in the Web-development world. PHP powers Web sites from small home pages to the likes of Yahoo!, largely because of its ease of development and Web-centered model. But using PHP to work with the big three — especially with XML — can sometimes be tricky. In this article, learn about QueryPath, a PHP library designed with two goals in mind:

  • Simplicity, to make it easy to work with HTML, XML, and HTTP
  • Robustness, to provide a rich site of tools for working with these technologies

This article explores building QueryPath objects, traversing XML and HTML, manipulating XML and HTML, and using QueryPath to access a Web service (Twitter is the example service).

The next section has a brief introduction to the library and its design.


For simplicity, QueryPath uses a compact syntax. Method names are short and representative of what they do (for example, text(), append(), remove()). Since most methods return a QueryPath object, method calls are chainable, meaning that several methods can be called in one sequence. This convention is sometimes called a fluent interface. To keep things familiar to JavaScript developers, QueryPath implements the majority of the jQuery traversal and manipulation functions and behaviors.

For robustness, QueryPath provides tools designed to address typical use cases for loading, searching, reading, and writing XML and HTML content. Not all needs can be met by one general-purpose API, though, regardless of the library's size. To address the issue, QueryPath includes an extension mechanism that lets you add new methods to QueryPath. QueryPath also includes extensions to add database support, template support, and additional XML features.

You might be wondering, "Why another XML or HTML tool? PHP V5 already has a handful of XML tools, including a Document Object Model (DOM) implementation and the SimpleXML library. Why add another?" The short answer: QueryPath is designed to be a general-purpose tool. The DOM API is complex and cumbersome. Its object-oriented model may be powerful, but even the simplest of tasks can take dozens of lines of coding. SimpleXML, on the other hand, is too simple for many programming tasks. Unless the XML is entirely predictable, navigating a SimpleXML document can be anything but simple.

QueryPath is an attempt to find the sweet spot between DOM's feature richness and SimpleXML's simplicity.


QueryPath is a pure PHP library. To use it, simply download it from the official Web site and add it to your PHP library path.

QueryPath has minimal system requirements. It will work on PHP V5 as long as the DOM extension is enabled. Most distributions of PHP V5 meet this requirement out of the box. QueryPath does not support the long-deprecated PHP V4.

Anatomy of a QueryPath chain

There are four concepts central to typical usage of QueryPath:

  • A QueryPath object is associated with a single XML or HTML document.
  • QueryPath can query the document, identifying a set of matches within the document.
  • Documents can be manipulated by QueryPath. New parts can be added, existing parts can be modified, and unwanted parts can be removed.
  • QueryPath methods can be chained together to execute many operations in a compact sequence. In just a few lines of code, a document can be loaded, parsed, queried, modified, and written.

The code in Listing 1 illustrates all of these points.

Listing 1. Basic QueryPath chain
require 'QueryPath/QueryPath.php';

qp('sample.html')->find('title')->text('Hello World')->writeHTML();

The example above requires one library, QueryPath/QueryPath.php. This is the only file you need to include to use QueryPath, unless you're also loading QueryPath extensions.

The next line of code in the example is a QueryPath chain, which does the following.

  1. Creates a new QueryPath object pointing to the sample.html document. When qp() is run, it will create a new QueryPath object, which will subsequently load and parse the document.
  2. Using the find() method, it searches through the document using the cascading style sheet (CSS) 3 selector title, which searches for all <title/> elements.

    In a valid HTML document, this will match only the single <title/> element in the head of the document.

  3. The text value of the title is set to Hello World. When this is executed, the title's child nodes will be replaced by the CDATA (character data) string Hello World. Any existing content will be destroyed.
  4. The entire document will be written to the standard output with the writeHTML() method.

The example above can actually be shortened a bit, since the qp() factory function takes a CSS selector as an optional second parameter. Listing 2 shows the shortened version.

Listing 2. Abbreviated version of the basic QueryPath chain
require 'QueryPath/QueryPath.php';

qp('sample.html', 'title')->text('Hello World')->writeHTML();

Assuming that sample.html is just a bare-bones HTML document, the result of the above (either Listing 1 or Listing 2) would look something like Listing 3. The line in bold contains the title we set.

Listing 3. Example of the generated HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
<html lang="en">
	<title>Hello World</title>

These simple examples show the general families of tasks QueryPath can perform. The next few sections explore the families of methods. After that, you will assemble these building blocks to create a simple Web service client.

The qp() factory function

The function used most often in the QueryPath library is the qp() factory function. Essentially, it performs the task of creating new QueryPath objects. It's used in favor of a traditional constructor.

If you're familiar with object-oriented design patterns, you might recognize qp() as a variant of the factory pattern. Instead of defining a factory class with a builder method, QueryPath just uses a function. Along with saving a few keystrokes (important when chaining methods), this approach keeps QueryPath just a little closer to jQuery and reduces the learning curve if you're familiar with jQuery.

A QueryPath object will be associated with a single XML or HTML document. The document is bound to the QueryPath object when the object is constructed. The qp() function takes up to three arguments, all of which are optional:

A document
Can be a file name or URL, an XML or HTML string, a DOMDocument or DOMElement, a SimpleXMLElement, or an array of DOMElements. If nothing is supplied here, QueryPath will create an empty XML document for manipulation.
A CSS3 selector
If this is supplied then QueryPath will, upon loading the document, query that document using the given selector.
An associative array of options
Provides a method of passing in a complex set of configuration parameters for this particular instance of QueryPath. The API reference details the options that can be passed in here.

qp() takes so many types of data for the first argument to make it easy to construct a QueryPath object. QueryPath can begin with a filename or URL and load a document. If a string of XML or HTML is passed in, QueryPath will parse the content. And, of course, it can receive documents in the other two common object representations of an XML document: DOM and SimpleXML. Listing 4 shows how the qp() function can parse a string containing XML.

Listing 4. Building a QueryPath object from an XML string
require 'QueryPath/QueryPath.php';

$xml = '<?xml version="1.0"?><doc><item/></doc>';
$qp  = qp($xml);

When the code in Listing 4 is run, $qp will reference a QueryPath object that internally points to a parsed representation of the XML. A previous example passed in a file name. If PHP is configured to allow HTTP/HTTPS stream wrappers (which is standard in most PHP V5 distributions), you can even load remote HTTP URLs, as shown below.

Listing 5. Building a QueryPath object from a URL
require 'QueryPath/QueryPath.php';

$qp = qp('');

This makes it possible to access Web services using QueryPath. (Stream contexts can be passed in using the third qp() parameter, allowing you to fine-tune connection settings.) When creating a new document, there is a shortcut for adding boilerplate HTML, as shown below.

Listing 6. Using the QueryPath::HTML_STUB constant
require 'QueryPath/QueryPath.php';

$qp = qp(QueryPath::HTML_STUB);

The QueryPath::HTML_STUB constant defines a basic HTML document, as shown below.

Listing 7. QueryPath::HTML_STUB document
<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
<html xmlns="">
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

Starting from this skeleton document will make HTML generation even faster.

At this point, you know how to create a new QueryPath object pointing to a document, and you've also seen a simple CSS selector. The next section covers how QueryPath can be used to traverse documents.

Traversing a document

Once a document has been opened, you need to move around inside the document to find the content of interest. QueryPath is designed to make this easy. To facilitate the many traversal needs, QueryPath provides several methods for traversing. Most allow the use of CSS3 selectors to find the desired nodes.

Figure 1. Important QueryPath traversal methods
Important                     QueryPath traversal methods
Important QueryPath traversal methods

Figure 1 summarizes the commonly used traversal functions. Each is described below. Though there are other traversal methods, those discussed here cover most of the typical needs.

Table 1. Common traversing methods
MethodDescriptionTakes CSS selector?
find()Select any element (beneath the currently selected nodes) that matches the selectorYes
xpath()Select any elements matching the given XPath queryNo (XPath query instead)
top()Select the document element (the root element)No
parents()Select any ancestor elementYes
parent()Select the direct parent elementYes
siblings()Select all siblings (both previous and next)Yes
next()Select the next sibling elementYes
nextAll()Select all siblings after the present elementYes
prev()Select the previous siblingYes
prevAll()Select all previous siblingsYes
children()Select elements immediately beneath this oneYes
deepest()Select the deepest node or nodes beneath this oneNo

Many methods in QueryPath can take queries that further specify what items should be selected. As shown in the third column of Table 1, almost all of these methods take a CSS3 selector as an optional parameter. (The xpath() function takes an XPath query instead of a CSS3 selector.) Only top() and deepest() do not take a query as an argument.

Look at another simple example to see how traversal works. Suppose you have an XML document like the one below.

Listing 8. A simple XML document
<?xml version="1.0"?>
  <child id="one"/>
  <child id="two"/>
  <child id="three"/>

The <root/> element has four children: Three are named <child/>, and one is named <ignore/>. You could select all four children of <root/> with a QueryPath query.

Listing 9. Selecting all children
require 'QueryPath/QueryPath.php';

$xml = '<?xml version="1.0"?>
  <child id="one"/>
  <child id="two"/>
  <child id="three"/>

$qp = qp($xml, 'root')->children();
print $qp->size();

The children() method would select all of the immediate children of the <root/> element. The last line, which prints the number of matched items in the QueryPath object, will print 4.

Suppose you only want to select the three <child/> elements but not the <ignore/> element. Listing 10 shows how you can do it.

Listing 10. Querying with a filter
require 'QueryPath/QueryPath.php';

$xml = '<?xml version="1.0"?>
<child id="one"/>
<child id="two"/>
<child id="three"/>

$qp = qp($xml, 'root')->children('child');

print $qp->size();

The final print statement will print the number of items currently selected by QueryPath. It will return 3. Internally, QueryPath is tracking these three matches. They are stored as the current context. Should you decide to execute a further query, it would begin from these three elements. If you tried to append data, the data would be appended to these three elements.

CSS selectors

CSS selectors are the part of a CSS statement that selects an element to which a style will be applied. CSS selectors can also be used outside the context of a style sheet. QueryPath uses selectors as a query language and supports the feature set described in the CSS3 selectors standard.

CSS selectors play a big role in QueryPath. You've seen 10 functions that take a CSS selector as an argument. The selectors used thus far are simple tag name queries. CSS3 selectors are far more powerful than the previous examples might suggest. Detailed descriptions of CSS3 selectors are outside the scope of this article, but Table 2 provides examples of common selector patterns.

Table 2. Common CSS3 selector patterns
Selector patternDescriptionExample match
pFind elements with tag name <p/>.<p>
.containerFind elements with the class attribute set to container.<div class="container"/>
#menuFind the element with the id attribute set to menu. This is how ID-based searches are done.<div id="menu"/>
[type="inline"]Find elements where the type attribute has the value inline.<code type="inline"/>
tr > thFind <th> elements whose immediate parent element is a <tr>.<tr><th/></tr>
table tdFind <td> elements that have a <table> element somewhere in their ancestry (such as a parent or grandparent).<table><tr><td/></tr></table>
li:firstGet the first element named <li/>. Supported pseudo-classes include :last, :even, and :odd.<li/>
RDF|seqFind <RDF:seq> elements. QueryPath includes CSS3 selectors for XML namespaces. Namespace support extends to attributes as well as elements.<RDF:seq>

These common selector patterns can be combined to build complex selectors, such as

div.content ul>li:first

This selector will search any <div/> with the class content. Inside of the div, it will search through all unordered lists (<ul>), returning the first list item (<li>) for each list.

Iterating through matched items

You've learned about two facets of traversing a document: the methods provided by QueryPath and the CSS3 selector support. The third facet is iterating through selected items.

A QueryPath object is traversable. In PHP parlance, this means that the object can be treated as an iterator. The standard PHP looping structures can loop through a QueryPath object's selected elements. Recall the example in Listing 10, a simple query to retrieve three elements from an XML document. It is used as the basis for the next example.

What if you wanted to process each item individually? You can do so easily, since QueryPath is capable of being used as an iterator. Listing 11 shows an example.

Listing 11. Iterating through selected elements
require 'QueryPath/QueryPath.php';

$xml = '<?xml version="1.0"?>
<child id="one"/>
<child id="two"/>
<child id="three"/>

$qp = qp($xml, 'root')->children('child');

foreach ($qp as $child) {
  print $child->attr('id') . PHP_EOL;

As the foreach loop iterates it will assign each of the matched elements to the $child variable. However, $child isn't just the element; it is a QueryPath object pointing to the current element. You have at your disposal all of the usual QueryPath methods.

To maintain an API similar to jQuery's, QueryPath provides several methods that act as both accessors and mutators — or getters and setters. A single method may, depending on the arguments, either retrieve (access) data or change (mutate) data. The attr() function is one example. qp()->attr('name') retrieves the value of an attribute with the name name. qp()->attr('name', 'value') sets the value of the name attribute to value. Several other methods, including text(), html(), and xml(), perform double duty as both accessors and mutators.

Since each iterated item is wrapped in a QueryPath object, you have all of the standard QueryPath methods at your disposal with $child. The example above uses the attr() function, which is an accessor and mutator for the attributes of an element.

The attr() method retrieves the value of the attribute named id. The output of the code above is shown below.

Listing 12. Output from iterator example in Listing 11

You've learned how to traverse a document using QueryPath methods, CSS3 selectors, and iterating techniques. The next section explores how to modify documents with QueryPath.

Manipulating a document

In addition to using QueryPath to search a document, you can use it to add, modify, and remove data from a document. You caught a glimpse of QueryPath's capabilities in Listing 1. It's repeated below for your convenience.

Listing 13. Basic QueryPath chain
require 'QueryPath/QueryPath.php';

qp('sample.html')->find('title')->text('Hello World')->writeHTML();

In this example, the text() function is used to modify the contents of the <title/> element. QueryPath provides a dozen or so methods for changing a document. Figure 2 shows how several frequently used modifying methods work. These methods all add or replace data. The tag in green represents the currently selected element.

Figure 2. QueryPath methods for adding or replacing content
QueryPath methods for adding or replacing content

Each method takes string data, usually in the form of fragments of HTML or XML, and inserts the data into the document. The data is then immediately available for access and further manipulation.

There are really two classes of methods represented. In one class, some of the methods work with arbitrary fragments of XML, as follows.

Working with fragments of HTML and XML
append()Append data as the last child of a currently selected element or elements
prepend()Prepend data as the first child of a currently selected element or elements
after()Insert data immediately after the currently selected element or elements
before() Insert data immediately before the currently selected element or elements
html() Replace the child content of the current element or elements in an HTML document
xml()Replace the child content of the current element or elements in an XML document

The items above expect an argument containing a string of well-formed XML or HTML data. Listing 14 has an example with the html() method.

Listing 14. A basic QueryPath chain
require 'QueryPath/QueryPath.php';


The remove() method is missing from Figure 2. (Removing is difficult to represent visually.) The remove() method removes elements from the document. If called with no parameters, it will remove the currently selected elements. But, as with many other QueryPath methods, remove() takes a CSS3 selector as an optional parameter. When a selector is provided, items matching the selector will be removed.

The second class of method in Figure 2 comprises methods that manipulate attributes within elements. In the example, two are shown.

Working with attributes
attr()Gets the value or sets a value for a given attribute on every selected element.
addClass()Adds a class to every element in the current selection.

There are other attribute-related methods. For example, removeClass(), which takes a class name as an argument, will remove an individual class from an element. removeAttr(), which takes an attribute name as an argument, will remove the named attribute from all of the currently selected elements.

Now it's time to pull all these basic capabilities together into something interesting.

Example: Searching Twitter with QueryPath

Twitter is a popular microblogging service that lets you post short messages while following the microblogs of other Twitter users. Twitter provides a simple Web service that exposes many of the features of the platform.

The following example uses QueryPath to execute a search on Twitter's server and print the results as HTML. Such a tool might be added to an existing Web site to show recent Twitter activity on a topic of interest.

Twitter's search server listens on a standard HTTP server and can (when asked) return search results in the Atom XML format. Our example will search for the most recent five twitter posts that mention QueryPath. To run such a search and return the contents in the Atom format, you need only encode the necessary information in the URL:

The three portions in bold represent the parameters tuned for this application.

.atomIncluding this extension indicates to the server that you want Atom XML content returned.
rpp=5RPP is for results per page. We want five results to be returned. By default, the five most recent results will be returned.
q=QueryPathThis is the query. Twitter supports more complex search queries, but this is all you need for this brief example.

When this URL is loaded, Twitter will return an Atom-formatted XML document. Listing 15 below shows a greatly simplified version of the returned document. Only the information that you're immediately concerned with is shown here (only one entry is shown).

Listing 15. Excerpt of XML returned from a Twitter search
<?xml version="1.0" encoding="UTF-8"?>
    <content type="html">
       Last night I added XSD schema validation and XSL
       Transformation (XSLT) support to &lt;b&gt;QueryPath&lt;/b&gt; (as
       extensions). Will commit them today.
    <link type="image/png" rel="image" href=""/>
      <name>technosophos (M Butcher)</name>

Listing 16 shows the brief QueryPath code that executes the search, sifts through the returned XML, and creates a document.

Listing 16. Processing the returned XML with QueryPath
require 'QueryPath/QueryPath.php';

$url = '';
$out = qp(QueryPath::HTML_STUB, 'body')->append('<ul/>')->find('ul');

foreach (qp($url, 'entry') as $result) {
  $title = $result->children('content')->text();
  $img = $result->siblings('link[rel="image"]')->attr('href');
  $author = $result->parent()->find('author>name')->text();
  $out->append("<li><img src='$img'/> <em>$author</em><br/>$title</li>");

If you were to execute the code above using a Web browser, you'd see something like Figure 3.

Figure 3. QueryPath displays Twitter search results
QueryPath displays the Twitter search results
QueryPath displays the Twitter search results

The code in Listing 16 is 14 lines long, and the work is done in only nine lines. How did that code translate into the view in Figure 3?

The $url variable holds the Twitter URL that you examined earlier. The $out variable points to the QueryPath object that you'll use to write HTML to the client. Starting with a basic document (QueryPath::HTML_STUB), you append an unordered list and (using find()) select that new list.

The foreach loop is the most important line in the script: foreach (qp($url, 'entry') as $result). Here, a new QueryPath object is created. Since a URL is passed in, QueryPath will retrieve the remote Atom document and parse the results. And, since the selector entry is passed in, QueryPath will select all entries in the document. Look back at Listing 15 to get an idea as to what part of the document this is. There will be five entries in the returned document (since that is how you set the rpp flag in the URL). Each of the five entries should look like the <entry/> in Listing 15.

Inside the loop, three pieces of data are fetched:

$titleContent of the entry
$imgURL to a profile image for the user who posted
$authorName of the user who posted

To retrieve each bit of the data, you employ various QueryPath methods. For example, you can get the $title with $result->children('content')->text();.

This first selects all of the children with the tag name content, then gets the CDATA text from within the found nodes. Every entry will have one <content/> element.

Now you need to get the image URL. In the previous chain, you selected the <content/> element, so that is the starting point. You need to search the siblings of <content/> for an element that looks like <link rel="image"/>. To do that, use the siblings() function with a selector. Then use the attr() function to get the value of the element's href attribute.

Finally, get the author's name by jumping from the <link/> element, back up to its parent, then using find('author>name'). (See Table 2 for how this works.) From there, you can get the text of the author's name using text().

At the end of each iteration of the foreach loop, you build up a fragment of HTML and use append() to insert this into the $out QueryPath.

After the results from Twitter have been iterated, you can wrap up the script by writing the HTML document to the browser: $out->writeHTML();.

There you have it. In about a dozen lines of code, you've interfaced with a remote Web service. QueryPath can be used in this way to access just about any Web service that uses HTTP and XML or HTML. The examples that ship with QueryPath show how to set connection parameters, execute SPARQL queries against SPARQL endpoints, and parse complex, multi-namespaced documents. QueryPath provides great potential for working with Web services.


In this article, you explored the basics of the QueryPath library. You learned how to create QueryPath objects, traverse documents, and manipulate content. You also built a small example script that worked with the Web services API of the popular Twitter microblogging service.

This article only began to mine the possibilities of the QueryPath library. For example, it only mentioned the database API, which can be used to integrate RDBMS support into QueryPath. Imagine running a SQL SELECT statement and merging the results directly into an HTML table marked up to your specifications. Or imagine building an XML importer that parsed data and inserted it straight into the database.

There are other features of QueryPath that weren't even mentioned. With mappers and filters, you can have QueryPath run custom functions to transform or filter QueryPath data. With the QPTPL extension, you can merge data into predefined pure-HTML templates. QueryPath also supports user-defined extensions. By writing a simple class definition, you can add your own methods to QueryPath.

Downloadable resources

Related topics


Sign in or register to add and subscribe to comments.

Zone=Open source, Web development
ArticleTitle=Get to know the QueryPath PHP library