Validating XML in PHP

Ensure data integrity and validate XML documents against an XML schema in PHP

PHP developers commonly require the services of an Extensible Markup Language (XML) parser in their code. Along these lines, they frequently find it necessary to validate XML input. Fortunately, you can easily accomplish this in PHP. This article shows you how to validate XML documents within PHP and determine the cause of validation failures.

Brian M. Carey, Senior Systems Engineer, Triangle Information Solutions

Photo of Brian CareyBrian Carey is an information systems consultant specializing in Java, Java Enterprise, PHP, Ajax, and related technologies. You can follow Brian Carey on Twitter at http://twitter.com/brianmcarey.



10 November 2009

Also available in Chinese Japanese Portuguese

Why XML validation?

XML is a markup language that enables you, as a developer, to create your own custom language. This language is then used to carry, but not necessarily display, data in a platform-independent fashion. The language is defined with the use of markup tags, much like Hypertext Markup Language (HTML).

XML has gained in popularity in recent years because it represents the best of two worlds: It is easily readable by humans and computers alike. XML languages are expressed in tree-like structure with elements and attributes describing key data. The element and attribute names are usually written in plain English (so humans can read them). They are also highly structured (so computers can parse them).

Now, for example, suppose you create your own XML language, called LuresXML. LuresXML simply defines a means for defining various types of lures that are offered on your Web site. First, you create an XML schema that defines what the XML document should look like, as in Listing 1.

Listing 1. lures.xsd
<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
<xs:element name="lures">
 <xs:complexType> 
  <xs:sequence>
   <xs:element name="lure">
    <xs:complexType>
     <xs:sequence>
     <xs:element name="lureName" type="xs:string"/>
     <xs:element name="lureCompany" type="xs:string"/>
     <xs:element name="lureQuantity" type="xs:integer"/>
     </xs:sequence>
    </xs:complexType>
   </xs:element>
  </xs:sequence>
 </xs:complexType>
</xs:element>
</xs:schema>

This is, quite intentionally, a fairly simple example. The root element is called lures. It is the parent element of one or more lure elements, each of which is the parent of three other elements. The first element is the lure name (lureName). The second element is the name of the company that manufactures the lure (lureCompany). And, finally, the last element is the quantity (lureQuantity), or how many lures your company has in inventory. The first two of these child elements are defined as strings, whereas the lureQuantity element is defined as an integer.

Now, say you want to create an XML document (sometimes called an instance) based on that schema. It might look something like Listing 2.

Listing 2. lures.xml
<lures>
 <lure>
  <lureName>Silver Spoon</lureName>
  <lureCompany>Clark</lureCompany>
  <lureQuantity>Seven</lureQuantity>
 </lure>
</lures>

This is a simple XML document instance of the schema from Listing 1. In this case, the document instance lists only one lure. The name of the lure is Silver Spoon. The manufacturing company is Clark. And the quantity on hand is Seven.

Here is the question: How do you know that the XML document in Listing 2 is a proper instance of the schema defined in Listing 1? In fact, it isn't (this is also intentional).

Note the lureQuantity element as defined in Listing 1. It is of type xs:integer. Yet in Listing 2 the lureQuantity element actually contains a word (Seven), not an integer.

The purpose of XML validation is to catch exactly those kinds of errors. Proper validation ensures that an XML document matches the rules defined in its schema.

Continuing with this example, when you attempt to validate the XML document in Listing 2, you get an error. You fix this error (by changing the Seven to a 7) before using the document within your software application.

XML validation is important because you want to catch errors as early as possible in the information interchange process. Otherwise, unpredictable results can occur when you attempt to parse an XML document and it contains invalid data types or an unexpected structure.

Simple XML parsing in PHP

It is beyond the scope of this article to provide an exhaustive overview of parsing XML documents in PHP. However, I look at the basics of loading an XML document in PHP.

Just to continue to keep things simple, keep using the schema from Listing 1 and the XML document from Listing 2. Listing 3 demonstrates some basic PHP code to load the XML document.

Listing 3. testxml.php
<?php

$xml = new DOMDocument(); 
$xml->load('./lures.xml'); 

?>

Nothing is complicated about this either. You are using the DOMDocument class to load the XML document, here called lures.xml. Note that for this code to work on your own PHP server, the lures.xml file must reside on the same path as the actual PHP code.

At this point, it is tempting to start parsing the XML document. However, as you have seen, it is best to first validate the document to ensure that it matches the language specifications set forth in the schema.

Simple XML validation in PHP

Continue adding to the PHP code in Listing 3 by inserting some simple validation code, as in Listing 4.

Listing 4. Enhanced testxml.php
<?php

$xml = new DOMDocument(); 
$xml->load('./lures.xml');

if (!$xml->schemaValidate('./lures.xsd')) { 
   echo "invalid<p/>";
} 
else { 
   echo "validated<p/>"; 
} 

?>

Once again, note that the schema file from Listing 2 must be in the same directory where the PHP code is located. Otherwise, PHP returns an error.

This new code invokes the schemaValidate method against the DOMDocument object that loaded the XML. The method accepts one parameter: the location of the XML schema used to validate the XML document. The method returns a Boolean where true indicates a successful validation and false indicates an unsuccessful validation.

Now, deploy the PHP code from Listing 3 to your own PHP server. Call it testxml.php because that is the name given in Listings 3 and 4. Ensure that the XML document (from Listing 2) and XML schema (from Listing 1) are both in the same directory. Once again, PHP reports an error if this is not the case.

Point your browser to testxml.php. You should see one simple word on the screen: "invalid."

The good news is that the schema validation is working. It should return an error, and it did.

The bad news is that you have no idea where the error is located within the XML document. Okay, you might know because I mentioned the source of the error earlier in the article. But pretend that didn't happen, okay?

There is an error, but where?

To repeat: The bad news is that you have no idea where the error is located within the XML document. Just play along. It would be nice if the PHP code actually reported the location of the error, as well as the nature of the error, so that you can take corrective action. Something along the lines of "Hey! I can't accept a string for lureQuantity" would be nice.

To view the error that was encountered, you can use the libxml_get_errors() function. Unfortunately, the text output of that function doesn't specifically identify where in the XML document the error occurred. Instead, it identifies where in the PHP code an error was encountered. Because that's fairly useless, you look at another option.

There is another PHP function called libxml_use_internal_errors(). This function accepts a Boolean as its only parameter. If you set it to true, then that means that you are disabling the libxml error reporting and fetching the errors on your own. That's what you do.

Of course, that means that you have to write a bit more code. But the trade-off is more specific error reporting. In the long run, this saves a lot of time.

Listing 5 shows the finished product.

Listing 5. The final testxml.php
<?php
function libxml_display_error($error) 
{ 
$return = "<br/>\n"; 
switch ($error->level) { 
case LIBXML_ERR_WARNING: 
$return .= "<b>Warning $error->code</b>: "; 
break; 
case LIBXML_ERR_ERROR: 
$return .= "<b>Error $error->code</b>: "; 
break; 
case LIBXML_ERR_FATAL: 
$return .= "<b>Fatal Error $error->code</b>: "; 
break; 
} 
$return .= trim($error->message); 
if ($error->file) { 
$return .= " in <b>$error->file</b>"; 
} 
$return .= " on line <b>$error->line</b>\n"; 

return $return; 
} 

function libxml_display_errors() { 
$errors = libxml_get_errors(); 
foreach ($errors as $error) { 
print libxml_display_error($error); 
} 
libxml_clear_errors(); 
} 

// Enable user error handling 
libxml_use_internal_errors(true); 

$xml = new DOMDocument(); 
$xml->load('./lures.xml'); 

if (!$xml->schemaValidate('./lures.xsd')) { 
print '<b>Errors Found!</b>'; 
libxml_display_errors(); 
} 
else { 
echo "validated<p/>"; 
} 

?>

First, notice the function at the top of the code listing. It's called libxml_display_error() and accepts a LibXMLError object as its only parameter. Then it uses the all-too-familiar switch statement to determine the error level and craft an error message appropriate to that level. When the level is determined, the code produces a string that reports the appropriate level.

Then, two more things happen. First, the error object is examined to determine whether or not a file property contains a value. If so, then that file value is appended to the error message so the location of the file is reported. Next, the line property is appended to the error message so the user can see exactly where in the XML file the error occurred. Needless to say, this is extremely important for debugging purposes.

It should also be noted that libxml_display_error() simply produces a string that describes the error. The actual printing of the error to the screen is left up to the caller, in this case libxml_display_errors().

The function below that is the previously mentioned libxml_display_errors(), which takes no parameters. The first thing this function does is call libxml_get_errors(). This returns an array of LibXMLError objects that represent all of the errors encountered when the schemaValidate() method was invoked on the XML document.

Next, you step through each of the errors you encountered and invoke the libxml_display_error() function for each error object. Whatever string is returned by that function is then printed to the screen. One great benefit of handling errors this way is that all of the errors are printed at once. This means that you only need to execute the code once to view all of the errors specific to that particular XML document.

Finally, libxml_clear_errors() clears out the errors recently encountered by the schemaValidate() method. This means that if schemaValidate() is executed again within the same code sequence, you will start with a clean slate, and only new errors will be reported. If you don't do this and you execute schemaValidate() again, then all of the errors from the first invocation of schemaValidate() remain in the array returned by libxml_get_errors(). Obviously, that presents problems if you're looking for a fresh set of errors.

It's also important to note that I made a slight change to the if-then statement at the bottom of the code in Listing 5. If an error is encountered, it prints "Errors Found!" in bold and then invokes the aforementioned libxml_display_errors() function which displays all of the errors encountered before clearing out the error array. I opted for this solution instead of just printing out "invalid" as I did in Listing 4.

Second test

Now, it's time to test again. Move the PHP file from Listing 5 to your PHP server. Keep the file name the same (testxml.php). As before, ensure that both the XML Schema Definition (XSD) file and the XML files are in the same directory as the PHP file. Point your browser to testxml.php once again, and now you should see something like this:

Errors Found!
Error 1824: Element 'lureQuantity': 'Seven' is not a valid value of the atomic type 'xs:integer'. in /home/thehope1/public_html/example.xml on line 5

Well, that's fairly descriptive, isn't it? The error message tells you on what line the error occurred. It also tells you where the file is (as if you didn't know). And it tells you exactly why the error occurred. That's information you can use.

Fixing the problem

You can now leave the PHP file alone and work on fixing the problem in your XML document.

Because the error reportedly occurred on line 5 of the XML document, it's a good idea to look at line 5 and see what's there. Unsurprisingly, line 5 is the location of the lureQuantity element. And, as you look at it carefully, you suddenly have an epiphany that Seven is a string, not a number. So you change the string Seven to the numeral 7. The final copy of the XML document should look like Listing 6.

Listing 6. Updated XML file
<lures>
 <lure>
  <lureName>Silver Spoon</lureName>
  <lureCompany>Clark</lureCompany>
  <lureQuantity>7</lureQuantity>
 </lure>
</lures>

Now, copy this new XML file to your PHP server. And, once again, point your browser to testxml.php. You should see just one word: "validated." This is excellent news for two reasons. First, it means that the validation code is working properly because the XML document is, in fact, valid. Second, you have probably just validated your first XML document in PHP. Congratulations!

As I always advise, now it is time to tinker. Modify lures.xsd to make it a more complex schema. Modify lures.xml to make it a more complex instance of that schema. Copy those files to the PHP server and, once again, execute testxml.php. See what happens. Intentionally produce an invalid document for several reasons and see what happens.

Also, note that when you tinker, you don't need to change the PHP code at all. Just make sure that the file names (lures.xml and lures.xsd) are the same and you can modify them to your heart's content.

Conclusion

PHP makes it easy for developers to validate XML documents. Using the DOMDocument class in conjunction with the schemaValidate() method, you can ensure that your XML documents comply with the specifications in their respective schemas. This is important to ensure data integrity in your software applications.

Resources

Learn

  • DOMDocument class: Check out the PHP Manual entry for this class, which represents an entire HTML or XML document and serves as the root of the document tree.
  • XML Validation: Read Wikipedia's explanation for the process of checking a document written in XML.
  • Parsing XML using PHP (Burhan Khalid, DevPapers, December 2003): Take this great tutorial on XML parsing with the built-in PHP parser (based on the expat library written by James Clark).
  • XML for PHP developers (Cliff Morgan, developerWorks, March 2007): Explore the XML-PHP combination further in this 3-part article series:
    • Part 1: The 15-minute PHP-with-XML starter: Meet PHP5's XML implementation and, if you are relatively new to using XML with PHP, learn to read, parse, and manipulate, and write a short and uncomplicated XML file using the DOM and SimpleXML in a PHP environment.
    • Part 2: Advanced XML parsing techniques: Review the XML parsing techniques of PHP5, focusing on parsing large or complex XML documents. Get some background about parsing extensions and, specifically, what parsing methods are best suited to what types of XML documents and why.
    • Part 3: Advanced techniques to read, manipulate, and write XML: Learn more techniques for reading, manipulating, and writing XML in PHP5. In it, you focus on the now familiar APIs DOM and SimpleXML in more sophisticated surroundings, and, for the first time in this three-part series, on the XSL extension.
  • Tutorial: Validating XML (Nicholas Chase, developerWorks, Aug 2003): Learn what validation is and how to check a document against a Document Type Definition (DTD) or XML Schema document.
  • New to XML: Visit this great starting point for resources available to XML developers on IBM developerWorks.
  • IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
  • XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
  • developerWorks technical events and webcasts: Stay current with technology in these sessions.
  • developerWorks podcasts: Listen to interesting interviews and discussions for software developers.

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Open source
ArticleID=445081
ArticleTitle=Validating XML in PHP
publish-date=11102009