XML involves an interesting perspective on data management, one which many developers find new and strange at first. XML offers flexible support for loosely structured and hierarchical data, but it also comes with inevitable performance problems. Unfortunately, developers often don't consider problems that can arise from XML's transparency. It's almost too obvious that once you stuff something into an XML document, it's hard to obscure it (short of encryption, of course). It becomes available to the extent that all the contents are available. This is so obvious that developers, to an alarming extent, fail to take it into account when they design XML-driven applications. The dangers of this lassitude were underscored in 2004 when security researchers at Sanctum, Inc., reported Blind XPath Injection, a tedious but dangerous attack. Blind XPath Injection allows an attacker, given an XPath engine used to query an XML document, to retrieve all contents of the document without any specific knowledge of the XPath queries that the application uses. Blind XPath Injection can work even if the application's queries themselves are all limited to a document subset.
Many XML applications build on raw XML dumps from databases and legacy applications. Software vendors have encouraged this approach by making monolithic XML dumps the most prominent XML features in their repertoire. The promised ease with which you can transform one XML format to another using XSLT leads to a cavalier philosophy: "Throw it all out as XML, and pick through for what you need." The problem is that this leaves the door wide open to security issues, such as XPath injection attacks. XML doesn't offer access control, so a black hat who compromises an XML-driven application isn't just limited to the data the application cares about -- everything in the XML document is accessible. If the application is built on an XML repository, the situation might be even worse, although some repositories do provide for access control, mitigating such vulnerability. In this article, I discuss principles for managing XML deployment to avoid such vulnerabilities. The principles are quite simple, and yet not often enough discussed among XML professionals.
Complete document transparency
The brute-force Blind XPath Injection attack discovered in 2004 emphasizes the first principle: If any part of an XML document is exposed to an application, assume that the entire document is also exposed to that application, its users, and any other applications that interface with it. Figure 1 helps illustrate this.
Figure 1. Indirect exposure of an XML document
The arrows indicate the availability of data. The organizational chart application is an internal application to your enterprise, and is protected by the firewall. You can assume all the direct users are trusted (a naive assumption even for an internal application, but sufficient for this scenario). The org chart app accesses an XML dump of employee information. This XML file includes information such as employee name, department, and Social Security number. The app only uses the employee name and department through an XPath query of the source file. The contact list application is outside the firewall, so trusted and untrusted users can connect to it. It provides contact information for key employees, and in order to avoid problems with obsolete static information, it connects dynamically through a controlled channel in the firewall to query the org chart app. Through a Blind XPath Injection attack, an attacker might be able to access the entire XML dump, including the Social Security numbers that the org chart app doesn't access directly. The diagram demonstrates this by the fact that you can trace an arrow from the external users all the way to the XML dump.
From a security standpoint, the main consequence of this principle is that you need to understand how XML-driven applications connect to each other and to the entire contents of the XML files involved in such chains. Awareness is your first defense, and every security plan should include a review of the reach of transparent data, accounting for nonobvious ways in which you can extend that reach.
"Need-to-know" XML document deployment
One way to protect sensitive data from attacks built on document transparency is to curtail the transparency by encrypting parts of XML documents. In the scenario discussed in the previous section, you might have the same source XML document for the org chart app, but have the Social Security information digitally encrypted. This is only a partial solution, because usually in such a case, an application that needs to use that information will have access to the credentials for reading it, making it still a potential vector for snooping on that data. It's safer to take an all-or-nothing approach to XML deployment. Rather than try to simplify matters by sharing XML data sets across domains and applications, it's better to deploy one trimmed data set for each application on a strict need-to-know basis, as exemplified in this second principle: XML source documents available to an application should only include information the application needs for its workings. Figure 2 demonstrates a scenario that respects this principle.
Figure 2. Trimming XML documents on a need-to-know basis
In this scenario, you still use an XML dump from the database, but no application ever accesses that full document directly, as you can tell by following the arrows to indicate access and availability. Rather, a separate, automated process (the "XML trimmer" in the application) from these applications generates subset XML files for each application. These subset files contain only the information that the application needs to know. Ideally, each application would be able to register only the XPath queries that it needs, and the trimmer would just refer to this query list to direct the process generation of subsets. In this scenario, each application is isolated from the full data set, reducing the scope of unexpected information leakage. It is not a cure-all for security problems from XML transparency, by any means. For one thing, you probably want to further limit the information available to each user session within an application. Even though you used the need-to-know pattern to make it impossible for a malicious user to access Social Security numbers, you probably have differentiated levels of access within the information the application uses. A human-resources manager using the application might have access to employee salary information, while other users don't. To avoid leakage that stems from document transparency, you might end up with a need-to-know system that isolates XML data sets by user session and not just by application. As such systems become more complex, you might find it harder to understand the full picture and thus maintain effective security review, if you're not careful.
Building on pipeline architecture
The difficulty of dealing with intricate data flow leads to the third principle: Design your applications around XML processing pipelines to make data flow easier to factor into threat assessment. XML pipeline architecture is the idea of effective XML data flow as a set of small, well-defined processing stages (largely, transforms). You can avoid the equivalent of large blocks filled with spaghetti code if you think in terms of how to combine a series of narrowly scoped modules to meet the XML data needs. There is a lot to XML pipeline processing, and it's important enough that the World Wide Web Consortium (W3C) has put together a working group primarily to standardize XML processing pipeline technology. Pipeline architecture can be an important tool in managing security issues from XML's transparency. It reduces to bite-sized chunks the problem of what data is accessible from where, and thus what data is vulnerable to attack. It makes it easier to organize strategies such as need-to-know, even when broken down to fine-grained situations such as protecting data across user sessions. In all, the pipeline approach doesn't hide from the transparency of XML data, hoping to shrug off security problems. Rather, it embraces the transparency in order to maintain clarity of the flow of all data across XML applications. This clarity allows you to visualize and thus assess potential threats, and it makes sure you don't naively rely on security by obscurity.
Good design for security is not all that different from good design for software quality. The more you clump and tangle things together, the harder it is to spot and protect against problems. The increased transparency of XML data requires an increased transparency of application processing workflow in order to mitigate problems from security to state control. Applications that work with large dumps of XML data, and use complex processing to scratch needed information from these data sets, are vulnerable to a sophisticated attacker who takes advantage of your blind spots. If you design applications that package and exchange small, controlled chunks of XML data in manageable processing stages, you reduce these blind spots and make the application easier to maintain. Understanding the implications of transparent data flow is key to the security of XML-based applications.
Learn
- Learn about a prominent vulnerability in XML-based systems in Blind XPath Injection by the SecureIT Alliance.
- Keep an eye on the
XML Processing Model Working Group, which aims to standardize architectural matters, such as XML pipeline processing.
- Read an earlier, less technical article on issues raised by Blind XPath Injection in Does XML give away the keys to the warehouse? by Uche Ogbuji (Application Development Trends, 2005).
- Find more XML resources on the developerWorks XML zone, where you'll find previous installments of the Thinking XML column. If you have comments on this article or any others in this column, please post them on the Thinking XML forum.
- Learn how you can become an IBM Certified Solution Developer in XML and related technologies.
Get products and technologies
- Build your next development project with IBM trial software, available for download directly from developerWorks.
Discuss

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia or contact him at uche@ogbuji.net.




