Skip to main content

Thinking XML: The XML decade

Thoughts on IBM Systems Journal's retrospective of XML at ten years (or so)

Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought Inc.
Photo of Uche Ogbuji
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia or contact him at uche@ogbuji.net.

Summary:  IBM Systems Journal recently published an issue dedicated to XML's 10th anniversary. It is primarily a collection of interesting papers for XML application techniques, but some of its articles offer general discussion of the technical, economic and even cultural effects of XML. There is a lot in these papers to draw from in thinking about why XML has been successful, and what it would take for XML to continue its success. This article expands on some of these topics that are especially relevant to readers of this column.

View more content in this series

Date:  14 Nov 2006
Level:  Intermediate
Activity:  3408 views

XML is approaching 10 years old. How closely depends on how you're counting. The W3C Recommendation Extensible Markup Language (XML) 1.0 was published on 10 February 1998. Work on XML started around 1996, however, rooted in almost thirty years of SGML. The design principles for XML, which guided its development were published on 25 August 1996. The first working draft, published on 14 November 1996 defined documents very similar to the majority of XML you might see today. Many of the changes between that first draft and the final recommendation were in more obscure areas of the standard. The basic idea of labeled, balanced, hierarchical tags and clearly defined text encoding were well in place in 1996, and so IBM Systems Journal accounts 2006 the year of XML's decade. Regardless of whether you agree with their counting, it is a volume well worth a thorough read by all XML professionals as it combines an interesting retrospective of XML with some useful articles discussing specific techniques and development, providing a glimpse into the future of the technology, and thus our profession. In this article I offer some comment and expansion on the treatment in IBM Systems Journal, focusing on the keynote article "Technical context and cultural consequences of XML" and one of the other contained papers, "Emerging patterns in the use of XML for information modeling in vertical industries". The latter paper is concerned with a common theme of Thinking XML--the development and adoption of industry-specific XML vocabularies.

Avoiding the doom of history

Santayana's old adage "Those who cannot remember the past are condemned to repeat it." has a corollary in technology: "those who forget how a wheel was invented are doomed to reinvent it." In order to learn how to extract maximum value from XML it's important to understand at least the basics of its motivations and guiding principles. One of the most important of these is mentioned in the keynote article of the IBM Systems Journal issue.

The original motivation for SGML, subsequently passed on to XML, was to ensure that the content or data residing in documents survived long after the application that processed it became obsolete or unusable; thus no processing or procedural information is embedded within the content; instead, content is encoded as clear text and available everywhere.

This is perhaps the most fundamental principle of XML, and a simple story illustrates why this is so important. COBOL endured as an essential skill sought by recruiters decades after most trends in programming had firmly abandoned the language. In the 1990s COBOL had long diminished in computer science curricula, and most professionals were looking to work in C++, Java, SQL and the like. Nevertheless there was stubborn demand for COBOL talent, and the industry was coming close to a crisis because of the difficulty of filling these needs. The reason for this crisis is that so much crucial information for business was still locked into COBOL programs decades old, dating from the times when companies were first making heavy investments in information systems. Numerous failed projects to extract this data into more modern forms such as relational databases were proving very difficult, in part because of the volume of all that data, combined with the difficulty gathering manpower. The year 2000 was fast approaching with alarm after alarm going off about how much chaos could ensue from all the data in COBOL and other legacy systems that did not account for the rollover from "99" to "00". Many commentators look upon this period as an extraordinary waste in resources spent agonizing over past assets rather than productively developing new ones. The original problem was that the data encoded all those years ago was geared towards one programming and processing system: COBOL. No thought was given to having that data succeed the predominant processing technology of its day.

This hard lesson and many similar ones have taught us that it is extremely valuable to develop data so that it outlives the applications that presently operate on it. XML, used properly can help prevent such crises in productivity as the artificial COBOL boom of the 1990s, and even better, it can be a building block rather than a stumbling block for productivity by pointing the way to new applications in the constant quest for competitiveness. Charles Goldfarb himself, in "The Roots of SGML -- A Personal Recollection", a document reminiscing over his pioneering of structured markup (see Resources), put it this way.

Historically, electronic manuscripts contained control codes or macros that caused the document to be formatted in a particular way ("specific coding"). In contrast, generic coding, which began in the late 1960s, uses descriptive tags (for example, "heading", rather than "format-17").

Generic coding is the foundation of XML and related technologies. One of the most important principles you should adopt in using XML is "If any aspect of the XML design is too closely tied to the application, consider that a bug." It is useful to be familiar with the brief document "Design Principles for XML" (see Resources) from which you can derive more such guidance.

Fruitful disagreements

XML's success is rooted in the convergence of a huge diversity of backgrounds and interests, and this same strength is the source of many conflicts. The world of XML has always had battling factions; more so, in my observation than you find within other technologies of similar breadth. There is no aspect of XML that has not been exhaustively debated, and that does not lead to deep divisions in practical application. In many cases of technological factionalism the struggle is really over a prize in business competition. Usually one vendor wants to enshrine their approach to the standard so as to improve their penetration of the marketplace. Certainly a good deal of that does go on in XML but many basic philosophical differences constantly threaten to tear the XML community into sub-groups. The keynote IBM Systems Journal article that I quoted also mentions this fact.

One of the most compelling aspects of [XML's] evolution was the intense and spirited collaboration of communities from different disciplines, each having its own ideas of what was important and often dismissing the requirements of other communities. This dissension might have destroyed the entire experience, but perseverance defeated it. Discounting the small but persistent set of detractors, the diverse communities realized that there was much to learn from each other and much value in considering the broad range of requirements. The communities began to understand that XML could allow the integration of data-centric, document-centric, forms-centric, protocol-centric, and process-centric views of information and the processing they undergo. For the first time, XML and its related standards enabled data interoperability, content manipulation, content sharing and reuse, document assembly, document security and access control, document filtering, and document formatting across all disciplines and for all types of devices and applications. This collaboration and discussion continues today as work progresses on the next tier of standards, taking into account the benefits and risks brought by XML's growing complexity and diversity.

Some of the most important fractures in today's XML world are similar to those a decade ago.

Prose versus structured records

Sometimes called the document/data divide, this issue tends to separate those from a document management and electronic messaging background from those of a DBMS and distributed programming background. Some prefer to design XML in ways suitable for prose. Think of a book--sometimes markup is mixed into text (such as an emphasized word), and order is almost always important. Some prefer to design XML like database records--there is a lot of repetition, order doesn't matter, and markup is always sharply separated from text. Obviously some uses of XML are more suited to one of these design styles, but the debate comes down to which approach should be preferred in the middle cases, and whether XML processing technologies should favor one or the other. All this leads to some tremendous differences in practical approaches, such as that between the RPC-inspired SOAP and WSDL brand of XML-based communications and ebXML and REST approaches.

Strong, static typing versus loose, dynamic typing

People from database and programming circles often want to use the technical data typing tools they are familiar with, and those from the LAMP camp, as well as much of the prose content faction prefer to think of XML content strictly as text, which can only be very loosely considered as supporting data types. LAMP (originally Linux™, Apache, MySQL and Python/Perl/PHP) is a popular term, which now applies to a whole philosophy of lightweight Web development using dynamic languages (including others such as Ruby) and open-source databases (including PostgreSQL). I have my own whimsical name for this divide: "bohemians versus gentry" (see Resources). In practice it is responsible for everything from the competition between W3C XML Schema Language (WXS) and RELAX NG to support and opposition of XPath 2.0 and XQuery.

Must-understand versus must-ignore

The must-understand camp believes that it's important to insist on strict schemata (whether in the form of DTD, WXS, RELAX NG, Schematron or other) and full control over the data format. The must-ignore camp believes that it's OK to build formats around a very lightweight schema that can be informally extended. If an application runs across an extension it doesn't understand, it should take a relaxed approach and just ignore that part of the file, moving on. The former group believes that extensibility comes from careful schema evolution, while the latter trusts informal agreements and graceful degradation of features. This argument comes up in areas as diverse as Microformats on the Web, and process conventions for Web services.

Formal versus informal semantics

Some of the leaders in the W3C saw XML as merely a stepping stone to a vision of a Web with carefully-annotated semantics so that autonomous agents could effectively navigate the interlinked information with minimal human intervention. Others feel this is not a practical goal at present, and that XML should not be burdened with the sort of heavily structured metadata that makes such a scenario possible. Those who look for very formal semantics for XML want to try to connect XML constructs to registries or knowledge representation formats. This is a topic of discussion that I've covered especially closely in this column.

Increased simplification versus stability

From the earliest days of XML some people have argued that DTD features, entities and such arcana unnecessarily complicate the standard, and some even go as far as to claim that attributes are unnecessary complexity. Others argue that trying to lop such features off XML would impair compatibility, leading to the degeneration of XML as a stable basis for expressing information. One group would like something like a clean-room XML 2.0 that removes what they consider warts. The other would prefer very careful, incremental improvements, with greater focus on the spectrum of technologies building on the core standard.

Other loud arguments are over narrower, more technical issues: XML namespaces, binary XML, XLink and more, but the above are pervasive regardless of what specific area of XML interests you. It's useful to be aware of the many varying perspectives so that you have all the tools at your disposal for effective decision making.


Thinking "Thinking XML"

Since the beginning of this column an important focus has been how industries come together to build vertical standards on top of XML. The IBM Systems Journal article "Emerging patterns in the use of XML for information modeling in vertical industries" is a close examination of this area using examples from The Open Application Group, Inc. (OAGi), The Association for Cooperative Operations Research and Development (ACORD) and The OpenTravel Alliance (OTA). It suggests some useful ways to classify such initiatives, as well as some technical design patterns in the details of how XML and Web services are used. The article indirectly touches on many of the divisions in XML practice I discussed in the previous section. It also touches on a distinction I've covered in prior Thinking XML installments: top-down versus bottom-up modeling of XML vocabularies. In discussing top-down modeling the article provides one of the better descriptions I've seen in the general literature.

Formal modeling patterns, in contrast with ad hoc development, are used by some organizations for developing standard messages. Such formal Top-Down Modeling patterns are emerging within vertical-industry organizations. In these patterns, message specifications are developed with increased rigor, based on some form of information model. The technology, methodologies, and tools so developed may be accompanied by training classes to ensure consistency. In such environments, a large portion of the effort is spent on defining requirements, use cases and roles, and the information model. A Unified Modeling Language (UML) profile is often used.
This pattern provides the opportunity for standardization of elements that may be more difficult to standardize in less formal environments. A rigorous methodology can facilitate the specification of the usage of the information, and this has ramifications on the inclusion of the information elements, their cardinality, and even their semantics. As a natural consequence of this rigor, library and registry considerations arise that play a key role in the assembly of information and the definition of usage contexts.

It goes on to give HL7 as an example of such a modeling approach.

One of several organizations employing such a Top-Down Modeling pattern is the advanced Health Level 7 (HL7) organization, a health-care information standard in which, increasingly, XML is viewed primarily as an encoding technology rather than a source information model. This pattern holds much promise for increasing the precision of standards required to promote interoperability between businesses, reducing ambiguity and leading to reduced complexity and cost.

This is interesting because it illustrates how top-down modeling has a tendency to marginalize XML, making it little more than a thin skin over some other information architecture. This also puts pressure on the very foundation of XML as discussed earlier in this article. To the extent that top-down modeling optimizes for interoperability of known applications, it also compromises the need for XML to be independent of the application domain. This might bring about reduced complexity and cost in the short term, but there is always the danger of a later reckoning on the order of the COBOL crisis of the 1990s. It's important to consider that sometimes independence from specific process models is worth the cost of increased abstraction, over time. The problem with top-down modeling of XML applications is that the XML usually ends up locked up with that application, and migrating valuable information to future applications can be very painful.

Separating information from process

In some ways the vertical industries article does show some bias towards process integration, especially in its emphasis on how service oriented architecture (SOAP and WSDL in particular) might drive XML design. This is a point of view that I think the keynote article hinted at in describing the following aspect of XML adoption.

The value of information hiding, generalization, encapsulation, and reuse in programming languages and methodologies--This work began in the early 1960s with the advent of such languages as Algol and Pascal, followed by the object-oriented approaches of Simula, Smalltalk, C++, Modula, Java, and C#. XML is a superb, common approach for defining interfaces to encapsulate abstractions.

Many XML experts (and I as well) believe that this is a very dangerous tendency that binds XML too tightly to the application domain, losing some of its key benefits. XML is perhaps better viewed as the inverse of the classic interface technology. It emphasizes opening up data rather than hiding it. Rather than representing an extension of interface methodologies that define process while hiding data, XML is suited for data-driven interchange where the content being exchanged is the basis of the contract, and each side is free to apply application semantics in their own way. Actors do have to meet business process constraints, but these should be separately expressed, and not be bound up in the XML. Such constraints can also be expressed using an XML format, of course, but this is different from the substance of the data being exchanged. The idea is that the data continues to be meaningful when industry dynamics, regulations and application technologies inevitably change. Business process and application process should probably also be separated, and it's possible that non-XML interface definition language (IDL) is a better fit for expressing the latter than XML vocabularies. The most important concern, though, is that the basic information be freed to survive all else, and this is best achieved by cleanly separating it from everything else.

I think the entire future of XML is bound up in this issue. Will XML become just a notational convenience in the latest fashions of tightly-coupled interchange technologies, or will it continue to influence the fundamental way we see such interchange. Unless industries come to appreciate XML as an important tool in separating the information from the machine, it's very likely that XML will lapse into obsolescence, because without the benefits of such separation, its costs start to weigh pretty heavily. XML is so much more bloated than traditional CORBA, EDI and ASN.1 exchange formats that if it is no less than a more readable version of these, there is little reason why it should take their place.


Wrap up

IBM developerWorks recently refreshed its "New to XML" page (see Resources). People come into the XML community every day. The ubiquity of XML leads programmers, database analysts, technical writers, systems integrators and others to run towards requirements for processing XML in the normal course of their business. If XML is to stay relevant across the changing face of technology, it's important to educate newcomers of its fundamental goals. XML is about durable data, but adopting XML by itself does not necessarily make data durable. This ten-year milestone (give or take) is a good occasion to examine how to ensure that we will see the long-term benefits from having entrusted so much data to the XML sphere of technologies. I look forward to seeing further technical and non-technical assessments of XML's past, present and future over the next couple of years. And, of course, I'll continue to discuss in this column how intelligent use of XML multiplies the benefits from standardized formatting. As always, if you have thoughts on these matters, please post them on the Thinking XML forum.


Resources

Learn

  • Robin Cover of Organization for the Advancement of Structured Information Standards (OASIS): Glean the exhaustive details of XML and related technologies from this great chronicler of things XML.

  • Celebrating 10 Years of XML (Volume 45, Number 2, 2006): Read through the recent issue of IBM Systems Journal®.

  • "The Roots of SGML - A Personal Recollection" (1996), by Charles F. Goldfarb, the father of structured markup: Gain a perspective on the origins of structured markup (including the roots of HTML and XML).

  • "Principles of XML design:" For more on XML best practices, see this developerWorks series by Uche Ogbuji.

  • "XML's demi-decade:" Check out Ogbuji's ruminations on the occasion of the fifth anniversary of the XML Recommendation. In another article in the same column, "XML class warfare," he discusses the divide between data typing proponents and opponents ("bohemians and gentry").

  • "Design Principles for XML:" Review the ten principles that guided its development.

  • "New to XML" page: Check out the XML zone's updated resource central for XML. Readers of this column are probably too advanced for this page, but it's a great place to get your colleagues started.

  • developerWorks XML zone: Find more XML resources, including previous installments of the Thinking XML column. "Semantic anchors for XML", discusses top-down versus bottom-up approaches to vocabulary modeling. If you have comments on this article, or any others in this column please post them on the Thinking XML forum.

  • IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.

  • developerWorks technical events and webcasts: Stay current with technology in these sessions.

Get products and technologies

  • IBM trial software: Build your next development project with software tools available for download directly from developerWorks.

Discuss

About the author

Photo of Uche Ogbuji

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia or contact him at uche@ogbuji.net.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=173028
ArticleTitle=Thinking XML: The XML decade
publish-date=11142006
author1-email=uche@ogbuji.net
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers