Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

developerWorks Community:

  • Close [x]

Introduction to XML

Doug Tidwell (dtidwell@us.ibm.com), XML Evangelist, EMC
Senior Programmer Doug Tidwell is IBM's evangelist for Web Services. He was a speaker at the first XML conference in 1997, and has been working with markup languages for more than a decade. He holds a Bachelors Degree in English from the University of Georgia and a Masters Degree in Computer Science from Vanderbilt University. He can be reached at dtidwell@us.ibm.com. You can also see his Web page at ibm.com/developerWorks/speakers/dtidwell/.
(An IBM developerWorks Contributing Author)

Summary:  XML, the Extensible Markup Language, has gone from the latest buzzword to an entrenched eBusiness technology in record time. This newly revised tutorial discusses what XML is, why it was developed, and how it's shaping the future of electronic commerce. It also covers a variety of important XML programming interfaces and standards, and ends with two case studies showing how companies are using XML to solve business problems.

Date:  07 Aug 2002
Level:  Introductory PDF:  A4 and Letter (141 KB | 35 pages)Get Adobe® Reader®

Activity:  137679 views
Comments:  

Defining document content

Overview: Defining document content

So far in this tutorial you've learned about the basic rules of XML documents; that's all well and good, but you need to define the elements you're going to use to represent data. You'll learn about two ways of doing that in this section.

  • One method is to use a Document Type Definition, or DTD. A DTD defines the elements that can appear in an XML document, the order in which they can appear, how they can be nested inside each other, and other basic details of XML document structure. DTDs are part of the original XML specification and are very similar to SGML DTDs.
  • The other method is to use an XML Schema. A schema can define all of the document structures that you can put in a DTD, and it can also define data types and more complicated rules than a DTD can. The W3C developed the XML Schema specification a couple of years after the original XML spec.

Document Type Definitions

A DTD allows you to specify the basic structure of an XML document. The next couple of sections look at fragments of DTDs. First of all, here's a DTD that defines the basic structure of the address document example in the section, What is XML? :

<!-- address.dtd -->
<!ELEMENT address (name, street, city, state, postal-code)>
<!ELEMENT name (title? first-name, last-name)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT first-name (#PCDATA)>
<!ELEMENT last-name (#PCDATA)>
<!ELEMENT street (#PCDATA)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT state (#PCDATA)>
<!ELEMENT postal-code (#PCDATA)>
          

This DTD defines all of the elements used in the sample document. It defines three basic things:

  • An <address> element contains a <name>, a <street>, a <city>, a <state>, and a <postal-code>. All of those elements must appear, and they must appear in that order.
  • A <name> element contains an optional <title> element (the question mark means the title is optional), followed by a <first-name> and a <last-name> element.
  • All of the other elements contain text. ( #PCDATA stands for parsed character data; you can't include another element in these elements.)

Although the DTD is pretty simple, it makes it clear what combinations of elements are legal. An address document that has a <postal-code> element before the <state> element isn't legal, and neither is one that has no <last-name> element.

Also, notice that DTD syntax is different from ordinary XML syntax. (XML Schema documents, by contrast, are themselves XML, which has some interesting consequences.) Despite the different syntax for DTDs, you can still put an ordinary comment in the DTD itself.


Symbols in DTDs

There are a few symbols used in DTDs to indicate how often (or whether) something may appear in an XML document. Here are some examples, along with their meanings:

  • <!ELEMENT address (name, city, state)>

    The <address> element must contain a <name>, a <city>, and a <state> element, in that order. All of the elements are required. The comma indicates a list of items.

  • <!ELEMENT name (title?, first-name, last-name)>

    This means that the <name> element contains an optional <title> element, followed by a mandatory <first-name> and a <last-name> element. The question mark indicates that an item is optional; it can appear once or not at all.

  • <!ELEMENT addressbook (address+)>

    An <addressbook> element contains one or more <address> elements. You can have as many <address> elements as you need, but there has to be at least one. The plus sign indicates that an item must appear at least once, but can appear any number of times.

  • <!ELEMENT private-addresses (address*)>

    A <private-addresses> element contains zero or more <address> elements. The asterisk indicates that an item can appear any number of times, including zero.

  • <!ELEMENT name (title?, first-name, (middle-initial | middle-name)?, last-name)>

    A <name> element contains an optional <title> element, followed by a <first-name> element, possibly followed by either a <middle-initial> or a <middle-name> element, followed by a <last-name> element. In other words, both <middle-initial> and <middle-name> are optional, and you can have only one of the two. Vertical bars indicate a list of choices; you can choose only one item from the list. Also notice that this example uses parentheses to group certain elements, and it uses a question mark against the group.

  • <!ELEMENT name ((title?, first-name, last-name) | (surname, mothers-name, given-name))>

    The <name> element can contain one of two sequences: An optional <title>, followed by a <first-name> and a <last-name>; or a <surname>, a <mothers-name>, and a <given-name>.


A word about flexibility

Before going on, a quick note about designing XML document types for flexibility. Consider the sample name and address document type; I clearly wrote it with U.S. postal addresses in mind. If you want a DTD or schema that defines rules for other types of addresses, you would have to add a lot more complexity to it. Requiring a <state> element might make sense in Australia, but it wouldn't in the UK. A Canadian address might be handled by the sample DTD in Document Type Definitions, but adding a <province> element is a better idea. Finally, be aware that in many parts of the world, concepts like title, first name, and last name don't make sense.

The bottom line: If you're going to define the structure of an XML document, you should put as much forethought into your DTD or schema as you would if you were designing a database schema or a data structure in an application. The more future requirements you can foresee, the easier and cheaper it will be for you to implement them later.


Defining attributes

This introductory tutorial doesn't go into great detail about how DTDs work, but there's one more basic topic to cover here: defining attributes. You can define attributes for the elements that will appear in your XML document. Using a DTD, you can also:

  • Define which attributes are required
  • Define default values for attributes
  • List all of the valid values for a given attribute

Suppose that you want to change the DTD to make state an attribute of the <city> element. Here's how to do that:

<!ELEMENT city (#PCDATA)>
<!ATTLIST city state CDATA #REQUIRED>
          

This defines the <city> element as before, but the revised example also uses an ATTLIST declaration to list the attributes of the element. The name city inside the attribute list tells the parser that these attributes are defined for the <city> element. The name state is the name of the attribute, and the keywords CDATA and #REQUIRED tell the parser that the state attribute contains text and is required (if it's optional, CDATA #IMPLIED will do the trick).

To define multiple attributes for an element, write the ATTLIST like this:

<!ELEMENT city (#PCDATA)>
<!ATTLIST city state CDATA #REQUIRED
               postal-code CDATA #REQUIRED>
          

This example defines both state and postal-code as attributes of the <city> element.

Finally, DTDs allow you to define default values for attributes and enumerate all of the valid values for an attribute:

<!ELEMENT city (#PCDATA)>
<!ATTLIST city state CDATA (AZ|CA|NV|OR|UT|WA) "CA">
          

The example here indicates that it only supports addresses from the states of Arizona (AZ), California (CA), Nevada (NV), Oregon (OR), Utah (UT), and Washington (WA), and that the default state is California. Thus, you can do a very limited form of data validation. While this is a useful function, it's a small subset of what you can do with XML schemas (see XML schemas).


XML schemas

With XML schemas, you have more power to define what valid XML documents look like. They have several advantages over DTDs:

  • XML schemas use XML syntax. In other words, an XML schema is an XML document. That means you can process a schema just like any other document. For example, you can write an XSLT style sheet that converts an XML schema into a Web form complete with automatically generated JavaScript code that validates the data as you enter it.
  • XML schemas support datatypes. While DTDs do support datatypes, it's clear those datatypes were developed from a publishing perspective. XML schemas support all of the original datatypes from DTDs (things like IDs and ID references). They also support integers, floating point numbers, dates, times, strings, URLs, and other datatypes useful for data processing and validation.
  • XML schemas are extensible. In addition to the datatypes defined in the XML schema specification, you can also create your own, and you can derive new datatypes based on other datatypes.
  • XML schemas have more expressive power. For example, with XML schemas you can define that the value of any <state> attribute can't be longer than 2 characters, or that the value of any <postal-code> element must match the regular expression [0-9]{5}(-[0-9]{4})?. You can't do either of those things with DTDs.

A sample XML schema

Here's an XML schema that matches the original name and address DTD. It adds two constraints: The value of the <state> element must be exactly two characters long and the value of the <postal-code> element must match the regular expression [0-9]{5}(-[0-9]{4})?. Although the schema is much longer than the DTD, it expresses more clearly what a valid document looks like. Here's the schema:

<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <xsd:element name="address">
    <xsd:complexType>
      <xsd:sequence>
        <xsd:element ref="name"/>
        <xsd:element ref="street"/>
        <xsd:element ref="city"/>
        <xsd:element ref="state"/>
        <xsd:element ref="postal-code"/>
      </xsd:sequence>
    </xsd:complexType>
  </xsd:element>

  <xsd:element name="name">
    <xsd:complexType>
      <xsd:sequence>
        <xsd:element ref="title" minOccurs="0"/>
        <xsd:element ref="first-Name"/>
        <xsd:element ref="last-Name"/>
      </xsd:sequence>
    </xsd:complexType>
  </xsd:element>

  <xsd:element name="title"      type="xsd:string"/>
  <xsd:element name="first-Name" type="xsd:string"/>
  <xsd:element name="last-Name"  type="xsd:string"/>
  <xsd:element name="street"     type="xsd:string"/>
  <xsd:element name="city"       type="xsd:string"/>

  <xsd:element name="state">
    <xsd:simpleType>
      <xsd:restriction base="xsd:string">
        <xsd:length value="2"/>
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:element>

  <xsd:element name="postal-code">
    <xsd:simpleType>
      <xsd:restriction base="xsd:string">
        <xsd:pattern value="[0-9]{5}(-[0-9]{4})?"/>
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:element>
</xsd:schema>
          


Defining elements in schemas

The XML schema in A sample XML schema defined a number of XML elements with the <xsd:element> element. The first two elements defined, <address> and <name>, are composed of other elements. The <xsd:sequence> element defines the sequence of elements that are contained in each. Here's an example:

<xsd:element name="address">
  <xsd:complexType>
    <xsd:sequence>
      <xsd:element ref="name"/>
      <xsd:element ref="street"/>
      <xsd:element ref="city"/>
      <xsd:element ref="state"/>
      <xsd:element ref="postal-code"/>
    </xsd:sequence>
  </xsd:complexType>
</xsd:element>
          

As in the DTD version, the XML schema example defines that an <address> contains a <name>, a <street>, a <city>, a <state>, and a <postal-code> element, in that order. Notice that the schema actually defines a new datatype with the <xsd:complexType> element.

Most of the elements contain text; defining them is simple. You merely declare the new element, and give it a datatype of xsd:string:

<xsd:element name="title"      type="xsd:string"/>
<xsd:element name="first-Name" type="xsd:string"/>
<xsd:element name="last-Name"  type="xsd:string"/>
<xsd:element name="street"     type="xsd:string"/>
<xsd:element name="city"       type="xsd:string"/>
          


Defining element content in schemas

The sample schema defines constraints for the content of two elements: The content of a <state> element must be two characters long, and the content of a <postal-code> element must match the regular expression [0-9]{5}(-[0-9]{4})?. Here's how to do that:

  <xsd:element name="state">
    <xsd:simpleType>
      <xsd:restriction base="xsd:string">
        <xsd:length value="2"/>
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:element>

  <xsd:element name="postal-code">
    <xsd:simpleType>
      <xsd:restriction base="xsd:string">
        <xsd:pattern value="[0-9]{5}(-[0-9]{4})?"/>
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:element>
          

For the <state> and <postal-code> elements, the schema defines new data types with restrictions. The first case uses the <xsd:length> element, and the second uses the <xsd:pattern> element to define a regular expression that this element must match.

This summary only scratches the surface of what XML schemas can do; there are entire books written on the subject. For the purpose of this introduction, suffice to say that XML schemas are a very powerful and flexible way to describe what a valid XML document looks like.

4 of 10 | Previous | Next

Comments



static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=138344
TutorialTitle=Introduction to XML
publish-date=08072002
author1-email=dtidwell@us.ibm.com
author1-email-cc=