Skip to main content

Tip: Use the right pattern for simple text in RELAX NG

Picking the right data type in your XML schema

Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought Inc.
Photo of Uche Ogbuji
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia or contact him at uche@ogbuji.net.

Summary:  The RELAX NG XML schema language allows you to say "permit some text here" in a variety of ways. Whether you're writing patterns for elements or attributes, it is important to understand the nuances between the different patterns for character data. In this tip, Uche Ogbuji discusses the basic foundations for text in RELAX NG.

View more content in this series

Date:  28 Oct 2005
Level:  Introductory
Activity:  1293 views

In the most common element pattern, the content is a simple text value with no special interpretation. I call this simple character data (CDATA). Simple CDATA is also the default interpretation of an XML attribute, though a schema language can override this by specifying a data type for the attribute. Listing 1 is a complete XML document that contains one element with simple CDATA, and an attribute intended to be interpreted as simple CDATA.


Listing 1. Simple XML document
<?xml version="1.0" encoding="utf-8"?>
<msg recipient="world">hello</msg>

Using the compact syntax for the popular RELAX NG XML schema language, Listing 2 and Listing 3 are two ways to define the element.


Listing 2. Element definition using text pattern
element msg { text }


Listing 3. Element definition using string data type
element msg { string }

Listing 4 and Listing 5 are two ways to define the attribute.


Listing 4. Attribute definition using text pattern
attribute recipient { text }


Listing 5. Attribute definition using string data type
attribute recipient { string }


The full range of options

The text pattern and the built-in string data type are not the only options for expressing simple CDATA. You have the following options:

  • text -- fundamental simple text pattern
  • token -- simple string data type with whitespace normalization
  • string -- simple string data type without whitespace normalization

In addition, you can use xsd:string which is available in RELAX NG engines that support W3C XML Schema (WXS) data types. xsd:string is nearly identical in basic behavior to string; the only difference between the two is that xsd:string allows you to add further restrictions to the string, while string does not. Such restrictions are expressed as data type facets, examples include limiting the length of the string, or forcing the string to match a regular expression. These restrictions are beyond the scope of this article, so I shall focus only on the patterns that come standard with RELAX NG -- the three shown in the above list.

text pattern

Use the text pattern by default -- in most cases this is just what you want. It matches zero or more text nodes (contiguous text fragments). You might on occasion need to consider what it means to match multiple text fragments when you are dealing with text in combination with combining patterns such as interleave, but treatment of these patterns is beyond the scope of this article. Another reason text patterns are attractive is that they are not subject to some of the arcane restrictions that RELAX NG places on data types. When in doubt, use text.

token data type

In most cases, you use the token data type in association with RELAX NG enumerations, which are defined using the value pattern as in Listing 6


Listing 6. Enumeration example
attribute poll-response { "yes" | "no" | "unsure" }

In this example any of the specified values are allowed, and these values belong to the token data type. This means that whitespace is normalized when checking the values; for example, " yes " (notice the extra space characters within the value) would be a valid match. Use the token data type when you need such normalization.

string data type

Sometimes you may want to express an enumeration without any whitespace normalization. In this case, you specify the values that should not be normalized using the string data type. Listing 7 is similar to Listing 6 except that no normalization is applied when checking any of the values.


Listing 7. Enumeration example using string data type
attribute poll-response {
  string "yes" | string "no" | string "unsure"
}

In this case, " yes " would not be a valid match. Use the token data type for enumerations and other simple data typing situations where you do not want whitespace normalization.


Wrap-up

To answer a commonly-asked question, all of the options discussed in this article allow empty strings. If you want to mandate that a text value cannot be empty, the usual way is to specify xsd:string with a minimum length facet. text and string also offer no special treatment for text nodes consisting only of whitespace.

Text in XML is one of those topics that you might expect to be simple, but becomes unexpectedly complex once you consider the diversity of users' subtle attitudes. RELAX NG's treatment of this topic is actually quite straightforward compared to other XML technologies. The rule of thumb Use the text pattern unless you have good reason not to is a good start toward avoiding surprises when dealing with text-based patterns in RELAX NG.


Resources

Learn

Get products and technologies

  • Jing: Try out the examples in this article using this RELAX NG processor that supports compact syntax.

About the author

Photo of Uche Ogbuji

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia or contact him at uche@ogbuji.net.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=97283
ArticleTitle=Tip: Use the right pattern for simple text in RELAX NG
publish-date=10282005
author1-email=uche@ogbuji.net
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers