Skip to main content

skip to main content

developerWorks  >  XML  >

Tip: Use the right pattern for simple text in RELAX NG

Picking the right data type in your XML schema

developerWorks
Document options

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Introductory

Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought Inc.

28 Oct 2005

The RELAX NG XML schema language allows you to say "permit some text here" in a variety of ways. Whether you're writing patterns for elements or attributes, it is important to understand the nuances between the different patterns for character data. In this tip, Uche Ogbuji discusses the basic foundations for text in RELAX NG.

In the most common element pattern, the content is a simple text value with no special interpretation. I call this simple character data (CDATA). Simple CDATA is also the default interpretation of an XML attribute, though a schema language can override this by specifying a data type for the attribute. Listing 1 is a complete XML document that contains one element with simple CDATA, and an attribute intended to be interpreted as simple CDATA.


Listing 1. Simple XML document
<?xml version="1.0" encoding="utf-8"?>
<msg recipient="world">hello</msg>

Using the compact syntax for the popular RELAX NG XML schema language, Listing 2 and Listing 3 are two ways to define the element.


Listing 2. Element definition using text pattern
element msg { text }


Listing 3. Element definition using string data type
element msg { string }

Listing 4 and Listing 5 are two ways to define the attribute.


Listing 4. Attribute definition using text pattern
attribute recipient { text }


Listing 5. Attribute definition using string data type
attribute recipient { string }



Back to top


The full range of options

The text pattern and the built-in string data type are not the only options for expressing simple CDATA. You have the following options:

  • text -- fundamental simple text pattern
  • token -- simple string data type with whitespace normalization
  • string -- simple string data type without whitespace normalization

In addition, you can use xsd:string which is available in RELAX NG engines that support W3C XML Schema (WXS) data types. xsd:string is nearly identical in basic behavior to string; the only difference between the two is that xsd:string allows you to add further restrictions to the string, while string does not. Such restrictions are expressed as data type facets, examples include limiting the length of the string, or forcing the string to match a regular expression. These restrictions are beyond the scope of this article, so I shall focus only on the patterns that come standard with RELAX NG -- the three shown in the above list.

text pattern

Use the text pattern by default -- in most cases this is just what you want. It matches zero or more text nodes (contiguous text fragments). You might on occasion need to consider what it means to match multiple text fragments when you are dealing with text in combination with combining patterns such as interleave, but treatment of these patterns is beyond the scope of this article. Another reason text patterns are attractive is that they are not subject to some of the arcane restrictions that RELAX NG places on data types. When in doubt, use text.

token data type

In most cases, you use the token data type in association with RELAX NG enumerations, which are defined using the value pattern as in Listing 6


Listing 6. Enumeration example
attribute poll-response { "yes" | "no" | "unsure" }

In this example any of the specified values are allowed, and these values belong to the token data type. This means that whitespace is normalized when checking the values; for example, " yes " (notice the extra space characters within the value) would be a valid match. Use the token data type when you need such normalization.

string data type

Sometimes you may want to express an enumeration without any whitespace normalization. In this case, you specify the values that should not be normalized using the string data type. Listing 7 is similar to Listing 6 except that no normalization is applied when checking any of the values.


Listing 7. Enumeration example using string data type
attribute poll-response {
  string "yes" | string "no" | string "unsure"
}

In this case, " yes " would not be a valid match. Use the token data type for enumerations and other simple data typing situations where you do not want whitespace normalization.



Back to top


Wrap-up

To answer a commonly-asked question, all of the options discussed in this article allow empty strings. If you want to mandate that a text value cannot be empty, the usual way is to specify xsd:string with a minimum length facet. text and string also offer no special treatment for text nodes consisting only of whitespace.

Text in XML is one of those topics that you might expect to be simple, but becomes unexpectedly complex once you consider the diversity of users' subtle attitudes. RELAX NG's treatment of this topic is actually quite straightforward compared to other XML technologies. The rule of thumb Use the text pattern unless you have good reason not to is a good start toward avoiding surprises when dealing with text-based patterns in RELAX NG.



Resources

Learn

Get products and technologies
  • Jing: Try out the examples in this article using this RELAX NG processor that supports compact syntax.



About the author

Photo of Uche Ogbuji

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia or contact him at uche@ogbuji.net.




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top