Add entities in XML

Create text macros for your documents

Many developers use entities in their XHTML for special characters, but in XML you can also define entities to make authoring easier, or to reference the content of external documents. Entities are also useful when you create a Document Type Definition (DTD) and want to reduce its apparent complexity to keep it readable by humans. This article will tell you all about XML entities and show you how to take advantage of them in your documents.

Chris Herborth (chrish@pobox.com), Freelance, Freelance

Photo of Chris HerborthChris Herborth is an award-winning senior technical writer and software developer with more than 15 years of experience writing about operating systems and programming. When he's not playing with his son Alex or hanging out with his wife Lynette, Chris spends his spare time designing, writing, and researching (that is, playing) video games. He doesn't play World of Warcraft.



16 February 2010

Also available in Japanese

Introduction

Frequently used acronyms

  • ASCII: American Standard Code for Information Interchange
  • DOM: Document Object Model
  • DTD: Document Type Definition
  • HTML: Hypertext Markup Language
  • W3C: World Wide Web Consortium
  • XHTML: Extensible Hypertext Markup Language
  • XML: Extensible Markup Language

XML is a well-supported Internet standard for encoding structured data in a way that can be easily decoded by practically any programming language and even read or written by humans using standard text editors. Many applications, especially modern standards-compliant Web browsers, can deal directly with XML data.

Entities in XML are used to represent specific characters (that are generally difficult or impossible to produce on a standard keyboard), to re-use snippets of XML, to organize your documents into several files, and to make it easier to write DTDs.


What are entities?

Entities are references to data; depending on the kind of entity, the XML parser will either replace the entity reference with the entity's replacement text or the contents of an external document. In this article, you look at several kinds of entities, and learn how they work, and how to take advantage of them in your own XML documents:

  • Character entities
  • Named entities
  • External entities
  • Parameter entities

All entities (except parameter entities) start with an ampersand (&) and end with a semicolon (;) character. The XML standard defines five standard entities that must be implemented by all XML parsers, regardless of what other entities they support:

  • ' is an apostrophe: '
  • & is an ampersand: &
  • " is a quotation mark: "
  • &lt; is a less-than symbol: <
  • &gt; is a greater-than symbol: >

You will often see and use the &amp;, &lt;, and &gt; entities in XHTML and XML documents, especially those that document markup by showing examples.


Character entities

Character entities let you specify any Unicode character in decimal format (&#nnn;, where nnn is the decimal value of the character) or hexadecimal format (&#xhhh;, where hhh is the hexadecimal value of the character).

For example, the capital letter A is Unicode character U+0065. If you wanted to represent it as a character entity, you can type &#65; (the decimal value) or &#x41; (the hexadecimal value) instead. Another, more useful character might be ©, the copyright symbol. The copyright symbol's character entity is &#169; or &#xa9; because it is Unicode character U+0169.

Character entities are replaced first, before any other entity-related parsing happens. As far as the XML parser is concerned, character entities are exactly the same as if you'd typed the specified character directly. They're similar to trigraphs in the C programming language, just an alternate representation of a specific character.

You should also note that even though the XML parser will have no trouble dealing with "any" character entity, your processing application and display systems might not be able to do anything useful with them. Be sure you've got access to the right fonts and encoding support systems to display any exotic characters in your documents. Most systems fall back to displaying a placeholder of some sort when an appropriate character can't be found in the current font.


Named entities

Named entities, also known as internal entities in the XML specifications, are what you usually refer to when you talk about "entities." You declare them in either the DTD or the internal subset (that is, as part of the <!DOCTYPE> statement in your document) and use them in your document as references. During the XML document parsing, the entity reference is replaced by its representation.

In plain English, these entities are simply macros that get expanded when you process your document.

The XHTML specification uses named entities to represent the special characters found in the ISO 8859-1 (Latin 1) character set, but not found on most keyboards. Additional entities are used to specify special characters and symbols. These standardized entities are well-supported by Web browsers, and let you write XHTML documents using plain text editors even on systems that only support 7-bit ASCII character sets (yes, there are still people using ancient mainframe systems). Listing 1 provides a brief list of XHTML 1.0's special characters.

Listing 1. A few of XHTML 1.0's special characters
<!ENTITY ndash   "&#8211;"> <!-- en dash, U+2013 ISOpub -->
<!ENTITY mdash   "&#8212;"> <!-- em dash, U+2014 ISOpub -->
<!ENTITY lsquo   "&#8216;"> <!-- left single quotation mark,
                                    U+2018 ISOnum -->
<!ENTITY rsquo   "&#8217;"> <!-- right single quotation mark,
                                    U+2019 ISOnum -->
<!ENTITY sbquo   "&#8218;"> <!-- single low-9 quotation mark, 
                                    U+201A NEW -->
<!ENTITY ldquo   "amp;“"> <!-- left double quotation mark,
                                    U+201C ISOnum -->
<!ENTITY rdquo   "&#8221;"> <!-- right double quotation mark,
                                    U+201D ISOnum -->

As you can see in Listing 1, which is a hunk of the XHTML 1.0 Special Characters declaration, the named entities are replaced by character entities. As you'll recall from the previous section, the character entities are the same as if you'd typed the referenced character directly into the document.

When you use &ndash; in your document, it's replaced by Unicode character U+2013, the en dash (-) character. Because the replacement text for &ndash; is a character reference, it's exactly the same as typing an en dash character.

Named entities are parsed after the entire document is read. This makes it perfectly valid to have named entities that refer to other named entities because all of the entities will be declared before any of them are expanded. For example, Listing 2 shows two entities, one referencing the other.

Listing 2. Entities referencing entities
<!ENTITY c "Chris">
<!ENTITY ch "&c; Herborth">

Using &c; in a document will expand to Chris, and &ch; will expand to the full Chris Herborth.

Circular references are errors in entities, and your parser will happily tell you when you render your document invalid by creating one.

These entity references work as you might expect in XHTML documents stored as XML files using Firefox, Safari, and Chrome, but not in Microsoft® Internet Explorer® 8. I'll discuss how to declare named entities in the Defining entities in your documents section.


External entities

External entities represent the content of an external file. This is useful when, for example, you're creating a book and want to store each chapter in its own file. You might create a set of entities like those in Listing 3.

Listing 3. External entities refer to other files
<!ENTITY chap1 SYSTEM "chapter-1.xml">
<!ENTITY chap2 SYSTEM "chapter-2.xml">
<!ENTITY chap3 SYSTEM "chapter-3.xml">

Now when you put these together in your main book XML file (see Listing 4) the contents of these files will be inserted at the reference point.

Listing 4. Putting the chapters together
<?xml version="1.0" encoding="utf-8"?>
<!-- Pull in the chapter content: -->
&chap1;
&chap2;
&chap3;

Because the contents of these files are inserted into the XML document, they must also be valid XML, and they must be balanced. That is, any element that starts in an external entity's referenced file must also end in that same file. When the XML document in Listing 4 is parsed, it will be read as one large document, containing the contents of chapter-1.xml, chapter-2.xml, and chapter-3.xml; the XML processing application doesn't care that the document was written in four separate files.


Parameter entities

Parameter entities are only available inside the DTD and the internal subset of your document. They use the percent (%) symbol instead of the ampersand, and can be either named entities or external entities.

In the XHTML DTD, parameter entities are used to reference the Latin 1, Special Characters and Symbols entity sets declared in external files, and as shortcuts for re-using parts of the DTD, such as the standard set of attributes supported by every XHTML element (see Listing 5).

Listing 5. A parameter entity in the XHTML 1 DTD
<!ENTITY % attrs "%coreattrs; %i18n; %events;">
<!ENTITY % coreattrs
 "id          ID             #IMPLIED
  class       CDATA          #IMPLIED
  style       %StyleSheet;   #IMPLIED
  title       %Text;         #IMPLIED"
  >
<!ENTITY % i18n
 "lang        %LanguageCode; #IMPLIED
  xml:lang    %LanguageCode; #IMPLIED
  dir         (ltr|rtl)      #IMPLIED"
  >

As you can see from Listing 5, parameter entities can refer to other parameter entities. Like named entities, they're not expanded until the entire document has been read.


Defining entities in your documents

As you've probably noticed already, entities are defined using the ENTITY declaration, either as part of an external DTD (see Listing 6) or as part of your document's internal subset (see Listing 7).

Listing 6. Entity declarations in a DTD
<!-- 6.1 Named entity for site name: -->
<!ENTITY dw "developerWorks">

<!-- 6.2 External entity for re-use: -->
<!ENTITY bio SYSTEM "dw-author-bio.xml">

<!-- 6.3 Parameter entity for use in DTD -->
<!ENTITY % English "en-US|en-CA|en-UK">

When declaring a named entity, you specify the entity's name, and its replacement text. The replacement text can include character entities, named entities, elements, etc. but not parameter entities.

Listing 7. Entity declarations in the internal subset
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html 
    PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
[
    <!ENTITY test-entity "This <em>is</em> an entity.">
]>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <meta http-equiv="Content-Type" content="application/xhtml+xml;charset=utf-8"/>
    <title>Entities in XML</title>
</head>
<body>
    <h1>Entities in XML</h1>

    <p>&test-entity;</p>

    <p>You can use it anywhere you'd use a standard XHTML entity:</p>

    <pre>&test-entity;</pre>
</body>
</html>

The XHTML document in Listing 7 (which I saved on my system as entities.xml) declares a new entity named test-entity in its internal subset. The internal subset is part of the <!DOCTYPE> declaration, after the PUBLIC and/or SYSTEM identifier for your DTD and inside a bracketed section.

Browsers don't seem to currently support external or parameter entities in the internal subset, probably as a security measure. Both types of entities might be used to create denial of service attacks or other malicious documents when used with a Web browser's rendering engine.


Using entities in your documents

You're already familiar with the mechanics of using entities in your documents; parameter entities in DTDs work the same way. The listings we've already seen show you how to use the various kinds of entities:

You can take advantage of entities any time you end up typing the same text over and over. Good entity candidates are things like your company's official name; the name of the product you're documenting; copyright, trademark, and registered trademark notices; and your e-mail address (see Listing 8).

Listing 8. Save typing with entities
<!ENTITY co "Father Karass' Olde Tyme Steambots, LLC">
<!ENTITY prod "Semi-Autonomous Security Servant (SASSbot)">
<!ENTITY c "Copyright &copy; 2010 &co; All Rights Reserved.">
<!ENTITY author "Chris Herborth (chrish@pobox.com)">

Things that might change, such as a product name, make particularly good entities, much like declaring constants in program source code. If (and when) the product name gets changed, you only need to update the entity declaration instead of having to search and replace through all of your files (see Listing 9).

Listing 9. Easily update documents in flux with entities
<!-- Current name: -->
<!ENTITY prod "Semi-Autonomous Security Servant (SASSbot)">

<!-- Old names preserved for posterity: -->

<!-- Original R&D name: -->
<!--ENTITY prod "Security Bot"-->

<!-- Marketing name v1 -->
<!--ENTITY prod "Security Servant Bot"-->

<!-- Marketing name v2 -->
<!--ENTITY prod "Autonomous Security Servant Bot"-->

Summary

XML-based standards like XHTML define a library of useful entities that make it possible to create documents with characters that you can't type directly on standard keyboards. Named entities can act like macros, letting you replace repetitive or difficult text with entity references. While Web browsers don't support external entities, you can use them to create composite documents using other XML applications, which makes it easier to standardize and re-use parts of your documents. Use parameter entities to pull external declarations into your DTD, or to create in-DTD macros to improve readability.

Declaring named entities in an XML document's DOCTYPE declaration is straightforward, and using them in the document body is something you probably already know how to do.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=467617
ArticleTitle=Add entities in XML
publish-date=02162010