XML is a well-supported Internet standard for encoding structured data in a way that can be easily decoded by practically any programming language and even read or written by humans using standard text editors. Many applications, especially modern standards-compliant Web browsers, can deal directly with XML data.
Entities in XML are used to represent specific characters (that are generally difficult or impossible to produce on a standard keyboard), to re-use snippets of XML, to organize your documents into several files, and to make it easier to write DTDs.
What are entities?
Entities are references to data; depending on the kind of entity, the XML parser will either replace the entity reference with the entity's replacement text or the contents of an external document. In this article, you look at several kinds of entities, and learn how they work, and how to take advantage of them in your own XML documents:
- Character entities
- Named entities
- External entities
- Parameter entities
All entities (except parameter entities) start with an ampersand (&) and end with a semicolon (;) character. The XML standard defines five standard entities that must be implemented by all XML parsers, regardless of what other entities they support:
- ' is an apostrophe: '
- & is an ampersand: &
- " is a quotation mark: "
- < is a less-than symbol: <
- > is a greater-than symbol: >
You will often see and use the &, <, and > entities in XHTML and XML documents, especially those that document markup by showing examples.
Character entities let you specify any Unicode character in decimal format (&#nnn;, where nnn is the decimal value of the character) or hexadecimal format (&#xhhh;, where hhh is the hexadecimal value of the character).
For example, the capital letter A is Unicode character U+0065. If you wanted to represent it as a character entity, you can type A (the decimal value) or A (the hexadecimal value) instead. Another, more useful character might be ©, the copyright symbol. The copyright symbol's character entity is © or © because it is Unicode character U+0169.
Character entities are replaced first, before any other entity-related parsing happens. As far as the XML parser is concerned, character entities are exactly the same as if you'd typed the specified character directly. They're similar to trigraphs in the C programming language, just an alternate representation of a specific character.
You should also note that even though the XML parser will have no trouble dealing with "any" character entity, your processing application and display systems might not be able to do anything useful with them. Be sure you've got access to the right fonts and encoding support systems to display any exotic characters in your documents. Most systems fall back to displaying a placeholder of some sort when an appropriate character can't be found in the current font.
Named entities, also known as internal entities in the XML specifications, are what you usually refer to when you talk about "entities." You declare them in either the DTD or the internal subset (that is, as part of the <!DOCTYPE> statement in your document) and use them in your document as references. During the XML document parsing, the entity reference is replaced by its representation.
In plain English, these entities are simply macros that get expanded when you process your document.
The XHTML specification uses named entities to represent the special characters found in the ISO 8859-1 (Latin 1) character set, but not found on most keyboards. Additional entities are used to specify special characters and symbols. These standardized entities are well-supported by Web browsers, and let you write XHTML documents using plain text editors even on systems that only support 7-bit ASCII character sets (yes, there are still people using ancient mainframe systems). Listing 1 provides a brief list of XHTML 1.0's special characters.
Listing 1. A few of XHTML 1.0's special characters
<!ENTITY ndash "–"> <!-- en dash, U+2013 ISOpub --> <!ENTITY mdash "—"> <!-- em dash, U+2014 ISOpub --> <!ENTITY lsquo "‘"> <!-- left single quotation mark, U+2018 ISOnum --> <!ENTITY rsquo "’"> <!-- right single quotation mark, U+2019 ISOnum --> <!ENTITY sbquo "‚"> <!-- single low-9 quotation mark, U+201A NEW --> <!ENTITY ldquo "amp;“"> <!-- left double quotation mark, U+201C ISOnum --> <!ENTITY rdquo "”"> <!-- right double quotation mark, U+201D ISOnum -->
As you can see in Listing 1, which is a hunk of the XHTML 1.0 Special Characters declaration, the named entities are replaced by character entities. As you'll recall from the previous section, the character entities are the same as if you'd typed the referenced character directly into the document.
When you use – in your document, it's replaced by Unicode character U+2013, the en dash (-) character. Because the replacement text for – is a character reference, it's exactly the same as typing an en dash character.
Named entities are parsed after the entire document is read. This makes it perfectly valid to have named entities that refer to other named entities because all of the entities will be declared before any of them are expanded. For example, Listing 2 shows two entities, one referencing the other.
Listing 2. Entities referencing entities
<!ENTITY c "Chris"> <!ENTITY ch "&c; Herborth">
Using &c; in a document will expand to Chris, and &ch; will expand to the full Chris Herborth.
Circular references are errors in entities, and your parser will happily tell you when you render your document invalid by creating one.
These entity references work as you might expect in XHTML documents stored as XML files using Firefox, Safari, and Chrome, but not in Microsoft® Internet Explorer® 8. I'll discuss how to declare named entities in the Defining entities in your documents section.
External entities represent the content of an external file. This is useful when, for example, you're creating a book and want to store each chapter in its own file. You might create a set of entities like those in Listing 3.
Listing 3. External entities refer to other files
<!ENTITY chap1 SYSTEM "chapter-1.xml"> <!ENTITY chap2 SYSTEM "chapter-2.xml"> <!ENTITY chap3 SYSTEM "chapter-3.xml">
Now when you put these together in your main book XML file (see Listing 4) the contents of these files will be inserted at the reference point.
Listing 4. Putting the chapters together
<?xml version="1.0" encoding="utf-8"?> <!-- Pull in the chapter content: --> &chap1; &chap2; &chap3;
Because the contents of these files are inserted into the XML document, they must also be valid XML, and they must be balanced. That is, any element that starts in an external entity's referenced file must also end in that same file. When the XML document in Listing 4 is parsed, it will be read as one large document, containing the contents of chapter-1.xml, chapter-2.xml, and chapter-3.xml; the XML processing application doesn't care that the document was written in four separate files.
Parameter entities are only available inside the DTD and the internal subset of your document. They use the percent (%) symbol instead of the ampersand, and can be either named entities or external entities.
In the XHTML DTD, parameter entities are used to reference the Latin 1, Special Characters and Symbols entity sets declared in external files, and as shortcuts for re-using parts of the DTD, such as the standard set of attributes supported by every XHTML element (see Listing 5).
Listing 5. A parameter entity in the XHTML 1 DTD
<!ENTITY % attrs "%coreattrs; %i18n; %events;"> <!ENTITY % coreattrs "id ID #IMPLIED class CDATA #IMPLIED style %StyleSheet; #IMPLIED title %Text; #IMPLIED" > <!ENTITY % i18n "lang %LanguageCode; #IMPLIED xml:lang %LanguageCode; #IMPLIED dir (ltr|rtl) #IMPLIED" >
As you can see from Listing 5, parameter entities can refer to other parameter entities. Like named entities, they're not expanded until the entire document has been read.
Defining entities in your documents
Listing 6. Entity declarations in a DTD
<!-- 6.1 Named entity for site name: --> <!ENTITY dw "developerWorks"> <!-- 6.2 External entity for re-use: --> <!ENTITY bio SYSTEM "dw-author-bio.xml"> <!-- 6.3 Parameter entity for use in DTD --> <!ENTITY % English "en-US|en-CA|en-UK">
When declaring a named entity, you specify the entity's name, and its replacement text. The replacement text can include character entities, named entities, elements, etc. but not parameter entities.
Listing 7. Entity declarations in the internal subset
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" [ <!ENTITY test-entity "This <em>is</em> an entity."> ]> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="application/xhtml+xml;charset=utf-8"/> <title>Entities in XML</title> </head> <body> <h1>Entities in XML</h1> <p>&test-entity;</p> <p>You can use it anywhere you'd use a standard XHTML entity:</p> <pre>&test-entity;</pre> </body> </html>
The XHTML document in Listing 7 (which I saved on my system as entities.xml) declares a new entity named test-entity in its internal subset. The internal subset is part of the <!DOCTYPE> declaration, after the PUBLIC and/or SYSTEM identifier for your DTD and inside a bracketed section.
Browsers don't seem to currently support external or parameter entities in the internal subset, probably as a security measure. Both types of entities might be used to create denial of service attacks or other malicious documents when used with a Web browser's rendering engine.
Using entities in your documents
You're already familiar with the mechanics of using entities in your documents; parameter entities in DTDs work the same way. The listings we've already seen show you how to use the various kinds of entities:
- character entities—Listing 1
- named entities—Listing 2 and Listing 7
- external entities—declared in Listing 3 and used in Listing 4
- parameter entities—Listing 5
You can take advantage of entities any time you end up typing the same text over and over. Good entity candidates are things like your company's official name; the name of the product you're documenting; copyright, trademark, and registered trademark notices; and your e-mail address (see Listing 8).
Listing 8. Save typing with entities
<!ENTITY co "Father Karass' Olde Tyme Steambots, LLC"> <!ENTITY prod "Semi-Autonomous Security Servant (SASSbot)"> <!ENTITY c "Copyright © 2010 &co; All Rights Reserved."> <!ENTITY author "Chris Herborth (firstname.lastname@example.org)">
Things that might change, such as a product name, make particularly good entities, much like declaring constants in program source code. If (and when) the product name gets changed, you only need to update the entity declaration instead of having to search and replace through all of your files (see Listing 9).
Listing 9. Easily update documents in flux with entities
<!-- Current name: --> <!ENTITY prod "Semi-Autonomous Security Servant (SASSbot)"> <!-- Old names preserved for posterity: --> <!-- Original R&D name: --> <!--ENTITY prod "Security Bot"--> <!-- Marketing name v1 --> <!--ENTITY prod "Security Servant Bot"--> <!-- Marketing name v2 --> <!--ENTITY prod "Autonomous Security Servant Bot"-->
XML-based standards like XHTML define a library of useful entities that make it possible to create documents with characters that you can't type directly on standard keyboards. Named entities can act like macros, letting you replace repetitive or difficult text with entity references. While Web browsers don't support external entities, you can use them to create composite documents using other XML applications, which makes it easier to standardize and re-use parts of your documents. Use parameter entities to pull external declarations into your DTD, or to create in-DTD macros to improve readability.
Declaring named entities in an XML document's DOCTYPE declaration is straightforward, and using them in the document body is something you probably already know how to do.
- Process XML in the browser using jQuery (Uche Ogbuji, developerWorks, December 2009): Find out how to process XML directly in the browser with jQuery.
- XML 1.0 Specification (W3C Recommendation, 26 November 2008): Check out this source for specific details about XML features such as the CDATA section.
- Tip: Using an entity resolver (Brett McLaughlin, developerWorks, June 2001): Learn how to use an entity resolver to find the content for external entity references.
- Tip: Use the Unicode database to find characters for XML documents (Uche Ogbuji, developerWorks, March 2006): Learn how to use the Unicode database to find character entities for your XML documents.
- Tip: Flexible DTDs with parameter entities (Brett McLaughlin, developerWorks, January 2003): Learn how to create flexible DTDs using parameter entities.
- The XML FAQ: Explore another excellent source of XML information, the XML FAQ edited by Peter Flynn.
- XML DOM tutorial from W3schools.com: Find out what XML-based interfaces are available to the browser (and which browsers support them).
- XML Entity Definitions for Characters (W3C Working Draft, 17 November 2009): Learn about several sets of names which are assigned to Unicode characters.
- XHTML™ 1.0: The Extensible HyperText Markup Language (World Wide Web Consortium Recommendation, 26 January 2000): Read more about XHTML 1.0, a reformulation of HTML 4 as an XML 1.0 application and provides the foundation for future extensibility of XHTML.
- Wikipedia's list of XML and HTML character entities: Check out this handy table with the entity name, its character representation, the Unicode value and the related W3C standard that defines it.
- More articles by this author (Chris Herborth, developerWorks, March 2006-current): Read articles about XML and other technologies.
- XML area on developerWorks: Get the resources you need to advance your skills in the XML arena.
- IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
- XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
- developerWorks technical events and webcasts: Stay current with technology in these sessions.
- developerWorks podcasts: Listen to interesting interviews and discussions for software developers.
Get products and technologies
- IBM product evaluation versions: Download or explore the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
- XML zone discussion forums: Participate in any of several XML-related discussions.
- developerWorks blogs: Check out these blogs and get involved.