Using Emacs for XML documents

Install add-ons to the powerful Emacs text editor to build a platform-independent (and free) environment for working with XML

Though it's best known as a powerful text editor favored by UNIX developers, Emacs can be used to work with XML in non-UNIX platforms such as Windows, MS-DOS, and MacOS. Emacs (see the sidebar Emacs in a nutshell) works as a full-blown development environment for processing text, writing applications, and, as I'll discuss, creating structured information like XML and SGML. I use it as a general-purpose editor for creating and managing some of my programming projects, and for writing XHTML and playing around with SGML and XML. In fact, I used it to write this article.

This article tells how to install Emacs and the extensions PSGML and OpenSP. It also outlines how to customize Emacs to make it function with a variety of DTDs. I present many of the Emacs customizations one piece at a time. However, you can download a zip file with sample DTDs and all of the Emacs customizations (see Related topics). My intent is to get you started using Emacs by providing you with just enough information for you understand what's going on. Then you'll be able to add DTDs and customize Emacs based on your needs and preferences.

Getting and installing Emacs

Start by installing Emacs. You can access additional Emacs information and distributions from the GNU Web site or its mirrors (see Related topics). Some UNIX-based distributions come with Emacs. For example, my Redhat Linux 7.1 came with Emacs version 20.5.1 (an older version of PSGML) already installed.

Linux and UNIX installation

Because most UNIX and Linux users are savvy enough to get and install software without any guidance from me, I'll just direct you to the GNU project site. The customizations I describe in the rest of the article will apply to UNIX/Linux environments.

Windows installation

Windows users can find the latest binary distribution from the windows/emacs/ directory of any of the FTP sites listed on the GNU FTP list. The emacs-20.7-bin-i386.tar.gz file does not include the Emacs Lisp source. Editor's note: A newer version, version 21.1, was released in late October, while this article was in production. This article is based on the 20.7 version and will be updated to include details on the new version. If you're interested in programming Emacs or seeing how particular functions are implemented, instead download the emacs-20.7-fullbin-i386.tar.gz file. Download the .gz file to your local hard drive. Use WinZip or some other .gz-aware tool to extract the contents to a directory structure on your hard drive (make sure you "retain folder information" when you extract so the appropriate directory structure is created). If you unzip to a drive d:, and allow the original directory structure to be created, you will end up with a base path of d:\emacs-20.7, where d: is the drive on which you unpacked the distribution. For the remainder of this article, I'll refer to this directory as d:\Emacs. The readme suggests that you avoid spaces in your install path. I'd heed this warning.

After you've unpacked the distribution, there will be a number of files and four sub-directories: bin, etc, info, and lisp under the main directory. The README.W32 file contains information on obtaining future distributions, setting up Emacs, and so on. (The README file also includes a URL for the FAQ for GNU Emacs on Windows 95/98/ME, and 2000.) Though it is not required, I suggest that you run the addpm.exe file in the bin subdirectory to register Emacs so that it's accessible from your Start menu. Once it is installed, select Start->Gnu Emacs->Emacs. If you opt not to register Emacs, start it up by double clicking the runemacs.exe file installed in the d:\Emacs\bin directory.

You can take a tutorial by starting Emacs and selecting Help->Emacs Tutorial. Don't get discouraged by the fact that you must use control-key sequences for many of the functions. You can begin by learning a few commonly used control-key sequences and learn new ones as you find you need them. Besides, in the GUI version of Emacs, many functions are accessible from menus. See Related topics for a couple of suggestions for other tutorials on Emacs and PSGML.

Customizing Emacs
The next step is to start customizing Emacs as necessary, such as:

  • Setting variables to control various behaviors
  • Adding packages
  • Writing your own Emacs Lisp code

So first I'll cover how to set variables and add packages.

Your first step is to access an Emacs initialization file. Emacs looks for this file in your home directory. In a UNIX environment, the initialization file is typically named .emacs, and located (by default) in your home directory.

On Windows, I use a file named _emacs since Windows doesn't generally like filenames that start with a period. On Windows, you specify the home directory by setting an environment variable or by setting a registry entry. As a last resort, Emacs looks for the initialization file in the directory c:\. (So for now, either create this file in c:\, or consult the GNU Emacs FAQ For Windows (see Related topics) for other options.)

To test that Emacs is finding your initialization file, use your favorite text editor to add the entry in Listing 1, which turns on the clock in the Emacs status bar. After turning on the clock and starting Emacs, look for the time in the status area (after the name of the current file). If you see the clock, all is well.

Listing 1. Testing the Emacs initialization file
; Display the time in the Emacs status area (an easy way to test
; that we are picking up our Emacs customizations).

Now that you have Emacs installed and you've laid the foundation for customizing it, we'll look at how to add packages that provide an environment for editing and validating SGML and XML documents.

Adding PSGML for SGML and XML modes

The current distribution of GNU Emacs includes major editing modes for HTML and SGML. Generally, the function provided by these is limited to assisting with element/attribute entry and navigating among elements. The HTML support is based on earlier versions of HTML.

Around the time SGML was starting to become popular for document publishing, Lennart Staflin created PSGML, a package for adding an SGML major editing mode to Emacs. Because HTML and XML are subsets of SGML, you can use PSGML for editing those as well. In fact, recent PSGML versions provide an XML editing mode.

PSGML also includes a built-in SGML parser that is DTD aware. If you have your own dialect of SGML or XML, you simply install your DTD(s). Changes in HTML standards are handled by installing a new DTD (or set of DTDs). PSGML provides context-sensitive editing, so you can add elements or attributes based on where you are in the document. Navigation features allow you to move among elements and even move to the next-trouble-spot to locate markup that doesn't conform to your DTD. Formatting features indent elements, based on nesting, or hide element content so you can restrict the view to specific areas. Finally, you can validate documents with an external validating parser, which I discuss later in this article.

Figure 1. Emacs with PSGML installed (editing the DITA FAQs)
Emacs with PSGML
Emacs with PSGML

Figure 1 shows some of the structured editing features PSGML adds to Emacs, including:

  • Colored markup syntax.
  • Markup indented based on nesting level.
  • Element folding. Note how the <prolog> and first two <section> elements have been collapsed to one line, to get them out of the way, while you can see the subelements in the unfolded Tips and Techniques <section> element.
  • Validation using an external parser. The results of the validation are displayed in a buffer below the document buffer. In this case, I used OpenSP to validate the document. If validation results in errors, you can use the Emacs next-error command ([Ctrl]-x `) to locate the error(s) in the source.

In addition to the features visible in Figure 1, PSGML adds many functions that you can access via pull-down menus, pop-up menus, or control-key sequences or commands.

Getting and installing PSGML

You'll need to download the current version of PSGML. Version of 1.2.2, current as of this writing, is available from Source Forge (see Related topics). As with Emacs, PSGML is downloaded as a .gz file; unpack it using a .gz-aware utility such as WinZip. I unpacked the PSGML distribution into the site-lisp directory of my Emacs installation. Again, remember to specify to retain directory information when you unpack. In my installation, I have d:\Emacs\site-lisp\psgml-1.2.2.

Once unpacked, consult README.psgml for some basic information, including how to install it in the UNIX version of Emacs.

Installing PSGML in Emacs for Windows
To prepare to install PSGML in the Windows version of Emacs, first create a directory for it (mine is site-lisp) and unpack the .gz file into it, retaining directory information.

Next you need to make sure that Emacs can find the files that comprise PSGML. You do that by adding the contents of Listing 2 to your Emacs initialization file _emacs file.

Listing 2. Adding PSGML to the Emacs initialization file
;;; Set up PSGML
; Add PSGML to load-path so Emacs can find it.
; Note the forward slashes in the path... this is platform-independent so I 
; would suggest using them over back slashes. If you use back slashes, they 
; MUST BE doubled, as Emacs treats backslash as an escape character. 
(setq load-path (append (list nil "d:/Emacs/site-lisp/psgml-1.2.2") load-path))

; Use PSGML for sgml and xml major modes.
(autoload 'sgml-mode "psgml" "Major mode to edit SGML files." t)
(autoload 'xml-mode "psgml" "Major mode to edit XML files." t)

Now Emacs should have access to the PSGML files and it will use PSGML whenever you invoke sgml-mode or xml-mode. Later I'll show how to invoke those modes automatically, based on the file extension of the file being edited.

Compiling the PSGML files

Whether you're working in Linux or Windows, there's one more thing to do to complete the installation: compile the PSGML files. Look in the psgml directory and find a bunch of .el file-types. These are Emacs Lisp files. If you compile them, the PSGML support runs faster. Here's a simple way to accomplish this:

  1. Start Emacs.
  2. Type [Alt]-x.
  3. When prompted for a command, enter byte-force-recompile [Enter].
  4. When prompted for a directory name, change the path to your PSGML files, for example d:/Emacs/site-list/psgml-1.2.2 and press [Enter].

That ought to compile most of the .el files and display the results in a "*Compile-log*" buffer. (I received a couple of warnings about obsolete variables when I compiled, but I believe they are harmless enough to ignore.) The end result should be an .elc file for most of the .el files in the psgml directory (not all of the files will be compiled, so don't worry if some are missing).

Adding DTDs

SGML and XML modes aren't much use without incorporating the DTDs to describe the types of documents you need to create. So here's how to add some DTDs and the appropriate configuration to make them useful with PSGML.

Let's start with XHTML 1.0, which is an XMLized version of HTML 4.01 (see Related topics for more information on XHTML). The XHTML DTDs will let you create HTML that conforms to the XML standard and can be validated with a parser (more on this later), thereby providing more robust and manageable documents. (See Related topics for a zip file that contains the XHTML 1.0 DTDs and catalog file I discuss in this section).

Here's how to download the XHTML DTDs and the related entities:

  1. Create a subdirectory for the XHTML DTDs. I keep all of my DTDs in one place on my system; let's assume they will reside under a DTDs folder at the same level as Emacs: d:\DTDs. Under there, create a folder for the XHTML DTDs, d:\DTDs\xhtml1.
  2. After creating a folder to hold them, simply go to the W3C's DTD site (see Related topics) to obtain the XHTML DTDs. There are three document types (strict, transitional, and frameset).
  3. For each of the three document types, click mouse-button-2 on the links and then save the target as a file. (You may need to remove the extra .txt extension that the browser adds when saving the files).
  4. Save the three entity sets (xhtml-lat1.ent, xhtml-special.ent, and xhtml-symbol.ent) into the same subdirectory as the DTDs.

Next, you need to create an SGML catalog file that PSGML can use to find these DTDs.

In the same directory as the DTDs, create a file called xhtml1.soc. The content should look like Listing 3.

Listing 3. xhtml1.soc - SGML catalog file for XHTML

PUBLIC  "-//W3C//DTD XHTML 1.0 Strict//EN"        "xhtml1-strict.dtd"
PUBLIC  "-//W3C//DTD XHTML 1.0 Transitional//EN"  "xhtml1-transitional.dtd"
PUBLIC  "-//W3C//DTD XHTML 1.0 Frameset//EN"      "xhtml1-frameset.dtd"
-- added DTDDECLs for use with onsgmls --
DTDDECL  "-//W3C//DTD XHTML 1.0 Strict//EN"        "xhtml1.dcl"
DTDDECL  "-//W3C//DTD XHTML 1.0 Transitional//EN"  "xhtml1.dcl"
DTDDECL  "-//W3C//DTD XHTML 1.0 Frameset//EN"      "xhtml1.dcl"

PUBLIC  "-//W3C//ENTITIES Latin 1 for XHTML//EN"  "xhtml-lat1.ent"
PUBLIC  "-//W3C//ENTITIES Symbols for XHTML//EN" "xhtml-symbol.ent"
PUBLIC  "-//W3C//ENTITIES Special for XHTML//EN" "xhtml-special.ent"

DOCTYPE html "xhtml1-transitional.dtd"

See Related topics for background on SGML Open Catalogs. For this article, I'll just explain the particular features that are used in Listing 3. The PUBLIC entries map what is referred to as a formal public identifier to a file system entity, which in this case is the file containing the various DTDs. This will allow us to refer to these DTDs without having to actually know where they are in the file-system. They require that your documents have a <!DOCTYPE xxxxxx PUBLIC "yyyyy"> document type declaration, where the "xxxxx" matches one of the entries in your catalog file. The DTDDECL entries are not actually used by PSGML, but they will be used by the SGML parser (stay tuned!), and they indicate what SGML declaration should be used with the DTD that has the same formal public identifier.

Lastly, the DOCTYPE entry allows us to refer to a particular DTD without using the formal public identifier or an actual filename. The downside to this is that, for XHTML, there are several DTDs that define the same document type html, so you have to pick one. I would simply choose the one you'd expect to use the most. In Listing 3, I've chosen the transitional DTD. Remember, you can use any of the XHTML document types as long as you include the full !DOCTYPE declaration.

There's one more piece of configuration that you need to do. PSGML needs to know where to find the SGML catalog files. There are a couple of ways to accomplish this, as described in the PSGML documentation. I use the method that makes use of the environment variable SGML_CATALOG_FILES because it is also used by the SGML parser (patience, I come to it in the next section of this article). So, now that you have a set of DTDs and a catalog file, create the afore-mentioned environment variable and set it to include the path to your xhtml1.soc file, for example d:\DTDs\xhtml1\xhtml1.soc. If you have more that one catalog file, you can include them all, separating them with a path delimiter (";" on Windows, ":" on UNIX-based systems).

I'll show you how to add one more set of DTDs:

  1. If necessary, create a subdirectory for the new DTDs, such as d:\DTDs\dita.
  2. Once you have the download, use your favorite utility to unpack the distribution to d:\DTDs\dita, once again preserving the directory information.
  3. Add the included catalog file to your SGML_CATALOG_FILES environment variable, so you might now have d:\DTDs\xhtml1\xhtml1.soc;d:\DTDs\dita\dtd\dita.soc.
Listing 4. dita.soc - SGML catalog file for DITA DTDs

-- For documents that don't include a DOCTYPE declaration --
DOCTYPE topic "topic.dtd"
--DOCTYPE topic "ditabase.dtd"--
DOCTYPE task "task.dtd"
DOCTYPE reftopic "reftopic.dtd"
DOCTYPE concept "concept.dtd"
DOCTYPE APIdesc "APIdesc.dtd"
DOCTYPE bctask "bctask.dtd"

-- There should probably be an entry here referencing the standard --
-- XML SGML declaration for example SGMLDECL or DTDDECL  --
-- (once we have public identifiers for the DTDs) --

As you can see, once you get things initially set up, adding new DTDs is relatively easy.

Editing a document with PSGML

Now that you have Emacs with PSGML installed and you have a set of DTDs to work with, you can begin editing documents using PSGML. Whenever you edit a document with an extension of .sgml or .xml, you will note that Emacs invokes SGML major mode (indicated in the status area) and the menu changes to look like the one shown in Figure 2.

Figure 2. Emacs menu with SGML editing mode
Emacs Menu
Emacs Menu

So far, if you edit an .html document, the old HTML major mode will be invoked. I'll show you how to fix that in a moment. In the meantime, you could invoke [Alt]-x and key in xml-mode to force XML mode.

To try using PSGML, edit a test file called test.html and insert beginning and ending html tags:


Turn on XML mode by invoking [Alt]-x and then keying in xml-mode. Next, click on the menu item DTD->Info->General DTD Info. This causes PSGML to parse the DTD and display general information in a buffer below your document. If your test was not successful, check for an error in your catalog file or environment variable. Also, this test assumes you have the DOCTYPE html entry in one of your SGML catalog files so that PSGML knows what DTD to associate with a doctype of "html". Alternatively, you could include a doctype declaration, such as <!DOCTYPE html PUBLIC ...>, where the PUBLIC identifier matches an entry in one of your SGML catalog files. If you have your catalogs and environment variables set up correctly, you should see something like this:

            Doctype: html
      Element types: 89
           Entities: 253
 Parameter entities: 63
         Files used: d:/DTDs/xhtml1/xhtml-special.ent

The output indicates that PSGML was able to locate the DTD and parse it, including all of the referenced entity modules.

Now PSGML is aware of your DTD, and you can begin utilizing some of PSGML's more powerful features. For example, place the cursor after the <html> tag and select menu item Markup->Insert Element. You will be presented with a list of elements that are valid at that location in the document. But before getting into any more of the editing features, let's do some more customization to get more out of PSGML.

More customization

Now that you can edit documents with PSGML, let's explore some more customizations that will exploit more of PSGML's features and make it easier to use. Listing 5 shows some more customizations you can append to your existing Emacs initialization file.

Listing 5. _emacs; more customizations for PSGML
;;; Set up file-extension/mode associations.   
; Note that I use xml-mode for html... that's because i'm writing 
; XHTML and I want my html to conform to XML.
(setq auto-mode-alist 
      (append '(
                ("\\.sgml" . sgml-mode)
                ("\\.idd" . sgml-mode)
                ("\\.ide" . sgml-mode)
                ("\\.htm" . xml-mode)
                ("\\.html" . xml-mode)
                ("\\.xml" . xml-mode)
                ("\\.xsl" . xml-mode)
                ("\\.fo" . xml-mode)

;;; Set up and enable syntax coloring. 
; Create faces  to assign markup categories.
(make-face 'sgml-doctype-face)
(make-face 'sgml-pi-face)
(make-face 'sgml-comment-face)
(make-face 'sgml-sgml-face)
(make-face 'sgml-start-tag-face)
(make-face 'sgml-end-tag-face)
(make-face 'sgml-entity-face)

; Assign attributes to faces. Background of white assumed.

(set-face-foreground 'sgml-doctype-face "blue1")
(set-face-foreground 'sgml-sgml-face "cyan1")
(set-face-foreground 'sgml-pi-face "magenta")
(set-face-foreground 'sgml-comment-face "purple")
(set-face-foreground 'sgml-start-tag-face "Red")
(set-face-foreground 'sgml-end-tag-face "Red")
(set-face-foreground 'sgml-entity-face "Blue")

; Assign faces to markup categories.
(setq sgml-markup-faces
      '((doctype        . sgml-doctype-face)
        (pi             . sgml-pi-face)
        (comment        . sgml-comment-face)
        (sgml   . sgml-sgml-face)
        (comment        . sgml-comment-face)
        (start-tag      . sgml-start-tag-face)
        (end-tag        . sgml-end-tag-face)
        (entity . sgml-entity-face)))

; PSGML - enable face settings
(setq sgml-set-face t)

; Auto-activate parsing the DTD when a document is loaded.
; If this isn't enabled, syntax coloring won't take affect until
; you manually invoke "DTD->Parse DTD"
(setq sgml-auto-activate-dtd t)

;;; Set up my "DTD->Insert DTD" menu.

(setq sgml-custom-dtd '
       ( "DITA concept"
         "<?xml version=\"1.0\"?>\n<!DOCTYPE concept SYSTEM \"concept.dtd\">" )
       ( "DITA task"
         "<?xml version=\"1.0\"?>\n<!DOCTYPE task SYSTEM \"task.dtd\">" )
       ( "DITA reftopic"
         "<?xml version=\"1.0\"?>\n<!DOCTYPE reftopic SYSTEM \"reftopic.dtd\">" )
       ( "DITA APIdesc"
         "<?xml version=\"1.0\"?>\n<!DOCTYPE APIdesc SYSTEM \"apidesc.dtd\">" )
       ( "DITA topic"
         "<?xml version=\"1.0\"?>\n<!DOCTYPE topic SYSTEM \"ditabase.dtd\">" )
       ( "HOD Script"
         "<?xml version=\"1.0\"?>\n<!DOCTYPE HASCRIPT SYSTEM \"HAScript.dtd\">" )
       ( "XHTML 1.0 Strict"
         "<?xml version=\"1.0\"?>\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 

Strict//EN\" \"xhtml1-strict.dtd\">" )
       ( "XHTML 1.0 Transitional"
         "<?xml version=\"1.0\"?>\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 

Transitional//EN\" \"xhtml1-transitional.dtd\">" )
       ( "XHTML 1.0 Frameset"
         "<?xml version=\"1.0\"?>\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 

Frameset//EN\" \"xhtml1-frameset.dtd\">" )
; I use XHTML now!
;       ( "HTML 4.01 Transitional"
;        "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\">" )
;       ( "HTML 4.01 Strict"
;        "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01//EN\">" )
;       ( "HTML 4.01 Frameset"
;        "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Frameset//EN\">" )
; An example of IBMIDDoc SGML DTD
;       ( "IBMIDDoc"
;        "<!DOCTYPE ibmiddoc PUBLIC \"+//ISBN 0-933186::IBM//DTD IBMIDDoc//EN\" [\n]>")
; An example of DOCBOOK XML DTD.
;       ( "DOCBOOK XML 4.1.2"
;        "<?xml version=\"1.0\"?>\n<!DOCTYPE book PUBLIC \"-//OASIS//DTD DocBook XML 

V4.1.2//EN\" \"\" [\n]>")

; From Lennart Staflin - re-enabling launch of browser (from original HTML mode)
(defun my-psgml-hook ()
  (local-set-key "\C-c\C-b" 'browse-url-of-buffer)

(add-hook 'sgml-mode-hook 'my-psgml-hook)

The first section of Listing 5 tells Emacs which major mode to invoke when you load a file with a particular extension, similar to the way Windows associates application based on file type. Note here that I've set .htm and .html files to use xml-mode. This is because I'm actually writing XHTML.

The next four sections of Listing 5 provide for syntax-based highlighting which causes different markup constructs to appear in different colors in the editor. By default, PSGML simply defines tags to appear in bold and comments to appear in italic. Here, I've set start and end tags to appear in blue, comments to appear in purple, entity references to appear in blue, PIs to appear in magenta, and so on. In addition to the constructs I've modified, you can also define the appearance of ignored marked sections, marked section start and ends, and short references. The purpose of the four sections is to:

  • Define a face
  • Set the characteristics of the face
  • Associate the face with the particular markup type
  • Activate the settings

The next section of Listing 5, sgml-auto-activate-dtd, causes the DTD associated with the document to be parsed as soon as the document is loaded. This is set to false by default because of the processing required. With processors as fast as they are, this shouldn't be a concern. Also, if this is not set to true, when a document is initially loaded, the syntax coloring will not take effect until you explicity parse the DTD, using either the DTD->Parse DTD menu item or the [Ctrl]-c[Ctrl]-p key sequence.

The next section modifies the DTD->Insert DTD menu item to allow you to quickly insert the DOCTYPE declaration for a new document. I've included a variety of document types, including both SGML and XML document types (some are commented out). Note how the XML document types include the XML declaration. Whenever you add a new DTD, you'll probably want to update the sgml-custom-dtd variable to add your new DTD to the Insert DTD menu.

The last section defines my-psgml-hook and hooks it into the SGML mode. This allows you to launch your default browser against the current file you are editing. This is handy for viewing HTML and XHTML as you edit. It will be even more handy when browsers more fully support XML and XSLT.

A quick PSGML test drive

Now that you have some customizations in place, let's take a quick test drive to see some of the PSGML editing features.

  1. Start Emacs and open a file ([Ctrl]-x[Ctrl]-f) called test.html. That should put Emacs into XML mode, which you can verify by looking at the status line.
  2. From the menu, select DTD->Insert DTD->XHTML 1.0 Transitional. That should insert the XML declaration and a <!DOCTYPE html...> declaration for an document with the default name "html." Also notice syntax coloring of these two entries.
  3. Next, place the cursor after the DOCTYPE declaration and from the menu select Markup->Insert Element (or press Shift and mouse-button-2). You should see a pop-up menu with a list of elements that are valid at this point in the document, in this case the html element. Notice that when you insert the HTML element, its required elements, head and body, are also inserted. Also, a comment appears prompting you that you must insert either a title or base element. This feature is handy until you get used to a particular markup language, after which it's more annoying than helpful. You can disable the prompting by setting the sgml-insert-missing-element-comment variable to false in your Emacs initialization file.
  4. You can use the same technique to add or modify attributes: Place the cursor inside a start-tag and select from the menu Markup->Insert Attribute (or press [Shift]mouse-button-2). A pop-up menu appears that offers valid attributes for the selected element. Select an attribute from the pop-up menu.
  5. Note how the structure is indented based on element nesting. If you insert an H1 inside the body, it will not be indented. This is because the default settings do not indent mixed content elements (elements that may contain both markup and text, or PCDATA in SGML/DTD parlance). You can change the indenting assumptions by setting sgml-indent-data to true in your Emacs initialization file. Before doing that, consider whether white-space will be significant in your XML application (see Related topics).
  6. If you have already installed an external validator, try validating your document: Select SGML->Validate and then press Enter (you may be prompted to save your file) or press [Ctrl]-c [Ctrl]-v and then press Enter.
    Note: If validation doesn't work, install an external validator (as I explain how to do in the next section) and test drive that feature later. If validation does work, you should receive an error indicating the "head" is not finished. If you press Ctrl-x` (note the back-tic), you will be taken to the line number in the source where the error occurred. Go ahead and insert a title element.

Using SP or OpenSP for SGML and XML validation

Although PSGML contains an SGML parser, it is not a fully functional parser. It does, however, provide the ability to validate SGML and XML documents using an external parser. This allows you to fully validate your source and find, for example, elements with IDREFs that lack a corresponding target element with a matching ID.

When you invoke SGML->Validate from the menu or keyboard (Ctrl-c Ctrl-v), PSGML will shell a process to invoke the SGML parser against the file you are currently editing. It displays the results of the validation in a buffer below the file you are currently editing. If it encounters errors, use the Emacs [Ctrl]-x ` (note the back-tic) to have Emacs take you to the location of the error in your source document.

By default, it is configured to invoke nsgmls, part of SP, an SGML parser originally written by James Clark. SP is no longer being supported, but is the foundation for OpenSP, which is now maintained on as part of the OpenJade project. (See Related topics for more information on SP and OpenSP.) You can download and use SP or OpenSP. I chose OpenSP because it is actively supported, and it contains support for the DTDDECL keyword of SGML catalogs whereas SP does not (DTDDECL is supported as of the 1.4 version of OpenSP). If you are dealing only with XML, you will need only a single SGML declaration defined for XML. If, however, you will also be dealing with SGML, the DTD you are using will probably reference its own declaration. Because PSGML allows you to specify only one particular SGML declaration to be used, via the sgml-declaration (or sgml-xml-declaration for XML mode), the DTDDECL catalog feature can come in handy. One last consideration is that I was unable to locate binaries for OpenSP for the Windows platform. Because maintains only source code, you will need to build the binaries yourself or locate them by searching more diligently than I did.

Using SP

If you prefer to use SP, all you really need to do is download SP (see Related topics), unpack it, and update two environment variables. You will need to append your PATH so that nsgmls can be found when invoked by PSGML. Assuming you unpack the distribution to the path d:\SP, you would need to add d:\SP\bin to your PATH. Also, you will need to add an entry to your SGML_CATALOG_FILES so the SGML declaration for XML can be found. If you don't pick up the correct SGML declaration when validating your XML, you will probably receive a lot of error messages. This is because XML doesn't support the SGML's OMITTAG feature which requires the DTD to specify minimization information (XML DTDs do not include this information because all tags are required). Again, assuming you installed SP in d:\SP, an SGML declaration for XML will be in d:\SP\pubtext\xml.dcl which is referenced by d:\SP\pubtext\xml.soc (see the SGMLDECL entry). So simply add d:\SP\pubtext\xml.soc to your SGML_CATALOG_FILES so nsgmls can find this catalog. Alternatively, you can set the Emacs/PSGML variable sgml-xml-declaration in your Emacs initialization file to point to this file as shown in Listing 6.

Listing 6. _emacs - enabling SP for validation
; Note the forward slashes in the path!!!! 
(setq sgml-xml-declaration "d:/SP/pubtext/xml.dcl")

Using OpenSP

If you wish to use OpenSP, you need to make a couple of slight modifications to PSGML, however, all of this can be done using the Emacs initialization file.

Assuming you have built and installed OpenSP or found a pre-built binary distribution, again the first thing you need to do is update your PATH so the executables can be found. Assuming OpenSP is installed in d:\OpenSP, you would need to add d:\OpenSP\bin to your PATH. Note that you can have both SP and OpenSP installed and accessible at the same time because the executables in OpenSP have been renamed.

The next thing you need to do is update your Emacs configuration to alter the command used for validation. This would normally be done by setting the Emacs variable sgml-validate-command, and in fact we will set this variable to handle the case of using OpenSp's onsgmls executable to validate in sgml-mode. For xml-mode, however, this doesn't seem to work correctly: When I set this variable in my Emacs initialization file, the sgml-mode picks up the change, but the xml-mode does not. You can get around this issue by providing a mode-hook. The goal is to override the default validate command, which is defined as nsgmls -wxml -s %s %s, setting it to onsgmls -wxml -s %s %s. The fragment of Emacs initialization code in Listing 7 takes care of both of these tasks.

Listing 7. _emacs - enabling OpenSP for validation
; override default validate command to utilize OpenSP's onsgmls executable
(setq sgml-validate-command "onsgmls -s %s %s")

; override default xml-mode validate command to utilize OpenSP's onsgmls
; executable by using a mode-hook, since there appears to be no other means 
; to accomplish it.
(defun my-psgml-xml-hook ()
  (setq sgml-validate-command "onsgmls -s %s %s")
  (setq sgml-declaration "d:\OpenSP\pubtext\xml.dcl")
(add-hook 'xml-mode-hook 'my-psgml-xml-hook)

You really don't need to understand what's going on here to make PSGML work with OpenSP. However if you're interested, a mode-hook basically defines an Emacs function that will be invoked after the mode is initialized. This gives you an opportunity to override functions and settings established by that mode. In this case, since the validate command is hardwired in the PSGML code, you can use the mode-hook to override that setting without having to modify the PSGML code and recompile it (which would need to be done each time you install a new version of PSGML).

Suggestions and tips

Once you get comfortable with the basic functions I've described, try exploring each of the menus that PSGML adds to the Emacs menu bar:

  • On the SGML menu, experimenting with the File Options and User Options can give you a good idea of what you can customize within PSGML. For more information on particular settings, you can refer to the online documentation or consult the "Editing SGML with Emacs and PSGML" document included with PSGML. Changes you make through this menu persist only for that particular editing session. If you prefer to make a permanent change, you have to update your Emacs initialization file.
  • The Modify menu mainly provides functions for changing existing markup. Some of these functions, for example Normalize, might come in handy for trying to clean up HTML and make it XHTML.
  • Functions under the Move menu basically allow for quicker navigation of the structure of your document.
  • The Markup menu provides menu access for inserting elements, tags, attributes, entities, and so on. I'll just point out two things that might not be obvious. Tag Region allows you to wrap existing text inside an element, using PSGML's internal parser to determine what elements are valid for the highlighted location. Insert Entity allows you to insert general text entities defined in your DTD. If you define new text entities in your internal subset at the beginning of the document, you will need to reparse the DTD to pick up the newly defined entities during your editing session.
  • Items under View are self explanatory.
  • Most of the items under the DTD menu have been covered. The Info items are worth a mention, however, because they can be useful for exploring your DTD if not already familiar with it.

Downloadable resources

Related topics

  • The XHTML 1.0 DTDs, _emacs customizations, and updated dita.soc files I described in this article are available in the
  • Download PSGML version of 1.2.2 (or whatever version is current) from Source Forge.
  • The GNU Web site provides information on Emacs as well as numerous other GNU projects.
  • If you prefer to learn from a book, O'Reilly & Associates publishes a good book called Learning GNU Emacs, which provides information on how to accomplish basic editing tasks, use many of the major editing modes, customize, and even program Emacs.
  • There's also an excellent tutorial in Bob DuCharme's book SGML CD, in the chapter "Editing SGML with the Emacs Text Editor", which is available online. In addition to providing a tutorial on using Emacs, Bob also discusses using PSGML for editing SGML documents, and in fact this chapter is what got me started.
  • Check out the GNU Emacs FAQ for Windows.
  • For more information on XHTML, visit the XHTML 1.0 section of the W3C Web site.
  • The Darwinian Information Typing Architecture, DITA, is an architecture for creating article-based information. DITA includes a base set of DTDs and framework that allows for specialization using derived DTDs and processing conventions.
  • In the DTD samples file provided, I've included a DTD I used to edit Host On Demand (HOD) macros (in the hodmacro directory). This demonstrates how Emacs with PSGML can be used to edit XML which is not of the traditional book or article type of information. For more information on HOD, see WebSphere Host Publisher You can learn more about WebSphere Host Publisher file formats from WebSphere Host Publisher Programmer's and Reference.
  • For a more data-oriented XML editing tool, check out the replacement for the WebSphere Studio Application Developer environment -- WebSphere Studio Site Developer which contains a visual XML editor, or check out the Downloads and products section of developerWorks XML zone (to view editing tools only, select Editing in the View by field).
  • Find out more about SGML Open Catalogs.
  • SP is an SGML parser originally written by James Clark. It is no longer being supported, but is the foundation for OpenSP which is now maintained on as part of the OpenJade project. If you're looking for a pre-built RPM package for Linux, you can try RPM Find (SP) or RPM Find (OpenSP).
  • Another good source of publicly available XML/SGML tools is The XML Cover Pages.
ArticleTitle=Using Emacs for XML documents