DITA is a topic-oriented architecture now managed by the OASIS DITA Technical Committee. With DITA, you author content in small, independent units that you assemble into deliverables, such as online help, books, or courses. (See Resources for more information.)
Until recently, the primary large-scale authoring format inside IBM was IBMIDDoc SGML. The IBMIDDoc DTD had been in use for over 10 years -- enough time to develop large libraries and complicated processing tools. When we began the move to DITA, there were still many reasons to continue using these old tools:
- The existing tools had been extensively tested for translation and accessibility support.
- They were proven to work for very large books, at a time when no extremely large sets of information existed in DITA.
- Authors were familiar with the existing processes, and would continue to use them for old books.
- Many publishing options did not yet exist for DITA, such as transforms to README-style text or full book-like PDF.
Of course, we could have written new tools for DITA that matched the function of the old system, to minimize the learning curve for authors. However, writing a full set of new tools had a number of disadvantages:
- It required a large up-front investment.
- It ignored our investment in working tools.
- It didn't allow us to combine old SGML and new DITA information.
- Most importantly, we could not start using DITA until the new tools were ready.
This article highlights the pros and cons of alternative solutions we came up with during our evaluation. It also identifies the DITA solution we chose and the rationale for it, and describes the dirty technical details of that solution.
At IBM, we currently have a large toolset that we use to convert our IBMIDDoc SGML into many different formats. This toolset (the ID Workbench) ensures that all IBM documentation uses a common style, creates appropriate output for all supported languages, complies with all accessibility regulations, and meets many other requirements. It would take years to displace this system entirely.
Given the robust toolset that's available for IBMIDDoc, one of the first ideas for working with DITA was to use IBMIDDoc SGML as an intermediate format for some or all output transformations. This would have immediately provided access to numerous output formats, and would have guaranteed compliance with all IBM requirements. However, limitations quickly became apparent.
It is relatively simple to convert DITA to IBMIDDoc, because the specific structure and semantics of a DITA topic fit easily into the more general book structure of IBMIDDoc. It is also easy to take the SGML output and then start one of the existing IBMIDDoc transforms. However, this makes every simple transformation a two-step process, which annoys authors. It also leaves a lot of SGML files sitting around that look remarkably like source files, which can result in updates to the wrong files. Third, even with many similarities in markup, it is impossible to preserve every semantic nuance in IBMIDDoc, especially when considering specialized DITA markup with very specific information.
All things considered, these are relatively easy problems to work around. Two steps are a pain, but shortcuts can make that easier. Intermediate SGML can be stored in a temporary folder so it doesn't mix with the actual source. The loss of semantics is an issue, but the output still appears correct visually.
One remaining problem is much more significant: The convert-to-SGML-and-go solution leaves no way to add DITA content to an existing IBMIDDoc book. IBM has large SGML libraries today. What happens when these books need to be updated -- should new content be authored in IBMIDDoc or DITA? What if the books will be converted to DITA eventually, but there is no time to do so now? Finally, what if the same content is reused in two products, one of which uses only DITA, and one of which still supports IBMIDDoc books?
Clearly, a solution was needed that could merge new DITA content with existing IBMIDDoc content. At this point, one proposal was simply to convert DITA to IBMIDDoc fragments instead of books. In our SGML DTD, a DITA topic corresponds roughly to a division element. It was possible to convert each topic to a division, insert the division into an existing book, and proceed. However, this solution also had clear limitations.
When including DITA as SGML fragments, you must declare those fragments as SGML entities, to be resolved later. It is tedious and error prone to declare these entities before the topics have been converted. The full IBMIDDoc book cannot be processed until after the DITA topics are converted to SGML. Once the fragments are created, you again end up with extra "source" files. You may be tempted to edit those SGML fragments to improve the book's flow -- but that results in two diverging copies of the source.
The real roadblock to this plan is the cost to the translation process. Say a DITA source file gets translated for use in one product. When the IBMIDDoc book that reuses this topic goes to translation, there are several alternatives:
- Convert the English DITA file to SGML, and send the whole English book to translation. The topics will be translated twice. The markup will not match exactly, so the process is not fully automatic. This results in extra costs.
- Convert the translated DITA to SGML, and send those translated fragments with the English book. In this case, a different package is needed for every language. Updates to the DITA content since its last use may not be reflected in the translated file. Also, this solution is not possible if the DITA topics and SGML files are being translated at the same time.
- Send the SGML and DITA files together, with instructions on how to merge them post-translation. This increases translation costs simply because it takes more time to handle -- a one-step process is always cheaper.
At this point we needed a solution that let us reuse SGML tools; one that let IBMIDDoc and DITA work together; and one that did this with minimal impact to authors and translation. Such a solution was only possible if IBMIDDoc could reference DITA content, and use it automatically.
The best way to include DITA without transforming in advance, without worrying about translation, and without generating new source copies is simply to do so with a pointer. We can create an element in IBMIDDoc which points to a DITA topic or map, and let the processing stream resolve everything behind the scenes.
Including DITA by reference solves all of the problems raised so far:
- It is simple to point to a DITA topic or map from an existing book. Authors have no extra processing steps to remember.
- Users never have to keep track of SGML versions of the DITA source.
- If a DITA topic has already been translated, the new translation compares the current DITA source to an existing DITA-based translation memory; so, all old content has an exact match in the memory.
In addition, one simple IBMIDDoc document shell, with no real content other than a DITA pointer, can be used to get to any output format already supported by the IBMIDDoc toolset.
a pointer, we created an element called
XMLObj and included it in the IBMIDDoc
DTD. It uses an entity reference to point to DITA topics or maps, and can
be used in any location that allows divisions.
For example, assume that you have a book that defines everything about llamas. This book was written several
years ago in IBMIDDoc, at the start of the llama craze. Today, with more and
more people looking to alpacas as a smaller alternative, the book needs to
be updated. With no time to convert the original book to DITA, this
is an ideal time to use the
XMLObj element. The markup is very simple -- just
declare an entity and then reference it:
Listing 1. Sample IBMIDDoc markup using XMLObj
<!DOCTYPE IBMIDDOC PUBLIC "+//ISBN 0-933186::IBM//DTD IBMIDDoc//EN" [ <!ENTITY alpacas SYSTEM "allAboutAlpacas.ditamap" NDATA ditamap> <!ENTITY buyalpaca SYSTEM "buyingYourAlpaca.dita" NDATA dita> ]> <ibmiddoc> .... <!-- Document title and prolog --> <body> ... <!-- Existing description of llamas --> <xmlobj obj="alpacas"> <!-- Point to the DITA map --> </body> <backm> <!-- Appendix information --> ... <!-- Info on how to purchase llamas --> <xmlobj obj="buyalpaca"> <!-- Point to the DITA topic --> </backm> </ibmiddoc>
All of the IBMIDDoc output transforms begin with the same validation and normalization process. When this process encounters a DITA reference, the transform pauses to retrieve the indicated file. With the original process on hold, the DITA content is converted first to IBMIDDoc, and then to normalized IBMIDDoc. The DITA content -- now in normalized SGML form -- is inserted with a slight bit of modification into the original output stream. This can happen with any number of topics or maps. (The sample book above includes one map and one topic.)
For example, using the IBMIDDoc book above, the process is:
- Call the transform process to convert the book (llama.idd) to PDF.
- The transform begins with a normalization step.
- The document title and prolog are processed.
- The llama chapters are processed.
<xmlobj obj="alpacas">entity is found, and the file allAboutAlpacas.ditamap is retrieved.
- allAboutAlpacas.ditamap is converted to IBMIDDoc (allAboutAlpacas.idd). Note that translations are done on all files in the original format; so for NLS versions, this process merges translated DITA with translated IBMIDDoc.
- allAboutAlpacas.idd is normalized (allAboutAlpacas.idn).
- The normalized IBMIDDoc file allAboutAlpacas.idn is copied into the output stream (with some magic described below in "The dirty little details").
- Processing continues until the next
<xmlobj>element is encountered; buyingYourAlpaca.dita is converted to buyingYourAlpaca.idd, then to buyingYourAlpaca.idn, and then included in the output stream.
- Processing completes, resulting in one complete normalized SGML file (llama.idn).
- Normalized SGML is converted to PDF with no change to the existing process.
- All of the IBMIDDoc fragments relating to alpacas exist only in a temporary processing directory, and are discarded at the end of the process.
As demonstrated, this process allows us to easily integrate DITA with IBMIDDoc. So... are there any other advantages to this process?
One big advantage is that it provides a shortcut to output formats that our DITA toolkit does not yet support. Working with the sample above, what happens when you want to produce output using only the alpaca map? At the time of this writing, the formal DITA book model (known as bookmap) is still under discussion by the OASIS DITA Technical Committee. Although many vendors are successfully using bookmap today, we have so far chosen to wait for OASIS to discuss it. Until then, we can use this IBMIDDoc shell to produce a full book:
Listing 2. IBMIDDoc shell for producing a book
<!DOCTYPE IBMIDDOC PUBLIC "+//ISBN 0-933186::IBM//DTD IBMIDDoc//EN" [ <!ENTITY alpacas SYSTEM "allAboutAlpacas.ditamap" NDATA ditamap> ]> <ibmiddoc> <prolog> <title>Everything Alpaca</title> ... required metadata, such as document numbers or ISBN... </prolog> <body> <xmlobj obj="alpacas"> <!-- Point to the DITA map --> </body> </ibmiddoc>
This simple file uses DITA content to produce a PDF that automatically meets all style, translation, and accessibility rules. Most importantly, IBMIDDoc already has room for all of the information that IBM requires for books. We have been using this method to create books for over two years. If we had waited for formal approval of bookmap, we would still not be producing books from DITA today.
It is also worth pointing out that the same file can be used for any format already supported by our ID Workbench; in addition to PDF, it can produce README-style text output, PostScript, or legacy formats like BookManager.
As hinted at above, we needed to work out a few technical details when pulling DITA into the normalization processing stream. For a full solution, we needed to address all of the details listed below. While the list may seem long, most of these were quite easy, and the others were only necessary to support complex DITA markup. All were simple compared to rewriting the full toolset for DITA.
- Of course, there needed to be a DITA-to-SGML conversion process. For us, this was a given because we knew we would want to leverage existing IBMIDDoc transformations, especially for legacy formats.
- In the process described above, the temporary normalized file allAboutAlpacas.idn
was placed in the middle of the normalized book llama.idn. In reality, allAboutAlpacas.idn
contained several items that needed to be removed (such as the root
<ibmiddoc>element and the
<body>element) before it could be included. The easy fix was to scan the file and only include content from the
- IBMIDDoc does not allow duplicate IDs. If two DITA topics used the same
ID, or if one DITA topic used an ID that was also in the SGML book, a simple
merge would have resulted in errors. We solved this problem by using a generated
dit2idNNin front of each ID, where NN is a number that increases with each XMLObj reference.
- In our normalized files, items like revision definitions are placed in
the main prolog. When the body of allAboutAlpacas.idn was pushed into llama.idn,
revisions from each IDN file had to be merged. We solved this during the scan
mentioned above. Instead of discarding information before the
<body>, we simply saved prolog definitions and copied them into the main prolog for the "llama" book.
- In the llama sample, both allAboutAlpacas.ditamap and buyAlpaca.dita were
included in a book. These were each converted to IBMIDDoc independently, without
knowing which topics would eventually be included in the book. This complicated
links from one to another, because we did not know if the target topic would
be included or what ID it would use when included. This was tougher than the
previous issues -- it was the only one of these details not included in our
initial DITA support. We solved it by keeping a list of the IDs used for every DITA
target, as well as the IDs used for each (temporarily) invalid link. When
the normalized file was complete, we used the full list to fix each link. For
- allAboutAlpacas contained a link to buyAlpaca.dita. After the conversion, it referenced an id
dit2id1_invalid_1". The target and ID were saved in a list of invalid IDs.
- When buyAlpaca.dita was included, it used an ID such as "
dit2id2_buyAlpaca". The target and ID were saved in a list of valid IDs.
- After all DITA content was processed, that content was scanned for invalid IDs.
- When the scan found "
dit2id1_invalid_1", it recognized this as a link to buyAlpaca.dita; it knew from the valid list that buyAlpaca.dita was present, with id="
- The invalid link was replaced with the correct value. If no correct value was available, the user would have been warned at this point.
- allAboutAlpacas contained a link to buyAlpaca.dita. After the conversion, it referenced an id like "
The advantages of using DITA by reference are clear, but how useful is this information for companies that do not use IBMIDDoc (or even SGML)? The answer is that the same approach can be applied to any SGML or XML system, with varying degrees of complexity.
Any SGML or XML system should be able to add a new DITA object element
to reference DITA content. In order to merge content or reuse existing tools,
it is also necessary to have a conversion from DITA to the current format.
Beyond that, the solution will vary based on that current format. If there
is no prolog for central definitions, the transform can convert directly to
nested content (no need for the root or
<body> elements). Without
supporting links between separately included content, there is no complex
mechanism for recognizing IDs and references.
As an example, the attached ZIP file (see Download) contains a customization for DocBook
to allow DocBook books and articles to import DITA content by reference.
The customization adds a new
to DocBook that can refer to DITA topics or maps (including bookmaps) to supply
part or all of the content within a DocBook-defined bibliographic unit.
The prototype also adds a wrapper around the existing DITA-to-DocBook
transform from the DITA Open Toolkit to process DocBook books or articles.
The process replaces the
with DocBook content generated from the referenced DITA content; the result
can then be processed with DocBook tools. If you have the DocBook toolset,
the ANT file here allows you to do this as one step, and the temporary merged
file will be removed. Otherwise, the scripts leave you with a merged DocBook file.
This mechanism can be used to integrate DITA into DocBook environments; it can also be used to leverage DocBook's book model and tools to produce books from topics. Although this is only a proof-of-concept for simple DITA content, it can be extended for more robust support in a full production system.
XMLObj element in IBMIDDoc shows that it is possible to move
to DITA without abandoning an existing publishing system. We were able to
start using DITA content with our existing books long before time permitted a full-scale migration. We created a path to old output formats that were not yet supported from DITA. All of this allowed us to use DITA content
without investing in an entirely new system, years before a DITA-based system
|Sample code to pull DITA content into Docbook||x-dita11-dbdita.zip||20KB||HTTP|
- IBM donated DITA to the OASIS standards organization in March of 2004, where it is now managed by the OASIS DITA Technical Committee (http://www.oasis-open.org/committees/dita/). In April of 2005, OASIS approved Version 1.0 of the DITA specification, which consists of the following documents:
- OASIS Darwin Information Typing Architecture (DITA) Language Specification: http://xml.coverpages.org/DITAv10-OS-LangSpec20050509.pdf
- OASIS Darwin Information Typing Architecture (DITA) Architectural Specification: http://xml.coverpages.org/DITAv10-OS-ArchSpec20050509.pdf
- A consolidated .zip file with all specifications, DTDs, and Schemas is publicly available in the documents section of the OASIS DITA Technical Committee site: http://www.oasis-open.org/committees/download.php/15316/dita10.zip
- The DTDs and Schemas were updated recently with some bug fixes; this version (DITA 1.0.1) is also available in the documents section of the OASIS DITA Technical Committee site: http://www.oasis-open.org/committees/download.php/15396/dita-document-definitions-1.0.1.zip
A reference implementation toolkit for both the developerWorks and OASIS 1.0 versions of the DITA DTDs/Schemas is available at the DITA Open Toolkit project site on SourceForge: http://dita-ot.sourceforge.net. The DITA Open Toolkit supercedes all previous versions published on developerWorks, the last version of which was commonly called "dita132".
- Read the updated developerWorks article "Introduction to the Darwin Information Typing Architecture" (developerWorks, updated September 2005).
- Define new topic structures by specializing topics in DITA (developerWorks, updated September 2005).
Information Typing Architecture (DITA XML): The OASIS Cover Pages are
a one-stop shop for everything related to DITA.
- Visit the OASIS DocBook Technical Committee to learn about DocBook.
- See the DocBook project on sourceforge for information on working with DocBook.
Get products and technologies
Build your next development project with
trial software, available for download directly from developerWorks.
- Download the DITA Open Toolkit, the open source processing system required to use the attached ZIP file.
Robert D. Anderson is the chief architect for the DITA Open Toolkit, and is also a member of the OASIS DITA Technical Committee. He has worked on IBM's internal Information Development Workbench since 1999, supporting both XML (DITA) and SGML (IBMIDDoc).