Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Build a digital book with EPUB

The open XML-based eBook format

Liza Daly, Software Engineer and Owner, Threepress Consulting Inc.
Photo of Liza Daly
Liza Daly is a software engineer who specializes in applications for the publishing industry. She has been the lead developer on major online products for Oxford University Press, O'Reilly Media, and other publishers. Currently she is an independent consultant and the founder of Threepress, an open source project developing ebook applications.

Summary:  Need to distribute documentation, create an eBook, or just archive your favorite blog posts? EPUB is an open specification for digital books based on familiar technologies like XML, CSS, and XHTML, and EPUB files can be read on portable e-ink devices, mobile phones, and desktop computers. This tutorial explains the EPUB format in detail, demonstrates EPUB validation using Java technology, and moves step-by-step through automating EPUB creation using DocBook and Python.

05 Feb 2009 - As a followup to reader comments, the author revised the content of Listing 3 and refreshed the epub-raw-files.zip file (see Downloads).

27 Apr 2010 - Refreshed the epub-raw-files.zip file (see Downloads).

03 Jun 2010 - At author request,revised the content of Listings 3 and 8. Also refreshed the epub-raw-files.zip file (see Downloads).

11 Jan 2011 - At author request,revised the content of Listing 5. Changed second line of code from <item id="ncx" href="toc.ncx" media-type="text/xml"/>; to <item id="ncx" href="toc.ncx" media-type="application/x-dtbncx+xml"/>.

12 Jul 2011 - As a followup to reader comments, revised the content of Listing 14. Removed ` character near end of first line of code from <?xml version="1.0" encoding="utf-8"?`>. Revised code now reads: <?xml version="1.0" encoding="utf-8"?>.

Date:  13 Jul 2011 (Published 25 Nov 2008)
Level:  Intermediate PDF:  A4 and Letter (504 KB | 25 pages)Get Adobe® Reader®

Activity:  314348 views
Comments:  

From DocBook to EPUB

DocBook is a popular choice for developers who need to maintain long-form technical documentation. Unlike the files produced by traditional word-processing programs, you can manage DocBook output with text-based version-control systems. Because DocBook is XML, you can easily transform it into multiple output formats. Since the summer of 2008, you can find native support of EPUB as an output format from the official DocBook XSL project.

Running the basic DocBook-to-EPUB pipeline with XSLT

Start with a simple DocBook document, in Listing 14. This document is defined as type book and includes a preface, two chapters, and an inline image displayed on the title page. This image will be found in the same directory as the DocBook source file. Create this file and the title page image yourself, or download samples from Downloads.


Listing 14. A simple DocBook book

<?xml version="1.0" encoding="utf-8"?>  
<book>
  <bookinfo>
    <title>My EPUB book</title>
    <author><firstname>Liza</firstname>
            <surname>Daly</surname></author>
    <volumenum>1234</volumenum>
  </bookinfo>
  <preface id="preface">  
    <title>Title page</title>
    <figure id="cover-image">
      <title>Our EPUB cover image icon</title>
      <graphic fileref="cover.png"/>
    </figure>
  </preface>
  <chapter id="chapter1"> 
    <title>This is a pretty simple DocBook example</title>
    <para>
      Not much to see here. 
    </para>
  </chapter>
  <chapter id="end-notes"> 
    <title>End notes</title>
    <para>
      This space intentionally left blank.
    </para>
  </chapter>
</book>

Next, see Resources to download the latest version of the DocBook XSL stylesheets, and make sure that you have an XSLT processor such as xsltproc or Saxon installed. This example uses xsltproc, which is available on most UNIX-like systems. To convert the DocBook file, just run that file against the EPUB module included in DocBook XSL, as in Listing 15.


Listing 15. Converting DocBook into EPUB

$ xsltproc /path/to/docbook-xsl-1.74.0/epub/docbook.xsl docbook.xml
Writing OEBPS/bk01-toc.html for book
Writing OEBPS/pr01.html for preface(preface)
Writing OEBPS/ch01.html for chapter(chapter1)
Writing OEBPS/ch02.html for chapter(end-notes)
Writing OEBPS/index.html for book
Writing OEBPS/toc.ncx
Writing OEBPS/content.opf
Writing META-INF/container.xml

Customizing DocBook XSL

The DocBook-to-EPUB conversion pipeline is still relatively new, and you might need to customize the XSLT to get the desired output.

Next, add the mimetype file and build the epub+zip archive yourself. Listing 16 shows those three quick commands and the result of a pass through the EpubCheck validator.


Listing 16. Creating the EPUB archive from DocBook

$ echo "application/epub+zip" > mimetype
$ zip -0Xq  my-book.epub mimetype
$ zip -Xr9D my-book.epub *
$ java -jar epubcheck.jar my-book.epub 
No errors or warnings detected

Pretty easy! Figure 3 shows your creation in ADE.


Figure 3. Converted DocBook EPUB in ADE
DocBook EPUB displayed in ADE

Automatic DocBook-to-EPUB conversion with Python and lxml

The DocBook XSL goes a long way toward making EPUB generation painless, but you must perform a few steps outside XSLT. This last section demonstrates a sample Python program that completes the creation of a valid EPUB bundle. I show individual methods in the tutorial; you can get the complete docbook2epub.py program from Downloads.

Several Python XSLT libraries are available, but my preference is lxml. It provides not just XSLT 1.0 functionality but also high-performance parsing, full XPath 1.0 support, and special extensions for handling HTML. If you prefer a different library or use a different programming language than Python, these examples should be easy to adapt.

Calling the DocBook XSL with lxml

The most efficient method to call XSLT using lxml is to parse the XSLT in advance, then create a transformer for repeated use. This is useful, as my DocBook-to-EPUB script accepts multiple DocBook files to convert. Listing 17 demonstrates this approach.


Listing 17. Running the DocBook XSL using lxml

import os.path
from lxml import etree

def convert_docbook(docbook_file):
    docbook_xsl = os.path.abspath('docbook-xsl/epub/docbook.xsl')
    # Give the XSLT processor the ability to create new directories
    xslt_ac = etree.XSLTAccessControl(read_file=True, 
                                      write_file=True, 
                                      create_dir=True, 
                                      read_network=True, 
                                      write_network=False)
    transform = etree.XSLT(etree.parse(docbook_xsl), access_control=xslt_ac)
    transform(etree.parse(docbook_file))

The EPUB module in DocBook XSL creates the output files itself, so nothing is returned from the evaluation of the transform here. Instead, DocBook creates two folders (META-INF and OEBPS) in the current working directory that contain the results of the conversion.

Copying the images and other resources into the archive

DocBook XSL does nothing about any images that you might supply for use with your document; it only creates the metadata files and the rendered XHTML. Because the EPUB specification requires that all resources be listed in the content.opf manifest, you can inspect the manifest to find any images that were referenced in the original DocBook file. Listing 18 shows this technique, which assumes that the path variable contains the path to your EPUB-in-progress, as created by the DocBook XSLT.


Listing 18. Parse the OPF content file to find any missing resources

import os.path, shutil
from lxml import etree

def find_resources(path='/path/to/our/epub/directory'):
    opf = etree.parse(os.path.join(path, 'OEBPS', 'content.opf'))

    # All the opf:item elements are resources
    for item in opf.xpath('//opf:item', 
                          namespaces= { 'opf': 'http://www.idpf.org/2007/opf' }):

        # If the resource was not already created by DocBook XSL itself, 
        # copy it into the OEBPS folder
        href = item.attrib['href']
        referenced_file = os.path.join(path, 'OEBPS', href):
        if not os.path.exists(referenced_file):
            shutil.copy(href, os.path.join(path, 'OEBPS'))

Creating the mimetype file automatically

DocBook XSL won't create your mimetype file, either, but a quick bit of code from Listing 19 can take care of that.


Listing 19. Create the mimetype file

def create_mimetype(path='/path/to/our/epub/directory'):
    f = '%s/%s' % (path, 'mimetype')
    f = open(f, 'w')
    # Be careful not to add a newline here
    f.write('application/epub+zip')
    f.close()

Creating the EPUB bundle with Python

All that's left now is to bundle the files into a valid EPUB ZIP archive. This takes two steps: adding the mimetype file as the first in the archive with no compression, then adding the remaining directories. Listing 20 shows the code for this process.


Listing 20. Using the Python zipfile module to create an EPUB bundle

import zipfile, os

def create_archive(path='/path/to/our/epub/directory'):
    '''Create the ZIP archive.  The mimetype must be the first file in the archive 
    and it must not be compressed.'''

    epub_name = '%s.epub' % os.path.basename(path)

    # The EPUB must contain the META-INF and mimetype files at the root, so 
    # we'll create the archive in the working directory first and move it later
    os.chdir(path)    

    # Open a new zipfile for writing
    epub = zipfile.ZipFile(epub_name, 'w')

    # Add the mimetype file first and set it to be uncompressed
    epub.write(MIMETYPE, compress_type=zipfile.ZIP_STORED)
    
    # For the remaining paths in the EPUB, add all of their files
    # using normal ZIP compression
    for p in os.listdir('.'):
        for f in os.listdir(p):
            epub.write(os.path.join(p, f)), compress_type=zipfile.ZIP_DEFLATED)
    epub.close()

That's it! Remember to validate.

5 of 9 | Previous | Next

Comments



static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Mobile development
ArticleID=485946
TutorialTitle=Build a digital book with EPUB
publish-date=07132011
author1-email=liza@threepress.org
author1-email-cc=dhatten@us.ibm.com