Polish the EPUB

Find and correct problems in EPUB files

In EPUB documents, you cannot detect some problems with normal validation methods. As long as the document validates as well-formed XML and follows the EPUB standard, it can appear to be correct but might not read correctly in an e-Reader. Examples include broken paragraphs, bad page numbering, and spelling errors caused by OCR scanning. But you can view and correct errors using two methods: with the EPUB editor Sigil and with PHP in combination with SimpleXML and the Enchant libraries. Regular expressions provide the key to efficient processing.

Colin Beckingham, Writer and Researcher, Freelance

Colin Beckingham is a freelance researcher, writer, and programmer who lives in eastern Ontario, Canada. Holding degrees from Queen's University, Kingston, and the University of Windsor, he has worked in a rich variety of fields including banking, horticulture, horse racing, teaching, civil service, retail, and travel and tourism. The author of database applications and numerous newspaper, magazine, and online articles, his research interests include open source programming, VoIP, and voice-control applications on Linux. You can reach Colin at colbec@start.ca.



30 August 2011

Also available in Chinese Japanese Spanish

The EPUB format is an efficient method of presenting documents. Its XML structure ensures that document components are in their place and will be displayed reasonably well on a wide variety of devices. For an introduction to EPUB, see Liza Daly's article in Resources.

Frequently used acronyms

  • GUI: Graphical user interface
  • OCR: Optical character recognition
  • HTML: Hypertext Markup Language
  • WYSIWYG: What you see is what you get
  • XML: Extensible Markup Language

These documents can fail at two levels:

  • At a fundamental level, where the XML markup or the content is broken
  • More subtly—at a level that checking the XML cannot detect

For the former problem, where the EPUB is broken internally, you can use the EpubCheck project (see Resources for a link). The remainder of this article examines the second type of issue, which can be an annoyance for readers.

The tight control that XML enforces goes only so far. XML happily permits a number of errors that, although not sufficiently grave to cause a software fault, nevertheless impede smooth reading. It is easy to see how these errors can happen—if a publisher uses OCR on a printed page to transfer it to text format, then all of the oddities of the printed page are carried over, including errors resulting from font incompatibility. In a commercial situation, editors review the result by hand to produce a polished edition, but where the product is designed for free and open source distribution, publishers cannot absorb these costs as easily. So what ends up in your e-Reader is good but not as good as it might be. Examples include broken paragraphs, blank pages, odd page numbering, and spelling errors.

From a developer's point of view, the challenge is how you can tackle these issues by using the structure of the EPUB. This article looks at how you can use the Sigil EPUB editor to address some of them and employ PHP in combination with SimpleXML and spelling libraries to resolve many others.

Broken paragraphs and blank pages

Take the broken paragraph as an example of a secondary problem. In HTML markup, this problem appears as:

<p>This is where my paragraph begins, hits the end of a physical page here</p>
<div class="newpage" id="page-12"></div>
<p>and then continues from the top of the next physical page, 
     finally coming to an end here.</p>

The scanner has read to the end of a page, put in a paragraph tag regardless of whether it applies to ensure that the page is syntactically complete, then starts at the top of the next page, ensuring that it begins with a new paragraph—again, whether it is appropriate or not. It makes for complete code but incomplete paragraphs because of orphaned sections. On the e-Reader, the user might see both sections on the same device page with no page marker displayed but the paragraph sections separated as if they were independent paragraphs.

Similarly, consider blank pages:

<div class="newpage" id="page-128"></div>
<p></p>
<div class="newpage" id="page-129"></div>

Does page 129 in the snippet above really exist? It might be important to preserve it blank, but otherwise, it is inconvenient to have to turn two pages when only one should be necessary.

Spelling errors are a different kind of problem where you compare two different lists of words rather than look for complex patterns. This problem you deal with separately using scripting methods.


Sigil

Sigil (see Resources for the website and support pages) is a WYSIWYG EPUB editor that can find the pattern-matching types of errors and allow programmers to correct them. See the Regular expressions sidebar for a quick introduction to regular expressions, and see Resources for more detailed information.

Regular expressions

Regular expressions provide a powerful way to search and replace text using pattern-matching techniques. The syntax is concise, so you need to exercise care to avoid unwanted effects.

An example of a regular expression is [^.]</p>, which searches for an end of paragraph tag that is not preceded by a period. This might or might not be a problem.

In this regular expression, the square brackets ([]) enclose a group of characters in which one only might apply, the caret symbol (^) means not any of the following characters, and the period (.) inside the group stands for itself, as do the rest of the symbols outside the brackets.

See Resources for a more in-depth discussion of this useful tool.

Sigil might not be available from your Linux® repository, but it is available as a precompiled binary or as source files. Once in the GUI, click File > Open to open your EPUB directly. Doing so extracts the EPUB and displays a directory of the component files on the left; it reveals a browser pane on the right in which you can display the contents of individual files either as you might view them in the e-Reader or as the marked-up code. This latter point is an essential feature in finding and correcting problems.

Choose one of the HTML files that your EPUB contains, and double-click it to open it in the browser window. Then, click View > Code View to display the code behind the file. All the tags should now be visible.

Suppose that you want to find orphaned paragraph chunks. The criterion you are looking for is </p> end-of-paragraph tags that are not preceded by a normal end-of-sentence character. The most common of these characters is the period. Sigil provides a search function (Edit > Find), and the normal search mode lets you find strings like .</p>, but it does not help you find the end of paragraph that does not have a period before it. For this, you need the regular expression search mode, which appears when you click More. Navigate to the top of the code in the browser window, then perform these steps:

  1. Select Down for the direction.
  2. Select Regular expression for the search mode.
  3. Type [^.]</p> as your Find what string.
  4. Click Find Next.

This process should find what you are seeking, if it exists. If there are no hits, you might want to create one temporarily just to check that the search function works.

After using this technique for a while, you soon find that paragraphs can legitimately end with characters other than periods. You find that double quotation marks ("), exclamation mark (!), question marks (?), and maybe some other characters fit the requirement of a complete sentence. Allowing for this is not a problem with regular expressions. Because the square brackets indicate a group, if you change the Find what to [^.?!"]</p>, the search accepts as normal anything that has a period, question mark, exclamation mark, or double quotation mark at the end of a paragraph and flags as erroneous anything else.

Another tell-tale sign of a broken paragraph might be those that begin with <p> followed by a lowercase alphabetic character. The regular expression version of this would be <p>[a-z].. Another useful one is <p>[0-9]., which looks for paragraphs that begin with numbers. This sign might be valid where the scanner has picked up a page number that in an e-Reader context might no longer be relevant.

How you decide to fix one of these errors is another matter. If a page marker separates the two pieces, you might move the marker to before or after the true paragraph and rejoin the two pieces to make one single paragraph. The page numbering is then approximately but not perfectly accurate.

Searching for page markers is a similar process. Again, using the regular expression option if the Find what is page-[0-9]+, the editor searches for any string that begins with the literal characters p, a, g, e, and dash followed by at least one of and maybe several number characters from the range zero to nine.

An interesting break that you can find easily is one where a word, paragraph, and page are all broken at the same time. The print version indicates the break with a hyphen or dash, which is easily visible and searched for in code view:

<p>This is where my paragraph begins, hits the end of a phys-</p>
<div class="newpage" id="page-12"></div>
<p>ical page and then continues from the top of the next physical page, 
     finally coming to an end here.</p>

In this case, a global normal search using the Find what string of -</p> should pick them out quite quickly.


Review page numbers

Although you can use Sigil to find and review page breaks and numbering, in a more than 100-page document, doing so might be tedious. An easier way is to iterate through the documents with PHP and review the numbering.

The script in Listing 1 finds and reviews the HTML pages and runs through the page breaks. It finds the number for the first page, which is quite often different from page 1, and verifies that each subsequent page is an increment from the first page. Although the page numbering test is fairly simple, it is an example of how to use the OPF file to find and examine the component HTML.

Listing 1. Page checking the EPUB with PHP and SimpleXML
<?php
/* epub is a zipped package containing many files
  the file "content.opf" contains the pointers to the constituent files
  inside content.opf you have 

  package (root)
    -> manifest
      -> item
          which we need to filter for media-type="application/xhtml+xml"
          and to check these are real text pages, not just full page images

  these are the text chapters which need to be checked one by one
*/
$firstpage = 0;
$oldpage = 0;
// look for the text to be checked
$opf_file = "./OEBPS/content.opf";
if (!file_exists($opf_file)) {
  //cleanup();
  die("Cannot find the OPF file\n");
} else {
  echo "Found it!\n";
  $xml = simplexml_load_file($opf_file);
  // get the manifest items
  foreach ($xml->manifest->item as $mi) {
    if ($mi['media-type']=='application/xhtml+xml') {
      echo "Found ".$mi['href']."\n";
      if (substr($mi['href'],0,4) == 'part') {
          echo "Page number check in document ".$mi['href']."\n";
          echo scan_chap("./OEBPS/".$mi['href']);
      }
    }
  }
}
function scan_chap($chap) {
global $firstpage, $oldpage;
  echo "Trying to page num check section $chap \n";
  if (!file_exists($chap)) {
    echo "Cannot find the chapter $chap\n";
  } else {
    echo "Found it!\n";
    $xml = simplexml_load_file($chap);
    //$i = 0;
    foreach ($xml->body->div->div as $pagnumdiv) {
      if ($pagnumdiv["class"]=='newpage') {
          echo $pagnumdiv["id"]."\n";
          $page = (int) substr($pagnumdiv["id"],5);
          if ($firstpage == 0) {
          $firstpage = $oldpage = $page;
          } else {
          if ($page != $oldpage+1) echo "Problem at page after $oldpage\n";
          $oldpage++;
          }
      }
    }
  }
  return "Done...\n";
}
?>

The code first sets up global variables for the number of the first logical page found (set once at the beginning of the loop) and the number of the previous page checked (that changes with each iteration). It then declares the name of the OPF file, looks for that file, and—if it cannot find it—ends with an error. If the file is found, the script opens the file as an XML object and looks for the names of the files mentioned in the manifest that appear to be HTML using the media-type attribute. In this particular EPUB document, some HTML files contain only a full-page image and therefore can be ignored. The file names of these pages contain the string leaf; the other files that contain extended text have a part label. The code filters these out using substrings.

Now that you know the name of the file, you can read this file into its own simpleXML object. Iterating through the <div> tags and filtering for those that have a class attribute of newpage, you can find the value of the id attribute that contains the page number. You need to let the book tell you which number is the first page because this is often not page 1, and after this value is stored in the global first page variable, you can go on to predict what the number of the next page should be. If it happens not to be the expected number, the script generates an error and continues checking.

This script does not attempt to make changes to the text. It merely flags what it thinks might need your attention.


Spell checking using PHP, XML, and Enchant

Spelling is a different problem. In this case, you are really after events such as Upon, which the OCR has read as TJpon or IJpon, which is close but not correct. It might come in as a number of alternatives, and the spelling routine sees it as so strange that the suggestions it offers are not close or helpful.

A spelling routine examines words one by one and compares them to a standard known list, pointing out those that don't match, making suggestions, and allowing you to make changes. Sigil can make replacements of specific strings across multiple documents in the EPUB package, but you need the power of a scripting engine such as PHP, Perl, Python, and so on, together with specialist libraries, for finer-grained control.

Newer versions of PHP now contain the hooks necessary not only for digging into XML and HTML files using SimpleXML but also for using the Enchant spelling manager library. Enchant is capable of managing multiple different base spelling lists. It helps to differentiate UK English from US English spellings, for example.

The script in Listing 2 examines each of the manifest files separately using the same method as in Listing 1, this time going through paragraph by paragraph and word by word checking each against the known spelling list. It uses the same method of iterating through the HTML component files as in Listing 1 and adds the required instructions to access the dictionaries.

Listing 2. Spell checking the EPUB with PHP, SimpleXML, and Enchant
<?php
  // spell check an epub
/* epub is a zipped package containing many files
  the file "content.opf" contains the pointers to the constituent files
  inside content.opf we have 

  package (root)
    -> manifest
      -> item
          which we need to filter for media-type="application/xhtml+xml"
          and to check these are real text pages, not just full page images

  these are the text chapters that need to be checked one by one

  Acknowledgment: Some of the dictionary-related code
  was copied from the PHP Enchant manual page

*/
// set up console for input
$console = fopen("php://stdin","r");
// set up enchant (from PHP manual)
$tag = 'en_CA';
$r = enchant_broker_init();
$bprovides = enchant_broker_describe($r);
echo "Current broker provides the following backend(s):\n";
print_r($bprovides);
$dicts = enchant_broker_list_dicts($r);
print_r($dicts);
if (enchant_broker_dict_exists($r,$tag)) {
    $d = enchant_broker_request_dict($r, $tag);
    $dprovides = enchant_dict_describe($d);
    echo "dictionary $tag provides:\n";
} else {
  cleanup();
  die ("Cannot set up the spell checker\n");
}
// look for the text to be checked
$opf_file = "./OEBPS/content.opf";
if (!file_exists($opf_file)) {
  cleanup();
  die("Cannot find the OPF file\n");
} else {
  echo "Found it!\n";
  $xml = simplexml_load_file($opf_file);
  foreach ($xml->manifest->item as $mi) {
    if ($mi['media-type']=='application/xhtml+xml') {
      echo "Found ".$mi['href']."\n";
      if (substr($mi['href'],0,4) == 'part') {
          echo "Need to spell check ".$mi['href']."\n";
          echo scan_chap("./OEBPS/".$mi['href']);
      }
    }
  }
}
function cleanup() {
global $d, $r;
  enchant_broker_free_dict($d);
  enchant_broker_free($r);
}
function scan_chap($chap) {
  echo "Trying to spell check section $chap \n";
  if (!file_exists($chap)) {
    echo "Cannot find the chapter $chap\n";
  } else {
    echo "Found it!\n";
    $xml = simplexml_load_file($chap);
    $i = 0;
    foreach ($xml->body->div->p as $para) {
      echo $para."\n";
      // need to spell check the contents of $para
      spell_check(trim($para));
      $i++;
      if ($i > 5) break;
    }
  }
  return "Done...\n";
}
function spell_check($para) {
global $console, $d;
  $para = str_replace("  "," ",$para);
  $para = str_replace(".","",$para);
  $para = $para." ";
  echo "Checking text : $para\n";
  $start = 0;
  while ($pos !== false) {
    $pos = strpos($para," ",$start);
    echo "Found $pos\n";
    if (!$pos) break;
    $len = $pos-$start;
    $theword = substr($para,$start,$len);
    // tidy up theword which may contain punctuation
    $punc = array(':',';',',','"','?','!');
    $theword = str_replace($punc,"",$theword);
    //
    if ((strlen($theword) > 0) and (!is_numeric($theword))) {
      if ($wordcorrect = enchant_dict_check($d, $theword)) {
          echo "$theword is OK!\n";
      } else {
          $suggs = enchant_dict_suggest($d, $theword);
          echo "Suggestions for <$theword>:\n";
          //print_r($suggs);
          $max = 5;
          foreach ($suggs as $k=>$sugg) {
            echo "$k => $sugg\n";
            if ($k > $max) break;
          }
          $inp = fgets($console,1024);
      }
    }
    $start += $len+1;
  }
}
?>

In this code, you start by declaring a file pointer to standard input so that you can get interactive information from the keyboard during the spell-check process. The next section establishes the connection to the dictionaries. Note that the tag variable indicates en-CA, which, in this instance, puts a preference on Canadian English. The result is that the checker chooses colour over color, acknowledgement over acknowledgment, and so on. A more standard setting for the tag is en-US. After the dictionary is connected, it performs the same search for HTML text files as in Listing 1, but this time, instead of looking for page number <div> tags, it looks for paragraphs with real text.

Before performing the actual spell check, the script cleans up the paragraph text to make it more manageable by removing long spaces and removing periods and commas because the goal is to examine word by word. Then, the actual spell checking starts by moving from word to word in the paragraph, ignoring words that are numbers and comparing the word to the dictionary. Where the dictionary does not contain the word, the script suggests words that might be a better substitute. In this case, the script presents only the first five alternates. The script halts at each problem word and waits for user input from the keyboard. At this point, you can add code to change, ignore once, ignore for the session, and so on.


Conclusion

Both Sigil and PHP scripting with XML and spelling libraries are helpful tools in finding and fixing errors that cannot be detected using normal EPUB checking routines. Whether these secondary errors are truly errors or just minor cosmetic inconveniences depends on the context in which you are using the document and the ability of the hardware reader and its own software to resolve these issues on the fly.

Resources

Learn

  • Build a digital book with EPUB (Liza Daly, developerWorks, updated January 2011, published November 2008): Read an introduction to EPUB and a list of EPUB resources.
  • Know your regular expressions, (Michael Stutz, developerWorks, June 2007): Check out this introduction to regular expressions on UNIX® systems. Discover the available tools and techniques that can help you learn how to construct regular expressions for various programs and languages.
  • More articles by this author (Colin Beckingham, developerWorks, March 2009-current): Read articles about XML, voice recognition, XHTML, PHP, SMIL, and other technologies.
  • New to XML? Get the resources you need to learn XML.
  • XML area on developerWorks: Find the resources you need to advance your skills in the XML arena, including DTDs, schemas, and XSLT. See the XML technical library for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
  • IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
  • developerWorks technical events and webcasts: Stay current with technology in these sessions.
  • developerWorks on Twitter: Join today to follow developerWorks tweets.
  • developerWorks podcasts: Listen to interesting interviews and discussions for software developers.
  • developerWorks on-demand demos: Watch demos ranging from product installation and setup for beginners to advanced functionality for experienced developers.

Get products and technologies

  • Sigil: Explore this multi-platform WYSIWYG ebook editor, designed to edit books in EPUB format.
  • Enchant: Learn about spell checking with this wrapper that provides uniformity and conformity on top of several libraries.
  • EpubCheck project: Check out this useful tool to validate IDPF EPUB files. It can detect many types of errors in EPUB.
  • IBM product evaluation versions: Download or explore the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Open source
ArticleID=754003
ArticleTitle=Polish the EPUB
publish-date=08302011