My first Web-based filtering proxy

Converting text to HTML using Txt2Html


In the course of writing articles in this developerWorks series, I faced a quandary about the best format to write in. Word processor formats are proprietary, and conversion between formats tends to be imperfect and troublesome (and such formats bind one to proprietary tools, contrary to an open source spirit). HTML is fairly neutral -- and is probably the form you are reading this article in -- but it also adds tags that are easy to mistype (or which commit one to an HTML-enhanced editor). DocBook is an interesting XML format that can be converted to many target formats, and which has the right semantics for technical articles (or books); but like HTML, there are lots of tags to worry about during the writing process. LaTeX is great for sophisticated typography; but it also has lots of tags, and these articles don't need typographic sophistication.

For real ease of composition -- and especially for platform and tool neutrality -- plain ASCII just cannot be beat. Beyond completely plain text, however, the Internet (especially Usenet) has prompted the development of an informal standard of "smart ASCII" documents. "Smart ASCII" adds just a little bit of extra semantic content and context in ways that look "natural" in text displays. E-mails, newsgroup posts, FAQs, project READMEs, and other electronic documents often include a few typographic/semantic elements like asterisks around emphasized words, underscores surrounding titles, vertical and horizontal whitespace to describe textual relations, selective ALLCAPS, and a few other tidbits. Project Gutenberg is a wonderful effort that put quite a bit of thought into its own consideration of formats, and decided on "smart ASCII" as the best choice for preserving and distributing great books for a long time. Even if these articles won't live as such literary classics, the decision was made to write them as "smart ASCII", and automate any conversions to other formats with handy Python scripts.

Introduction to Txt2Html

Txt2Html started out as a simple file converter, as the name suggests. But the Internet suggested several obvious enhancements to the tool. Since many of the documents one might want to view in an "html-ized" form live somewhere at the end of http: or ftp: links, the tool should really handle such remote documents straightforwardly (without the need for a download/convert/view cycle). And since the target of the conversion is HTML after all, what we would generally want to do with the target is view it in a Web-browser.

Putting these things together, Txt2Html emerged as a "Web-based filtering proxy." Fancy words there, maybe even "fully buzzword compliant." They amount to the idea that a program might read a Web page (or other resource) on your behalf, massage the contents in some way, then present you with something that is better than the original page (at least for some particular purpose). A good example of such a tool is the Babelfish translation service (see Related topics). After running a URL through Babelfish, you see a Web page that looks pretty much like the original one, but has the pleasant feature of having words in a language you can read instead of in a language you do not understand. In a way, all the search engines that present little synopses of the pages they find for a search do the same thing. But those search engines (by design) take a lot more liberty with the formatting and appearance of a target page, and leave out a lot more. Txt2Html is certainly a lot less ambitious than Babelfish is; but conceptually, both do largely the same thing. See Related topics for more examples, some rather humorous.

Best of all, Txt2Html uses a number of programming techniques that are common to a lot of different Web-oriented uses of Python. This article will introduce those techniques, and give some pointers on coding techniques and the scope of some Python modules. Note: the actual module in Txt2Html is called dmTxt2Html to avoid conflict with the name of a module written by someone else.

Using the cgi module

Python's cgi module -- in the standard distribution -- is a godsend for anyone developing "Common Gateway Interface" applications in Python. You could create CGIs without it, but you wouldn't want to.

Most typically, you interact with CGI applications by means of an HTML form. You fill out a form that calls on the CGI to perform its action using your specifications. For example, the Txt2Html documentation uses this example for calling an HTML form (the one generated by Txt2Html itself is a bit more complicated, and may change, but the example will work perfectly well, even from within your own Web pages):

HTML form to call 'Txt2Html'
	<form method="get" action="">
	 URL: <input type="text" name="source" size=40>
	 <input type="submit" name="go" value="Display!">

You may include many input fields within an HTML form, and the fields can have one of a number of different types (text, checkboxes, picklists, radio buttons, etc.). Any good book on HTML can help a beginner with creating custom HTML forms. The main thing to remember here is that each field has a name attribute, and that name is used later to refer to the field in our CGI script. Another detail worth knowing about is that forms can have one of two method attributes: "get" and "post. The basic difference is that "get" includes the query information in the URL, and this method makes it easier for a user to save a specific query for later reuse. On the other hand, if you do not want users to save queries, use the "post" method.

The Python script that gets called by the above form does an import cgi to make sorting out its calling form easy. One thing this module does is hide any details of the difference between "get" and "post" methods from the CGI script. By the time the call is made, this is not a detail the CGI creator needs to worry about. The main thing done by the CGI module is to treat all the fields in the calling HTML form in a dictionary-like fashion. What you get is not quite a Python dictionary, but it is close enough to be easy to work with:

Using Python [cgi] module
import cgi, sys
cfg_dict = {'target': '<STDOUT>'}
sys.stderr = sys.stdout
form = cgi.FieldStorage()
if form.has_key('source'):
     cfg_dict['source'] = form['source'].value

There are a couple of little details to notice in the above few lines. One trick we do is to set sys.stderr = sys.stdout. By doing this, if our script encounters an untrapped error, the traceback will display back to the client browser. This can save a lot of time in debugging a CGI application. But it might not be what you want users to see (or it might, if they are likely to report problem details to you). Next, we read the HTML form values into the dictionary-like form instance. Much like a true Python dictionary, form has a .has_key() method. However, unlike a Python dictionary, to actually pull off the value within a key, we have to look at the .value attribute for the key.

From here, we have everything in the HTML form in plain Python variables, and we can handle them as in any other Python program.

Using the urllib module

Like most things Python, urllib makes a whole bunch of complicated things happen in an obvious and simple way. The urlopen() function in urllib treats any remote resource -- whether http:, ftp:, or even gopher: -- just like it was a local file. Once you grab a remote (pseudo-)file object using urlopen(), you can do everything you would with the file object of a local (read-only) file:

Using Python [urllib] module
from urllib import urlopen
import string
source = cfg_dict['source']
if source == '<STDIN>':
     fhin = sys.stdin
          fhin = urlopen(source)
          ErrReport(source+' could not be opened!', cfg_dict)
doc = ''
for line in fhin.readlines():   # Need to normalize line endings!
     doc = doc+string.rstrip(line)+'\n'

One minor problem that I have encountered is that depending on the end-of-line convention used on the platform that produced the resource and on your own platform, some odd things can happen to the resulting text (this appears to be a bug in urllib). The cure for this problem is to perform the little .readlines() loop in the above code. Doing this gives you a string that has the right end-of-line conventions for the platform you are running on, regardless of what the source resource looked like (within reason, presumably).

Using the re module

There is certainly a lot more to regular expressions than can fit into this article. The re module is fairly widely used in Txt2Html to identify various textual patterns in the source texts. A moderately complex example is worth looking at:

Using Python [re] module
import re
    txt = re.sub('((?:http|ftp|gopher|file)://(?:[^ \n\r<\)]+))(\s)',	 
                 '<a href="\\1">\\1</a>\\2', txt)
    return txt

URLify() is a nice little function that does pretty much what it says. If something that looks like a URL is encountered in the "smart ASCII" file, it is converted into an actual hotlink to that same URL within the HTML output. Let's look at what the re.sub() is doing. First, in broadest terms, the function's purpose is to "match what is in the first pattern, then replace it with the second pattern, using the third argument as the string to operate on." Good enough, not much different from string.replace() in those terms.

The first pattern has several elements. Notice the parentheses first: the highest level consists of two pairs: a complicated bunch of stuff followed by (\s). Sets of parentheses match "subexpressions" that can potentially make up part of the replacement pattern. The second subexpression, (\s), just means "match any whitespace character and let us refer back to what was matched. So let's look at the first subexpression.

Python regular expressions have a couple tricks of their own. One such trick is the ?: operator at the beginning of a subexpression. This means "match a subpattern, but don't include the match in the back-references." So let's examine the subexpression:

	((?:http|ftp|gopher|file)://(?:[^ \n\r<\)]+)).

First notice that this subexpression is itself composed of two child subexpressions, with some stuff in the middle that is not part of any child subexpression. However, each of the children starts with ?:, which means that they get matched, but don't count for reference purposes. The first of these "non-reference" child subexpressions just says "match something that looks like http or that looks like ftp or ...". Next we get the short string ://, which means to match anything that looks exactly like it (simple, huh?). Finally, we get the second child subexpression, which other than the "don't refer" operator consists of some stuff in square brackets, and a plus sign.

In regular expressions, square brackets just mean "match any character in the brackets." However, if the first character is a caret (^), the meaning is reversed, and it means "match anything not in the next characters." So we are looking for stuff that is not a space, CR, LF, "<" or ")" (notice also that characters that have special meaning to regular expressions can be "escaped" by having a "\" in front of them). The plus sign at the end means "match one or more of the last thing" (the asterisk is for "zero or more", and the question-mark is for "zero or one").

This regular expression has a bunch to digest, but if you walk through it a few times, you can see that this is what a URL has to look like.

Next is the replacement chunk. This is simpler. The parts that look like \\1 and \\2 (or \\3, \\4, etc., if we needed them) are those "back references" discussed. \\1 (or \\2) means the pattern matched by the first (or second) subexpression of the match expression. All the rest of the stuff in the replacement chunk just is what it is: some characters that are easily recognized as HTML codes. One little thing that is a bit subtle is that we bother to match \\2 -- which looking above is just a whitespace character. One might ask, "why bother? why not just insert a space as a literal?" Fair question, and we do not really need to do what we did for HTML. But aesthetically, it is better to let the HTML output stay as much as possible like the source text file was before our HTML markup. Particularly, let's keep the line-breaks as line-breaks, and spaces as spaces (and tabs as tabs).

