You can find many tutorials that show you how to create an effective format for your 404 page. Most suggest that 404 pages contain static, suggested links that point to common areas on your site, such as the front page, downloads page, and your site's search engine, if you have one. The problem with generic 404 pages is that they do not reflect why the visitor came to the site. This article shows you how to build a suggestion-maker and a method of providing more useful redirect links that are based on the content of your Web site.
Current 404 handlers allow you to provide a few suggested links for all errors, such as pointing the users to the site directory. Spelling correctors, such as mod_speling (yes, it has one "l") can be used to correct errors in dictionary words that may lead a user to the right page. The code here will help you build a suggestion-making engine to handle nondictionary words and directory links based on the content of your Web site.
Consider, for example, you hear a Web page name during a teleconference, so you try a link to blegs/DavSmath.html. Current spelling correction modules would be unable to provide a useful link for this case. Using the code in this article, you'll be able to generate a 404 page with a suggestion for the valid page at /blogs/DaveSmith.html.
Any modern PC manufactured after 2000 should provide plenty of horsepower for compiling and running the code in this article. You may need RAM-rich, high-powered hardware or patience if your Web site contains more than 10,000 or so distinct pages.
The Perl and CGI scripts provided work on a variety of UNIX® and Windows® flavors (see Download). Although this article uses Apache and a CGI script for the suggestion engine, the tools built should function with most Web servers. For metaphone matching, this article references the Text::Metaphone module by Michael Schwern. Install the Text::Metaphone module from your favorite CPAN mirror and you'll be ready to start. See Resources for downloads.
The sample files referred to in this article are available in Download.
Web server pages and metaphone codes
The primary method for suggesting alternatives to typographical and spelling errors will be metaphone matching. Metaphones, like Soundex and other algorithms, use a alphanumeric code to represent the verbal pronunciation of a word. Unlike Soundex, however, metaphone codes are built to match the linguistic variabilities of pronunciation in the English language. The average metaphone code is, therefore, a much more accurate representation of a given word, and provides an ideal basis for building a suggestion library.
Consider the following list of files in a sample Web server directory.
Listing 1. Web server files
./index.html
./survey.html
./search_tips.html
./about.html
./how.html
./why.html
./who.html
./NathanHarrington.html
./blogs/NathanHarrington.html
./blogs/DaveSmith.html
./blogs/MarkCappel.html
|
With this set of static HTML files, we'll use the buildMetaphoneList.pl program to create metaphones for each filename with an .html extension.
Listing 2. buildMetaphoneList.pl
#!/usr/bin/perl -w
# buildMetaphoneList.pl - / split filename, 0 score, metaphones
use strict;
use File::Find;
use Text::Metaphone;
find(\&htmlOnly,".");
sub htmlOnly
{
if( $File::Find::name =~ /\.html/ )
{
my $clipFname = $File::Find::name;
$clipFname =~ s/\.html//g;
my @slParts = split '/', $clipFname;
shift(@slParts);
print "$File::Find::name ### 0 ### ";
for( @slParts ){ print Metaphone($_) . " " }
print "\n";
}#if a matching .html file
}#htmlOnly sub
|
The buildMetaphoneList.pl program processes files with an .html extension only, removes
the .html from the filename, then generates metaphones for each part of the full path
name. Copy the buildMetaPhoneList.pl program to your webserver root directory and run
the command perl buildMetaphoneList.pl >
metaphonesScore.txt. For the files shown in Listing 1, the corresponding
metaphonesScore.txt file contents is shown below.
Listing 3. metaphonesScore.txt
./index.html ### 0 ### INTKS
./survey.html ### 0 ### SRF
./search_tips.html ### 0 ### SRXTPS
./about.html ### 0 ### ABT
./how.html ### 0 ### H
./why.html ### 0 ### H
./who.html ### 0 ### H
./NathanHarrington.html ### 0 ### N0NHRNKTN
./blogs/NathanHarrington.html ### 0 ### BLKS N0NHRNKTN
./blogs/DaveSmith.html ### 0 ### BLKS TFSM0
./blogs/MarkCappel.html ### 0 ### BLKS MRKKPL
|
Each line in Listing 3 shows the actual link in the filesystem under the webserver root directory, default score, and metaphone code. Note how how.html, why.html, and who.html all resolve to the same metaphone code. To deal with this ambiguity, modify the score field to have the link-suggestion program provide links to your pages in the desired order. For example, change the "H" metaphone entires to be:
./how.html ### 100 ### H ./why.html ### 50 ### H ./who.html ### 0 ### H |
This creates a straightforward reordering of the links, with room for further modification of the scores. Large score counts are preferable for later insertion of files with the same metaphone, but a different score. For example, adding a hoo.html file list could have a score of 25 appear above the who.html entry and below the why.html entry.
You can also use the score field for differentiation between files of the same name from differing directories. Modify the ./NathanHarrington.html line score to be 100, for example, and requests for pages like nathenHorrington.html will list the ./NathanHarrington.html link before the ./blogs/NathanHarrington.html page.
When choosing how to score your files, consider the statistical and logical access components of your Web site. Users may more frequently request the why.html page according to the log files, but if you know it's more important they know the how.html, simply provide corresponding scores for correct sorting.
With the appropriate metaphones generated along with their associated scores, we can now build the actual suggestion-maker. The typical 404 error message path is due to typographical errors in the link or bad links themselves. Suggestions made by the code listed below will be created by running three main tests: matching given a directory structure, matching with a combined metaphone, and "contains" matching when all else fails. These three tests are designed to handle the majority of 404 errors. The beginning of the MetaphoneSuggest CGI Perl script is shown below.
Listing 4. MetaphoneSuggest CGI Part 1
#!/usr/bin/perl -w
# MetaphoneSuggest - suggest links for typographical and other errors from 404s
use strict;
use CGI::Pretty ':standard'; #standard cgi stuff
use Text::Metaphone;
my @suggestLinks = (); # suggested link list
my %mt = (); # filename, score, metaphone code hash
my $origLink = substr($ENV{REDIRECT_URL},1); # remove leading /
$origLink =~ s/\.html//g; # remove trailing .html
open(MPH,'metaphonesScore.txt') or die "can't open metaphones";
while(my @slPart = split '###', <MPH>)
{
$slPart[0] =~ s/ //g; #remove trailing space
$mt{$slPart[0]}{ score } = $slPart[1];
$mt{$slPart[0]}{ metaphones } = $slPart[2];
}
close(MPH);
|
After the usual library includes and variable declarations, the code will load the reported 404 text, as well as the metaphones created using the buildMetaphoneList.pl program. Now we're ready for the main program logic, as shown below.
Listing 5. Main program logic
push @suggestLinks, sortResults( directorySplitTest( $origLink ) );
push @suggestLinks, sortResults( combinedTest( $origLink ) );
push @suggestLinks, sortResults( containsTest( $origLink ) );
# from the book - unique-ify the array
my %seen = ();
@suggestLinks = grep{ ! $seen{$_}++ } @suggestLinks ;
print header;
print qq{Error 404: The file requested [$ENV{REDIRECT_URL}] is unavailable.<BR >};
next if( @suggestLinks == 0 );
print qq{Please try one of the following pages:<BR >};
for my $link( @suggestLinks ){
$link = substr($link,index($link,'./')+1);
print qq{<a href="$link">$link</a><BR >};
}
|
The output of each section of match test code is sorted, then added to the overall suggestion link list. After sorting and unique-ifying the link list, printing out the suggested links is straightforward.
The three sort commands pushed onto a single results array is designed to create an ordered and numerically sorted suggestion list. When a 404 comes in, it's highly likely that the presence of directory delimiters indicate a Web page is desired at least one level down the directory tree. Take, for example, a page request like bloggs/nathenherringtoon.html. The directorySplitTest as called above will create a sorted list of pages that have a metaphone match for both BLKS and N0NHRNKTN in subsequent directories. This strategy provides the necessary distinction between files in the root directory, such as a blogs.html and nathanharrington.html, and pages with the full path name match like blogs/nathanharrington.html. The listing below shows the contents of the directorySplitTest subroutine.
Listing 6. directorySplitTest subroutine
sub directorySplitTest
{
my @matchRes = ();
my $inLink = $_[0];
for my $fileName ( keys %mt )
{
my @inLinkMetas = ();
# process each metaphone chunk as a directory
for my $inP ( split '\/', $inLink ){ push @inLinkMetas, Metaphone($inP) }
my @metaList = split ' ', $mt{$fileName}{metaphones};
next if( @metaList != @inLinkMetas );
my $pos = 0;
my $totalMatch = 0;
for( @metaList )
{
$totalMatch++ if( $metaList[$pos] =~ /(\b$inLinkMetas[$pos]\b)/i );
$pos++;
}#for meatlist
# make sure there is a match in each metaphone chunk
next if( $totalMatch != @metaList );
push @matchRes, "$mt{$fileName}{score} ## $fileName";
}#for keys in metaphone hash
return( @matchRes );
}#directorySplitTest
|
Following the directorySplitTest, the combined test will check for matches where the
metaphones are smooshed together — disregarding any directory structure. This
is useful for correcting a class of 404s that involve space, slash, backslash, colon,
and other nonpronounced characters in their filenames. For example, if a 404 request
comes in for blogs_nathanherrington.html, the directorySplitTest will return zero
results, but the combinedTest will find that the metaphones produced by that 404 are an
exact match with those of the blogs/NathanHarrington.html page when combined. Again,
these suggestions are lower priority than a directory match, so their sorted results
are pushed onto the suggestLinks array after the directorySplitTest. The listing below
shows the combinedTest subroutine.
Listing 7. combinedTest subroutine
sub combinedTest
{
my @matchRes = ();
my $inLink = $_[0];
for my $fileName ( keys %mt )
{
my $inLinkMeta = Metaphone($inLink);
# smoosh all of the keys together, removing spaces and trailing newline
my $metaList = $mt{$fileName}{metaphones};
$metaList =~ s/( |\n)//g;
next if( $metaList !~ /(\b$inLinkMeta\b)/i );
push @matchRes, "$mt{$fileName}{score} ## $fileName";
}#for filename keys in metaphone hash
return(@matchRes);
}#combinedTest
|
After the combinedTest, the final attempt is made to match based on a broad-ranging
contains search. If the metaphone of the current 404 link is anywhere in any of the
available metaphones from metaphoneScores.txt, we will add it to the suggestion list.
The contains search is designed to pick up on severely incomplete URLs. The page
nathan.html is nowhere to be found, but a good suggestion would be
/NathanHarrington.html and /blogs/NathanHarrington.html, and these are sorted on score
and added to the suggestLinks array. Note that this approach
will also produce suggestions of NathanHarrington.html for one-letter metaphone 404s
like whoo.html. Because the NathanHarrington.html metaphone contains an "H," it will be
added to the suggestion list. Consider creating minimum lengths of metaphones to be
matched or providing a total limit to the number of contains matches to modify this
behavior. Listing 8 shows the containsTest and sortResults subroutines.
Listing 8. sortResults and containsTest subroutines
sub sortResults
{
# simply procedue to sort an array of 'score ## filename' entries
my @scored = @_;
my @idx = (); #temporary index for sorting
for my $entry( @scored ){
# create an index of scores
my $item = substr($entry,0,index($entry,'##'));
push @idx, $item;
}
# sort the index of scores
my @sorted = @scored[ sort { $idx[$b] <=> $idx[$a] } 0 .. $#idx ];
return( @sorted );
}#sortResults
sub containsTest
{
my @matchRes = ();
my $inLink = $_[0];
for my $fileName ( keys %mt )
{
my $inLinkMeta = Metaphone($inLink);
my $metaList = $mt{$fileName}{metaphones};
next if( $metaList !~ /$inLinkMeta/i );
push @matchRes, "$mt{$fileName}{score} ## $fileName";
}#for filename keys in metaphone hash
return(@matchRes);
}#containsTest
|
Modifying the Apache httpd.conf file
The MetaphoneSuggest script as designed above is a straightforward cgi-bin script to be called from Apache. You'll need to modify your httpd.conf file to run the MetaphoneSuggest script instead of displaying a 404 error page. For example, if your default httpd.conf file has the section:
Listing 9. Default httpd.conf section
# Customizable error responses come in three flavors:
# 1) plain text 2) local redirects 3) external redirects
#
# Some examples:
#ErrorDocument 500 "The server made a boo boo."
#ErrorDocument 404 /missing.html
#ErrorDocument 404 "/cgi-bin/missing_handler.pl"
#ErrorDocument 402 http://www.example.com/subscription_info.html
|
Insert the following line: ErrorDocument 404
"/cgi-bin/MetaphoneSuggest" after the commented-out ErrorDocument lines. Make sure
the MetaphoneSuggest and metaphonesScore.txt file are in the
<document_root</cgi-bin/ directory on your Web server. Issue a server restart
command as root: /usr/local/apache2/bin/apachectl restart
(for example), and you're ready to start serving smart suggestions instead of dead-end errors.
Implementation options and usability considerations
Keep in mind when using the tools described in the MetaphoneSuggest program that a 404 page is an error condition. Consider providing just a few suggested alternatives and keeping the design simple. Consult the big names in Web design for information on why they do not provide automatic link suggestions, or usability studies for how best to implement a link suggestion tool into your site.
This article provides the tools and code necessary to create options for useful link suggestions from 404s. However they are chosen to be implemented, you now have the ability to provide more than simple directory links or spelling suggestions. With results tailored for specific sites and content, the dead-end 404 can be a thing of the past.
| Description | Name | Size | Download method |
|---|---|---|---|
| Code | os-metaphone.web404MetaphoneSuggest.zip | 2KB | HTTP |
Information about download methods
Learn
-
To listen to interesting interviews and discussions for software developers, check out developerWorks podcasts.
-
Stay current with developerWorks'
Technical events and webcasts.
-
Watch and learn about IBM and open source technologies and product functions with the
no-cost developerWorks On demand demos.
-
Check out upcoming conferences, trade shows, webcasts, and other
Events around the world that are of interest to IBM open source developers.
-
Visit the developerWorks Open
source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM's products.
Get products and technologies
-
Download Michael Schwern's Text::Metaphone module from CPAN.
-
Check out Apache.org for the best in Web servers.
You can download a version of Apache
HTTP Server for almost any operating system.
-
If your system doesn't have Perl, you can download it for almost any operating system.
-
Download IBM product evaluation versions, and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
-
Innovate your next open source development project with
IBM trial software, available for download or on DVD.
Discuss
-
Participate in developerWorks blogs and get involved in the developerWorks community.
Comments (Undergoing maintenance)





