Web services are late arrival to the structure of the World Wide Web. Many interesting Web applications are built as traditional CGI programs with request-response forms going back and forth between the client and the server. With the benefits of service architectures, you may want to have the functionality of an existing CGI program as a Web service. However, it may be too complex or expensive to rewrite the application. One easy trick is to have the Web service read and parse the Web page itself. This "page scraping" technique is often the handiest device in the modern programmer's toolbox, and it is the focus of this article.
This article assumes that you have some familiarity with programming Perl.
Figure 1: Slashdot's poll

To make this discussion more concrete, let's define a Web service for Slashdot's weekly poll (see Figure 1). Using this service, programs will be able to get the latest poll topic (along with its options), retrieve the current results of the poll, and vote for a particular option. This API is summarized in Table 1.
Table 1: Slashdot Web service API
| URL: | http://marian.daisypark.net/~jjohn/slashdot_poll.pl | Method API | Method Name | Input | Output | Description | ||
|---|---|---|---|---|---|---|---|---|
getPollTopic | none | Returns a structure | Returns the topic, the poll ID, and the voting options. If an error occurs, a special member called "error" is given a non-zero value. | |||||
vote | pollID and choiceID | Boolean indicating success or failure | Given the strings that the poll and choice on SlashDot, this method attempts to vote | |||||
getPollResults | pollID | A structure | Given a pollID, a structure that maps the choice to a two element array is returned. The array's first element is the number of votes cast for that choice, and the second element is the percent of the total votes that choice received. | |||||
Much like XML-RPC, SOAP can return arbitrarily complex data structures. SOAP has a much richer collection of data types, but all the expected ones (strings, integers, lists and dictionaries) are represented. Also like XML-RPC, the programmer is shielded from having to do anything about the underlying XML.
These three methods are conceptually uncomplicated. The getPollTopic()
function returns all the important information needed for the other two methods, including the label used by Slashdot to identify this poll in its database. A dictionary that maps the human-readable voting choice to the Slashdot identifier for that choice is also returned by this method. Both the poll and individual choice identifiers are needed by the vote() method, which actually registers a vote on Slashdot. Lastly, getPollResults() reports the state of the current voting.
The hardest part of creating this service has nothing to do with SOAP, but parsing the HTML. A new SOAP dispatch object is created on lines 4-7 in Listing 1. Here, the class to be published
is called Slash and it is implemented in the same script that dispatches the SOAP requests.
1 #!/usr/bin/perl
2 # Wrap a CGI script into a web service
3 use strict;
4 use SOAP::Transport::HTTP;
5
6 my $dispatch = SOAP::Transport::HTTP::CGI->dispatch_to('.', 'Slash');
7 $dispatch->handle;
8
9 package Slash;
10 use HTML::TreeBuilder;
11 use LWP;
12 use constant BASE_URL => 'http://slashdot.org';
13
14 sub getPollTopic {
15 my $homepage = get_page();
16 return { error => 1 } unless $homepage;
17
18 my $tree = HTML::TreeBuilder->new;
19 $tree->parse($homepage);
20
21 my %poll = (); # to be returned
22
23 for my $td ($tree->look_down("_tag", "td") ){
24 if( my $form = $td->look_down("_tag", "form") ){
25 if($form->attr("action") eq 'http://slashdot.org/pollBooth.pl' ){
26
27 # get the pollID
28 if( my $input = $td->look_down("_tag", "input")){
29 if( $input->attr("name") eq 'qid' ){
30 $poll{pollid} = $input->attr("value");
31 }
32 }
33
34 # get the poll topic
35 my $b = $td->look_down("_tag", "b");
36 $poll{topic} = $b->as_text;
37
38 # the poll options
39 for my $i ( $td->look_down("_tag", "input")){
40 next unless $i->attr("name") eq "aid";
41 my $label = ($i->right)[0];
42 my $aid = $i->attr("value");
43 $poll{options}->{$label} = $aid
44 }
45 }
46 }
47 }
48 return \%poll;
49 }
50
51 sub vote {
52 my ($class, $pollID, $choice) = @_;
53 if( get_page("/pollBooth.pl?aid=$choice&qid=$pollID") ){
54 return SOAP::Data->type( Boolean => 'true' );
55 }else{
56 return SOAP::Data->type( Boolean => 'false' );
57 }
58 }
59
60 sub getPollResults {
61 my ($class, $pollID) = @_;
62
63 my %choices;
64 if( my $content = get_page("/pollBooth.pl?qid=$pollID&aid=-1")){
65 my $tree = HTML::TreeBuilder->new;
66 $tree->parse( $content );
67 my $last;
68 for my $td ($tree->look_down("_tag", "td") ){
69 if( my $i = $td->look_down("_tag", "img") ){
70 if( $i->attr("src") =~ /leftbar.gif$/ ){
71 chop $last;
72
73 # point to a list ( actual votes and percentage )
74 $choices{ $last } = [ split( m!\s+/\s+!, $td->as_text, 2) ];
75 }
76 }
77 $last = $td->as_text;
78 }
79
80 }else{
81 return {error => 1};
82 }
83 return \%choices;
84 }
85
86 # Helper function to get pages in a more robust way
87 sub get_page {
88 my ($path) = @_;
89 my $ua = LWP::UserAgent->new;
90 my $uri= URI->new_abs($path, BASE_URL);
91 my $rq = HTTP::Request->new( GET => $uri );
92 my $rs = $ua->request($rq);
93
94 return ($rs->is_error ? undef : $rs->content );
95 }
|
On line 9, the Slash class begins. By the way, for those unfamiliar
with creating Perl classes, use the online documentation tool perldoc
to look up Tom Christiansen's object oriented tutorial (the keyword is
"perltool"). In order to fetch and parse Slashdot pages, the LWP
and HTML::TreeBuilder modules need to be included after the package
declaration. (These modules will be explained in a moment.) Line 12 introduces
Perl's mechanism for creating unchangeable constants. This method of creating
constants should be used over declaring file-scoped lexical variables or
using package globals. After all, no programmer wants his constants accidently
changed at run time.
This class isn't like most standard object classes because all its methods are class methods. There is no per-object information, so there is no constructor method (often called new()). Because Perl always passes the class name or object reference to invoked methods, every method will have one additional argument at the head of the parameter list.
The first method, getPollTopic(), gets the content of Slashdot's front page using the LWP library, which contains many helper classes used for getting data through HTTP. Those new to LWP should have a look at O'Reilly's Web Client Programming with Perl, which, although out of print, is available on O'Reilly's Open Book site (see Resources). To simplify things a bit, all the LWP code is centralized on the get_page() subroutine, starting on line 87. This subroutine expects to be passed only the last part of the URL. Since this class only gets pages from Slashdot, constructing full URLs outside of this function would produce a lot of repeated code. We use the URI method new_abs to create an absolute URL based on the additional path information passed into get_page().
Back in getPollTopic, the front page is parsed into a tree-like structure with HTML::TreeBuilder. Although not the fastest way to deconstruct HTML, it is a robust and easy-to-use library worth learning. A new tree builder is created on line 18 and fed the HTML content of the Slashdot home page on line 19.
By grouping HTML by tags, HTML::TreeBuilder allows the programmer to iterate through the set of objects corresponding to HTML tags for the current scope using the look_down method. On line 23, the scope is the entire HTML content of Slashdot's front page. The poll information is located in a <TD> tag. Every time look_down finds just such a tag, it returns a new HTML::TreeBuilder object whose scope includes all the HTML between the opening and closing <TD> tags. The loop that begins on line 23 is trying to find the <TD> tag that contains the form whose action attribute points to the pollBooth.pl program. (Wouldn't it be nice if HTML simply described content and not layout?) Notice how the scope of the HTML parsing changed from the whole document in line 23 to just looking at the contents of a <TD> tag.
Once the right table data element is located, extracting the poll ID is easy. In this section of HTML, there is a form with a hidden input element with the name attribute labeled qid. The value attribute of this element is the key that Slashdot uses to identify this poll. Using the HTML::TreeBuilder method attr, the value of this name attribute is extracted and stored in a hash called %poll on line 30.
Finding the poll topic is easy too because it is the only bolded text in this section. By using the look_down method for the current $td scope, the sole <B> element is located. The as_text method converts all the content of the current scope into plain, umarked ASCII text. This is perfect for extracting the topic from the $b scope, as is done on line 36.
<BR><INPUT TYPE="radio" name="aid" VALUE="3">Heat Ray Vision |
The last task for getPollTopic is the trickiest. All of the
poll choices need to be extracted along with the HTML needed to vote for
each choice. The choices in HTML are in the form of radio buttons (see
Listing 2 for an example), which means
that the value needed to vote for that choice is in the name attribute
that equals aid in an <input> tag. The human-readable choice
is just to the right of that input tag. Finding the right <input> tags
is simply a matter of iterating over the whole list of them and finding
the ones with the desired name attributes. To get the text on the
right of this <input> tag, use the cleverly named right method,
which returns a list of either HTML::TreeBuilder objects representing
HTML tags or simply strings that are unmarked-up text. The human-readable
text will always be the first element to the right of the <input> tag.
On line 43, a hash that maps the human-readable choice label to the value needed to register a vote for that choice is created. Line 43 creates a hash of hashes. Those unfamiliar with complex Perl data structures should have a look at the Perl Data Structures Cookbook, which is shipped with Perl, and viewable with the command perldoc perldsc.
Now that all this information about the current poll has been collected into a hash, it is time to return this information to the caller. Line 48 is returning a reference to this %poll hash. All SOAP methods are expected to return a single value, and this is the most natural way to squeeze a hash into a scalar.
The rest of the API methods are pretty easy. In order to vote, Slashdot needs a poll ID and a choice number. These can be submitted as a GET query to Slashdot's pollBooth.pl program. The API method vote packages both of these values passed in by the user into a URL that is submitted to Slashdot. Notice that although vote only takes two arguments, there is a third argument called $class. Because this is a method, Perl adds a class reference to the parameter list.
The success or failure of the page request is returned to the caller as a boolean value. SOAP::Lite provides an easy facility for creating SOAP data types that do not have a direct Perl mapping. Using the SOAP::Data->type call, any SOAP data type can be created (as shown on lines 54 and 56 in Listing 1).
<TR>
<TD width=100 align=right>Heat Ray Vision </TD>
<TD width=450><NOBR>
<IMG src=http://images.slashdot.org/leftbar.gif width=4 height=20 alt="">
<IMG src=http://images.slashdot.org/mainbar.gif height=20
width=4 alt="0%">
<IMG src=http://images.slashdot.org/rightbar.gif width=4 height=20 alt="">
256 / <FONT color=006666>0%</FONT></NOBR>
</TD>
</TR>
|
The last method getPollResults is only complicated because of the way Slashdot reports the poll results. Listing 3 shows the HTML that represents the results for one poll choice. The human-readable choice label, the actual number of votes, and the percentage of votes this choice received all need to be recorded. Unfortunately, the human-readable string comes before any clear indicator that it is a poll choice and not some random text. Therefore, the code on lines 68-77 of Listing 1 look for all <TD> tags, and remember the last one seen in the $last variable (line 77). Because all choices have graphs following them, it is clear that if a <TD> has an image called leftbar.gif that the preceding <TD> had the human-readable text.
The hackery isn't quite over. The actual number and percentage of votes are going to be extracted as one string (line 74). This string is split on the slash that separates these values, which creates a list with two elements. The list is turned into an anonymous array so that it can be stuck in the %choices hash. As always, this hash is returned to the caller as a reference.
Creating a Web client that takes advantage of this API is straightforward. Listing 4 is a simple CGI script that uses the SOAP service described above. Like many CGI programs, this one handles multiple states. When the user first runs this program, a screen very much like that in Figure 2 appears. It displays the topic and presents the voting choices as a drop-down menu.
Figure 2: Initial voting screen

In Listing 4, the lines 17-20 are a switch statement that runs either paint() or vote() (not to be confused with the API method vote) depending on the value of the CGI parameter action. Of course the first time this program is called, this parameter is not set, so the paint() subroutine is called, which produces a screen like the one in Figure 1.
Drilling down into paint(), the first thing to notice is that this subroutine expects an initialize CGI object and possibly an HTML fragment. These are stored in the local variables $cgi and $mesg respectively. To get the poll information, a SOAP client call to getPollTopic begins on line 24. To avoid typing errors, both the SOAP URI and proxy
values are stored in constants. Line 27 invokes the remote method. It is tempting to think of $resp as the return value of getPollTopic. While this object does have the return value somewhere, it also has information about any transmission errors that may have occurred during the SOAP call. You can invoke the fault() method of $resp to test for the presence of these errors. If a fault occurred, further information about it can be determined using the faultcode() and faultstring() methods. If there is no error, the return value of getPollTopic can be extracted with the result() method, as seen on line 35.
Recall that this API method returns a reference to a hash. In order to create a drop-down menu that lists the voting choices, a new hash is created. This new hash maps the Slashdot value for that voting choice to the human-readable label. This may seem counterintuitive, but this is the structure that CGI's popup_menu(), used in line 52, expects.
Line 41 begins printing all the HTML elements necessary to create a voting form. If this subroutine was a passed a message, the message is displayed after the voting form.
Figure 3: Voting results

Once the Web visitor has selected a choice and pressed the submit button, the CGI script will call the vote() subroutine. This subroutine is a little expensive because it needs to make two API method calls, vote and getPollResults: This means two HTTP calls and all the network latency those calls entail. Aside from a lot of error checking, the code is relatively terse. Line 100 gets the hash of voting results and creates a visually appealing table that is then passed to the paint() program for display.
1 #!/usr/bin/perl
2 # a web client for the slashdot poll client
3 use strict;
4 use SOAP::Lite;
5 use CGI qw/:all *table/;
6 use CGI::Carp qw/fatalsToBrowser/;
7
8 use constant SOAP_URL =>
9 'http://marian.daisypark.net/Slash';
10 use constant SOAP_PROXY =>
11 'http://marian.daisypark.net/~jjohn/slashdot_poll.pl';
12
13
14 my $cgi= CGI->new;
15 my $action = $cgi->param("action");
16
17 for($action){
18 /^vote/ && do{ vote( $cgi ); last; };
19 paint($cgi);
20 }
21
22 sub paint {
23 my ($cgi, $mesg) = @_;
24 my $client = SOAP::Lite->uri(SOAP_URL);
25 $client->proxy(SOAP_PROXY);
26
27 my $resp = $client->getPollTopic();
28
29 if( $resp->fault ){
30 die
31 "ERROR: SOAP Failure: ",
32 $resp->faultcode, ":",
33 $resp->faultstring;
34 }
35 my $poll = $resp->result();
36 my %menu_options;
37 while( my($k, $v) = each %{ $poll->{options} } ){
38 $menu_options{$v} = $k;
39 }
40
41 print
42 header,
43 start_html( -title => "Slash Poll Proxy",
44 -bgcolor => "#FFFFFF",
45 ),
46 h1("Slash Poll Proxy"),
47 p("The current poll topic is: ", b($poll->{topic})),
48 p("Cast your vote by selecting one of the following:"),
49 start_form,
50 qq(<input type="hidden" name="action" value="vote">),
51 qq(<input type="hidden" name="pollID" value="$poll->{pollid}">),
52 popup_menu(
53 -name => 'choice',
54 -labels => \%menu_options,
55 -values => [ keys %menu_options ],
56 ),
57 submit,
58 end_form,
59 hr,
60 $mesg,
61 end_html;
62 }
63
64 sub vote {
65 my ($cgi) = @_;
66 my $choice = $cgi->param("choice");
67 my $pollid = $cgi->param("pollID");
68
69 if( !$choice || !$pollid ){
70 return paint($cgi, font({color=>"#FF0000"},
71 "Error! Vote again"));
72 }
73
74 my $client = SOAP::Lite->uri(SOAP_URL);
75 $client->proxy(SOAP_PROXY);
76
77 # vote
78 my $resp = $client->vote($pollid, $choice);
79
80 if( $resp->fault ){
81 die
82 "ERROR: SOAP Failure: ",
83 $resp->faultcode, ":",
84 $resp->faultstring;
85 }
86
87 unless( $resp->result ){
88 return paint($cgi, font({color=>"#FF0000"},
89 "Vote failed! Vote again"));
90 }
91
92 # Get the results
93 my $resp = $client->getPollResults($pollid);
94
95 if( $resp->fault ){
96 return paint($cgi, font({color=>"#FF0000"},
97 "Can't get results"));
98 }
99
100 my $results = $resp->result();
101 my $ret = start_table;
102 for my $r ( keys %{$results} ){
103 $ret .= Tr(td(
104 [$r, b($results->{$r}->[0])]
105 )
106 );
107 }
108 $ret .= end_table;
109 return paint( $cgi, $ret);
110 }
|
Although SOAP can be used like an RPC mechanism, its real strength and promise comes from its object-oriented nature. Every object can store data specific to that particular instantiation. This means that Web objects can remember program states. Because SOAP is platform neutral, it is possible to create an object in a client Perl script that changes class data in a Python SOAP server. How far SOAP can implement OOP concepts remains to be seen. Can one derive subclasses from a SOAP object? What about inheritance? Even for moderately long inheritance trees that have to traverse several Web servers, a programmer may have to wait a long time for a SOAP call to return. In any case, SOAP has an interesting future ahead of it.
- The Quick Guide to SOAP::Lite
is an excellent resource for learning how to use this Perl library.
- There's no substitute for reading the SOAP
specification yourself.
- Read
SOAP::Liteprimers from its author Paul Kulchenko on perl.com - Free O'Reilly books! In particular, do read Web Client Programming with Perl.
- ZOPE is an open source application server with interfaces for Perl as well.
- Lotus eSuite also has a different method for CGI gateways.
- The IBM Framework for e-business can help you understand the technology choices behind deploying CGI applications.
By day, Joe Johnston (jjohn@cs.umb.edu) is a programmer for O'Reilly and Associates. Whenever his cat isn't sitting on his keyboard, he writes articles for The Perl Journal, use.perl.org, www.perl.com and the O'Reilly Network. Along with Michael Lord, he created the humorous UFO folklore site, Aliens, Aliens, Aliens.





