Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Repurposing CGI applications with SOAP

Wrapping an existing CGI application into a SOAP service

Joe Johnston, Senior Software Engineer, O'Reilly and Associates
By day, Joe Johnston (jjohn@cs.umb.edu) is a programmer for O'Reilly and Associates. Whenever his cat isn't sitting on his keyboard, he writes articles for The Perl Journal, use.perl.org, www.perl.com and the O'Reilly Network. Along with Michael Lord, he created the humorous UFO folklore site, Aliens, Aliens, Aliens.

Summary:  SOAP is a popular web service protocol that can be used in a surprising number of ways, even for Web applications. This article demonstrates how to wrap an existing CGI application into a SOAP service. Additionally, the article guides a quick tour of HTML page scraping that leads to a discussion of HTML::TreeBuilder.

Date:  01 May 2001
Level:  Introductory
Also available in:   Japanese

Activity:  6971 views
Comments:  

Putting CGI out to pasture

Web services are late arrival to the structure of the World Wide Web. Many interesting Web applications are built as traditional CGI programs with request-response forms going back and forth between the client and the server. With the benefits of service architectures, you may want to have the functionality of an existing CGI program as a web service. However, it may be too complex or expensive to rewrite the application. One easy trick is to have the web service read and parse the Web page itself. This "page scraping" technique is often the handiest device in the modern programmer's toolbox, and it is the focus of this article.

This article assumes that you have some familiarity with programming Perl.


Figure 1: Slashdot's poll
Slashdot's poll

To make this discussion more concrete, let's define a web service for Slashdot's weekly poll (see Figure 1). Using this service, programs will be able to get the latest poll topic (along with its options), retrieve the current results of the poll, and vote for a particular option. This API is summarized in Table 1.


Table 1: Slashdot web service API
URL:http://marian.daisypark.net/~jjohn/slashdot_poll.plMethod APIMethod NameInputOutputDescription
getPollTopicnoneReturns a structureReturns the topic, the poll ID, and the voting options.
If an error occurs, a special member called "error" is given
a non-zero value.
votepollID and choiceIDBoolean indicating success or failureGiven the strings that the poll and choice on SlashDot,
this method attempts to vote
getPollResultspollIDA structureGiven a pollID, a structure that maps the choice to a two element
array is returned. The array's first element is the number of votes cast
for that choice, and the second element is the percent of the total votes
that choice received.

Much like XML-RPC, SOAP can return arbitrarily complex data structures. SOAP has a much richer collection of data types, but all the expected ones (strings, integers, lists and dictionaries) are represented. Also like XML-RPC, the programmer is shielded from having to do anything about the underlying XML.

These three methods are conceptually uncomplicated. The getPollTopic() function returns all the important information needed for the other two methods, including the label used by Slashdot to identify this poll in its database. A dictionary that maps the human-readable voting choice to the Slashdot identifier for that choice is also returned by this method. Both the poll and individual choice identifiers are needed by the vote() method, which actually registers a vote on Slashdot. Lastly, getPollResults() reports the state of the current voting.


The Perl server

The hardest part of creating this service has nothing to do with SOAP, but parsing the HTML. A new SOAP dispatch object is created on lines 4-7 in Listing 1. Here, the class to be published is called Slash and it is implemented in the same script that dispatches the SOAP requests.

 1  #!/usr/bin/perl
 2  # Wrap a CGI script into a web service
 3  use strict;
 4  use SOAP::Transport::HTTP;
 5
 6  my $dispatch = SOAP::Transport::HTTP::CGI->dispatch_to('.', 'Slash');
 7  $dispatch->handle;
 8
 9  package Slash;
10 use HTML::TreeBuilder;
11 use LWP;
12 use constant BASE_URL => 'http://slashdot.org';
13
14 sub getPollTopic {
15   my $homepage = get_page();
16   return { error => 1 } unless $homepage;
17
18   my $tree = HTML::TreeBuilder->new;
19   $tree->parse($homepage);
20
21   my %poll = (); # to be returned
22
23   for my $td ($tree->look_down("_tag", "td") ){
24     if( my $form = $td->look_down("_tag", "form") ){
25       if($form->attr("action") eq 'http://slashdot.org/pollBooth.pl' ){
26
27         # get the pollID
28         if( my $input = $td->look_down("_tag", "input")){
29           if( $input->attr("name") eq 'qid' ){
30             $poll{pollid} = $input->attr("value");
31           }
32         }
33
34        # get the poll topic
35        my $b = $td->look_down("_tag", "b");
36        $poll{topic} = $b->as_text;
37
38        # the poll options
39        for my $i ( $td->look_down("_tag", "input")){
40          next unless $i->attr("name") eq "aid";
41          my $label                       = ($i->right)[0];
42          my $aid                          = $i->attr("value");
43          $poll{options}->{$label}     = $aid
44          }
45        }
46      }
47    }
48    return \%poll;
49  }
50
51 sub vote {
52    my ($class, $pollID, $choice) = @_;
53    if( get_page("/pollBooth.pl?aid=$choice&qid=$pollID") ){
54      return SOAP::Data->type( Boolean => 'true' );
55    }else{
56      return SOAP::Data->type( Boolean => 'false' );
57    }
58  }
59
60  sub getPollResults {
61     my ($class, $pollID) = @_;
62
63     my %choices;
64     if( my $content = get_page("/pollBooth.pl?qid=$pollID&aid=-1")){
65       my $tree = HTML::TreeBuilder->new;
66       $tree->parse( $content );
67       my $last;
68       for my $td ($tree->look_down("_tag", "td") ){
69         if( my $i = $td->look_down("_tag", "img") ){
70           if( $i->attr("src") =~ /leftbar.gif$/ ){
71             chop $last;
72
73             # point to a list ( actual votes and percentage )
74             $choices{ $last } = [ split( m!\s+/\s+!, $td->as_text, 2) ];
75           }
76         }
77         $last = $td->as_text;
78       }
79
80     }else{
81        return {error => 1};
82     }
83     return \%choices;
84   }
85
86   # Helper function to get pages in a more robust way
87   sub get_page {
88    my ($path) = @_;
89    my $ua = LWP::UserAgent->new;
90    my $uri= URI->new_abs($path, BASE_URL);
91    my $rq = HTTP::Request->new( GET => $uri );
92    my $rs = $ua->request($rq);
93
94    return ($rs->is_error ? undef : $rs->content );
95   }

On line 9, the Slash class begins. By the way, for those unfamiliar with creating Perl classes, use the online documentation tool perldoc to look up Tom Christiansen's object oriented tutorial (the keyword is "perltool"). In order to fetch and parse Slashdot pages, the LWP and HTML::TreeBuilder modules need to be included after the package declaration. (These modules will be explained in a moment.) Line 12 introduces Perl's mechanism for creating unchangeable constants. This method of creating constants should be used over declaring file-scoped lexical variables or using package globals. After all, no programmer wants his constants accidently changed at run time.

This class isn't like most standard object classes because all its methods are class methods. There is no per-object information, so there is no constructor method (often called new()). Because Perl always passes the class name or object reference to invoked methods, every method will have one additional argument at the head of the parameter list.

The first method, getPollTopic(), gets the content of Slashdot's front page using the LWP library, which contains many helper classes used for getting data through HTTP. Those new to LWP should have a look at O'Reilly's Web Client Programming with Perl, which, although out of print, is available on O'Reilly's Open Book site (see Resources). To simplify things a bit, all the LWP code is centralized on the get_page() subroutine, starting on line 87. This subroutine expects to be passed only the last part of the URL. Since this class only gets pages from Slashdot, constructing full URLs outside of this function would produce a lot of repeated code. We use the URI method new_abs to create an absolute URL based on the additional path information passed into get_page().

Back in getPollTopic, the front page is parsed into a tree-like structure with HTML::TreeBuilder. Although not the fastest way to deconstruct HTML, it is a robust and easy-to-use library worth learning. A new tree builder is created on line 18 and fed the HTML content of the Slashdot home page on line 19.

By grouping HTML by tags, HTML::TreeBuilder allows the programmer to iterate through the set of objects corresponding to HTML tags for the current scope using the look_down method. On line 23, the scope is the entire HTML content of Slashdot's front page. The poll information is located in a <TD> tag. Every time look_down finds just such a tag, it returns a new HTML::TreeBuilder object whose scope includes all the HTML between the opening and closing <TD> tags. The loop that begins on line 23 is trying to find the <TD> tag that contains the form whose action attribute points to the pollBooth.pl program. (Wouldn't it be nice if HTML simply described content and not layout?) Notice how the scope of the HTML parsing changed from the whole document in line 23 to just looking at the contents of a <TD> tag.

Once the right table data element is located, extracting the poll ID is easy. In this section of HTML, there is a form with a hidden input element with the name attribute labeled qid. The value attribute of this element is the key that Slashdot uses to identify this poll. Using the HTML::TreeBuilder method attr, the value of this name attribute is extracted and stored in a hash called %poll on line 30.

Finding the poll topic is easy too because it is the only bolded text in this section. By using the look_down method for the current $td scope, the sole <B> element is located. The as_text method converts all the content of the current scope into plain, umarked ASCII text. This is perfect for extracting the topic from the $b scope, as is done on line 36.

<BR><INPUT TYPE="radio" name="aid" VALUE="3">Heat Ray Vision

The last task for getPollTopic is the trickiest. All of the poll choices need to be extracted along with the HTML needed to vote for each choice. The choices in HTML are in the form of radio buttons (see Listing 2 for an example), which means that the value needed to vote for that choice is in the name attribute that equals aid in an <input> tag. The human-readable choice is just to the right of that input tag. Finding the right <input> tags is simply a matter of iterating over the whole list of them and finding the ones with the desired name attributes. To get the text on the right of this <input> tag, use the cleverly named right method, which returns a list of either HTML::TreeBuilder objects representing HTML tags or simply strings that are unmarked-up text. The human-readable text will always be the first element to the right of the <input> tag. On line 43, a hash that maps the human-readable choice label to the value needed to register a vote for that choice is created. Line 43 creates a hash of hashes. Those unfamiliar with complex Perl data structures should have a look at the Perl Data Structures Cookbook, which is shipped with Perl, and viewable with the command perldoc perldsc.

Now that all this information about the current poll has been collected into a hash, it is time to return this information to the caller. Line 48 is returning a reference to this %poll hash. All SOAP methods are expected to return a single value, and this is the most natural way to squeeze a hash into a scalar.

The rest of the API methods are pretty easy. In order to vote, Slashdot needs a poll ID and a choice number. These can be submitted as a GET query to Slashdot's pollBooth.pl program. The API method vote packages both of these values passed in by the user into a URL that is submitted to Slashdot. Notice that although vote only takes two arguments, there is a third argument called $class. Because this is a method, Perl adds a class reference to the parameter list.

The success or failure of the page request is returned to the caller as a boolean value. SOAP::Lite provides an easy facility for creating SOAP data types that do not have a direct Perl mapping. Using the SOAP::Data->type call, any SOAP data type can be created (as shown on lines 54 and 56 in Listing 1).

<TR>
          <TD width=100 align=right>Heat Ray Vision </TD>
          <TD width=450><NOBR>
          <IMG src=http://images.slashdot.org/leftbar.gif width=4 height=20 alt="">
          <IMG src=http://images.slashdot.org/mainbar.gif height=20
             width=4 alt="0%">
          <IMG src=http://images.slashdot.org/rightbar.gif width=4 height=20 alt="">
          256 / <FONT color=006666>0%</FONT></NOBR>
          </TD>
</TR>

The last method getPollResults is only complicated because of the way Slashdot reports the poll results. Listing 3 shows the HTML that represents the results for one poll choice. The human-readable choice label, the actual number of votes, and the percentage of votes this choice received all need to be recorded. Unfortunately, the human-readable string comes before any clear indicator that it is a poll choice and not some random text. Therefore, the code on lines 68-77 of Listing 1 look for all <TD> tags, and remember the last one seen in the $last variable (line 77). Because all choices have graphs following them, it is clear that if a <TD> has an image called leftbar.gif that the preceding <TD> had the human-readable text.

The hackery isn't quite over. The actual number and percentage of votes are going to be extracted as one string (line 74). This string is split on the slash that separates these values, which creates a list with two elements. The list is turned into an anonymous array so that it can be stuck in the %choices hash. As always, this hash is returned to the caller as a reference.


The CGI client

Creating a Web client that takes advantage of this API is straightforward. Listing 4 is a simple CGI script that uses the SOAP service described above. Like many CGI programs, this one handles multiple states. When the user first runs this program, a screen very much like that in Figure 2 appears. It displays the topic and presents the voting choices as a drop-down menu.


Figure 2: Initial voting screen
Initial voting screen

In Listing 4, the lines 17-20 are a switch statement that runs either paint() or vote() (not to be confused with the API method vote) depending on the value of the CGI parameter action. Of course the first time this program is called, this parameter is not set, so the paint() subroutine is called, which produces a screen like the one in Figure 1.

Drilling down into paint(), the first thing to notice is that this subroutine expects an initialize CGI object and possibly an HTML fragment. These are stored in the local variables $cgi and $mesg respectively. To get the poll information, a SOAP client call to getPollTopic begins on line 24. To avoid typing errors, both the SOAP URI and proxy values are stored in constants. Line 27 invokes the remote method. It is tempting to think of $resp as the return value of getPollTopic. While this object does have the return value somewhere, it also has information about any transmission errors that may have occurred during the SOAP call. You can invoke the fault() method of $resp to test for the presence of these errors. If a fault occurred, further information about it can be determined using the faultcode() and faultstring() methods. If there is no error, the return value of getPollTopic can be extracted with the result() method, as seen on line 35.

Recall that this API method returns a reference to a hash. In order to create a drop-down menu that lists the voting choices, a new hash is created. This new hash maps the Slashdot value for that voting choice to the human-readable label. This may seem counterintuitive, but this is the structure that CGI's popup_menu(), used in line 52, expects.

Line 41 begins printing all the HTML elements necessary to create a voting form. If this subroutine was a passed a message, the message is displayed after the voting form.


Figure 3: Voting results
Voting results

Once the Web visitor has selected a choice and pressed the submit button, the CGI script will call the vote() subroutine. This subroutine is a little expensive because it needs to make two API method calls, vote and getPollResults: This means two HTTP calls and all the network latency those calls entail. Aside from a lot of error checking, the code is relatively terse. Line 100 gets the hash of voting results and creates a visually appealing table that is then passed to the paint() program for display.

 1  #!/usr/bin/perl
 2  # a web client for the slashdot poll client
 3  use strict;
 4  use SOAP::Lite;
 5  use CGI qw/:all *table/;
 6  use CGI::Carp qw/fatalsToBrowser/;
 7
 8  use constant SOAP_URL =>
 9                              'http://marian.daisypark.net/Slash';
10 use constant SOAP_PROXY =>
11                             'http://marian.daisypark.net/~jjohn/slashdot_poll.pl';
12
13
14 my $cgi= CGI->new;
15 my $action = $cgi->param("action");
16
17 for($action){
18    /^vote/ && do{ vote( $cgi ); last; };
19    paint($cgi);
20   }
21
22 sub paint {
23    my ($cgi, $mesg) = @_;
24    my $client = SOAP::Lite->uri(SOAP_URL);
25    $client->proxy(SOAP_PROXY);
26
27    my $resp = $client->getPollTopic();
28
29    if( $resp->fault ){
30      die
31         "ERROR: SOAP Failure: ",
32         $resp->faultcode, ":",
33         $resp->faultstring;
34     }
35     my $poll = $resp->result();
36     my %menu_options;
37     while( my($k, $v) = each %{ $poll->{options} } ){
38        $menu_options{$v} = $k;
39      }
40
41      print
42         header,
43         start_html( -title => "Slash Poll Proxy",
44                         -bgcolor => "#FFFFFF",
45                        ),
46          h1("Slash Poll Proxy"),
47          p("The current poll topic is: ", b($poll->{topic})),
48          p("Cast your vote by selecting one of the following:"),
49          start_form,
50          qq(<input type="hidden" name="action" value="vote">),
51          qq(<input type="hidden" name="pollID" value="$poll->{pollid}">),
52          popup_menu(
53                               -name => 'choice',
54                               -labels => \%menu_options,
55                               -values => [ keys %menu_options ],
56                             ),
57 		   submit,
58         end_form,
59         hr,
60         $mesg,
61         end_html;
62   }
63
64   sub vote {
65     my ($cgi) = @_;
66     my $choice = $cgi->param("choice");
67     my $pollid = $cgi->param("pollID");
68
69     if( !$choice || !$pollid ){
70       return paint($cgi, font({color=>"#FF0000"},
71         						  "Error! Vote again"));
72     }
73
74    my $client = SOAP::Lite->uri(SOAP_URL);
75    $client->proxy(SOAP_PROXY);
76
77   # vote
78   my $resp = $client->vote($pollid, $choice);
79
80   if( $resp->fault ){
81     die
82         "ERROR: SOAP Failure: ",
83         $resp->faultcode, ":",
84         $resp->faultstring;
85   }
86
87   unless( $resp->result ){
88      return paint($cgi, font({color=>"#FF0000"},
89                                     "Vote failed! Vote again"));
90   }
91
92   # Get the results
93   my $resp = $client->getPollResults($pollid);
94
95   if( $resp->fault ){
96     return paint($cgi, font({color=>"#FF0000"},
97                         			"Can't get results"));
98     }
99
100    my $results = $resp->result();
101    my $ret = start_table;
102    for my $r ( keys %{$results} ){
103        $ret .= Tr(td( 
104 	          		[$r, b($results->{$r}->[0])]
105    	  		           )
106 	  			   );
107     }
108    $ret .= end_table;
109     return paint( $cgi, $ret);
110 }


Conclusion

Although SOAP can be used like an RPC mechanism, its real strength and promise comes from its object-oriented nature. Every object can store data specific to that particular instantiation. This means that Web objects can remember program states. Because SOAP is platform neutral, it is possible to create an object in a client Perl script that changes class data in a Python SOAP server. How far SOAP can implement OOP concepts remains to be seen. Can one derive subclasses from a SOAP object? What about inheritance? Even for moderately long inheritance trees that have to traverse several web servers, a programmer may have to wait a long time for a SOAP call to return. In any case, SOAP has an interesting future ahead of it.


Resources

About the author

By day, Joe Johnston (jjohn@cs.umb.edu) is a programmer for O'Reilly and Associates. Whenever his cat isn't sitting on his keyboard, he writes articles for The Perl Journal, use.perl.org, www.perl.com and the O'Reilly Network. Along with Michael Lord, he created the humorous UFO folklore site, Aliens, Aliens, Aliens.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=SOA and web services
ArticleID=11544
ArticleTitle=Repurposing CGI applications with SOAP
publish-date=05012001
author1-email=jjohn@cs.umb.edu
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Try IBM PureSystems. No charge.

Special offers