Web services for bioinformatics, Part 2

Integrate high-throughput services with web services

Current bioinformatics workflows require screen-scraping the results of different bioinformatics tools on several web sites. High-throughput services integrated with web services allow researchers to access a virtual organization by providing seamless access to vast computational and storage resources. In this article you can learn the details of integrating Open Grid Services Architecture (OGSA), web services, and the NC BioGrid.

Mine Altunay (maltuna@unity.ncsu.edu), Student, North Carolina State University

Mine Altunay: Mine is currently pursuing her PhD at the Computer Engineering Department of North Carolina State University. Her studies focus on grid computing and workflow management in OGSA, with a strong emphasis on authorization and trust management issues. She is also a member of the Fungal Genomics Laboratory, where she has worked on several bioinformatics projects, as well as the establishment and integration of their computational and data grids with North Carolina BioGrid. You can contact Mine at maltuna@unity.ncsu.edu.



Daniel Colonnese (dcolonn@ncsu.edu), Student, North Carolina State University

Daniel Colonnese: Daniel has recently completed his master’s degree in computer science from NC State University. He has worked on a number of projects in ecommerce, life sciences, and grid computing. His interests include software reliability and service-oriented architectures. He will be joining Lotus/Portal technical sales in June 2004. You can contact Daniel at dcolonn@ncsu.edu.



Chetna Warade (warade@us.ibm.com), Developer, IBM Healthcare & Life Sciences

Chetna Warade: Since 1999, Chetna has worked on a wide range of projects varying from systems programming to bioinformatics. She has a strong interest and aptitude in software architecture and development, systems programming, and various emerging technologies such as web services, life sciences, and the new breed of Internet technologies. You can contact Chetna at warade@us.ibm.com.



25 May 2004

High-throughput services

There are considerable costs associated with running a high-throughput application including hardware, storage, maintenance, and bandwidth. Researchers are now taking advantage of economies of scale by building large shared systems for bioinformatics processing. Some researchers have invested in special-purpose hardware or configurable Field Programmable Gate Arrays (FPGA) for specific applications. The methods of submitting a job from within a grid are well-established. The process of consuming web services from within a grid is addressed in the Open Grid Services Architecture (OGSA). This article explains the process of accessing a high-throughput application remotely via web services.

The convergence of several trends, including grid technologies and web services, has made a new model for bioinformatics possible. Processing power, storage, and network bandwidth have all advanced to the point where it is now feasible to provide high-throughput applications as web services.

The size of XML output reports is a hindrance to the integration of high-throughput bioinformatics applications with web services. Since processing power is usually less scarce than bandwidth, most high throughput applications benefit from file compression. XML documents are particularly well-suited for compression. For example, a series of BLAST output reports can be reduced to less than 1 percent of the original size using the zlib lossless data-compression library. The representation of DNA sequence data takes up the most space. The genetic coding data is represented as a combination of the characters ACGT. You can compress each 8-bit character down to 2 bits with the conversion (A=00, C=01, G=10, T=11). Since most existing programs expect character input, you are trading CPU time for bandwidth.

Security can also be a major concern, in regards to both protecting sensitive data as well as preventing abuse of computational resources. Encrypting the network traffic over the Secure Sockets Layer (SSL) is usually sufficient to protect sensitive data. Our web services use HTTP basic authentication as described in the section Secure Web services and Globus Security Infrastructure.

Running genomics applications on the NC BioGrid

The NC BioGrid (see Resources) has been built by a consortium consisting of over 70 organizations, including universities and colleges, biomedical, biotechnology and information technology companies, nonprofit institutions, and foundations. It allows for terascale computer operations, petascale data storage, and high-speed networking for consortium members. Since Globus Toolkit 2.0 for computational grid and Avaki for data grid are already installed on the NC Biogrid, our design followed the specifications of these underlying middlewares.

There are two main challenges associated with running an application on the grid: submitting jobs remotely via web services and integrating secure web services standards with Globus Security Infrastructure (GSI).

Upon receiving a request from web services, the XML request is parsed and necessary information, such as nucleotide chain and specifications of a sequence, are compiled with respect to the basic local alignment search tool (BLAST) application interface. Since there might be more than one sequence inside one XML document, each of the submitted sequences is parsed separately and a FASTA format file is created for each of them under the specific username and job identification number. Also, the statistical parameters required for the BLAST program are collected from the XML document and passed to BLAST executables on the computational nodes through Resource Specification Language (RSL). Below is an example of how the Globusrun command might be used.

Remote job submission
     Globusrun -r bluejay002.ncbiogrid.org -f submit.rsl

The RSL invocation script must also set the environmental variables that the Globus Toolkit 2.0 requires. Variables such as the name of the authentication server, the MyProxy server, and User Proxy servers are all configured at runtime by the web service provider before the actual job submission. The following code is an example of how to set those variables.

Setting environment variables on the Grid
setenv MYPROXY_SERVER_DN "/O=Grid/OU=NCBioGrid/OU=
TestBed/CN=host/bluejay015.ncbiogrid.org"

setenv MYPROXY_SERVER bluejay015.ncbiogrid.org

In order to fully exploit the processing power of grid, users may simultaneously submit several jobs. For this reason, RSL is used in a completely generic manner, so that at run time the master node decides how many sequences to submit and produces the required RSL code with different specifications for each job. In order to best utilize resources for each submitted job, different resource managers are assigned to each job at runtime. Below is the code for generating RSL code at runtime.

Generate RSL code at runtime
my $rsl_string="+";
foreach(0..$numSeq){
        $rsl_string .="(&(executable=/ncbg/apps/blast/bin/blastall))
 (directory = /home/\"$user\"/grid-blast)
 (arguments = 
 -p \"$program\" -d \"$db\" -i \"$inputFiles[$_]\" -m 7 -o \"$outputFiles[$_]\")
(environment = (BLASTDB /ncbg/data/blast/current/n/ecoli.nt))
 (count = 1)
 (resourceManagerContact=bluejay002.ncbiogrid.org))";
}

Most of the bioinformatics applications use databases to search for similarities between different species and their genomes. The National Center for Biotechnology Information (NCBI) maintains a separate database for each of the species' genomes. Based on a bioinformaticists requirement, the proper database must be installed and made available to all the computational nodes that are participating in computation. The Avaki data grid provides virtual shared directories, so each participating node can easily access a local replica of the required database. At runtime, a single primary node generates the RSL code and specifies the location of the virtual directory that contains the necessary database.


Secure web services and Globus Security Infrastructure

Another challenge of running web-service-enabled applications on a grid is the integration of Globus Security Infrastructure (GSI) with secure web services. When you deploy a grid application as a web service on an HTTP server, such as Apache, three separate security domains must interact. First, users need to have username/password credentials to access the web server. Second, users need separate credentials to access the specific web service, or any operations within it. Finally, to run applications on the grid, users must have proper Globus credentials, which consist of username, public/private key, hostname, and the digital signature of the certificate authority.

In the Globus Toolkit GSI, you log into a host machine participating in the grid with a set of credentials. Then you are authenticated based on these credentials. If the authentication is successful, a server-side user proxy is created on your behalf. Whenever you want to get access to resources on the grid, your specific user proxy talks to the resource proxy and confirms your identity. This functionality saves you from typing your password each time you need to access different resources. In other words, single-sign-on is provided through user and resource proxy interaction.

However, exposing an application outside the grid, via web services, brings some complications to the GSI system. Since user Apache is the job owner, from the Globus environment perspective, the Globus environment expects user Apache to have the proper credentials for job submission. However, only the end user bioinformaticist has such credentials. Therefore, a propagation of credentials to the Globus environment is required.

This propagation of credentials may be summarized as two different steps. The first step is retrieving the username and password from the web service provider, and the second step is passing those credentials from Apache to Globus.

HTTP basic authentication is the least common denominator for security among web service providers. While basic authentication is not necessary to secure the web service, it adds both speed and standardization to the security system. There are several ways to map HTTP basic authentication to grid users. In most systems, valid grid users also have regular user accounts on the machines that participate with the grid. Preferably, these user accounts are stored in a standard identity registry such as LDAP. Most often, one machine is running an LDAP server, such as iPlanet or OpenLDAP. Several Apache modules map valid users from an LDAP directory to valid users in Apache. For example, Apache can use mod_pam, Auth_ldap, LDAP_auth, mod_authz_ldap, and mod_ldap, all of which map system user permissions to Apache permissions. While every grid user is a user on a participating machine, the reverse in not true. Therefore, security policies are defined in a .htconf file for both the groups and users. For most of these modules, everything that can be said about individual users also applies to groups of users.

In Perl, you can retrieve the username and password (at the web service provider) by calling these functions from a CGI script with mod_perl.

Retrieve authentication information
my $r = shift;
($ret, $password) = $r->get_basic_auth_pw;
$user = $r->user;

Once you have the username and password, the next step is to retrieve the grid credentials necessary to run a job.

An open source project called MyProxy from the National Center for Supercomputing Applications (NCSA) provides a mechanism to map users to their grid credentials. These credentials consist of a certificate and private key. The purpose of using the MyProxy tool is so that the certificate and private key files need not be stored on the same machine as the web service consumer. This provides greater security and allows trusted parties to renew credentials so that long-running tasks do not fail because of an expired proxy credential. The NCSA runs a public MyProxy server and the software is available from the Partnership for Advanced Computational Infrastructure and the NSF Middleware Initiative (see Resources). MyProxy APIs are also available through the Globus Commodity Grid Kits (see Resources).

In order to delegate the proper credentials to an Apache server on behalf of an end user, Apache must authenticate the user with their grid credentials to the MyProxy server and retrieve the users' grid credentials under specified directories with the appropriate user id. An example for the user maltuna is given below:

Delegating credentials to an Apache server
myproxy-get-delegation -s bluejay015.ncbiogrid.org -l maltuna -o /tmp/x509up_maltuna

In the case of having many users simultaneously accessing the Apache server, each of these credentials are stored under specific user names. Therefore, no end user can access or overwrite someone else's credential file. By default, this credential is stored under the user Apache ID, such as /tmp/x509up_u6448.

When the job submission to Globus is performed, GSI checks the default user proxy server and credentials under the job owner's name. Since MyProxy is not a part of the GSI system, the Globus environment is not aware of any delegated rights to Apache. In order to link them together, the Apache server needs to set up environment variables called USER_PROXY and a credential file directory.

Linking delegated rights and the Globus environment
setenv X509_USER_PROXY /tmp/x509up_$3

An example: Web service for (BLAST)

The NC BioGrid is now hosting a web service for BLAST that lets you execute thousands of BLAST operations in parallel.

The BLAST web service is a SOAP::Lite service using SOAP over HTTP transport. Authentication for the web service is handled with the basic HTTP Authentication from Apache bound to an iPlanet LDAP server with module mod_pam. The Globus framework handles authorization. To use GridBlast, you must be a member of the NC Biogrid.

The GridBlast service on the NC BioGrid provides a model for how document-style services can allow existing applications to access remote processing power. Currently the NC Biogrid has Globus Toolkit 2.0 installed. Although our existing infrastructure does not support grid services, the existing document-style service currently deployed will become a grid service when the grid is upgraded to Globus Toolkit 3.0.


Summary

This article addressed several issues relating to the high-throughput web services such as large data sets and security. It demonstrated the BLAST web services deployed on the NC BioGrid, describing various problems and possible solutions.

Acknowledgements

This paper describes the joint work of the Extreme Blue team Summer 2003, Fungal Genomics Lab at NC State University and the North Carolina Biogrid. Our team has set up a framework for deploying bioinformatics applications as high-throughput Web Services on the North Carolina BioGrid. The intern team consists of: Mine Altunay (maltuna@unity.ncsu.edu), Daniel Colonnese (dcolonn@ncsu.edu), Chetna Warade (warade@us.ibm.com), and Lindsay Wilber (WilberL04@darden.virginia.edu). The team was advised by members of the IBM Life Sciences Group, including Virinder Batra (batra@us.ibm.com), Madhu Gombar (mgombar@us.ibm.com), Rick Runyan (runyan@us.ibm.com), Prasad Vadlamudi (prasadv@us.ibm.com) and Doug Brown (debrown@unity.ncsu.edu).

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into SOA and web services on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=SOA and web services
ArticleID=11921
ArticleTitle=Web services for bioinformatics, Part 2
publish-date=05252004