Skip to main content

Open source in the biosciences

Freely available software plays special role for Big Pharma and others

Cameron Laird (claird@phaseit.net), Consultant, Phaseit, Inc.
Cameron is a full-time consultant for Phaseit, Inc., who writes and speaks frequently on open source and other technical topics. You can contact him at claird@phaseit.net.

Summary:  Bioinformatics and the use of open source in the biosciences are both still in the take-off phase. There's a lot of growth ahead of us. Here are a few of the technical software developments that will matter most in bioinformatics over the next year.

Date:  01 Nov 2002
Level:  Introductory
Activity:  2708 views

There are two kinds of bioscience. Open source is important to both, but in different ways. Let's look at both kinds from a developer's perspective, starting with Edsger Dijkstra's sage counsel:

"The programmer is in the unique position that his is the only profession in which such a gigantic ratio [10^9], which totally baffles our imagination, has to be bridged by a single technology. He has to be able to think in terms of conceptual hierarchies that are much deeper than a single mind ever needed to face before. ... [A program] has, unavoidably, the uncomfortable property that the smallest possible perturbations -- i.e., changes of a single bit -- can have the most drastic consequences."
--Edsger Dijkstra, 1989

The first kind of bioscience is "small" bioscience: natural history, paleontology, limnology, and other traditional pursuits. "Small" here refers strictly to budget constraints, not constraints on intellectual excitement or even physical challenge. For the present purposes, though, it's convenient to lump these biosciences together with other academic disciplines. developerWorks recently profiled open source's growing contribution in general science and engineering (see Resources later in this article).

The other kind is the biosciences or bioinformatics you see mentioned in business or technology circles. However, the speakers invariably have something more narrow in mind: research of medical or, occasionally, agricultural benefit. Gargantuan investment pools are in pursuit of those biosciences, and it's essential to understand the consequences in order to see bioinformatics clearly.

The bioinformatics landscape

Bioinformatics development is currently concentrated in three broad and occasionally overlapping categories:

What's the big deal? (Or, how big do these Big Pharma projects get, anyway?)

For a feeling of just how big the really big bioinformatics projects are, consider that they tend to be one of the few areas of computer science that actually require the petabyte.

A petabyte is 2 to the 50th power (1,125,899,906,842,624) bytes. It is "about" one thousand terabytes (actually, it's exactly 1024 terabytes, which is perhaps even easier to remember).

One petabyte is about 400 billion pages of text. To compare, Google writes that their engine searches on more than 2 billion Web pages, 35 million non-HTML documents and one terabyte (about 50 million printed pages) of Usenet messages. So as big as it is, the Web (as indexed by Google, which many consider to be the biggest indexer of the Web), seems to be something along the lines of 2,100,000,000 pages big. Even if we assume it's actually twice that big -- say, 4 billion pages in all (or even at twice or at ten times that), it still adds up to just a fraction of a petabyte.

Only a few very data-intensive fields already require such things -- mainly (you guessed it) Big Pharma research. Some researchers in the fields of genomics or proteomics either are building or already have storage systems that are measured by petabyte. And there's an IBM Research project working on a petaflop computer named Blue Gene. It, too, is being built for work in genomics (hence the name).

  • Molecular biology includes genomics, proteomics, molecular modelling, chemical analysis, and allied areas. This area combines interesting scientific challenges in comprehension of fundamental chemistry, with requirements to integrate huge datasets, real-time analysis, and management of innovative physical devices.

  • Medical imaging juggles such technologies as X-ray, ultrasound, positron emission, nuclear magnetic resonance, and more, to deliver diagnostically relevant images to radiologists and other specialists. The tensions here are between quality of image, cost, speed of delivery, and ability to render results remotely with adequate security and quickness.

  • Workflow management itself has two aspects: management of patient records, and medical research leading to pharmaceutical approval (including Big Pharma -- trade slang for the major international pharmaceutical companies).

It's hard to overstate the volatility and turbulence inherent in bioinformatics projects. These projects require the skills software architects hone, because, more than anything else, bioinformatics juggles comically disparate scales: everything from physicians' illegible handwriting and medical outcomes that unfold over decades, to the mortal macroscopic consequences of a single amino acid substitution in the genomes of thousands.

To illustrate the difference between bioinformatics and all other scientific and engineering software work, consider the pay of chief legal officers (CLOs) of bioscience companies. Many companies in other sectors don't even have top-level titled CLOs. In the biosciences, though, legal proprieties are so important, CLOs' average pay exceeds that of chief financial officers (CFOs), according to a 2002 study by Clark/Bardes Consulting. Almost uniquely among all sectors, bioscience companies focus on intellectual property (IP) protection and regulatory compliance. While their external communications emphasize their hunger for scientific innovation, bioscience companies must be managed conservatively, to defend narrow legal grounds of their intellectual capital. Science and technology are valuable only to the extent they conform to patent or other IP law, and are approved by regulatory agencies.

This makes bioinformatics a strange landscape for software engineers. On one hand, companies are able and willing to pay huge licensing fees for approved software, and they spend large budgets on projects that would be technically easy to automate. On the other, there's widespread dissatisfaction with what one anonymous researcher characterizes as "slow, buggy, inflexible" commercial software. Experienced researchers no longer are shocked when technically superior programs turn out to be available and utterly free of charge.

That does not mean that bioscience companies are welcoming open source. Keep in mind that they're far more sensitive to regulatory details than to price or engineering merit. Even with the best of intentions and leadership, an automation project can bog down for months while considering whether "sex" is a boolean or character variable. Although such data dictionary disputes might sound frivolous to programmers, they're very real to the custodians of tons of medical records that are the basis for product approval. In such an environment, any change is difficult. Open source is a change.


So what's the good news?

So is open source making any headway at all in biosciences? Absolutely. Lincoln Stein, a researcher at the Cold Spring Harbor Laboratory, enjoys considerable acclaim for his Perl-based work "to make the human genome both accessible and navigable by scientists." O'Reilly has already published two Perl-and-bioinformatics books. Protein folders rely on Linux clusters to build up the computational muscle they need. Other computational molecular biologists have organized themselves to the point of sponsoring the Biopython, BioJava, and BioRuby Project Web sites. Development teams in particular companies use Postgres, Tcl, Octave, and other high-profile open source technologies in crucial programming roles.

Perhaps equally important, IBM and a few other large players in bioinformatics seem to "get the point" in regard to standards. IBM makes the bioinformatics tools developed by its researchers publicly available for non-commercial use. The company appointed Caroline Kovac its general manager for life sciences just a couple of years ago. Dr. Kovac is known for her support of the Interoperable Informatics Infrastructure Consortium (I3C), and impatience with the present situation in which "[n]one of these databases can talk to one another, ... [so t]he researchers have to do it, and they have to do it with keystrokes."

The major international pharmaceutical companies remain ambivalent or worse about open source. Their culture is all about IP protection. On the other hand, the costs of record-keeping are so horrendous -- often around $20,000 per participant in clinical trials -- that Big Pharma has become receptive to the simplification that standards-based open source affords.

One of the deliberate aims of I3C members such as IBM is to distinguish levels of IP ownership. Big Pharma has experience with sharing basic scientific data, at the same time it jealously guards product details and documentation. I3C casts itself in an analogous role: infrastructure, or middleware, can be standards-based and open source, even while companies rely on proprietary programs layered over the infrastructure.

Timely technical breakthroughs encourage the move to open source. More and more developers understand that the Web scraping common in programming with molecular data is, as Stein calls it, "medieval torture." Development by way of Web services is far more satisfying and robust. Understanding of computational clusters and grids has also "turned a corner" and made supercomputing seem affordable on even a modest research budget. Much of the leadership in Web services and clustering comes from open source projects.


Unknown advantages

Beyond this progress, though, open source offers biosciences three significant advantages that Big Pharma has only begun to appreciate: security, strategic ownership, and extensibility.

Security matters to Big Pharma. Fines paid for mishandling medical and research data are a matter of public record. As Bernard P. Wess Jr., president of Perseid Software Ltd., observes, "The computer industry has been poor in quality control." Until recently, bioscience has reacted to this by going more proprietary, in a search for someone to sue. There's plenty of evidence, though, that open source has at least as good a record as proprietary vendors in delivery of high-quality, and particularly high-security, programs. Recent events in election vote-tallying and national security affairs have called into question whether proprietary programs can ever be trusted for sensitive matters. Expect dramatic events on this front over the next year, as bioscience companies take drastic steps to improve their data security.

Big Pharma's IP culture has always been seen as an impediment to open source. Indeed, there have been plenty of failures to communicate open source's benefits within the old-line bioscience companies. Eventually, though, one of the companies will turn this around and appreciate that open source supports their business strategy. This is the argument Eric Raymond has often made: why would a company entrust its strategic assets to a vendor whose interests are so divergent? In this perspective, open source is the ideal way to insure against vendor caprice. As Raymond points out, "If you depend on closed-source software for the critical infrastructure of your business, you don't have control of your business -- you don't know what's in there! Open source gives you a way to get back control."

The final battle for developers in biosciences is to communicate possibilities. Most of the researchers, physicians, and managers in bioscience companies are too focused on their immediate responsibilities to appreciate the opportunities of pervasive automation. They concentrate on, at most, getting data from one place to one other, without allowing themselves to demand the qualitative improvement possible when dataflows are reliably and securely linked. Many, many bioscientists are committed to solving the one problem that's in front of them. Even though they invest large amounts of staff time and capital in information technology (IT), they don't have the "re-use culture" that's common among software engineers. Their IT productivity is low. Worse, though, is that they probably are missing opportunities to operate on larger-scale patterns of data and theory for lack of properly generalized, extensible, and open software.

Another possibility still too-little understood is open source's ability to complement both IP protection and use of proprietary software. Standards-based open source software is in a unique position to enhance the value derived from proprietary software -- existing programs and data immediately become more valuable when open source "glue" combines them with other processes and resources. True IP security only grows when open source software upgrades the quality of a lab's IT operations, and makes the boundaries between IP and "commodity" data more explicit.


Summary

Bioscience is special. Its scales of money, people, and time involved in research are without equal. Its culture of regulation and IP protection is closer to legal work than to other scientific fields.

Until recently, open source has often appeared to bioscientists as some sort of novelty, or, worse, a threat to IP protection. In the last few years, though, solid achievements in clustering, genomic data management, Web publication, and scores of specific "vertical" applications have established open source as a serious technical alternative.

Big Pharma and other biosciences are just starting to realize how open source can systematically cut costs, improve security, allow their own workers to shift attention back to their "core competences" from proprietary IT expertise, and even promote better science. We're in the midst of a dramatic evangelical movement that teaches better ways for open source IT to support bioscientific goals. Perhaps the most consequential shift is that participants have begun to understand that standards-based open source can enhance biosciences' fundamental values. These are exciting times for open source bioinformatics.


Resources

About the author

Cameron is a full-time consultant for Phaseit, Inc., who writes and speaks frequently on open source and other technical topics. You can contact him at claird@phaseit.net.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux, Open source
ArticleID=11263
ArticleTitle=Open source in the biosciences
publish-date=11012002
author1-email=claird@phaseit.net
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers