 | Level: Introductory Cameron Laird (claird@phaseit.net), Consultant, Phaseit, Inc.
01 Nov 2002 Bioinformatics and the use of open source in the biosciences are both still in the take-off phase. There's a lot of growth ahead of us. Here are a few of the technical software developments that will matter most in bioinformatics over the next year.
There are two kinds of bioscience. Open source is important to both,
but in different ways. Let's look at both kinds from a developer's perspective, starting with Edsger Dijkstra's sage counsel:
"The programmer is in the unique position that his is the
only profession in which such a gigantic ratio [10^9], which
totally baffles our imagination, has to be bridged by a single
technology. He has to be able to think in terms of conceptual
hierarchies that are much deeper than a single mind ever
needed to face before. ... [A program] has, unavoidably,
the uncomfortable property that the smallest possible
perturbations -- i.e., changes of a single bit -- can have the
most drastic consequences." --Edsger Dijkstra, 1989
The first kind of bioscience is "small" bioscience: natural history, paleontology, limnology,
and other traditional pursuits. "Small" here refers strictly to budget
constraints, not constraints on intellectual excitement or even physical
challenge. For the present purposes, though, it's convenient to lump
these biosciences together with other academic disciplines.
developerWorks recently profiled open source's growing contribution in
general science and engineering (see Resources later in this article).
The other kind is the biosciences or bioinformatics you see mentioned in business or
technology circles. However, the speakers invariably have something more
narrow in mind: research of medical or,
occasionally, agricultural benefit. Gargantuan investment pools are in
pursuit of those biosciences, and it's essential to understand
the consequences in order to see bioinformatics clearly.
The bioinformatics landscape
Bioinformatics development is currently concentrated in three broad and
occasionally overlapping categories:
 |
What's the big deal? (Or, how big do these Big Pharma projects get, anyway?)
For a feeling of just how big the really big bioinformatics projects are,
consider that they tend to be one of the few areas of computer science
that actually require the petabyte.
A petabyte is 2 to the 50th power (1,125,899,906,842,624) bytes. It is
"about" one thousand terabytes (actually, it's exactly 1024 terabytes,
which is perhaps even easier to remember).
One petabyte is about 400 billion pages of text. To compare, Google writes that
their engine searches on more than 2 billion Web pages, 35 million
non-HTML documents and one terabyte (about 50 million printed pages) of
Usenet messages. So as big as it is, the Web (as indexed by Google, which
many consider to be the biggest indexer of the Web), seems to be something
along the lines of 2,100,000,000 pages big. Even if we assume it's
actually twice that big -- say, 4 billion pages in all (or even at twice
or at ten times that), it still adds up to just a fraction of a
petabyte.
Only a few very data-intensive fields already require such things --
mainly (you guessed it) Big Pharma research. Some researchers in the
fields of
genomics or proteomics either are building or already have storage systems
that are measured by petabyte. And there's an IBM Research project
working on a
petaflop computer named Blue Gene. It, too, is being
built for work in genomics (hence the name).
|
|
- Molecular biology includes genomics, proteomics, molecular modelling, chemical analysis, and allied areas. This area combines interesting scientific challenges in comprehension of fundamental chemistry, with requirements to integrate huge datasets, real-time analysis, and management of innovative physical devices.
- Medical imaging juggles such technologies as X-ray, ultrasound, positron emission, nuclear magnetic resonance, and more, to deliver diagnostically relevant images to radiologists and other specialists. The tensions here are between quality of image, cost, speed of delivery, and ability to render results remotely with adequate security and quickness.
- Workflow management itself has two aspects: management of patient records, and medical research leading to pharmaceutical approval (including Big Pharma -- trade slang for the major international pharmaceutical
companies).
It's hard to overstate the volatility and turbulence inherent in
bioinformatics projects. These projects require the skills software architects
hone, because, more than anything else, bioinformatics juggles comically
disparate scales: everything from physicians' illegible
handwriting and medical outcomes that unfold over decades, to the mortal
macroscopic consequences of a single amino acid substitution in the
genomes of thousands.
To illustrate the difference between bioinformatics and all
other scientific and engineering software work, consider the pay of chief legal
officers (CLOs) of bioscience companies. Many companies in other sectors
don't even have top-level titled CLOs. In the biosciences,
though, legal proprieties are so important, CLOs' average pay exceeds that
of chief financial officers (CFOs), according to a 2002 study by
Clark/Bardes Consulting. Almost uniquely among all sectors, bioscience
companies focus on intellectual property (IP) protection and regulatory
compliance. While their external communications emphasize their hunger
for scientific innovation, bioscience companies must be managed
conservatively, to defend narrow legal grounds of their intellectual
capital. Science and technology are valuable only to the extent they
conform to patent or other IP law, and are approved by regulatory
agencies.
This makes bioinformatics a strange landscape for software engineers.
On one hand, companies are able and willing to pay huge licensing fees for
approved software, and they spend large budgets on projects that would be
technically easy to automate. On the other, there's widespread
dissatisfaction with what one anonymous researcher characterizes as "slow,
buggy, inflexible" commercial software. Experienced researchers no
longer are shocked when technically superior programs turn out to be
available and utterly free of charge.
That does not mean that bioscience companies are welcoming
open source. Keep in mind that they're far more sensitive to regulatory
details than to price or engineering merit. Even with the best of
intentions and leadership, an automation project can bog down for months
while considering whether "sex" is a boolean or character variable.
Although such data dictionary disputes might sound frivolous to
programmers, they're very real to the custodians of tons of medical
records that are the basis for product approval. In such an environment,
any change is difficult. Open source is a change.
So what's the good news?
So is open source making any headway at all in biosciences? Absolutely.
Lincoln Stein, a researcher at the Cold Spring Harbor Laboratory, enjoys
considerable acclaim for his Perl-based work "to make the human genome
both accessible and navigable by scientists." O'Reilly has already
published two Perl-and-bioinformatics books. Protein folders rely on
Linux clusters to build up the computational muscle they need. Other
computational molecular biologists have organized themselves to the point
of sponsoring the Biopython, BioJava, and BioRuby Project Web sites.
Development teams in particular companies use Postgres, Tcl, Octave, and
other high-profile open source technologies in crucial programming
roles.
Perhaps equally important, IBM and a few other large players in
bioinformatics seem to "get the point" in regard to standards. IBM makes
the bioinformatics tools developed by its researchers publicly available
for non-commercial use. The company appointed Caroline Kovac its general
manager for life sciences just a couple of years ago. Dr. Kovac is known
for her support of the Interoperable Informatics Infrastructure Consortium
(I3C), and impatience with the present situation in which "[n]one of these
databases can talk to one another, ... [so t]he researchers have to do it,
and they have to do it with keystrokes."
The major international pharmaceutical
companies remain ambivalent or worse about open source. Their culture is
all about IP protection. On the other hand, the costs of record-keeping
are so horrendous -- often around $20,000 per participant in clinical
trials -- that Big Pharma has become receptive to the simplification that
standards-based open source affords.
One of the deliberate aims of I3C members such as IBM is to distinguish
levels of IP ownership. Big Pharma has experience with sharing basic
scientific data, at the same time it jealously guards product details and
documentation. I3C casts itself in an analogous role: infrastructure, or
middleware, can be standards-based and open source, even while companies
rely on proprietary programs layered over the infrastructure.
Timely technical breakthroughs encourage the move to open source. More
and more developers understand that the Web scraping common in programming
with molecular data is, as Stein calls it, "medieval torture."
Development by way of Web services is far more satisfying and robust.
Understanding of computational clusters and grids has also "turned a
corner" and made supercomputing seem affordable on even a modest research
budget. Much of the leadership in Web services and clustering comes from
open source projects.
Unknown advantages
Beyond this progress, though, open source offers biosciences three
significant advantages that Big Pharma has only begun to appreciate:
security, strategic ownership, and extensibility.
Security matters to Big Pharma. Fines paid for mishandling medical and
research data are a matter of public record. As Bernard P. Wess Jr.,
president of Perseid Software Ltd., observes, "The computer industry has
been poor in quality control." Until recently, bioscience has reacted to
this by going more proprietary, in a search for someone to sue. There's
plenty of evidence, though, that open source has at least as good a record
as proprietary vendors in delivery of high-quality, and particularly
high-security, programs. Recent events in election vote-tallying and
national security affairs have called into question whether proprietary
programs can ever be trusted for sensitive matters. Expect
dramatic events on this front over the next year, as bioscience companies
take drastic steps to improve their data security.
Big Pharma's IP culture has always been seen as an impediment to open
source. Indeed, there have been plenty of failures to communicate open
source's benefits within the old-line bioscience companies. Eventually,
though, one of the companies will turn this around and appreciate that
open source supports their business strategy. This is the
argument Eric Raymond has often made: why would a company entrust its
strategic assets to a vendor whose interests are so divergent? In this
perspective, open source is the ideal way to insure against vendor
caprice. As Raymond points out, "If you depend on closed-source software
for the critical infrastructure of your business, you don't have control
of your business -- you don't know what's in there! Open source gives you
a way to get back control."
The final battle for developers in biosciences is to communicate
possibilities. Most of the researchers, physicians, and managers in
bioscience companies are too focused on their immediate responsibilities
to appreciate the opportunities of pervasive automation. They concentrate
on, at most, getting data from one place to one other, without allowing
themselves to demand the qualitative improvement possible when dataflows
are reliably and securely linked. Many, many bioscientists are committed
to solving the one problem that's in front of them. Even though they
invest large amounts of staff time and capital in information technology
(IT), they don't have the "re-use culture" that's common among software
engineers. Their IT productivity is low. Worse, though, is that they
probably are missing opportunities to operate on larger-scale patterns of
data and theory for lack of properly generalized, extensible, and open
software.
Another possibility still too-little understood is open source's
ability to complement both IP protection and use of proprietary
software. Standards-based open source software is in a unique position to
enhance the value derived from proprietary software -- existing programs and
data immediately become more valuable when open source "glue" combines
them with other processes and resources. True IP security only grows when
open source software upgrades the quality of a lab's IT operations, and
makes the boundaries between IP and "commodity" data more explicit.
Summary
Bioscience is special. Its scales of money, people, and time involved in
research are without equal. Its culture of regulation and IP protection
is closer to legal work than to other scientific fields.
Until recently, open source has often appeared to bioscientists as some
sort of novelty, or, worse, a threat to IP protection. In the last few
years, though, solid achievements in clustering, genomic data management,
Web publication, and scores of specific "vertical" applications have
established open source as a serious technical alternative.
Big Pharma and other biosciences are just starting to realize how open
source can systematically cut costs, improve security, allow their own
workers to shift attention back to their "core competences" from
proprietary IT expertise, and even promote better science. We're in the
midst of a dramatic evangelical movement that teaches better ways for
open source IT to support bioscientific goals. Perhaps the most
consequential shift is that participants have begun to understand that
standards-based open source can enhance biosciences' fundamental
values. These are exciting times for open source bioinformatics.
Resources - Learn more about computing pioneer Edsger Dijkstra from the resources listed in his very own Google category. His article "On the Cruelty of Really Teaching Computing Science", Communications of the ACM, Volume 32, Number 12 (December 1989), pages 1398-1400, characterizes computing work in terms of its ambition to engineer across unprecedented scales of time and space.
- "Open source in the lab" explains how many scientific laboratories are replacing or supplementing proprietary software products with others based on open source (developerWorks, October 2002).
- The purpose of OpenInformatics is "to help scientists to become more aware of Open Source Software."
- The similar and related OpenScience project is "dedicated to writing and releasing free and Open Source scientific software."
-
"Combating Creative Chaos in Bioinformatics" explains Lincoln Stein's "code of conduct for bioinformaticians" manifesto. Lincoln Stein, one of the software-oriented celebrities of bioinformatics, operates from the Cold Spring Harbor Laboratory.
- Clark/Bardes Consulting researches executive compensation, especially in technical fields.
- Patrick O'Brien's "Beginning Python for Bioinformatics" walks bioscientists through examples that show how easy it is to work with Python, and mentions the language's applicability for programming in the large.
- Eric Raymond's "The Magic Cauldron" explains how open source supports organizations' business strategies.
- See the role IBM middleware played in these Linux case studies, "Structural Bioinformatics supports drug discovery with IBM and Linux", and "IBM and MDS Proteomics alliance aims to speed drug development".
- IBM researchers are involved in a number of scientific and technological disciplines, including chemistry, computer science, electrical engineering, materials science, math, and physics. Learn more about it at the IBM Research site.
- IBM Life Sciences addresses IT needs specific to biotechnology, pharmaceuticals, genomics, proteomics, and healthcare.
- Find the Linux resource you're looking for in the developerWorks Linux zone.
About the author  | |  | Cameron is a full-time consultant for Phaseit, Inc., who writes and speaks
frequently on open source and other technical topics. You can contact him
at claird@phaseit.net. |
Rate this page
|  |