 | Level: Introductory Benjamin Lieberman, Principal Software Architect, BioLogic Software Consulting, LLC
08 Apr 2008 One of the most interesting challenges for information architects
is the situation in which large, proprietary, widely distributed data stores are
necessary to address a specific research question. Learn about the difficulties
involved in mining distributed data sources and the strategies that have been
developed to address these issues.
Challenges to
organizations with distributed data
The explosive growth in data-storage capabilities and rapid network communication
protocols has allowed organizations to collect and store a staggering amount of
information on specific topics. These databases may be upwards of petrabyte size
(1 x 10^15 bytes, or a billion megabytes) — a truly awe-inspiring amount of
data! Such massive information stores are often found in research applications
(such as biology, medicine, physics, and astronomy) and government agencies (such
as the IRS, Department of Defense, and Department of Labor). They may also occur
in business: for example, in insurance calculations for underwriting risk.
Government agencies often need to share data, but different data schemas,
interfaces, and communication techniques complicate these transfers. This is
especially true with regard to sensitive information, such as that used by the
Department of Defense or Homeland Security. These agencies often have legacy
systems that are proprietary, difficult to extend, or otherwise closed to external
systems. The information stored in these systems may be in a variety of binary
formats, some of which are no longer properly documented. To further complicate the
situation, the data of interest may be spread among multiple systems, hosted on
different networks, or housed in a variety of physical locations.
Businesses are often faced with the issue of widely distributed data when they
acquire another company. In this case, the systems of the two companies are rarely
compatible, resulting in a great deal of difficulty in mining the joined company
for answers to common management questions of profit, loss, risk, and costs.
Issues can also arise with product or service offerings, delivery, inventory
management, scheduling, and so on. The cost of integrating these diverse data
sources is a significant expense to the newly joined company.
Researchers are focused on the discovery of new knowledge. To acquire new
knowledge, they often need to find and understand the previous discoveries of
others. There are now massive databases containing information on the entire human
genome (as well as the genomes of other species), astronomical observations,
particle physics, drug discoveries, and a host of other fields. The challenge is
no longer collecting information, but mining the data to answer specific
research questions — such as the paradox of the human genome being so much
smaller than that of a fruit fly. These databases are hosted in research centers
around the world, each with its own unique storage structure, access interface,
and communication protocol. Researchers who wish to collaborate with colleagues
must be able to easily pass information back and forth between data stores, as
well as have efficient mechanisms for processing data.
Given the massively diffuse nature of these data stores, the challenge is for
organizations to discover, access, and effectively use distributed information.
Skills and
competencies
The problem of distributed data mining has many considerations, but there are
three primary concerns: the ability to discover the information, access that
information securely, and transfer the data efficiently enough to support the
processing need.
Data mining
The first issue with data mining of distributed data sources is discovery.
Unless you can find the data of interest, it's highly unlikely that you'll be able
to use the data source. Mechanisms for discovery vary, but they fall into two
principle categories: static and dynamic. You make a static discovery by
manually identifying the data-source system and preconfiguring the processing
system to use the identified source in its processing. This approach is the most
common but the least flexible. If newer sources are made available, there is no
guarantee that they will be incorporated. It's likely that unless someone notices
a new source, it will go unused. A more flexible (but more difficult to implement)
mechanism is to dynamically discover appropriate data sources. Dynamic
discovery is the idea behind the Universal Description Discovery and
Integration (UDDI) and the Open Grid Service Infrastructure (OGSI). A data source
registers its capabilities and content with a central registry that can be
automatically queried at run time for matches to your processing needs (for
example, an astronomical database for a sky-survey search).
After discovery of the data source, the next step is to gain access to the
information. Gaining access involves the first of two security issues (see the
following section, Security): authenticating permitted
users. There are many protocols for authenticating remote users, such as
certificates or security tokens from trusted sources. But with distributed
databases, each source may use a separate mechanism. Consider the difficulty in
gaining access to multiple data stores, all of which require different
authentication techniques. This is a major problem with the distributed processing
model and a significant area of investigation and standardization.
Once you've gained access to a remote data source, the next issue is data
transfer. The difficulty in this step arises from the size of the data source in
question — often in the tera- or petrabyte range — which makes it
impractical to retrieve the data over a remote connection. In this case, you have
two possibilities: retrieve the data in batch amounts for processing locally, or
perform the processing on the remote platform. An example of the first situation
is the SETI@HOME project (see Resources), where packets
of data are distributed to volunteer processing sites, transformed locally, and
then transmitted back to the central server for consolidation and analysis. An
example of the second situation is the performance of a genetic basic alignment
search (BLAST) for genetic matches to a particular DNA, RNA, or protein sequence.
Finally, after the processing is complete, you need to consolidate the source
information or the results of the processing for analysis. As noted earlier, it
may require retrieving the data from the remote data source or consolidating the
processing results locally. Consolidating information requires the data to be
structured in a common way. Otherwise, it would be time-consuming to map each data
entry from one source data system to another.
Security
Security for distributed processing is affected by the need to transmit
information from one site to another over a potentially nonsecure medium (such as
the Internet). This article doesn't cover security other than to note the issues
involved and some of the techniques available. One approach to the distributed
security-management problem, where many interacting parties may or may not be
directly known to one another, is to use the federated network model (see Figure
1).
Figure 1. Federated network
In the federated network model, each partner in a trusted federation is granted
access to the shared resources. A security check is performed upon entry into the
federation, after which the party has whatever access privileges are available to
the party's access group. The advantage of this approach is that all the data
sources and processing centers don't have to establish unique security protocols,
nor must they reauthenticate on every request for data. The disadvantage is that
if the federation is corrupted, few safeguards prevent an unauthorized user from
gaining access to controlled information.
One safeguard that can be placed on any security model is graduated data
access. Many large databases are available to general users in read-only
mode, with limited bandwidth or processing time. A graduated model, however, can
provide select user groups larger slices of processing time or increased transfer
bandwidth. If select groups have update abilities (such as research labs who are
submitting sequence data to a central database), the security model can be
tailored to batch updates for validation prior to inclusion in the database.
Information transfer mechanism
The hallmark of a distributed processing model is the need to transfer
information from one site to another. There are a variety of approaches you can
use to transfer source data or processing results, and they include the following:
-
Private network. Collaborating groups share a network that is closed to
outside use. A private network is specifically established for the purpose of sharing data
among the partners. Examples include virtual private networks (VPNs) and
networks configured to a private domain.
-
Public network. Public networks are available for general use and are
consequently less secure or reliable. The most common public network is the
Internet, where distributed parties may collaborate using some form of secure
communication (such as sHTTP or sFTP).
-
Direct connection. A direct connection is created between partners
using rented or purchased network lines set up for point-to-point connectivity.
A critical factor in distributed processing is the bandwidth of the network
connection. The amount of data transferred between processing sites may be large,
so a corresponding network capability is required for adequate performance.
Terabyte amounts of data transfer often require gigabit/sec performance. The
recently completed Internet2 project has linked more than 300 academic sites in a
fully optical network providing 10 gigabit/second or higher transfer rates. This
network will permit government and research institutions, and eventually the
business community, to establish and use large distributed databases.
Tools and
techniques
The ability of government, business, and research groups to access large,
distributed databases is a becoming a critical factor in their ability to maintain
a leadership role in the world. Numerous research projects are involved in
developing standards and frameworks for distributed processing. The current
proposals for distributed data management mostly involve Web services and
standards that are under development, such as WS-Security, WS-Transfer, and the
updated version of the OGSI framework specification: Web Service Resource
Framework (WSRF).
Web services
Web services are a hot topic in the distributed processing field. The general
idea is to provide data and processing services in the form of a generic
Web-enabled service, such that an interested user can locate, bind, and access the
service of interest. The opaque nature of a Web service method combined with the
descriptive power of XML documents for data are perfect for integrating any number
of remote operations. The requester can call the Web service without knowing any
details of the implementation, the location of the remote data source, or the
communication protocols.
The drawback to using Web services for distributed data management is the lack of
additional support for critical data considerations around scheduling, resource
management, and storage control and the overhead associated with large-scale data
transfers. Using Web services for distributed computing is therefore a flexible
but somewhat limited approach. Recently, the WSRF was announced as the successor
to the OGSI framework, but significant controversy remains regarding the best way
to use Web services in a grid-computing environment.
Data grids
Similar to the Web service model, a data grid (sometimes referred to as a
computational grid) provides access to remote data stores by offering
authorized users a set of processing and data-management services. However, a data
grid goes beyond the Web service model by providing scheduling, resource
management, storage reservations, quality-of-service assurance, monitoring, and
other capabilities. These additional services provide for a better organized
shared-resource model that allows more efficient utilization of resources. The
OGSI and WSRF frameworks standardize these services, as well as the interface
presented by the remote data sources.
Structured data is the mainstay of a data grid, whether it's used for relational
data storage, hierarchical storage, XML tags, or specialized binary formats. These
structures are divided into several categories:
-
Primary structured data. The original data source, such as images, raw
observation data, genetic sequences, and so on. This information is supplemented
by ancillary data.
-
Ancillary data. Describes each data element within the bulk data store,
such as source organization, application support, data summary, index, catalog,
or digests.
-
Collaboration data. Permits group behaviors, as illustrated by the Kegg
Biochemical Pathway map (see Resources).
-
Personal data. Characterizes individual users and preferences, as well
as security permissions.
-
Service data. Supports grid operations, as shown by the Globus Toolkit
monitoring and discovery services.
Milestones
Grid computing has been around for some time and is beginning to be viewed as the
future of large-scale computation. The ability to manage large distributed data
sets is a critical aspect of a significant grid effort. As noted in this article,
a number of challenges are involved in effectively mining the data contained in
these very large data repositories. The development of standards, such as OGSI and
WSRF, as well as the overall growth of standardized Web services for grid
computation, has provided the groundwork for the research and development of
grid-computing platforms such as the Globus Toolkit, GridFTP, and NeST at the
University of Wisconsin at Madison. Future developments in remote data management
for automated data-source discovery, common schema standards, task schedulers, and
federation of services will result in a more transparent and flexible grid
environment.
Resources
About the author  | 
|  | Benjamin A. Lieberman serves as the principal architect for BioLogic
Software Consulting. Dr. Lieberman provides consulting and training
services on a wide variety of software development topics, including
requirements analysis, software analysis and design, configuration
management, and development process improvement. Dr. Lieberman is also an
accomplished professional writer with a book (The Art of Software
Modeling) and numerous software-related articles to his credit. Dr.
Lieberman holds a doctorate degree in biophysics and genetics from the
University of Colorado, Health Sciences Center, Denver, Colorado. |
Rate this page
|  |