This series explores how to process unstructured data in parallel fashion —
within a machine and across a series of machines — using the power of IBM DB2 for Linux, UNIX and Windows (LUW) and GPFS shared-nothing cluster (SNC) to provide efficient, scalable access to unstructured data through a standard SQL interface.
In this article, see how the Java-based sashyReader framework
leverages the architectural features in DB2 LUW. The
sashyReader provides for parallel and scalable processing of unstructured data
stored locally or on a cloud via an SQL interface. This is useful for data
ingest, data cleansing, data aggregation, and other tasks requiring the scanning,
processing, and aggregation of large unstructured data sets. You also
learn how to extend the sashyReader framework to read arbitrary unstructured
text data by using dynamically pluggable Python classes.
See how unstructured data can be processed in parallel fashion. Leverage the power
of IBM DB2 for Linux, UNIX and Windows to provide efficient highly scalable access to
unstructured data stored on the cloud.
Learn how unstructured data can be processed in
parallel fashion -- within a machine and across a series of
machines -- by leveraging DB2 Linux, UNIX, and Windows
and GPFS SNC to provide efficient highly scalable access to unstructured
data, all through a standard SQL interface. Realize this capability with clusters
of commodity hardware, suitable for provisioning in the cloud or directly on
bare metal clusters of commodity hardware. Scalability is achieved within the
framework via the principle of computation locality. Computation is
performed local to the host which has direct data access, thus minimizing or
eliminating network bandwidth requirements and eliminating the need for any
shared compute resource.