Parallel processing of unstructured data, Part 3, Extend the sashyReader

From the developerWorks archives

Steve Raspudic, Alexander Abrashkevich, and Toni Kunic

Date archived: January 12, 2017 | First published: May 22, 2014

This series explores how to process unstructured data in parallel fashion — within a machine and across a series of machines — using the power of IBM® DB2® for Linux®, UNIX® and Windows® (LUW) and GPFS™ shared-nothing cluster (SNC) to provide efficient, scalable access to unstructured data through a standard SQL interface. In this article, see how the Java-based sashyReader framework leverages the architectural features in DB2 LUW. The sashyReader provides for parallel and scalable processing of unstructured data stored locally or on a cloud via an SQL interface. This is useful for data ingest, data cleansing, data aggregation, and other tasks requiring the scanning, processing, and aggregation of large unstructured data sets. You also learn how to extend the sashyReader framework to read arbitrary unstructured text data by using dynamically pluggable Python classes.

This content is no longer being updated or maintained. The full article is provided "as is" in a PDF file. Given the rapid evolution of technology, some steps and illustrations may have changed.

Zone=Information Management
ArticleTitle=Parallel processing of unstructured data, Part 3: Extend the sashyReader