Extending data flows through catalog assets and external assets

You can track the flow of data in your enterprise, even when you use external processes that do not write to disk, or when you use ETL tools, scripts, and other programs that do not save their metadata to the catalog.

The catalog stores information from the tools in the InfoSphere® Information Server suite. A lineage report displays the flow of information from your data sources, through IBM® InfoSphere DataStage® and QualityStage® jobs, and into target data structures.

However, there might be times you need to view data lineage that includes data flows that are not stored in the catalog. This scenario can happen in the following cases:

Your enterprise uses third-party ETL (extract, transform, and load) tools whose information is not automatically stored in the catalog.
Your InfoSphere DataStage job depends on information from stored procedures.
You invoke web services that are located elsewhere.
You need to track lineage from mainframe applications and programs such as COBOL extracts.
You receive flat files or other types of feeds from third parties.
You often run scripts at the operating system level to copy or restructure files before processing them in jobs.

In these cases and others, the flow of data through tables and columns can extend beyond the catalog. By creating mappings in extension mapping documents and extended data sources, you can track that data flow and create data lineage reports from any asset in the data flow.

Mappings in extension mapping documents

Mappings in extension mapping documents are source-to-target mappings that represent an external flow of data from one or more sources to one or more targets. The source or target must exist within the catalog, either as database or data file metadata or as a component of an extended data source.

By using mappings, you can create data lineage reports for the following types of data flows:

Data flows that happen completely outside of InfoSphere Information Server
Data flows that happen both outside and inside InfoSphere Information Server

Mappings allow great flexibility in the types of assets that you can map to each other. You make your mapping decisions based on what information you want to see in your data lineage reports. For example, you might map an external stored procedure definition to a job, or you might map the output value of a column used in an ETL tool from an independent software vendor to an input parameter of a different ETL process. As another example, you might want to read Hadoop log files and create MapReduce job assets and data file assets in the catalog.

Extended data sources

Before you create mappings, you can import the metadata from external databases and data files into the catalog. But some external processes, including web services and stored procedures, do not write their data to disk. You might need to report on information about the data structure in these processes to get a clear picture of each step of the transformation process. You can capture this information by creating extended data sources and importing them into the catalog.

You can create three distinct types of extended data sources: applications, stored procedure definitions, and files. The application type allows maximum flexibility for you to create an extended data source at any level of granularity. You can define object types, methods, and input and output parameters for applications to match the structure of your external data sources. While the parameters of an application might usually represent columns in a data flow, you can make them represent whatever structure is most appropriate to your external metadata.