Structure of data sets

A data set comprises a descriptor file and a number of other files that are added as the data set grows.

These files are stored on multiple disks in your system. A data set is organized in terms of partitions and segments. Each partition of a data set is stored on a single processing node. Each data segment contains all the records written by a single IBM® InfoSphere® DataStage® job. So a segment can contain files from many partitions, and a partition has files from many segments.

Figure 1. Structure of data sets

Shows a schematic diagram of data sets

The descriptor file for a data set contains the following information:

  • Data set header information.
  • Creation time and date of the data set.
  • The schema of the data set.
  • A copy of the configuration file use when the data set was created.

For each segment, the descriptor file contains:

  • The time and date the segment was added to the data set.
  • A flag marking the segment as valid or invalid.
  • Statistical information such as number of records in the segment and number of bytes.
  • Path names of all data files, on all processing nodes.

This information can be accessed through the Data Set Manager.