Data ingestion in IBM InfoSphere Identity Insight uses the concept of data source, where each data source is mapped to entities that are the result of extract, transform, and load (ETL) processes from the source system. In typical ingestion scenarios, you will have a number of data sources to process. When the number of data sources goes beyond a few dozen, then the process starts to become complicated. Common complications when ingesting several data sources, specially in the initial load, are:
- Changing priorities for each data source load
- Reporting ingestion progress figures quickly
- Organizing the various UMF (Universal Message Format) files
- Validating ingestion results
- Validating data before ingestion
IBM InfoSphere Identity Insight provides programs to run an ingestion process. However, to overcome the complications mentioned above, a model-based approach is required, so each of those potential problems can be avoided or minimized.
Using the model
The model is intended to work around the mechanisms that IBM InfoSphere Identity Insight already provides for managing the ingestion process. Some of the concepts in the ingestion flow that the model describes are new. The model consists of the following concepts:
- Load: a grouping of UMF files that come from different data sources and a set of characteristics that logically represent a ingestion milestone
- File Set: the UMF files that comprise a specific load, those files meet the criteria set by the Selector (e.g. file name, date, data source, etc.).
- Chunk: a part of a UMF file, chunks are used to gradually process the UMF files by the pipelines.
- Slot: a temporary holding area where external systems deposit UMF files for ingestion.
- Selector: a list of rules to identify files or chunks used by a load.
- Pre/Post Scripts: scripts to be run before or after certain stages of the load process; those scripts can serve different purposes e.g. monitoring, auditing, notification, validation, etc
- Pipeline: the IBM InfoSphere Identity Insight program that processes UMF files for ingestion
- Prioritizer: a list of rules and priorities applied to chunks and loads to determine order of processing
The flow of data and execution involving those components is depicted in Figure 1.
Figure 1. Overview of the load model
- An external system (usually a result of an ETL process) deposits UMF files in a slot.
- Files from the slots are taken by means of a load selector.
- Every time a file is selected from a slot, the file pre-script is executed.
- The files form a set of UMFs to be processed, and each file is sliced in chunks.
- A list of priorities, basically a number indicating the priority and a selector, establish the order in which each file chunk is going to be processed by the load.
- Before the chunk is actually sent to the pipelines for ingestion, the pre-script for the chunk is executed.
- At the system level are priorities for the different loads, so each load will take turns to make use of pipelines.
- Based on the priorities defined for each load and its chunks, one of them is taken for the pipelines for ingestion into IBM InfoSphere Identity Insight.
- Every time a chunk is processed by the pipelines, the post-script for the chunk is executed.
- After all the file's chunks have been processed by the pipelines, the post-script for the file is executed.
In the following sections, you will explore in more depth the functions and characteristics of each component of the model.
Working with data sources and slots
The model considers having a number of slots for grouping UMF files. The slot can be named according to certain criteria, for instance, data source, a date range, a phase name, etc. The slot, for example, can be a directory in a file system, where an external system transfers UMF files for ingestion. Figure 2 shows a case where two external systems deposit files in different slots.
Figure 2. External systems and slots
UMF file names can have a naming convention so the selectors can work on a known format and selections made on file names are consistent. The chunk prioritizer also can use the file name for populating the load chunk queue.
Working with loads
In a normal situation, there will be many requirements to group the UMF files for ingestion. Within the model, this is initially addressed with an appropriate name for the load. Figure 3 illustrates how different loads are named for various purposes. The model assumes you can have any number of loads to meet the grouping requirements of the overall ingestion process.
Figure 3. Naming loads
Every load will have a number of UMF files to process. Managing the priorities for several UMF files from a number of data sources can be a complicated and error-prone task if performed manually. This is addressed by the model by setting priorities at the chunk level. A few things happen before a chunk has a priority, the following sections will explain that.
Exploring files and chunks
As mentioned earlier, the load has a selector to gather files from the Slots. Figure 4 illustrates how a selector is composed to group files for the load. A selector in general has three types of specs to select files:
- Slot Spec: defines from which slots the load is going to take files from and can use wild cards
- Date Spec: contains file dates or date ranges
- File Spec: is a list of file names or file name wild cards
Figure 4. File Selector Specs
Each spec can have many expressions, which will form an
logical expression, for instance for
the expression would be
There could be many instances of each spec type.
The resulting set will be the intersection of the resulting sets of each spec. Note that
the resulting set for a
given load changes over time as new files are being deposited into the slots, so the selection process will be updated periodically.
Just before a file makes it into the load file set area, the pre-script is executed for that file. That means that the scripts have to handle file name as a parameter. The pre-script can be anything we want to validate about the file before it is passed to the set for ingestion. The pre-script might return a success or failure code, in the latter case the file will not enter into the load file set. Whatever the outcome of the pre-script, the load will write the results to a log file. Once a file is in the load file set, you can start taking chunks of the files to populate the load chunk queue according to the priorities that the chunk prioritizer dictates.
The chunk prioritizer has a list of specs, as the selector, but also contains a number to indicate priority of each chunk that meet the criteria of the spec. To populate the load chunk queue, the chunk prioritizer applies the specs to the various elements in the load file set and cuts a chunk as a file and puts it in the queue with the specified priority. Figure 5 depicts a sample instance of the chunk prioritizer specs.
Figure 5. Chunk prioritizer specs
When a chunk is taken from the load chunk queue, the chunk pre-script is executed. This script can be anything we want before the chunk is passed to the pipelines for ingestion. For instance, it can validate or notify external systems. The return code of the scripts will indicate if the chunk is allowed to continue in the flow, that is, going to the pipelines or is skipped. Whatever the outcome of the scripts, the results will be written to a log file.
Understanding actual ingestion in the pipelines
Since, most likely, you will have several loads running simultaneously, system-wide load priorities will determine which load gets the turn to the pipelines. The load prioritizer will take care of that task. It will have a series of specs that include a priority number and a wild card expression to match a load name or names.
Figure 6. Load prioritizer specs
At this point of the process, once a chunk has been processed by the pipelines, the chunk post-script is executed and if the chunk was the last for a given file, the file post-script gets executed as well. The post-scripts can be used for notification or logging.
Walking through an implementation scenario
This section is a reference to use as the basis for a full implementation of the model. The following paragraphs will help you understand how all the concepts of the model can be realized in a hypothetical environment.
The model was conceived for a Linux/UNIX ® environment in mind, since all of the required functionality can be programmed in shell scripts. In the following paragraphs, you will explore how each component or functionality can be addressed within a Linux/UNIX-like environment.
Slots can intuitively map to a directory. The potential methods on how the external system deposit files on each slot can be,
for instance, file transfer programs like FTP (File Transfer Protocol), SFTP (SSH's
Secure FTP), SCP (SSH's Secure Copy), or file sharing
protocols such as NFS (Network File System) in Linux/UNIX or for Windows CIFS (Common
Internet File System). Assume FTP will be used.
The FTP client will connect to the FTP server which will be the IBM Identity Insight
server. On the server side, the clients should
only see the slots directory hierarchy. The client should send the files with a temporary name during transfer and rename it after is
completely done to the final name. For example, during transfer the file is named
filename.transfer and when it's done should be
filename.umf. So the upstream process does not pick up files that are being
transferred, just the ones with the
To implement a load, the configuration should be written in a file. This configuration file should have values for the following elements:
- File Selector Specs
- Load File Set directory path
- File Pre/Post-Scripts file paths
- Chunk Prioritizer Specs
- Load Chunk Queue directory path
- Chunk Pre/Post-Scripts file paths
The load itself will be materialized in a file system directory. Below it will have the configuration file, the directory for the load file set and the directory for the load chunk queue. A directory can contain the pre/post-scripts and control scripts for processing the file selector specs and the chunk prioritizer specs. Also, a directory to hold a file for logging all activities. The scripts directory and the log directory should be common for all loads.
The load file set file scan retain the original slot file names, since it is not required to have a specific order for processing.
However, for the load chunk queue we do need to have special naming to allow the priorities to take effect.
The simplest would be to add a prefix composed of the priority number zero padded with 4
positions, for instance,
In this way, the standard listing command will show at the top the files with highest priority.
Load prioritizer implementation
The load prioritizer must have a configuration file that contains the Specs for setting the priorities for the various loads. The load prioritizer must also have a script that will drive the chunk selection among the loads and will move the chunks to the pipeline area. Depending on how the pipelines are set up, the file chunks must be moved accordingly. For example, if the pipelines are configured for message queue transports, the load prioritizer script will place the files in the queue. The script will also drive the execution of the Post-Scripts for the chunk and the file.
Figure 7 shows the directory and file hierarchy for the implementation scenario we just described.
Figure 7. Directory hierarchy
The model presented in this article will make the data ingestion process with IBM Identity Insight more manageable and will contribute to the design of the overall process. During an implementation of an identity resolution solution there are many steps before getting to the ingestion process design, so you could potentially have less time to think about it. The concepts and implementation guidelines in this article can help you establish a starting point. This model can help you to envision the needs for loading priorities, UMF organization, validation, notification and logging early, thus simplifying the project.
- Consult the IBM Identity Insight information center for pipelines configuration.
- Read this developerWorks podcast transcript with Jeff Jonas for a discussion on the overall technical strategy of identity analytics.
- Read Use IBM Entity Analytic Solutions to analyze watch lists for detailed information on configuring IBM Relationship Resolution for watch list analysis.
- Learn more about Information Management at the developerWorks Information Management zone. Find technical documentation, how-to articles, education, downloads, product information, and more.
- Stay current with developerWorks technical events and webcasts.
- Follow developerWorks on Twitter.
Get products and technologies
- Build your next development project with IBM trial software, available for download directly from developerWorks.