Managing the data ingestion process in IBM InfoSphere Identity Insight

A high-volume model for fine-grained control of data ingestion

IBM® InfoSphere® Identity Insight is a real-time, scalable entity relationship and analysis platform. It uses entity and relationship disambiguation technology together with complex event processing techniques for fighting threat, fraud, and risk. The process of bringing data into Identity Insight, or data ingestion, acquires a high level of complexity when dealing with multiple data sources and entity types and hundred of millions of records. In this article, learn how to build a comprehensive approach for handling various aspects of the load process, including priorities, performance, logging, and auditing.

Share:

Alberto Ortiz (aortizg@mx1.ibm.com), IT Specialist, IBM

Photo of Alberto OrtizAlberto has 14 years of IT experience, and has worked for various companies in different industries (banking, government, high-tech, automotive, telecommunications and IT outsourcing). In 2009 he joined IBM Mexico. He has experience in UNIX, databases, programming, and configuration management.



01 September 2011

Also available in Chinese

Introduction

Data ingestion in IBM InfoSphere Identity Insight uses the concept of data source, where each data source is mapped to entities that are the result of extract, transform, and load (ETL) processes from the source system. In typical ingestion scenarios, you will have a number of data sources to process. When the number of data sources goes beyond a few dozen, then the process starts to become complicated. Common complications when ingesting several data sources, specially in the initial load, are:

  • Changing priorities for each data source load
  • Reporting ingestion progress figures quickly
  • Organizing the various UMF (Universal Message Format) files
  • Validating ingestion results
  • Validating data before ingestion

IBM InfoSphere Identity Insight provides programs to run an ingestion process. However, to overcome the complications mentioned above, a model-based approach is required, so each of those potential problems can be avoided or minimized.

Using the model

The model is intended to work around the mechanisms that IBM InfoSphere Identity Insight already provides for managing the ingestion process. Some of the concepts in the ingestion flow that the model describes are new. The model consists of the following concepts:

  • Load: a grouping of UMF files that come from different data sources and a set of characteristics that logically represent a ingestion milestone
  • File Set: the UMF files that comprise a specific load, those files meet the criteria set by the Selector (e.g. file name, date, data source, etc.).
  • Chunk: a part of a UMF file, chunks are used to gradually process the UMF files by the pipelines.
  • Slot: a temporary holding area where external systems deposit UMF files for ingestion.
  • Selector: a list of rules to identify files or chunks used by a load.
  • Pre/Post Scripts: scripts to be run before or after certain stages of the load process; those scripts can serve different purposes e.g. monitoring, auditing, notification, validation, etc
  • Pipeline: the IBM InfoSphere Identity Insight program that processes UMF files for ingestion
  • Prioritizer: a list of rules and priorities applied to chunks and loads to determine order of processing

The flow of data and execution involving those components is depicted in Figure 1.

Figure 1. Overview of the load model
Overview of the load model showing the steps in the flow of execution.
  1. An external system (usually a result of an ETL process) deposits UMF files in a slot.
  2. Files from the slots are taken by means of a load selector.
  3. Every time a file is selected from a slot, the file pre-script is executed.
  4. The files form a set of UMFs to be processed, and each file is sliced in chunks.
  5. A list of priorities, basically a number indicating the priority and a selector, establish the order in which each file chunk is going to be processed by the load.
  6. Before the chunk is actually sent to the pipelines for ingestion, the pre-script for the chunk is executed.
  7. At the system level are priorities for the different loads, so each load will take turns to make use of pipelines.
  8. Based on the priorities defined for each load and its chunks, one of them is taken for the pipelines for ingestion into IBM InfoSphere Identity Insight.
  9. Every time a chunk is processed by the pipelines, the post-script for the chunk is executed.
  10. After all the file's chunks have been processed by the pipelines, the post-script for the file is executed.

In the following sections, you will explore in more depth the functions and characteristics of each component of the model.

Working with data sources and slots

The model considers having a number of slots for grouping UMF files. The slot can be named according to certain criteria, for instance, data source, a date range, a phase name, etc. The slot, for example, can be a directory in a file system, where an external system transfers UMF files for ingestion. Figure 2 shows a case where two external systems deposit files in different slots.

Figure 2. External systems and slots
External systems can deposit files in one or more slots.

UMF file names can have a naming convention so the selectors can work on a known format and selections made on file names are consistent. The chunk prioritizer also can use the file name for populating the load chunk queue.

Working with loads

In a normal situation, there will be many requirements to group the UMF files for ingestion. Within the model, this is initially addressed with an appropriate name for the load. Figure 3 illustrates how different loads are named for various purposes. The model assumes you can have any number of loads to meet the grouping requirements of the overall ingestion process.

Figure 3. Naming loads
The loads can be named according to grouping requirements.

Every load will have a number of UMF files to process. Managing the priorities for several UMF files from a number of data sources can be a complicated and error-prone task if performed manually. This is addressed by the model by setting priorities at the chunk level. A few things happen before a chunk has a priority, the following sections will explain that.

Exploring files and chunks

As mentioned earlier, the load has a selector to gather files from the Slots. Figure 4 illustrates how a selector is composed to group files for the load. A selector in general has three types of specs to select files:

  • Slot Spec: defines from which slots the load is going to take files from and can use wild cards
  • Date Spec: contains file dates or date ranges
  • File Spec: is a list of file names or file name wild cards
Figure 4. File Selector Specs
File Selectors has many types of specs and many instances of each spec type.

Each spec can have many expressions, which will form an or logical expression, for instance for SlotSpec 1 the expression would be file.slot=DataSource* or file.slot=Person. There could be many instances of each spec type. The resulting set will be the intersection of the resulting sets of each spec. Note that the resulting set for a given load changes over time as new files are being deposited into the slots, so the selection process will be updated periodically.

Just before a file makes it into the load file set area, the pre-script is executed for that file. That means that the scripts have to handle file name as a parameter. The pre-script can be anything we want to validate about the file before it is passed to the set for ingestion. The pre-script might return a success or failure code, in the latter case the file will not enter into the load file set. Whatever the outcome of the pre-script, the load will write the results to a log file. Once a file is in the load file set, you can start taking chunks of the files to populate the load chunk queue according to the priorities that the chunk prioritizer dictates.

The chunk prioritizer has a list of specs, as the selector, but also contains a number to indicate priority of each chunk that meet the criteria of the spec. To populate the load chunk queue, the chunk prioritizer applies the specs to the various elements in the load file set and cuts a chunk as a file and puts it in the queue with the specified priority. Figure 5 depicts a sample instance of the chunk prioritizer specs.

Figure 5. Chunk prioritizer specs
The Chunk Prioritizer populates the load Chunk Queue.

When a chunk is taken from the load chunk queue, the chunk pre-script is executed. This script can be anything we want before the chunk is passed to the pipelines for ingestion. For instance, it can validate or notify external systems. The return code of the scripts will indicate if the chunk is allowed to continue in the flow, that is, going to the pipelines or is skipped. Whatever the outcome of the scripts, the results will be written to a log file.

Understanding actual ingestion in the pipelines

Since, most likely, you will have several loads running simultaneously, system-wide load priorities will determine which load gets the turn to the pipelines. The load prioritizer will take care of that task. It will have a series of specs that include a priority number and a wild card expression to match a load name or names.

Figure 6. Load prioritizer specs
Load priorities at system level dictates which loads the pipelines serve first.

At this point of the process, once a chunk has been processed by the pipelines, the chunk post-script is executed and if the chunk was the last for a given file, the file post-script gets executed as well. The post-scripts can be used for notification or logging.

Walking through an implementation scenario

This section is a reference to use as the basis for a full implementation of the model. The following paragraphs will help you understand how all the concepts of the model can be realized in a hypothetical environment.

The model was conceived for a Linux/UNIX ® environment in mind, since all of the required functionality can be programmed in shell scripts. In the following paragraphs, you will explore how each component or functionality can be addressed within a Linux/UNIX-like environment.

Slot implementation

Slots can intuitively map to a directory. The potential methods on how the external system deposit files on each slot can be, for instance, file transfer programs like FTP (File Transfer Protocol), SFTP (SSH's Secure FTP), SCP (SSH's Secure Copy), or file sharing protocols such as NFS (Network File System) in Linux/UNIX or for Windows CIFS (Common Internet File System). Assume FTP will be used. The FTP client will connect to the FTP server which will be the IBM Identity Insight server. On the server side, the clients should only see the slots directory hierarchy. The client should send the files with a temporary name during transfer and rename it after is completely done to the final name. For example, during transfer the file is named filename.transfer and when it's done should be renamed to filename.umf. So the upstream process does not pick up files that are being transferred, just the ones with the umf extension.

Load implementation

To implement a load, the configuration should be written in a file. This configuration file should have values for the following elements:

  • File Selector Specs
  • Load File Set directory path
  • File Pre/Post-Scripts file paths
  • Chunk Prioritizer Specs
  • Load Chunk Queue directory path
  • Chunk Pre/Post-Scripts file paths

The load itself will be materialized in a file system directory. Below it will have the configuration file, the directory for the load file set and the directory for the load chunk queue. A directory can contain the pre/post-scripts and control scripts for processing the file selector specs and the chunk prioritizer specs. Also, a directory to hold a file for logging all activities. The scripts directory and the log directory should be common for all loads.

The load file set file scan retain the original slot file names, since it is not required to have a specific order for processing. However, for the load chunk queue we do need to have special naming to allow the priorities to take effect. The simplest would be to add a prefix composed of the priority number zero padded with 4 positions, for instance, 0010.filename.umf. In this way, the standard listing command will show at the top the files with highest priority.

Load prioritizer implementation

The load prioritizer must have a configuration file that contains the Specs for setting the priorities for the various loads. The load prioritizer must also have a script that will drive the chunk selection among the loads and will move the chunks to the pipeline area. Depending on how the pipelines are set up, the file chunks must be moved accordingly. For example, if the pipelines are configured for message queue transports, the load prioritizer script will place the files in the queue. The script will also drive the execution of the Post-Scripts for the chunk and the file.

Figure 7 shows the directory and file hierarchy for the implementation scenario we just described.

Figure 7. Directory hierarchy
Sample directory hierarchy for the implementation scenario

Conclusion

The model presented in this article will make the data ingestion process with IBM Identity Insight more manageable and will contribute to the design of the overall process. During an implementation of an identity resolution solution there are many steps before getting to the ingestion process design, so you could potentially have less time to think about it. The concepts and implementation guidelines in this article can help you establish a starting point. This model can help you to envision the needs for loading priorities, UMF organization, validation, notification and logging early, thus simplifying the project.

Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, Big data and analytics
ArticleID=755065
ArticleTitle=Managing the data ingestion process in IBM InfoSphere Identity Insight
publish-date=09012011