Education

Abstract

In this technote we will describe in greater detail how the item matching and duplicate handling works in the Analysis Repository (Onyx) Importer. We will cover basic concepts, gotchas and recommendations.

Content

Environment

IBM i2 Analyst’s Notebook Premium version 8.9.3+
Analysis Repository Importer
Either a Group (GAR) or Local Analysis Repository (LAR)

Steps

The Item Matching Page in the Import Specification Editor

The Item Matching page of the importer specification editor addresses three things:

How imported data is added to the item onto cards.
How to discriminate items (what properties to use to find matches in the incoming data and in the Analysis Repository).
What to do if no values have been provided for a discriminator property.

image-20181217181433-9

Card Creation

When data is imported, the information brought in is placed on a card on the Analysis Repository (AR) item. There are two options for adding imported data to an item in the AR:

Create separate cards for each matching row
Add information to a single card, discarding duplicate information (default)

What does each of the above options really do?

Create Separate Cards for Each Match Row

For most imports into the AR this is the recommended option as it will only create a new card on the AR item if there isn’t already a card that has the same information as the incoming row of data imported. For example, if the incoming data contains the columns of First Name, Last Name, Document Type, Document ID:

Row#	FirstName	LastName	DocumentType	DocumentID
1	Joe	Blogs	Passport	PP12345
2	Joe	Blogs	Drivers Licence	DL1234ABC
3	Joe	Blogs	Social Security Number	SS890123AZ

at the end of the import three new cards will be added to the item in the AR (or newly created one) containing:

Property	Value
First Name	Joe
Last Name	Blogs
Document Type	Passport
Document ID	PP12345

Property	Value
First Name	Joe
Last Name	Blogs
Document Type	Drivers Licence
Document ID	DL1234ABC

Property	Value
First Name	Joe
Last Name	Blogs
Document Type	Social Security Number
Document ID	SS890123AZ

So how are cards matched? The Onyx importer assumes that an existing card must have the exact same information as the incoming row of data. Thus, if the incoming row of data is a sub-set of data that is on an existing card, the card not be considered a match and therefore a new card will be created.

If the same data is imported in a subsequent import, additional cards will not be created. This is important as the server has a limit as to how many cards can exist on an item in the AR, by default it is 100 (can be modified at the server level).

Add Information to a Single Card

With this option, for each import a single card will be created on an item that captures all the unique information in the incoming data file. A scenario where this option is suitable is when first ingesting entities whereby to properties are used to firstly identify the document type and secondly the document ID number. For example, if the incoming data contains the columns of First Name, Last Name, Document Type, Document ID:

Row#	FirstName	LastName	DocumentType	DocumentID
1	Joe	Blogs	Passport	PP12345
2	Joe	Blogs	Drivers Licence	DL1234ABC
3	Joe	Blogs	Social Security Number	SS890123AZ

at the end of the import a single new card will be added to the item in the AR (or newly created one) containing:

Property	Value
First Name	Joe
Last Name	Blogs
Document Type	Passport
Document ID	PP12345
Document Type	Drivers Licence
Document ID	DL1234ABC
Document Type	Social Security Number
Document ID	SS890123AZ

Although this is the default option, it is not recommended for most imports given it will create a new record on a matched AR item on each import. This mean if the same data appears in a subsequent import an additional card will be created therefore duplicating information.

Item Matching Rules

Within the AR there is no general concept of a record identity to be used for item matching during import. In lieu of this, to facilitate item matching during import the ability to define “discriminating” properties has been made available. Item matching during import occurs at two levels:

Identifying matching items, both entities and links, within the incoming data.
Identifying matching existing entities in the AR against the resolved incoming data.
(NOTE: not links)

Any item property in the server’s schema can be used as a discriminating property. It is recommended that the properties used should:

Have values for all rows in the incoming data file.
Use a small number of strong properties, for example do you need all aspects of an address when a building number, street, city and ZIP/Postcode will suffice.
Have had normalisation applied to the incoming data, for example Use “Street” for all instances on “Str”, “St.” and “Street” in the street column (use the “Column Actions” section in the import specification editor to pre-process the data).

It is not possible to:

Define different discriminators for the same type used multiple times in the data mapping in the “Assign Columns” section. They are applied globally for the same type throughout the import specification.
Define OR operations. For example, must use First Name AND Last Name AND (Date of Birth OR Social Security Number). All discriminator properties are ANDed together.

When are Duplicates Created?

During import, duplicates can still be created even with item matching enabled. It can happen under the following scenarios.

There are already duplicates, more than 1 matching item based on item matching, on the server.
The matching item already on the server is not visible to the user executing the import owing to security.
The user does not have “Update” permissions, in other words they are unable to add additional information, on the matching item on the server.
The importer is unable to acquire a lock on the matching item already on the server because it is either being edited by another user or a match item in someone else’s running import.

In all of the above cases, rather than stopping the import as soon as duplicates exist, the importer will continue by creating a new item that will be used to capture the data for the currently running import. Therefore, in cases 1 to 3 above, additional duplicate items will be created on the server.

Identifying and Removing Duplicates

Whenever duplicates are detected or created a new set will be created to aid in later duplicate merging. To access the set at the end of the import a “View Duplicates” button will appear in the import results entry:

image-20181217183321-10

If the importer is closed the same set can be found browsing the AR through the Intelligence Portal. The set will be named after the import with a “Duplicate Items” suffix.

image-20181217180936-3

Within a “<import name> Duplicate Items” set it will contain:

All duplicates items created for the whole import.
All matching items found on the server.

To merge items together in the Intelligence Portal, select all the relevant records and select “Merge” from the “More actions” drop-down menu. The merge process allows the user to select a target record into which all other selected records are merged into.

In some scenarios it is possible that the “<import name> Duplicate Items” set may contain only the new item created by the last import. So, what happened to all the server duplicates in this case? During the import when the importer asks the server if there are any existing items that matching the item matching properties, the server will respond in one of four ways:

No matches found.
A single match found with identifier.
Multiple matches found with identifiers.
Too many matches found with no identifiers.

If the server responded with 4. Above only the newly created item will be in the “<import name> Duplicate Items” set as no information will be given by the server about the existing items. Why does the 4. response exist? Simply to help with server performance in finding and reporting matching items. The default reporting limit is 10 (a property named "DuplicateItemCountLimit" that can be found in the "ApolloServerSettingsCommonSearch.properties" file on the server). Unfortunately this is not treated as a TopG (return the first X values) which would be useful. Therefore if there were 11 matches on the server we don't get the first 10 identifiers we get another flag to say there were too many matched items to have the identifiers returned.

In this scenario how would the user resolve matching items to merge? First select the item in the set and “Open” it:

image-20181217180936-4

Then click the “Find like this” button:

image-20181217180936-5

Adjust any options and then click “Search”:

image-20181217180936-6

In the resulting list, select all the items and then merge:

image-20181217180936-7

Finally work through the merge dialog to identify the target item to merge into:

image-20181217180936-8

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSXVMQ","label":"i2 Analyst's Notebook Premium"},"Component":"","Platform":[{"code":"PF033","label":"Windows"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB24","label":"Security Software"}}]

Tips

Item Matching and Duplicate Handling in the Analysis Repository (Onyx) Importer