Product Documentation
Abstract
IBM Global Name Recognition (GNR), Version 4.1 contains numerous features and enhancements that are intended to increase its utility and ease of use in typical name processing tasks.
The quickening tide of mergers and acquisitions in the commercial sector requires corporations to collect and retain massive data resources. This trend means that GNR licensees and GNR Business Partners are now facing ever larger collections of names. To meet these demands, GNR product development must place a special emphasis on performance, capacity, and efficiency of processing resources.
This document addresses the ways that GNR Version 4.1 can be used most effectively when processing large name lists. The tips and techniques described in this document have been collected from internal use and testing of GNR components, and represent current best practices.
Content
Introduction
Why large name lists require special consideration
Generally speaking, large name lists can pose numerous challenges that are either less likely to be present in smaller lists, or are present but with a lower impact on processing outcomes. The following factors appear in more prominent and considerable ways, and in varying combinations, as the amount of names grow:
Lower levels of adjudication
Larger lists typically receive less human processing, so validation of names against external references might not be feasible, making it harder to identify and correct errors.
More semantic inconsistencies
Semantic constraints (such as personal names only, initials cannot be used for given names, etc.) are harder to enforce in larger data sets, especially those that arise from aggregating several name lists that already contain conflicting name semantics constraints.
Greater levels of noise or errors
Ensuring error-free data capture and noise-free name representation of each name in the collection is difficult in larger lists, especially when aggregated from several lists.
Greater redundancy and less distinctiveness
The ability of a specific name to identify a particular entity is proportional to the number of times that the name occurs in the collection (this is especially true for large collections of personal names). As more names are added, each name loses some of its distinctiveness, to the point where the name alone ceases to be an effective way of retrieving information about a specific individual.
Greater data preparation time required
As the amount of names increase, so too does the effort and expertise that is required to extract names from data stores and format them for processing with GNR components.
Greater processing resources required
CPU memory and intermediate storage that is required to achieve various GNR processing results are proportional to the size of the name collection that is to be processed.
Greater need for architectural efficiencies and processing economies
GNR name processing solutions are typically saved in CPU memory (memory resident), so processes that were able to run in a shared-processor environment might at some point require dedicated processor resources, if production time is held constant. Alternatively, processing that fits in a specific period of time (such as overnight) might not be containable as name lists grow.
Greater cultural and ethnic diversity
As a general rule, large collections of personal names exhibit considerable variety in the ethnic and cultural diversity that they present.
IBM InfoSphere Global Name Analytics functions and large name lists
Resource estimation and planning
Tips for planning large-scale name list processing with GNA
The IBM InfoSphere Global Name Analytics (GNA) components produce results based on the properties and characteristics of a particular name. Each name can be processed only one time, and the results can then be stored in a persistent way that removes the need for reprocessing. In addition, adding more or different names to the name set at a later time does not invalidate the results that were already delivered for a name.
The persistent value of GNA results means that planning for large-scale name list processing involves mainly per-transaction resources, and that a significant degree of parallelism is possible when processing the name set.
Caveats and cautions: when changes have impact
While GNA results have a generally persistent value, there are circumstances that can cause inconsistencies between names that are processed when a GNR application is first deployed and names that are processed in later sessions..
Any upgrade or version change in the GNA software components can create an inconsistency between previously calculated results and current results, because the linguistic algorithms and reference data that are used to make processing decisions in GNR products are constantly being refined.
Several GNA products support user-based extensions of the associated reference data. For example, NameParser allows users to extend reference data that is already present in the Name Data Object (NDO) file by adding new lexical material in an External Tokens File (ETF). Any addition, change, or deletion of user-defined reference data used by GNA processing components alters the persistence of previous results, requiring all names to be reprocessed. For this reason, user-initiated changes should be weighed carefully before they are implemented, because they can weaken the retained value of existing GNA processing results.
Preprocessing the data
Because GNA results can be retained, and because large name sets tend to exhibit a high degree of redundancy (that is, many names occur more than once, and some occur a large number of times), the most efficient way to process a large name set is to isolate the unique names so that GNA processes each distinct or unique full name one time. The GNA results are then shared across all actual instances of that same name.
Further efficiencies can be realized for large name sets that comprise both personal names (PNs) and organization names (ONs). By separating the ONs using NameSifter, these names can be omitted from the following GNA processing steps that are meaningful only for PNs:
· Parsing
· Culture categorization
· Genderization
· Generation of alternate spellings
After unique names in the name set have been isolated (while retaining a logical linkage back to the associated original instances) and ONs have been removed, the resultant unique PNs are much more efficient for GNA processing. Even when no ONs are present, isolation of unique names in an exclusively PN list can reduce the total GNA processing load by twenty to forty percent for typical large name lists.
Parallelism in the GNA processing flow
How to expedite GNA processing of large name lists
An efficient processing sequence for processing large name sets with GNA begins with the aforementioned preprocessing steps and proceeds to a highly parallel processing step. The unique PNs are segmented into subsets so that there is a single subset available for each processor that can be used by the GNA application. The IBM NameWorks integrated API offers a single API call, Analytics::analyze(), that provides all available GNA results in a single invocation. Alternatively, each GNA processing component has an individually associated IBM NameWorks API, so customized preprocessing utilities capable of parallel operation can be developed quickly and easily, using either the C++ or Java interface.
Regardless of which GNA components are used, maximum parallelism is the fastest possible delivery of GNA processing results. Because GNR products are supported on a wide variety of platforms that utilize different processors and operating systems, GNA processing can occur on heterogeneous host platforms.
Federating GNA processing
When GNA processing operates in parallel, whether on the same or different host platforms, each processing unit is autonomous, delivering results that are separate from those derived by the other processes. Therefore, failures or processing interruptions in one parallel process do not affect other active segments, allowing each process to be restarted or executed independently.
After processing results have been successfully generated in all parallel segments, these results can be merged into a single file.
Managing GNA processing results
A key factor in reducing the processing time and required resources that are associated with large-scale processing of names is planning a name management regime that maximizes the retained value from the results of the initial, baseline processing sequence.
Establishing a name management protocol
Large name sets typically arise in operational settings where significant data management resources already exist. As such, it is feasible to plan for retention of GNA processing results in a persistent data store, ideally under DBMS management. Such a regime should be in place before GNA processing begins in order to maximize retention benefits and minimize subsequent reprocessing. Depending on the specific GNA components and the format of the results that they deliver, some minor reformatting might be required before the results can be inserted in the data store. The storage scheme should support retrieval of names by the names themselves and by shared values (such as given names that were identified as being female and are also primarily associated with the French culture), as expressed by standard Boolean search conditions commonly supported by most DBMS products.
Fewer instances of new unique names are encountered as a large data list grows. An effective name management protocol establishes a baseline that serves as a local reference data repository that enables GNA components to process the new unique names as they are added to the list, instead of recalculating analysis results for all names. This investment saves time by reducing subsequent GNA processing.
Never process the same name twice
A well designed name management regime for GNA results should seek to process each unique PN only one time, and to contain almost all GNA processing tasks in an initial phase, during which a highly parallel processing path is employed. Maintaining the GNA results in a highly searchable form facilitates their reuse, and obviates most use of GNA components after the initial processing phase.
IBM InfoSphere Global Name Recognition Scoring functions and large name lists
How GNS functions differ from GNA functions
The IBM InfoSphere Global Name Scoring (GNS) product bundle has, as its most basic function, the ability to identify and measure the degree of relatedness between two names. The most typical deployment of a GNS solution is one in which a single name is searched for in a list of candidate names. In GNR product terminology, the name being searched for is the query name and the set of candidate names in which the search takes place is called the name list or name data list, while the names on that list are sometimes referred to as database names.
The GNS component that performs the search is NameHunter. The search process is controlled by a set of over forty individual controls, collectively known as comparison parameters (CompParms). Additionally, external reference data files that contain variant forms of the compared names (VARDATA tables), as well as files listing minor name components such as titles, affixes and qualifiers (TAQ tables), all influence the outcome of a particular search transaction. GNS results are typically lower than GNA results because CompParms can be modified on a transactional basis, VARDATA and TAQ reference files can be expanded, and search outcomes depend on the query name and the data list content. For example, results are potentially invalidated when searching for a specific name in a list of 1 million candidate names if a single name is added to or removed from the search list.
For this reason, efficiencies around GNS processing of very large name lists should not be based on a process-and-store approach. A GNS solution should instead focus on minimizing individual search response times by taking advantage of the distributional properties of large name collections.
Implications for processing large name lists
Efficient processing with GNS depends on the following determinations:
· How the components are combined into an overall solution
· The physical and logical organization of the large name list to be searched
· How GNS search processes are mapped to available processor resources
Each of these factors is considered in greater detail in the following sections.
Resource estimation and planning
Before considering the efficiency factors that are associated with GNS components, it is useful to review the NameHunter requirements for query names.
The following best practices are recommended for a set or list of names to be searched by one or more instances of NameHunter.
Isolate unique names
As with GNA best practices, duplicate entries should be removed from the name list before searching so that each unique (full) name occurs only one time. To maintain a link between name-based search results and individual records, a linkage table must be retained so that each unique name can be connected to all of its redundant instances in the original name list. Name Preprocessor (NPP) is capable of isolating unique PNs in large lists, while also building linkage tables that are required to find each instance of every unique name so that NameHunter Distributed Search (NH-DS) can support unique name search mode of operation.
NH-DS is a prebuilt search application that comprises a search manager application and one or more autonomous searcher processes, all communicating with each other and with client processes through XML messages sent over standard TCP/IP socket connections. NH-DS also includes NPP, which converts name data files into a format to serve as the basis for populating a memory-resident name data list, and also divides the name list into separate partitions so that it can be distributed to multiple processes for parallel processing.
Convert to Romanized formatting
Query names must be in the Romanized format that NameHunter internal matching and scoring logic expects. Names are converted automatically into a searchable Romanized form by GNR as they are loaded into memory and added to NameHunter datalists, provided that they are represented in one of the writing systems that are supported by GNR:
· Arabic
· Cyrillic
· Greek
· ISO Latin
Separate personal names from organizational names
NameHunter provides the best search results on name lists that contain the same entity type. NameHunter currently supports two types of named entities: personal names (PNs) and organization names (ONs). PNs and ONs need to be differentiated before searching so that each type can be placed into a data list of that entity type. If manual separation of PNs and ONs is not possible, then this distinction can be made automatically using NameSifter.
Assign structure to PNs
Each PN that is subjected to NameHunter searches must be parsed into its two most basic structural elements, the given name (GN) and the surname (SN), according to the GNR name model. Each of these name fields can comprise multiple tokens, although a name might have either name field missing. ONs, on the other hand, are not currently searched based on the GN and SN, but instead search the full name string. If a PN is not separated into its GN and SN components, NameParser can make this distinction, and can also suggest alternate parses for some PNs that might have more than one plausible mapping into GN and SN name fields.
Assign cultural context to PNs
Each PN that is subjected to NameHunter searches must be associated with one or more cultural categories. The cultural context allows NameHunter to apply valuable information from the GNR runtime reference data files that greatly improve search accuracy, and allows NameHunter to accommodate many frequent patterns of culture-specific name variation. The cultural context is required individually for the GN and SN name fields and for the PN as a whole. GNR Version 4.1 supports a total of 21 key culture classifications, including a Generic category that covers all cultures not otherwise distinguished or supported in GNR products. You can use NameClassifier and Country of Association (NC-COA) to determine culture classifications for the GN, SN, and full name.
Generate a regularized form of the name
An optional regularization feature is provided for similar sounding names whose spelled forms cannot be connected with NameHunter matching logic. PNs that contain unusual spelling can be mapped to more typical spellings that do not obscure the pronunciations of the related names. This feature is a powerful enhancement to the search logic, allowing visually dissimilar names (for example, KNOX and NAUCKS) to be related because of their underlying phonological similarities. Culture-specific regularization rules must be applied if these proxy forms are added to the data list.
Load search lists into memory
Because all NameHunter searches are based in memory, no searches can be performed until the name list is read from an external data source (typically, a CSV flat file) into a memory-based data structure. A NameHunter instance can comprise and control numerous distinct data lists, and a GNR search transaction can involve one or many NameHunter instances, depending on the scope of the search and the degree of parallelism configured into the runtime search environment.
Perform pair-wise comparisons
After a data list is saved in memory, a search transaction can be processed within its scope, using that data list. In such an instance, NameHunter compares each name in the list to the query name in a sequence of pair-wise comparisons. NameHunter provides an optional parallel data structure, called a bit signature file, for each name data list. The bit signatures allow NameHunter to identify and abandon unsuccessful pair-wise comparisons with minimal resource expenditure to greatly improve search times.
Maintain synchronization with underlying data store(s)
After memory-resident data lists have been populated and a search is underway, many GNR product deployments elect to make the memory-resident data list dynamically reflect the changes from the associated data stores. Add, delete, and update transactions need to be supported, even if the data list has been reduced to unique name instances. Support for such transactions also needs to be handled in order to minimize delay on active search requests.
Define and validate available search strategies and user preferences
Another valuable preparation step is conducting an assessment of the number and type of search strategies that are required to meet the business rules and individual preferences within the target user community. IBM NameWorks supports definition of search strategies, which are customized search settings that can be invoked and applied on a per-transaction basis during a GNS search session.
The degree of control and the format to be made available to users in a search strategy must be defined and established in the configuration files for NameHunter and IBM NameWorks so that they can be invoked as needed at search time. Both NameHunter and IBM NameWorks support a wide range of user-defined search strategies in order to present search outcomes that are closely tailored to organizational requirements and specific user needs. Implementing manual reviews of sample results and user focus groups can expedite the process of identifying the best roster of search strategies to implement. Search strategies are especially important for large name lists because many search requests can return large numbers of potential matches, and non-name search criteria are typically required in order to qualify and filter the match results that are delivered by GNR.
Preprocessing the data
Consistently delivering superior search results for large name lists entails a significant amount of architectural analysis and data preparation, even before the first search is performed. Much of what NameHunter does depends on the categorical, structural, and cultural distinctions that take place before names are loaded into memory for searching. These distinctions are made in a number of different ways, ranging from manual to fully automatic, so that each GNR product deployment can be fitted to existing local data management practices and business rules.
The following sections describe various levels of automation support that are provided in the GNS product bundle to enhance and preprocess large name lists for searching.
Custom utilities based on IBM NameWorks
The IBM NameWorks APIs expose individual GNS functions that are required to prepare name data for searching. In particular, the analyzeForSearch()API performs all preprocessing, but other APIs are available to perform individual processing for parsing and culture classification.
Development of a customized preprocessing utility with the IBM NameWorks APIs is an efficient way to perform only those name-data enhancements that are required for a particular operational setting. A customized preprocessing utility can also take advantage of the multithreading capability of IBM NameWorks that helps to significantly reduce preprocessing time, especially in comparison with the NamePreprocessor, which does not support mulithreading.
Using GNS command line utilities
GNS also provides individual command line utilities that can be applied to a large name list. If development at the API level is possible, then these utilities can be used to complete the following preprocessing functions:
· npclu: name parsing
· nc_coaclu: name classification
Use of individual command line utilities can allow for customized preprocessing in environments where some, but not all, of the information required by NameHunter is available. Processing results are typically returned in standard CSV format records, making it easy to store and combine output from individual utility programs.
The following preprocessing steps for name list data are not currently supported through GNS command line utilities, and can only be accomplished through the following automation services:
· Transliteration
· Categorization (PN versus ON)
· Regularization
The previous remarks concerning the efficiencies of preprocessing unique name versions of large name lists also apply here.
Using analyzeForSearch()
The IBM NameWorks Scoring APIs include analyzeForSearch(), a function that performs the following preprocessing functions in a single invocation:
· Transliteration
· Categorization
· Name parsing
· Name classification
· Name regularization
This comprehensive API enables you to develop custom preprocessing applications using C++, Java, or Web services interface layers. This API can be used to supplement preexisting information with automated preprocessing results and to generate output in custom formats.
The previous remarks concerning the efficiencies of preprocessing unique name versions of large name lists also apply when using the IBM NameWorks Scoring APIs for selective preprocessing functions.
Using Name Preprocessor with NH-DS
Name Preprocessor (NPP) accepts user name sets and formats them for use as input to one or more instances of a NH-DS searcher process. Each NH-DS searcher process comprises a memory-resident list of names that a query name can be searched against. When searching large name lists, multiple searchers are typically used to process subsets of the large list in parallel, a technique known as search federation.
Depending on the specified configuration options, NPP can also enrich the original name data by adding transliteration, alternate parses, regularized forms, and NameHunter character-level edits to remove noise and anomalous characters. NPP can also prepare the original name list for NH-DS when operating in unique name search mode by isolating unique full names and building cross-reference files that are required to link the derived list of unique names to the original names.
NPP is a single-threaded process, so processing a large name list with several configuration options can greatly increase processing time. The following strategies help to mitigate the processing delays imposed by NPP for very large name lists.
Turn off unnecessary processing options
NPP should be configured to omit any processing options that are not required or that add little value in improving search outcomes. For example, names might be pre-parsed with a known and acceptable degree of accuracy, or the regularization of names might not generate alternate spellings that enhance search results. Eliminating such options can help to reduce NPP processing time considerably.
Use a sample subset to estimate processing time
Pass a small percentage of the original name list through NPP to estimate both the approximate processing time for the entire list, as well as the associated benefits of various processing options.
Federate the processing if not in unique name search mode
If NPP output is not intended for use with NH-DS when operating in unique name search mode, then the original large name list can be divided into two or more subsets and processed separately with a different instance of NPP. Processing can occur on the same multiprocessor platform, or on a different host platform altogether. The resulting output from this type of federated processing can be configured with a single NH-DS instance.
Save the NPP interim data file
Retaining the NPP interim data file is another way to harness the time and processing resources invested in passing a large name list through NPP. As with GNA processing results, the NPP interim file does not change as the names in the original data file change. NPP interim files can be stored as DBMS tables so that they can be recovered and reconstituted quickly. The resultant name data can be redistributed over a number of CPUs when NH-DS is operating in federated search mode. This same interim data file can also be used with the IBM NameWorks Embedded Search (NW-ES) facility.
Custom preprocessing for IBM NameWorks Embedded Search instances
Unlike NH-DS, the IBM NameWorks Embedded Search (NW-ES) search facility (also provided with the GNS product bundle) is not dependent upon an external utility program to enrich the name data that it searches. Instead, NW-ES can transliterate, categorize, parse, classify, and regularize each name that it finds in a source data file, depending on what name processing enrichments are specified in its configuration file.
NW-ES can enrich name data as it builds and populates individual memory-resident data lists. Because each list is associated with a single CPU, NW-ES enrichment can be multithreaded and federated, enabling a large name list to be processed many times faster than when using NPP.
Searching of large name lists is much more efficient when the lists are reduced to unique full names, but NW-ES cannot isolate unique names like NPP can. This limitation can be overcome by identifying and isolating the unique names before they are processed by GNR.
For example, a typical best practice is to isolate unique full names and build cross-reference linkage tables for large name lists that are maintained in commercial DBMS products (such as DB2). In this scenario, the unique name list (in which each name is associated with its own unique name ID value) can be manually divided into an appropriate number of mutually-exclusive entries for parallel preprocessing with NW-ES. For exceptionally large lists, the total preprocessing time can be reduced substantially compared to standard NPP performance in order to retain the benefits of unique name processing by NW-ES or by NH-DS.
GNR search results delivered by NW-ES or NH-DS contain an internal, unique name ID. Therefore, a join within the DBMS cross-reference table is required to locate the original name records that are associated with the unique name ID. This necessity is offset in many instances because GNR search results that involve large name lists are often filtered, post-search, through DBMS queries on non-name search criteria.
Caveats concerning automatic name preprocessing tools
Each of the aforementioned preprocessing steps for large name lists adds to the quality and accuracy of searches with IBM InfoSphere Global Name Scoring (GNS). Each of these steps can be completed in a fully automated fashion using a combination of other GNR name analysis or name scoring products. The ability of GNR components to produce these processing results is dependent, in most instances, on reference data collected from many years of empirical analysis. In particular, many GNR processing components refer at run time to statistics and patterns that are stored in the Name Data Object (NDO) file, a highly compressed, memory-mapped data file that comprises a variety of reference data about names, originally extracted and summarized from the GNR Name Data Archive (NDA). The NDA is a collection of nearly 800 million names from countries all over the world, and the summary information present in the NDO serves as the knowledge base for many processing decisions made by GNR components.
While this repository of name-related information provides many valuable insights into the patterns of cultural variation and typical usage of names around the world, it is important to understand that processing results based on NDO data are not preferred to actual observations and data collected with the names in the original name list. That is, native information associated with a name should be considered more authoritative than results from when a name is processed by GNR components. For example, a personal name that is parsed into its given name and surname constituents might render a different parse when processed by NameParser. This generated parse is provided as a supplement to the original parse, but not as a replacement. Existing name annotations such as gender, ethnic/cultural affiliation, alternate spellings, or any other characteristic that was collected originally, should be used when these categories of data are available. The name annotations produced by GNR components are provably consistent with the large, global patterns of use for all names in the NDO, but these results cannot be considered equally applicable to each name therein. For this reason, automated name preprocessing results from GNR might in some instances contradict known information that is associated with particular names. The following caveats apply for name processing with GNR components.
Transliteration
GNR name transliterations are rendered in a form that makes the associated name available for effective processing by all other GNR product components. Downstream processing outcomes are more important for name processing than for human legibility in the Romanized forms that are rendered by GNR. In particular, GNR results are not intended for use as official or persistent representations, but only as intermediate, internal forms, in the same sense as generated keys and hashes used in many typical data processing applications.
Categorization
GNR decisions as to whether a specific name is best processed as a personal name (PN) or an organizational name (ON) are made based on previous use of the name. When little or no data is available concerning prior use, or when the name has internally conflicting indicators (such as an ON that contains PN information), the GNR name category decision might conflict with human intuition and other known facts about the name.
Current GNR quality assurance data indicates that automated name categorization accuracy is between ninety and ninety-seven percent for typical lists of mixed PNs and ONs. Thus, whenever the category that is associated with a record (person or organization) can be inferred from an external source, this information should be preferred over the decision of NameSifter, the GNR name categorization function. NameSifter should only be used when it is known that a data set contains both personal and organization names, and when there is no indication to differentiate whether a record relates to a person or an organization, and human adjudication of the records is not feasible. In addition, NH-DS and NW-ES can be configured to search both PN and ON lists using the doOnToPnListSearch=true configuration file setting. This setting ensures that both the PN and ON lists are searched when a search request for an ON is processed.
Parsing
NameParser identifies the two basic components of personal names, the given name and surname, without rearranging any name components. Occasionally, when NameParser determines that the order of name components is substantially different from the pattern in which those same components are typically found, an alternate parse (reparse) might be suggested for the name. NameParser assigns a validity score to both the standard parse and the reparse to measure the extent to which the parse conforms to observed sequences present in the NDO file. Therefore, the reparse might occasionally receive a higher validity score and can be interpreted as a better parse. The more accurate interpretation of the validity score is as a measure of conformity, not as a measure of correctness. Reparses produced by NameParser (either as a component of Name Preprocessor, or as an embedded component in the IBM NameWorks search facility) are taken as supplemental data, and not as a substitute for the standard parse. After preprocessing with NamePreprocessor, reparsed names that are added automatically to NameHunter search lists are flagged. This flag is stored when the name is loaded into memory for searching so that the names are easily identified when they appear in subsequent search results. The presence of this flag in the search results also means that any match-name that was generated as a reparse can be omitted from the search results before they are sent to the user. Again, a preexisting parse of a PN is always to be preserved and considered, regardless whether it happens to coincide with the parse(s) generated automatically by NameParser.
Culture classification
Associating personal names (PNs) with specific cultural or ethnic classifications is a powerful enhancement to NameHunter searches. Specifying the cultural background of a query name enables NameHunter to apply a wide range of specialized matching and scoring techniques, as well as pertinent reference data, that help to improve search accuracy and to suppress unwanted and irrelevant matches. Culture classifications made by NameClassifier Country of Association (NC-COA) depend on broad, pervasive patterns of spelling, letter sequences, and other culture-specific cues found in PNs, and are further filtered by NDA data to ensure consistency between the culture classification and the countries where the name has been observed in use.
NC-COA currently supports 21 cultural classifications, mainly representing cultures whose security, financial, or commercial prominence are important to GNR users. That being said, many names that are found in large data lists might stem from cultural backgrounds that are not yet supported by GNR. Although these names are intended to be assigned with a Generic culture category, some names exhibit characteristics that are similar to an existing culture category. For example, many Italian names are similar in spelling and style to Spanish names. In addition, many names (especially given names) have achieved popularity across numerous cultures and could pertain to more than one culture classifications. Errors and other anomalies can also cause some names to be assigned to culture classificationsthat are counterintuitive to standard usage.
GNR search components like NH-DS and NW-ES are capable of delivering excellent search results even if culture information is excluded. A name that has apparently been miscategorized by NC-COA can still match against a wide range of variants forms that are typically seen as related. The best way to mitigate cultural ambiguity of PNs is to rely on reported cultural association, and to consider the automated assessments that NC-COA generates as supplemental suggestions. NC-COA provides a measure of a PN’s cultural affinity against all 21 supported cultures, so it is possible to perform a search by each of the most plausible GNR culture associations. An efficient way to do this is to make use of the GNR “roll-up” culture categories (European, Southwest Asian, and Han), each of which comprises a set of closely related and interwoven cultures.
Regularization
The result of GNR name regularization is intended for use as an intermediate form and not as a persistent representation of the input name for human interpretation. As with name parsing, the regularized name is intended for use as a supplement to the original form, and not as a replacement. Regularization is most useful when names in a large data list are known to have originated from oral sources where the sound of the name might have been captured relatively well, but the actual spelling is speculative. These circumstances arise frequently in many public sector name search operations, when names in search lists are captured from informal sources without the knowledge of or cooperation from the associated individuals. Contrastingly, names that are found in most commercial settings are typically captured from written, attested sources, with the best cooperation of the associated individuals. If the names in a large search list are known to be largely free of transcription errors, even without corresponding documents of verification, then adding regularized name forms will add little value because the additional processing time that is required to process the regularized forms is unlikely to result in valuable name matches. Regularization can be turned on or off in the NPP, NH-DS, and NW-ES configuration files.
Establishing an optimal GNS search architecture
A distinct set of challenges must be addressed to establish a runtime search architecture that represents the best balance of efficiency and quality in order to maximize GNR name matching and name scoring potential for the user community.
GNR products are, in their base form, API libraries that can be combined with each other and with other search tools, and so architectural considerations are paramount to efficient searching. In collaborating with numerous clients, answers to several key operational and semantic questions have been proven benchmarks in distinguishing highly successful outcomes.
Transactional versus batch processing
Will name searching take place in an online transaction processing (OLTP) environment where users await search outcomes one by one, or as part of a bulk processing environment where name matching is one facet of a more general extract-transform-load (ETL) processing sequence, where individual search results are not available for review until all processing has been completed? Will both OLTP and ETL modes be needed?
Data list size
What is the total number of names to be placed within the scope of the search system? What is the growth rate of this set of names?
Data list volatility
What rate of change do the names exhibit? That is, what is the percentage of add, delete, and update transactions as a percentage of all transactions?
Name data latency
What is the maximum allowable time between changes in the underlying data stores and corresponding changes in the GNR memory-resident name data lists that are populated from those data stores?
Availability
What is the required level of availability for the search system? What degree of redundancy is required in order to meet this requirement? What is the maximum allowable time for a search system restart, including name preprocessing (if required)?
Transaction rate
For OLTP environments, what are the peak and average search transactions rates that must be met?
Search scope
Will each search operate over the same set of candidate names (fixed search scope) or will some searches operate over distinct or overlapping sets of names (dynamic scope)?
Search criteria
What criteria will be provided in a search? Is a name always present? How often are other, non-name criteria present? Can business rules for searching be used to organize names into mutually exclusive subsets (such as grouping names by state of residence if matching names must be from a specific state)?
Mapping to processor resources
How many CPUs can be dedicated, primarily or exclusively, to GNR-based search processing? What is the clock speed of those CPUs? Are additional CPU resources available for periods of peak demand?
Memory resources
How much internal memory is available for each CPU that is associated with a GNR-based search process?
Transactional versus batch processing
Name searching in an OLTP environment is addressed by IBM NameWorks Embedded Search (NW-ES) and NameHunter Distributed Search (NH-DS).
IBM NameWorks is a set of function calls that include the ability to search for a name in a list of names. The search function can be implemented by having IBM NameWorks make an external call to NH-DS, or by having IBM NameWorks search a list that it has created internally, known as an embedded search.
Either NW-ES or NH-DS can support OLTP search environments. The ability to configure NH-DS so that an indefinite number of search managers and searcher processes are available to service a singe search request means that NH-DS offers unlimited scalability. NW-ES is typically indicated for OLTP environments with high transactions rates but relatively small search lists (ideally, less than 100 million names). NW-ES does not provide any preprocessing utility, although NPP can also be used to generate input name data files that are suitable for use with NH-DS. Unlike NH-DS, NW-ES cannot be operated in unique name mode, so isolation of unique names and creation of cross-linkage tables must be accomplished manually before the unique name list can be input to NW-ES.
In general, OLTP search architectures are recommended for search environments that have more dynamic search conditions, such as when the search scope or user preferences must change often and when the query names are not all known in advance. The OLTP environment is also recommended when the non-name search criteria available to qualify the name matches might vary from one transaction to the next.
In OLTP search architectures, the predominant design guideline is minimizing the number of names that are assigned to each available CPU from the search list. This task is accomplished either by partitioning the list into equal subsets (as performed by NPP or by a manual process of subdivision), or by sorting the names into mutually exclusive subsets, if the search scope can be confined on a per-transaction basis to some small portion of the entire list. For example, divide the list by country of birth if all valid name matches must also match this non-name criterion. Reducing the original large name list to its unique name form helps to achieve the minimal name design principle, provided that a means is preserved (through NPP or through other, external means) to link from each unique name that is searched by GNS back to all original list entries that contain the name.
ETL or “batch” style name search processing can also be accomplished in an OLTP architectural framework because all GNR batch processing is implemented as a series of individual search transactions. However, batch processing with NH-DS and NW-ES (especially with NH-DS) requires greater CPU resources in order to process each transaction individually.
The recommended approach for batch-style name searching in IBM InfoSphere Global Name Scoring (GNS) Version 4.1 is to use the dsFile utility. This sample application from the NH-DS application suite is found in the <install>/bin directory. The application accepts bulk transactions for NH-DS as a series of CSV-format records in a file and produces search transaction output as a series of CSV-format records. A brief overview of the dsFile utility is provided in Attachment A.
The dsFile utility is most efficient when the following criteria are met:
- · All query names are known in advance
· The search scope is fixed and identical for all query names
· The search strategy for names from each distinct cultural group can be defined in advance
· Examination of the search results can wait for all match processing to complete
The unique name principle can also be applied for batch name searching, because the set of query names can be reduced to reflect only unique full name entries before batch processing is initiated.
The input record format for bulk search requests when preparing name data for use with dsfile has been modified in the IBM InfoSphere Global Name Recognition Version 4.1 release. The input record format for dsfile must contain a new field that specifies the entity type (personal=1, organization=2) of the query name. This new value is placed in the third field from the left in the CSV-format input record, as indicated in the following example:
S,101,1,BRANSON,RICHARD,1,1
S,102,1,AL SAIHATI,RAMI,2,2
S,103,1,LOPEZ HIDALGO,ALFREDO,4,4
Input files for bulk search requests using dsfile in previous versions must be modified to add this field before dsfile submits a request. Failure to add a value for the entity type causes NH-DS to generate an error message.
Parallelism and efficiency in the name score processing flow
Besides use of unique names during GNR search processing, the most powerful optimization that can be applied when processing large name lists is to make maximum use of parallelism. Achieving maximum parallelism includes minimizing the number of pair-wise comparisons and spreading that work evenly and efficiently over the maximum number of available processors (CPUs).
Both NW-ES and NH-DS can perform searches in federated mode by dividing large name lists into numerous smaller ones that can be searched by a process that operates on a separate CPU. NH-DS can distribute a search across federated CPUs and combine the results across an unlimited number of host platforms, even with a heterogeneous hardware base. NW-ES is limited by the number of processors that are available on a single host platform, and cannot distribute processing segments to other host platforms or recover original records that are associated with unique names.
The form of GNS search parallelism that renders the greatest efficiencies depends on the total size of the memory-resident search list. For small to medium search lists and moderate transaction loads, NW-ES can yield significant performance and throughput advantages over an equivalent search with NH-DS. For searches that involve large to very large search lists and heavy transaction loads, NH-DS offers greater scalability.
Other factors to consider
Unique name preprocessing and efficient parallelism in the search architecture are two central considerations when working with large name lists in a GNR search environment, but these are not the only factors that must be considered.
Significant additional efficiencies can be achieved in a number of ways by taking a somewhat broader view of the more general name search process itself. The following list provides other facets of name searching that should be considered:
Match what to what?
Consider the set of query names and the set of names in the search list. If one set is considerably smaller than the other, then it might prove more efficient to search for each record in the smaller list and make the larger list into the search list.
Name first versus name last search techniques
If search results are typically filtered by non-name criteria (such as date of birth), the most efficient search approach would first qualify match candidates based on these filtering criteria and then use NameHunter to search for matching names in the qualified records. A “name last” strategy is recommended for very large name lists where the name alone is typically not enough to confirm a search result, and where other identity factors are commonly considered in each search. A “name first” strategy entails maximum processing resources for the GNR name match phase, and many names that are qualified as matches by GNR are subsequently disqualified on the basis of non-name factors. A “name-last” approach, by contrast, makes more efficient use of GNR name matching capabilities.
Use the ID field
Associated identity information that is required for the user to conduct manual adjudication can be placed in the ID field of the name record to ensure that it is returned in the search results with the matching name. The ID field for each name in a GNR memory-resident search list can contain up to 255 bytes of data. This information can be extracted by the search client instead of retrieving the ID from the backing data store before it is presented to the user.
Common names and the search “choke-out” problem
Common names can occur thousands of times in large name lists, and can have numerous similar forms that occur within the same list. When a search is conducted against these names, the results can be overwhelming. To help relieve this problem, consult local data profiles of the name list to identify common names before they are submitted in a search request so that the user can provide additional qualifying search criteria. Similarly, consult the statistics in the Name Data Object (NDO) file (through NameParser output) to identify personal names that are observed as common. Either way, it is best to identify these names and take preemptive action to ensure that search results are kept within a reasonable quantity range, especially in an OLTP search environment.
Partial queries: prevent or control?
Some users might have established a habit of basing search requests on partial names (only a given name or only a surname) as a way to broaden the search scope, or to compensate for perceived weakness in the underlying search logic. Such a querying style is unnecessary and counterproductive in a GNR search environment because NameHunter searches accommodate a much wider range of name variations than other name matching approaches. In addition, partial queries against large name lists can have a serious impact on system resources. Adopting a documented policy on controlling partial queries can yield useful efficiencies in the operating environment.
Business names
Omitting standard business markers (INC, CO, LTD, etc.) and noise words (OF, THE, AND) when searching for business names helps to optimize search results.
Match-management regime
When the query name list and search list are both stable, establishing a “delta” regime can help to reduce the number of pair-wise comparisons by managing and tracking changes. A baseline matching is completed for both lists so that only the changes in one list are checked against the other list when the change involves a previously unseen name.
allowlists
Over time, some names that GNR qualified as matches prove to be disqualified on the basis of non-name criteria (such as incorrect date of birth or address).Implementing a post-search “allowlist” for certain query names makes it possible to remove the qualified, but unwanted, matches from GNR search results before they are presented to the user.
Many configuration options exist to help mitigate and reduce the processing effort that is associated with searching large name lists with IBM InfoSphere Global Name Recognition, Version 4.1. While there is no universal formula that can be applied in every operational setting, many techniques exist that can reduce processing time and resource consumption without negatively impacting the quality of search results that are delivered by GNS components.
Attachment A
Overview of dsFile
NameHunter Distributed Search (NH-DS) batch transaction utility
dsFile is a command line program which accepts an input file that describes transactions (queries, adds, etc.), sends messages to NH-DS, and places the output in a text file. Some of the dsFile properties can be controlled by a configuration file.
Configuration
The default configuration file for dsFile is named dsFile.config. The file has the following attributes:
· transFile –Name of the input file of transaction information
· resultFile – Name of the output file
· hostname – Hostname where NH-DS resides
· port – The port number where NH-DS accepts connections
· dataList – Not currently used.
· timeout – The number of seconds to wait before relinquishing a response from NH-DS
· sendParmsName – Specifies whether you want dsFile to send name parameters with each query
· sendParmsGn – Specifies whether you want dsFile to send given name parameters with each query
· sendParmsSn – Specifies whether you want dsFile to send surname parameters with each query
The configuration file has entries for NameHunter comparison parameters (CompParms), which are in the same format as those used in the GNS search and why utilities. The GNR CompParms are the control parameters that are sent with each search transaction, as indicated by the sendParmsGn and sendParmsSn configuration entries. See the IBM InfoSphere Global Name Recognition, Version 4.1 product documentation for more information about the configuration files that are used with various command line utilities.
Transaction file
The input file for dsFile is a comma-delimited text file where the first field indicates the type transaction. Subsequent fields depend on the transaction type.
The following transaction types are currently supported in dsFile:
Add (A) – Submits an add message to NH-DS with a transaction type of “A”. A sample Add message adheres to the following format:
A,100,1,finkelmyer, isiah,id1234,0,0,N
Where each field is defined as:
· A – transaction type of Add
· 100 – transaction ID, which can be any positive integer
· Entity Type – 1 for personal, 2 for organization
· finkelmyer – surname
· isiah – given name
· cust1234 – customer ID
· 0 – surname culture (from the numeric culture classifications that are supported by NameHunter)
· 0 – given name culture
· N –alternate parse, where yes=Y, no=N
If the Add request succeeds, a confirmation message is sent to the console.
Delete (D) – Submits a delete message to NH-DS with a transaction type of “D”. A sample Delete message adheres to the following format:
D,3,2024
Where each field is defined as:
· D – transaction type of Delete
· 3 – transaction ID, which can be any positive integer
· 2024 – the customer ID to be deleted, as previously associated with this name when added to the name list
If the Delete request succeeds, a confirmation message is sent to the console.
Original Data (G) – Submits an original name data request to NH-DS when operating in unique name search mode. This transaction recovers the list of record IDs that are associated with all records that have the same name, as identified by the NameHunter ID that is automatically assigned by NamePreprocessor (NPP). The transaction type is “G”, and a sample message adheres to the following format:
G,3,3456
Where each fields is defined as:
· G – transaction type of Original Data
· 3 – transaction ID which can be any positive integer
· 3456 – the NameHunter ID of the record for which you want the original record(s)
If the transaction succeeds, results are written to the specified results file, which contains the following comma-separated fields:
· transaction type – G
· transaction ID – the transaction ID supplied with the request
· ID – the NameHunter ID supplied with the request
· surname – surname that is supplied with the query
· given name – given name that is supplied with the query
· ID – customer ID returned by NH-DS
· surname culture – culture classification for the surname that is returned by NH-DS
· given name culture – culture classification for the given name that is returned by NH-DS
· isAltParse – indicates whether the name is an alternate name parse (0=NO, 1=YES)
· isReg – indicates whether the name is a regularized name (0=NO, 1=YES)
Search (S) – Submits a search message to NH-DS with a transaction type of “S”. A sample Search message adheres to the following format:
S,100,1,finkelmyer, isaac,1,1
Where each field is defined as:
· S – transaction type of Search
· 100 – transaction ID, which can be any number
· Entity Type – 1 for personal, 2 for organization
· finkelmyer - surname
· isaac - given name
· 1 - surname culture classification
· 1 - given name culture classification
If the Search request succeeds, search results are written to a results file as a set of records each that contain the following comma-separated fields:
· transaction type – S
· transaction ID – the transaction ID supplied with the request
· surname – surname that is supplied with the query
· given name – given name that is supplied with the query
· surname culture – culture classification for the surname that is supplied with the query
· given name culture – culture classification for the given name that is supplied with the query
· surname – surname that is returned by NH-DS
· given name – given name that is returned by NH-DS
· ID – the NameHunter ID returned by NH-DS
· surname culture – culture classification for the surname that is returned by NH-DS
· given name culture – culture classification for the given name that is returned by NH-DS
· name score – strength of match between the full query name and the full candidate match name
· surname score – strength of match between the surname portions of the query name and the candidate match name
· given name score – strength of match between the given name portions of the query name and the candidate match name
· name count – number of original names that are associated with this name (if operating in unique name search mode)
· isAltParse – indicates whether the name is an alternate name parse (0=NO, 1=YES)
· isReg – indicates whether the name is a regularized name (0=NO, 1=YES)
Shutdown (X) - Submits a Shutdown request to NH-DS, and is indicated by a transaction type of “X”. A sample Shutdown request adheres to the following format:
X,99
Where each field is defined as:
· X – transaction type of the shut down
· 99 – transaction ID, which can be any positive integer
If the Shutdown request succeeds, a NH-DS status message is sent to the console.
Status (T) - Submits a status request to NH-DS, and is indicated by a transaction type of “T”. A sample Status request adheres to the following format:
T,4
Where each field is defined as follows:
· T – transaction type of Status
· 3 – transaction ID, which can be any number
If the Status request succeeds, a NH-DS status message is sent to the console.
Update (U) – Submits an update message to NH-DS, and is indicated by a transaction type of “U”. A sample Update message adheres to the following format:
U,70,1,Abelo,Jaime,new1234,cust1234,8,8,N
Where each field is defined as follows:
· U – transaction type of Update
· 70 – transaction ID, which can be any positive integer
· Entity Type – 1 for personal, 2 for organization
· Abelo –surname to replace existing surname in updated record
· Jaime –given name to replace existing given name in updated record
· new1234 – customer ID
· cust1234 – customer ID of the record(s) to update
· 8 – new surname culture classification
· 8 – new given name culture classification
· N – specify that the updated record is not an alternate parse
If the Update request succeeds, a confirmation message is sent to the console.
Execution of the dsFile Utility
To invoke the application, type the following name from a command prompt:
dsFile
A configuration file can optionally be specified. If no argument is supplied, the utility looks for a file named dsFile.config in the current directory.
Original Publication Date
10 April 2009
Was this topic helpful?
Document Information
Modified date:
20 April 2022
UID
swg27015521