Guidelines for term assignment in automated discovery and quick scan

How To

Summary

A key element of a successful information governance strategy is proper association of data assets such as tables, columns, and files with business vocabulary stored in the governance catalog. The automatic term assignment capability of Watson Knowledge Catalog supports users who are assigning terms to assets by suggesting and creating assignments automatically through a collection of term assignment services.

Objective

This document offers hints and guidelines about how to proactively maximize the value and leverage of automatic term assignment as part of automated discovery and quick scan in Watson Knowledge Catalog for Cloud Pak for Data. The information applies to these product versions:

For quick scan: product version 3.0.1 to product version 4.0
For automated discovery: product version 3.0.1 to product version 4.7

You can find equivalent information for term assignment in the context of metadata enrichment in the IBM Cloud Pak for Data documentation:

The purpose of automatic term assignment is to support the assignment of terms to data assets which, when done manually, is a complex and time consuming task. By suggesting assignment candidates, automatic term assignment offers users a list of recommended terms to select from rather than leaving the task of browsing and searching for appropriate candidates completely to them. Even if the list does not contain the perfect match it usually provides good entry points for manual investigation.

Typographic convention: Terms and data classes are shown in italic, names of data assets in UPPERCASE.

Automatic term assignment is a subtask of column analysis and an optional subtask of metadata discovery or quick scan. There are three services contributing to this task: Linguistic name matching, Class-based term assignment, and Machine-learning (ML)-based term assignment. Each of these services creates a list of term candidates for a given data asset. These lists are consolidated into a single one by a special supervisor service. The decision if a service should add a term to the list is based on criteria such as the following:

A match between the name of an asset and term-related information in the catalog. An example would be a column CRM that matches the abbreviation of a term Customer Relationship Management.
Sufficient similarity between metadata of the asset and information available in the catalog. Both linguistic name matching as well as ML-based matching take such similarity into account. An example would be a table CUSTADDR representing a Customer’s address. A term Customer Address would be in the list of candidates for CUSTADDR though CUSTADDR would not usually be viewed as a common abbreviation of this term and would not be represented as such in the catalog.
Data of a certain class that is associated with a term, or data with a distribution that is similar to the data distribution of another data asset that already has a term assigned. An example would be a column holding telephone numbers identified by a data class PhoneNumber that is associated with the term Telephone Number. If another column in a different table has a similar distribution of values and already has the term Customer Phone Number assigned this makes Customer Phone Number an even better match and a more specific candidate for this column. Both class-based and ML-based assignment are capable of finding candidates based on data stored in a column or field. However, this capability is not available for ML-based assignment in quick scan.

In some cases, term assignment can be applied to data assets right away. However, better results can usually be obtained if some thoughts are spent upfront on how to apply term assignment algorithms in a most efficient way. This is can be achieved by setting customized term assignment parameters.

The following information applies to all exercises:

Up to version 4.0.2, any credentials of a user with administrative permissions can be used to run the commands. Starting with version 4.0.3, the commands must be run as the isadmin user.

Also, the oc commands assume that the current namespace is the namespace of the Watson Knowledge Catalog service (usually wkc). You can check the namespace by running the oc project command. To set the current namespace to wkc, run the command oc project wkc.

Exercise 1 The steps below set or modify project-specific term assignment parameters on a Watson Knowledge Catalog system. Project-specific term assignment parameters are not available for quick scan. To set term assignment parameters globally, omit the `-projectName` parameter. Project-specific parameter settings always take precedence over global settings. Step 1 Version 4.0.2 and earlier: Store the name of the iis-services pod in a variable by running the following command: `IISPOD=$(oc get pod -l app=iis-services -o name \| grep iis-services \| awk -F/ '{print $2}')` Starting with version 4.0.3, store the name of the iis-services pod and the password of the isadmin user in variables by running the following commands: `IISPOD=$(oc get pod -l app=iis-services -o name \| grep iis-services \| awk -F/ '{print $2}'); echo $IISPOD ISADMIN_PASSWORD=$(oc exec $IISPOD -- env \| grep ISADMIN_PASSWORD \| awk -F= '{print $2}'); echo $ISADMIN_PASSWORD` Step 2 Export project-specific term assignment parameters (not available for quick scan), in this example for the project MyProject, to a file. Version 4.0.2 and earlier: `oc exec $IISPOD -- \ /opt/IBM/InformationServer/ASBServer/bin/IAAdmin.sh \ -user 'admin' -password 'password' \ -projectName MyProject -getODFParams >/tmp/ta-params.json` Starting with Version 4.0.3: `oc exec $IISPOD -- \ /opt/IBM/InformationServer/ASBServer/bin/IAAdmin.sh \ -user isadmin -password $ISADMIN_PASSWORD \ -projectName MyProject -getODFParams >/tmp/ta-params.json` Step 3 Open `/tmp/ta-params.json` in an editor and adjust parameter values as needed. See the references section at the end of this document for a link to an example JSON structure. Step 4 Copy the updated file to the iis-services pod: `oc cp /tmp/ta-params.json $IISPOD:tmp/ta-params.json` Step 5 Update project-specific term assignment parameters (not available for quick scan), in this example for the project MyProject. Version 4.0.2 and earlier: `oc exec $IISPOD -- \ /opt/IBM/InformationServer/ASBServer/bin/IAAdmin.sh \ -user 'admin' -password 'password'\ -projectName MyProject -setODFParams \ -content /tmp/ta-params.json` Starting with Version 4.0.3: `oc exec $IISPOD -- \ /opt/IBM/InformationServer/ASBServer/bin/IAAdmin.sh \ -user isadmin -password $ISADMIN_PASSWORD \ -projectName MyProject -setODFParams \ -content /tmp/ta-params.json` Step 6 Verify that parameters are indeed updated, in this example for the project MyProject. Version 4.0.2 and earlier: `oc exec $IISPOD -- \ /opt/IBM/InformationServer/ASBServer/bin/IAAdmin.sh \ -user 'admin' -password 'password' \ -projectName MyProject -getODFParams` Starting with Version 4.0.3: `oc exec $IISPOD -- \ /opt/IBM/InformationServer/ASBServer/bin/IAAdmin.sh \ -user isadmin -password $ISADMIN_PASSWORD \ -projectName MyProject -getODFParams`

The following hints offer some guidelines on how term assignment can be adjusted to specific properties of the domain at issue. If not explicitly excluded these hints apply equally well for column analysis, metadata discovery, and quick scan.

Hint 1

When starting a discovery project we recommend to spend some time upfront on the selection of data sources and data assets to analyze. Create a temporary sandbox project and identify a representative subset of the data sources and/or assets of interest to better understand the nature of available data and metadata. Run a discovery on these data sources and take your time to inspect profiling, data classification and term assignment results prior to analyzing the complete set of assets.

Questions to be asked are:

Do asset names provide hints about which terms they can be mapped to?
Are there some systematic patterns that need consideration?
Does the asset metadata include descriptions?
Are names and/or descriptions in English or in another language?

Linguistic name matching relies on a certain level of similarity between asset names and terms. Asset names might be comprehensive or more or less technical. They might follow certain naming conventions, might have been invented by a genius author, or even generated by some tool. Linguistic name matching provides configuration options to deal with patterns (replacements) or with technical elements that don’t contribute to the meaning of an asset (wordsToIgnore). If a large number of asset names follow a certain pattern (COL1, COL2, ….) it can be helpful to adjust result limits and/or confidence thresholds to prevent too many candidates from being proposed.

Example: A term Name would match all occurrences of columns NAME_X for some numeric value X. If there are many such columns the list of suggested terms can get long and unwieldy. Restricting soft and hard limits helps to keep the list within a manageable size in cases like this.

Hint 2

Reserve time to create data classes in support of automatic term assignment to obtain higher quality results from class-based term assignment. Identify business standards and conventions that define the structure of certain classes of data such as identifiers or labels and implement custom data classifiers to identify these. The preferred way of doing this is to create data classes based on reference data sets. Other options are for example regular expressions or lists of values. Link these data classes to the respective terms such that automatic term assignment can assign the appropriate terms when data assets are identified by these classifiers.

Example: A company that represents product codes in the format XXX-XXX-YY where X represents a digit and Y a lowercase character might define a custom data class ProductCode based on a regular expression \d\d\d-\d\d\d-[a-z][a-z] and link it to the term Product Code. When run, term assignment will automatically assign the term Product Code to columns if a sufficient amount of the data matches this pattern.

Hint 3

Select the term scope carefully to avoid generating candidates outside the area of interest, e.g. when dealing with data of a certain business domain restrict the scope of automatic term assignment to the categories of interest within this domain. If set appropriately, this provides a more focused list of term candidates for the scope of interest and reduces effort for manual post processing that would otherwise be needed to resolve ambiguities. Note that once set the scope applies globally for discovery and quick scan. For discovery the scope can be set at the project level by including the project name as a parameter of the IAAdmin command as shown in exercise 1.

Example: Ambiguities might result from terms that have different meaning depending on the category which they are associated with. For example when focusing on the assignment of terms in the context of customer relationship management we recommend to restrict the scope to the relevant categories for the respective domain. This ensures that candidates for columns ADDRESS, PHONE_NR, ... are properly focused on Customer Address, Customer Phone Number, … instead of including terms from the human resources domain such as Employee Address, Employee Phone Number, … which would likely be returned as candidates if the scope would not adjusted.

The scope is defined by a set of categories stored as value of the categoryScope parameter. Categories are represented by their unique ID in XMeta. Subcategories don't need to be listed explicitly. They are taken into account automatically.

Exercise 2 To set the category scope to a list of categories CAT1, CAT2, CAT3 perform the following steps: Step 1 Export category information to a file `oc exec is-en-conductor-0 -- \ /opt/IBM/InformationServer/Clients/istools/cli/istool.sh \ glossary export \ -username isadmin -password $ISADMIN_PASSWORD \ -format "xml" -filename "/tmp/category-export.xml" \ -cat "CAT1,CAT2,CAT3"` Step 2 Extract unique ids from the file `oc exec is-en-conductor-0 -c iis-en-conductor -- \ xmllint --xpath "//*[local-name()='category']/@rid" /tmp/category-export.xml \| \ sed -e "s/\"/'/g;s/ rid=/,/g"` The result of the above command might look like: `,’6662c0f2.ee6a64fe.03dn6ag3o.593u4os.6skcjt.oel8qvh8bqbimu5h8uhjc’, ’6662c0f2.ee6a64fe.03dn6ag3n.nk2g1s4.08gghl.lt8eliljtd6m3921t6mkc’, ’6662c0f2.ee6a64fe.03dn6ag3o.a0kb2gr.cppdeh.0ejhdbhrjhjlt4p598s9s’` Step 3 Delete the initial comma from the result and enclose it with square brackets. `[’6662c0f2.ee6a64fe.03dn6ag3o.593u4os.6skcjt.oel8qvh8bqbimu5h8uhjc’, ’6662c0f2.ee6a64fe.03dn6ag3n.nk2g1s4.08gghl.lt8eliljtd6m3921t6mkc’, ’6662c0f2.ee6a64fe.03dn6ag3o.a0kb2gr.cppdeh.0ejhdbhrjhjlt4p598s9s’]` Store this as the value (all in one line) of the `categoryScope` parameter in `/tmp/ta-params.json as described in step 2 of exercise 1.`

Exercise 2

To set the category scope to a list of categories CAT1, CAT2, CAT3 perform the following steps:

Step 1

Export category information to a file

oc exec is-en-conductor-0 -- \
 /opt/IBM/InformationServer/Clients/istools/cli/istool.sh \
 glossary export \
 -username isadmin -password $ISADMIN_PASSWORD \
 -format "xml" -filename "/tmp/category-export.xml" \
 -cat "CAT1,CAT2,CAT3"

Step 2

Extract unique ids from the file

oc exec is-en-conductor-0 -c iis-en-conductor -- \
 xmllint --xpath "//*[local-name()='category']/@rid" /tmp/category-export.xml | \
 sed -e "s/\"/'/g;s/ rid=/,/g"

The result of the above command might look like:

,’6662c0f2.ee6a64fe.03dn6ag3o.593u4os.6skcjt.oel8qvh8bqbimu5h8uhjc’,
’6662c0f2.ee6a64fe.03dn6ag3n.nk2g1s4.08gghl.lt8eliljtd6m3921t6mkc’,
’6662c0f2.ee6a64fe.03dn6ag3o.a0kb2gr.cppdeh.0ejhdbhrjhjlt4p598s9s’

Step 3

Delete the initial comma from the result and enclose it with square brackets.

[’6662c0f2.ee6a64fe.03dn6ag3o.593u4os.6skcjt.oel8qvh8bqbimu5h8uhjc’,
’6662c0f2.ee6a64fe.03dn6ag3n.nk2g1s4.08gghl.lt8eliljtd6m3921t6mkc’,
’6662c0f2.ee6a64fe.03dn6ag3o.a0kb2gr.cppdeh.0ejhdbhrjhjlt4p598s9s’]

Store this as the value (all in one line) of the categoryScope parameter in /tmp/ta-params.json as described in step 2 of exercise 1.

Hint 4

Adjusting service specific settings to peculiarities of the domain at issue can help to improve the quality of results. To understand the purpose of these settings we will first outline how term assignment services identify term candidates:

4.1 Linguistic name matching

Linguistic name matching relies on a set of metrics that assess the similarity of a data asset’s metadata with metadata of the term and maps the overall similarity measure onto a confidence value between 0 and 1. The primary metadata used is the name of the term and the name of the asset, secondary metadata are the abbreviations that might optionally be associated with terms.

If the name of a data asset matches a term or one of its abbreviations, the confidence value of the corresponding candidate is 1 independent of case and independent of periods in the abbreviation. Matches are added to the result list if their confidence exceeds a minimum confidence value defined by the configuration parameters.

Though not valid universally, asset names tend to be shorter and more ‘condensed’ than term names either due to technical restrictions of the data source (e.g. length limits) or since their technical authors love to name things in a straight and short way. Asset names tend to consist of single words or a small number of words that in some cases are concatenated by special characters such as an underscore or based on conventions such as camel case. The words themselves might be abbreviated, truncated or otherwise condensed, for example by leaving out vowels in alphabetic languages.

Terms on the other hand tend to be much more comprehensive and self-contained, as their purpose is to represent business vocabulary. Oftentimes they consist of two or more words usually separated by blank. In larger glossaries, terms consisting of 5 to 10 words are not uncommon.

Linguistic name matching accounts for these effects by segmenting asset names into smaller pieces (tokens) according to the type of character (simplified):

Sequences of alphabetic characters with same case or initial uppercase followed by lowercase characters (names)
Sequences of special characters (including blanks)
Sequences of digits

The overall score is calculated by comparing these tokens to term constituents, computing similarity metrics and combining these into an overall confidence score based on a weighted sum. If the algorithm parameter contains a setting +vowel this calculation is done in a second cycle without taking vowels into account. The final confidence score is the maximum of both cycles.

In a nutshell, the confidence score represents the number of tokens that have matching characters at neighboring positions, weighting matches at initial positions higher than towards the end of a token including a penalty for characters that don’t have a match at neighboring positions.

Keeping this in mind helps to understand why running preparatory experiments prior to applying term assignment on a larger amount of data is useful. If parameters are not adjusted, the lists of candidates returned by services might not be as elaborate as if some time has been spent upfront to better understand the nature of the metadata and to adjust parameters accordingly.

Example: The default thresholds for candidates and assignments might be too low to filter out candidates if asset names follow certain patterns which make them match a large number of terms. On the other hand, terms that don’t match the beginning of an asset name but more towards the end of the name might have an unexpectedly low confidence as linguistic term matching tends to prefer matches at the beginning of tokens. In this case, thresholds need to be lowered to prevent such candidates from not making it to the result list.

4.2 Class-based term assignment

Class-based term assignment determines the list of candidates based on data classes that are linked with terms. Data profiling identifies data classes which characterize the data of a column or field based on all or a subset of the data stored therein. IBM Watson Knowledge Catalog comes with a large number of predefined data classes. In addition, custom data classes can be defined based on reference data sets, regular expressions or other means. Defining custom data classes based on a deployed Java class provides a powerful mechanism that can even classify assets at the data set level. Follow the link to the documentation chapter on ‘Creating data classes’ shown at the end of this document for examples of custom data classes.

Example: If a column contains airport codes, profiling will likely associate it with the data class AirportCode and a high confidence such as 0.9. If the data class AirportCode is in turn linked to a term Airport Code automatic term assignment will assign the term Airport Code with the same confidence as the associated data class (0.9).

Class-based term assignment is the preferred mechanism whenever the type and format of data is well defined and a good indicator of the business meaning of the columns or field containing it. Linking data classes to business terms and creating custom data classes is a key step to ensure high quality term assignment results. The time spent for this upfront pays off by the quality of returned candidates when running term assignment on data assets with highly structured and well-organized data.

4.3 Machine-learning based term assignment

ML-based term assignment uses a mix of supervised and unsupervised learning to identify a list of candidate terms for a given asset. While the number of assigned terms is zero or low it relies on the similarity between asset metadata such as name, data source hierarchy, and description and term metadata such as name, description, and category for proposing candidates (unsupervised learning). The service utilizes RAKE (Rapid Automatic Keyword Extraction) for language-aware preprocessing and a compact vector representation of the relevant features for optimized training and prediction. The vector representation is based on the TF-IDF metric (term frequency-inverse document frequency) of the relevant keyword features.

As the number of assigned terms increases, ML-based term assignment takes key characteristics of these assignments into account to improve its model (supervised learning). This enables ML-based term assignment to propose new candidates based on the similarity of new data assets with existing assets that already have terms assigned. Terms manually assigned in the governance catalog as well as terms assigned by publishing project content to a catalog automatically trigger model re-training to ensure that candidates proposed take recent assignments into account. Thus it is recommended to select a few representative data assets with characteristic features, and manually assign the proper terms to them. These will be added to the model when re-training leading to better recommendations for data assets with similar features. For example, if the term Address is manually assigned to an asset ADDR_1, then the term Address will likely be returned as candidate for assets like ADDR_2, …, ADDR_x as well. If the term Technical Person would be assigned to asset TECH_PRSN, then ML-based term assignment will propose the term Sales Person for asset SALES_PRSN with a higher confidence score.

The confidence score of candidates is based on the cosine similarity between the vector representations of the features derived from the metadata.

It is important to notice that ML-based term assignment learns from term assignments that are stored in the public catalog. Term assignments stored temporarily in a project do not impact the model.

The configuration value retrainThreshold can be used to prevent ML-based term assignment from being busy with model re-training if assignments are created frequently. Setting this to a higher value than the default 1 prevents the service from re-training the model on each public assignment for the price that the model might not reflect most recent assignments at a certain point in time.

An important configuration parameter for ML-based term assignment is the selected language as this allows the preprocessing to adjust its keyword extraction to specifics of the language at issue (parameter language). See the section on `Customizing term assignment parameters` in the IBM Information Server documentation for a list of supported languages.

Rejected terms

Users who review the list of candidate terms can mark candidates explicitly as ‘rejected’ to indicate that they do not fit the asset at issue. This is a means to report negative feedback to automatic term assignment services. It prevents terms from being returned again when re-running automatic term assignment. This feature is available for metadata discovery and analysis but not for quick scan.

It is important to note that terms rejected in the scope of a project only impact the results of linguistic name matching and class-based term assignment. Since ML-based term assignment only trains its model on published term assignments rejected terms have no impact on the model unless they are published.

The columns detail view of the data quality tab has six groups of entries that are related to term assignment. No term can appear in more than one of these groups. To the left it shows the assigned and rejected terms that are published with the groups ‘Assigned terms’ and ‘Rejected terms’. To the right it shows the current state of the project with the groups ‘New assigned terms’, ‘Suggested terms’, ‘New rejected terms’, and ‘Rejected terms to be published’.

While the two groups in the left column (Published terms) show the current state of what is published, the four groups on the right (Changes to be published) show the current state of the project and indicate what will happen to term assignments if the results of this asset are published.

Terms are represented by buttons that contain the term name, an optional confidence value, and, if in edit mode, a deletion icon and an optional assignment icon. The buttons are colored blue (assigned terms), gray (suggested terms), or red (rejected terms). The presence of a confidence value indicates that the term has been assigned or suggested by an automatic term assignment service. Buttons that have no confidence value represent terms that are created manually by a user.

	Manually assigned term selected from the governance catalog.
	Assigned term resulting from a service or services. Might have been assigned automatically (confidence exceeding assignment threshold) or manually from the list of suggested terms.
	Suggested term.
	Rejected term.

Table 1: Examples of buttons representing terms (in edit mode)

Hovering over a button shows the chain of categories from the category to which the term belongs to the top of the category hierarchy (Category path) and the owner of the assignment (Assigned By). Term candidates are assigned automatically if the assignAutomatically flag of the term assignment configuration is set to true (discovery only) and if the confidence of the candidate matches or exceeds the value of the assignmentThreshold parameter. Automatically created assignments are owned by the service or services that returned the candidate. Suggested terms are always owned by one or more term assignment services and always have a confidence.

In edit mode, one or two icons are shown at the right end of the buttons. Only grey buttons, representing suggested terms, show an assignment icon (checkmark). All three types of buttons show a deletion icon (cross). When clicked, they trigger the following state changes:

Clicking the assignment icon of a ‘Suggested term’ moves it to the ‘New assigned terms’ section and changes its ownership to the ID of the user who assigned the term.
Clicking the deletion icon of a ‘New assigned term’ or ‘Suggested term’ moves it to the ‘New rejected terms’ section and changes its ownership to the ID of the user who rejected the term.

However, there is one special case to be aware of: If a ‘New assigned term’ was created by rejecting an ‘Assigned term’ (published), clicking the deletion icon restores its original state by moving it back to the ‘Assigned terms’ section (published).

Clicking the deletion icon of an ‘Assigned term’ (published) moves it to the ‘Published terms to be rejected’ section indicating that it will reject a public term when published.
Clicking the deletion icon of a ‘New rejected term’ moves it to the ‘New assigned terms’ section if it has a confidence indicating that it was originally created by automatic term assignment. If the rejected term was created manually, clicking the deletion icon removes it. This enables users to delete a term that was created accidentally by rejecting it first and then deleting the rejected term.
Clicking the deletion icon of a ‘Rejected term’ (published) moves it to the ‘New assigned terms’ section indicating that it will assign a new term when published.
Clicking the deletion icon of a ‘Published term to be rejected’ undoes the rejection by moving it back to the ‘Assigned terms’ section (published).

It is important to note that suggested terms can not be removed. A user can leave them in the ‘Suggested terms’ section if (s)he does not care about them being assigned or rejected. Once assigned or rejected a term can not be put back into the ‘Suggested terms’ state.

Hint 5

It is a good practice to carefully review terms in the groups ‘New assigned terms’ and ‘New rejected terms’ prior to publishing the analysis results of an asset. However, this applies particularly to terms in the group ‘Published terms to be rejected’ since publishing them will remove public assignments that might already have impacted other assignments through ML-based term assignment. Furthermore, it is important to note that publishing terms in this group might undo public assignments created by a different user.

This is what happens when publishing term assignments of an asset:

Members of ‘New assigned terms’

Move to ‘Assigned terms’ (published).

Members of ‘Suggested terms’

Are not impacted

Members of ‘New rejected terms’ or ‘Rejected terms to be published’

Move to ‘Rejected terms’ (published).

Note that deleting a published term assignment directly from the catalog rather than from the data quality tab does not reject it but removes it completely, no matter how and by whom it was created.

Hint 6

We recommend making yourself familiar with the different states and the impact of assignment and deletion actions on these states prior to running such tasks in production. The UI has been created to support productive use of manual tasks by minimizing redundancy and focusing on the impact of a publish action. Some state transitions might not seem intuitive at first glance but triggering state changes and observing how this is reflected on the UI should help users to become familiar with it.

Disabling Term Assignment Services

There can be situations where it turns out useful if certain term assignment services are disabled. An example would be to disable linguistic name matching if the metadata is Chinese. The steps below show how to define the active list of term assignment services for analysis and metadata discovery by setting a global configuration parameter on the IIS services pod:

Exercise 3 Term assignment services are represented by their unique IDs: Linguistic name matching: `ibm.iis.odf.services.termclassification.matching.bg.MatcherDiscoveryService` Class-based term assignment: `com.ibm.iis.odf.iisext.services.cbta.ClassBasedTermAssignmentService` ML-based term assignment: `com.ibm.iis.odf.services.termassignment.finley.FinleyPredictorService` The sequence of term assignments services to be run is defined by the global configuration parameter `com.ibm.iis.ug11_7.odfservices.ta` To define a custom sequence of services, set this parameter to a comma-separated list of service IDs for the services to be run (see exercise 1 for how to set the variable IISPOD): Step 1 To disable the linguistic name matching services set it as follows: `oc exec $IISPOD -- \ /opt/IBM/InformationServer/ASBServer/bin/iisAdmin.sh \ -set -key com.ibm.iis.ug11_7.odfservices.ta \ -value com.ibm.iis.odf.iisext.services.cbta.ClassBasedTermAssignmentService, \ com.ibm.iis.odf.services.termassignment.finley.FinleyPredictorService` Step 2 To verify if settings are updated as expected run: `oc exec $IISPOD -- \ /opt/IBM/InformationServer/ASBServer/bin/iisAdmin.sh \ -display \ -key com.ibm.iis.ug11_7.odfservices.ta \| \ grep com.ibm.iis.ug11_7.odfservices.ta \| \ awk -F= '{print $2}'` The next time analysis or metadata discovery is run term candidates will only be proposed by class- and ML-based term assignment. This mechanism does not apply to quick scan.

Summary

Automatic term assignment is a powerful mechanism that can save governance staff a lot of time when creating and annotating metadata. It relies on term assignment services returning lists of representative term candidates based on different algorithms. Understanding the nature of data assets and their metadata is key. Spending time upfront on understanding samples of the data and metadata at hand as well as creating custom data classes and tuning service parameters prior to running automatic term assignment in production is an effort well-spent that usually results in higher quality results.

Understanding term assignment states, how they are represented and the impact of assignment and deletion actions helps users to manage their curation tasks using the term assignment details view more efficiently.

Appendix

Example of a parameter JSON:

{
  "capabilities": {
     "com.ibm.iis.odf.capabilities.TermAssignment": {
        "softLimit": 0,      
        "hardLimit": 100,
        "candidateThreshold": 0.7,
        "assignmentThreshold": 0.9,
        "assignAutomatically": true,
        "categoryScope": "'6662c0f2.ee6a64fe.kbn609t6l.3qiinrj.0ocdm1.91tk1pl6386rviv3l23f4','6662c0f2.ee6a64fe.kbn60ao6v.gr8ei58.g1d39q.a8u5uqf77g7f6nb4tani5'"
    }
  },
  "services": {
     "com.ibm.iis.odf.iisext.services.cbta.ClassBasedTermAssignmentService": {
        "sandboxCategory": "Terms representing classes (need review)",
        "classesWithNoBusinessMeaning": "C,I,D,T,X,Q,BOOL"
     },
     "com.ibm.iis.odf.services.termclassification.matching.bg.MatcherDiscoveryService": {
        "algorithm":"+vow"
     },
     "com.ibm.iis.odf.services.termassignment.finley.FinleyPredictorService": {
        "sandboxCategory":"Recommended domain terms (need review)",
        "maxWaitForConnection":30,
        "maxWaitForData":10,
        "retrainThreshold":1,
        "logLevel": "INFO",
        "logSize": 5242880,
        "language": "english"
     }
   }
}

Parameters relevant for all types of term assignment services:

Parameter	Description	Default value
softLimit	This setting specifies the maximum number of term candidates. The range of valid values is 0..`<hardLimit>`. It is an integer value. The term suggestions with higher confidence level have higher priority. If the limit is exceeded by terms that are grouped within one confidence level, all terms are included. It might result in exceeding the defined limit. For example, if the confidence level is set to 90%, and the limit is set to 10, and the results are 5 terms with 95% and 10 terms with 92%, all of the results (15 terms) are displayed. If the results are 11 terms with 95% and 4 terms with 90%, only the terms with 95% are displayed (11 terms).	0 (no limit)
hardLimit	This setting specifies the maximum number of term candidates created by term assignment. There is no upper limit to this parameter but 100 is the suggested maximum value. It is an integer value.	100
candidateThreshold	This parameter sets the percentage threshold for which the term must match to be automatically assigned to an asset as a candidate term. Values need to be entered as fractional numbers. For example, enter 1.0 instead of 1. It is a double value.	0.5
assignmentThreshold	This parameter sets the percentage threshold for which the term must match to be automatically assigned to an asset. Values need to be entered as fractional numbers. For example, enter 1.0 instead of 1. It is a double value.	0.8
categoryScope	A comma-separated list of category RIDs enclosed by single quotes. For example, `"categoryScope": "'6662c0f2.ee6a64fe. kbn609t6l.3qiinrj.0ocdm1.91tk1pl6386rviv3l23f4', '6662c0f2.ee6a64fe.kbn60ao6v.gr8ei58. g1d39q.a8u5uqf77g7f6nb4tani5'"`	null (no restriction)
assignAutomatically	Terms are assigned to assets automatically if the confidence of the assignment matches or exceeds the value specified for assignmentThreshold. If you do not want terms to be assigned automatically, set this parameter to false.	true

Parameters for Linguistic Name Matching:
(com.ibm.iis.odf.services.termclassification.matching.bg.MatcherDiscoveryService)

Parameter	Description	Default value
confidenceAdjustmentFactor	A numeric factor between 0 and 1 that you can use to adjust the confidence value returned by the linguistic matching service. The confidence values returned by this service are multiplied by this factor before thresholds are applied. For example, a confidenceAdjustmentFactor of 0.8 reduces the confidence of assigning terms to a column as follows: Regularly, term 'Address' would be assigned to column ADDRESS with confidence 1.0. An adjustment factor of 0.8 reduces the actual confidence to 0.8. Regularly, term 'Address Detail' would be assigned to column ADDRESS with confidence 0.9. An adjustment factor of 0.8 reduces the actual confidence to 0.72 (= 0.8 x 0.9). Thus, with assignmentThreshold set to 0.75, the term 'Address' would be assigned to the column ADDRESS because the confidence 0.8 exceeds the threshold of 0.75. The term 'Address Detail' would not be assigned because the confidence 0.72 is below the threshold of 0.75. A typical use of this parameter would be to give results of linguistic matching a lower weight compared to the results of data-class-based or ML-based matching. This parameter can be set starting with IBM Cloud Pak for Data 3.5.6 and IBM Cloud Pak for Data 4.0.1.	1.0

Parameters for Data-class-based Term Assignment:
(com.ibm.iis.odf.iisext.services.cbta.ClassBasedTermAssignmentService)

Parameter	Description	Default value
classesWithNoBusinessMeaning	A comma separated list of data class codes indicating that these data classes should not be considered as candidates when running class-based term assignment, since they do not convey a business meaning. For example, if you want to exclude the data class Identifier from being considered as a candidate (by means of automatic or manual linkage to a term), include I in the list of class codes.	"C,I,D,T,X,Q,BOOL"(excludes data classes Code, Identifier, Date, Text, Indicator, Quantity, and Boolean)

Parameters for ML-based Term Assignment:
(com.ibm.iis.odf.services.termassignment.finley.FinleyPredictorService)

Parameter	Description	Default value
language	This parameter determines the language used for pre-processing. Valid values are english, danish, dutch, finnish, french, german, hungarian, italian, norwegian, portuguese, romanian, russian, spanish, and swedish. You can use any other language, for example Chinese, but the pre-processing is limited. It is a basic string-based matching. As a value, use an English name of a language, for example chinese.	english
logLevel	The logLevel determines the type of entries that are written to the log file. Valid values are CRITICAL, ERROR, WARNING, INFO, or DEBUG.	INFO
logSize	The machine learning service maintains a log file /finley/PredictFinley.log in its Docker container. The parameter logSize determines the maximum size of this log file in bytes. When it exceeds the size, old messages are deleted from the log file.	5242880
maxWaitForConnection	Maximum number of seconds when a connection to the machine learning service can be established. If the service does not respond within this time limit, it does not contribute to the overall term assignment result. In such case, a warning message is added to the server log files. You can increase the value of this parameter when your network connection is slow.	30
maxWaitForData	Maximum number of seconds when machine learning service can return term assignment candidates. If the service does not respond within this timelimit, it does not contribute to the overall term assignment result. In such case, a warning message is added to the server log files. The value of this parameter must be smaller than or equal to the value of the maxWaitForConnection parameter. You can increase the value of this parameter when your network connection is slow.	10
retrainThreshold	The machine learning service is retrained on a regular basis so that it can provide better outcomes when using the term assignment feature. By default, the model is retrained each time a term is modified in any way. For example, when a new term is added, a term is deleted, or a term is manually assigned to an information asset. With a large glossary, this might impact the performance of the model learning process. To improve the performance, you can specify how many term-related modifications must bemade for the retrain process to be started. The value of this parameter is a number greater than or equal to 1.	1

Document Location

Worldwide

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCLA7","label":"IBM Watson Knowledge Catalog"},"ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Tips

Guidelines for term assignment in automated discovery and quick scan

How To

Summary

Objective

Document Location

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?