Use CSV and XML import methods to populate, update, and enhance your InfoSphere Business Glossary content

IBM InfoSphere™ Business Glossary enables you to create, manage, and share an enterprise vocabulary and classification system. In Version 8.1.1, the InfoSphere Business Glossary introduced some new CSV and XML import and export methods to populate a business glossary with data. This tutorial provides technical instructions, tips, and examples to help you implement these new features to efficiently create a business glossary.

Share:

Shlomit Becker (shlomitb@il.ibm.com), Quality Assurance Engineer, I.B.M.

Photo of Shlomit BeckerShlomit Becker is a QA Engineer who works on the IBM InfoSphere Business Glossary team at the IBM Israel Software Lab. Shlomit joined IBM as part of the Unicorn Solutions acquisition in 2006. She has developed a suite of automated tests for the Business Glossary Import and Export features.



Yair Rinot (yairr@il.ibm.com), Software Developer, I.B.M.

Photo of Yair RinotYair Rinot is a Software Developer who works on the IBM InfoSphere Business Glossary team at the IBM Israel Software Lab. Yair holds a B.Sc in Computer Science from Hebrew University of Jerusalem. He has developed the Business Glossary CSV and XML import and export features.



16 September 2010

Also available in Chinese

Introduction

The InfoSphere Business Glossary (Business Glossary) enables you to use a controlled vocabulary to create, manage, and share standard definitions of business and organization concepts. The Business Glossary product uses a category hierarchy in which categories contain terms. You can use terms to classify data assets in the metadata repository based on the needs of your organization.

Populating a business glossary with data is the first step to using it. In Version 8.1.1, the Business Glossary introduced some new import and export methods that use CSV and XML formats such that glossary administrators can import and export glossary content to and from external files of these types. These methods introduced greater flexibility in creating glossary content, enabling users to populate their business glossaries more easily and in a more comprehensive manner.

This tutorial describes and explains how to use these new import and export features. The tutorial includes best practices, tips, and examples to help you implement these features efficiently to populate a business glossary.

The examples in this tutorial use data from the IBM Industry Models Telecom Business Glossary Content Pack. IBM provides InfoSphere Business Glossary content packs for various industries, including banking, insurance, telecommunications, retail, and healthcare.

Choosing the import method

CSV and XML are common import methods. Deciding which import method to use to populate a business glossary is based on a number of factors, including the existing glossary content and the skill level of the business glossary administration staff.

CSV
The CSV format is simple. It includes the capability to contain categories and terms with their property values, such as descriptions, abbreviations, and custom attribute values. It also enables you to define steward relationships.
XML
The XML format is more comprehensive and complex. It includes the capability to define every possible relationship among terms, categories, and other object types, including terms relating to other terms, categories referring to terms, terms linking to assigned assets.

You might already have a version of a glossary in some format and want to use that content to start to populate a new business glossary. Or you might choose to build a business glossary from scratch. Because these scenarios start from different points, the recommended import method might be different.

If your data is in a spreadsheet, you should probably convert your existing spreadsheet to the Business Glossary CSV format. If you are starting from scratch, choose the import method based on the kind of data you need to import and on your technical skill level.

Table 1 and Table 2 describe the values that can be included in a CSV file and an XML file.

Table 1. Values available for categories in CSV and XML files
Category propertyCSV fileXML file
NamesYes, if being added for a new category only; names of existing terms or categories cannot be changed.Yes
Long descriptionsYesYes
Short descriptionsYesYes
SubcategoriesYesYes
Parent categoriesYes, if being added for a new category only; parent categories of existing terms or categories cannot be changed.Yes
Referenced terms NoYes
Contained termsYesYes
Custom attributesNoYes
Custom attribute valuesYes, if the custom attribute already exists in the target metadata repository. If the custom attribute does not exist, the import fails.Yes
Links to stewardYes, if the steward already exists in the target metadata repository. If the steward does not exist, the import fails.Yes, if the user or user group already exists in the target metadata repository. If the user or user group does not exist, the steward relationship is not created, but other content is imported successfully.
Table 2. Values available for terms in CSV and XML files
Term propertyCSV fileXML file
NamesYes, if being added for a new term only; names of existing terms or categories cannot be changed. Yes
Long descriptionsYesYes
Short descriptionsYesYes
AbbreviationsYesYes
Usages YesYes
ExamplesYesYes
StatusYesYes
Related termsNo Yes
Attribute IsModifierYesYes
Attribute TypeYesYes
SynonymsNo Yes
Containing (parent) categoriesYes, if being added for a new term only; parent categories of existing terms or categories cannot be changed. Yes
Custom attributesNoYes
Custom attribute valuesYes, if the custom attribute already exists in the target metadata repository. If the custom attribute does not exist, the import fails. Yes
Links to stewardYes, if the steward already exists in the target metadata repository. If the steward does not exist, the import fails. Yes, if the user or user group already exists in the target metadata repository. If the user or user group does not exist, the steward relationship is not created, but other content is imported successfully.
Links to assigned assets (such as columns, jobs, tables)NoYes, if the asset already exists in the target metadata repository. If the asset does not exist, the relationship is not created, but other content is imported successfully.
References to assigned external assets (such as business process models or web services)NoYes

Note: If you need to transfer business glossary content from versions older than Version 8.1.1 to newer business glossary instances, the only available method is the Glossary Archive import and export (format is XMI), which lets you transfer all or a subset of glossary data between glossary instances without editing their contents. In Business Glossary Version 8.1, if you wanted to transfer only categories and terms with their basic properties, you could also use the CSV import and export.


Importing using CSV format

The business glossary comma-separated value (CSV) format was created to offer an easy way to import basic business glossary data into the metadata repository. Tables 1 and 2 described the properties that can be included in a CSV file for a category and a term.

Note: This CSV import is not the same as the Metabrokers and Bridges Categories and Terms CSV Import (released with the initial 8.0 version of Business Glossary). They are two separate applications, and they have different formats.

The Business Glossary CSV import functionality is accessed from the Glossary tab inside the InfoSphere Information Server Web console, as shown in Figure 1.

Figure 1. InfoSphere Information Server Web console Glossary Import CSV page
Screen cap: Glossary tab higlighted, Import CSV tab shown on pane, Download a sample CSV file highlighted

Complete the following steps to import a CSV file:

  1. Click the Glossary tab.
  2. Click Import CSV in the Import and Export section on the left side.
  3. Click Browse.
  4. Find the CSV file, and click Import.

To get started creating a CSV file, click the Download a sample CSV file link as shown in Figure 1 to get a simple CSV file into which you can enter test data. This file has a few sample categories and terms in it, and it contains instructions that explain the types of data that you can enter into the CSV file.

Using the CSV template

Two best practices for writing a CSV file are either to start with the sample file mentioned above or to export a CSV file from the business glossary that has some initial data in it. A key reason for starting with the sample file or existing glossary content is to ensure that the file includes all of the required lines, since, if any of them are missing, the import will fail. An advantage to starting with an export of existing content is that the file will contain the existing custom attributes as importable properties. You can choose to import categories, terms, or both with a CSV file.

Importing categories

If you want to import only categories and no terms, you need to have only the categories section in the CSV file, as shown in Figure 2.

Figure 2. Categories defined in a CSV with no terms defined
Screen cap: CSV that contains Import Categories between Begin and End, but no terms

In Figure 2, two categories in the CSV file are displayed. The first category in the file is called Business Concepts. It has no parent category defined, which means that on import, this category will become a top-level category. The second category is called Location, and it has the category Business Concepts defined as its parent category. Figure 2 also contains a term section, but there are no terms defined yet in this section, which is allowed.

When this CSV file is imported, two categories are created: one top-level category called Business Concepts and one subcategory of Business Concepts called Location.

Importing terms

If you want to import only terms and not categories, you need to have only the terms section in the CSV file. You can also have an empty category section remaining in the CSV.

Figure 3. Terms defined in a CSV with no categories defined
Screen cap: CSV that contains Import Terms between Begin and End, but no categories

It is best that a term's parent category is defined and not left blank. If the term's parent category is left blank, the terms below it are created. But within a category that is auto-generated during the import, the category name will be unusable, such as Uncategorized_1273649004500, as shown in Figure 4.

Figure 4. Terms defined without a parent category
Screen cap: CSV that contains Business Concepts as parent for Category, but no parent category for terms

Specifying the file format

The CSV file needs to have the file extension .csv. If your text editor does not have a .csv extension option, you can specify .txt and then edit the extension later. If your text editor doesn't have .csv in its file format type list, type in .csv, and save the file with that extension.

If you are using Microsoft® Excel, be sure to save the file as CSV (Comma delimited) (*.csv) and not as any other CSV type or as the standard .xls type, as shown in Figure 5.

Figure 5. Microsoft Excel CSV format to use when saving as .csv
Screen cap: Save as type field dropped down with *.csv selected

Assigning stewards

A steward is a business owner or a subject matter expert. Stewards can be assigned to terms or to categories. Note that the steward needs to exist in the repository before import, or the import will fail. You can assign a steward to a category or term in the CSV file by typing the steward's user name in the Steward column, as shown in Figure 6.

Note: a steward can be a user or a user group.

Figure 6. Assigning stewards to a category and a term in a CSV file
Screen cap: CSV that contains Import Terms and Import Categories, Steward robertj highlighted for each

The user name is not the same as the steward's first name, last name, or user group names. The value needs to be the correct value for the user name, otherwise the import will fail.

You can find the user name value for a user or user group in the InfoSphere Information Server Web console by viewing the stewards, as shown in Figure 7.

Figure 7. Stewards page in the InfoSphere Information Server Web console
Screen cap: Glossary tab browsing Stewards with robertj as User Name

This user name value in this console is the value that you need to enter into the CSV file for the steward.

Creating custom attributes

Custom attributes are properties of categories and terms that you can create to extend the standard glossary template, such as name, short description, and long description. Table 1 shows a comprehensive list of available properties.

You can define custom attributes in the InfoSphere Business Glossary Administrative interface of the Information Server Web console, as shown in Figure 8.

Figure 8. Custom Attributes page in the InfoSphere Information Server Web console
Screen cap: Glossary tab, Custom Attributes tab showing Term or Category for each of 4 attributes

Each custom attribute can be defined to be for categories or terms. And each custom attribute can be of type String or of type Enumerated. If an attribute is of type Enumerated, you need to define the possible enumerated values when you define the custom attribute.

In the CSV file, you can assign custom attribute values for category and term custom attributes by adding more columns to the file after the last required column. Each custom attribute with its associated values must be in a separate column.

In Figure 9, in the Categories section, there are two new columns in the CSV file that correspond to the custom attributes called Industry and Priority. In the Terms section, there are two new columns that correspond to the custom attributes called ID and Position.

Figure 9. Assigning custom attribute values for categories and terms in a CSV
Screen cap: CSV that contains Telecom Industry and Medium Priority for Category, and BCLCAD and BCLCAT as Position 2 for Term

When this CSV file is imported, two categories and two terms are created. And the categories and terms will have the custom attribute values defined for them on import.

Custom attributes need to be defined in the Information Server before any values for them can be imported, otherwise the import will fail. For example, if you try to import the CSV file in Figure 9 and the custom attribute Industry is not defined for categories in the Information Server, the import will fail with an error message explaining that the custom attribute Industry was not previously defined.

Note: If a custom attribute is of type Enumerated, the value entered in the CSV must be one of the valid values that are defined for that enumerated list. If an invalid value is defined in the CSV, the import fails with an error message that explains that the specific value is not valid, listing the valid values.

Exploring case sensitivity and merging in CSV

Category, term, and steward names in the Information Server are case sensitive. This means that category Location is not the same as category location, term address is not the same as term Address, and steward ROBERTJ is not the same as steward robertj. If you use the wrong case in the CSV for these values, then depending on the case, the import will either fail or you will not get the expected results. If the capitalization is consistent, the category or term is identified as already existing in the repository, and a merge of values will take place for that category or term.

The following sections describe merge scenarios and their results.

Name of category or term

If the name value of a category or term has different capitalization that an existing category or term, a new category or term is created instead of updating an existing category or term. Here are the basic scenarios and their outcomes:

Table 3. Basic scenarios for case sensitivity and merging in CSV
ScenarioCSVRepository before importRepository after import
Category names with different capitalizationCategory: Business ConceptsCategory: business conceptsCategories: Business Concepts + business concepts
Term names with different capitalization in same-capitalization categoryTerm: Address in category Business Concepts>>LocationTerm: address in category Business Concepts>>LocationTerms: Address + address in category Business Concepts>>Location
Category names with same capitalizationCategory: Business ConceptsCategory: Business ConceptsCategory: Business Concepts (values from the CSV for the category Business Concepts will overwrite any values that exist in the repository for this category)
Term names with same capitalization in same-capitalization category nameTerm: Address in category Business Concepts>>LocationTerm: Address in category Business Concepts>>LocationTerm: Address in category Business Concepts>>Location (values from the CSV for the term Address will overwrite any values that exist in the repository for this term. If the CSV has no value for a certain property, and the repository has a value for that property, then that value will remain; and values are not deleted on import.

Parent category

If the parent category of a category or term in the CSV has different capitalization than an existing parent category, the connection is not made on import, as shown in Table 4.

Table 4. Basic scenarios for case sensitivity and merging parent categories in CSV
ScenarioCSVRepository before importRepository after import
Category's parent category name with different capitalizationCategory: Location parent category: Business ConceptsCategory: Location parent category: business ConceptsCategories: Business Concepts>>Location + business concepts>>Location
Term's parent category names with different capitalizationTerm: Address with parent category defined as Business Concepts>>Location (if CSV does not have category Business Concepts>>Location defined, it is expected that it exists in the repository)Has category: business conceptsImport fails with an error message that term Address cannot be imported because its parent category Business Concepts>>Location could not be found.

Steward

If a category or term has a steward defined in the CSV as robertj, and the repository has a steward with the username ROBERTJ and does not have a steward with the username robertj, then the import fails with a message that the steward was not found, as shown in Table 5.

Table 5. Steward scenarios for case sensitivity and merging
ScenarioCSVRepository before importRepository after import
Category's steward name with different capitalizationCategory: Location with steward robertjCategory: Location with steward ROBERTJImport fails with an error message that category Location cannot be imported because its steward robertj could not be found.
Term's steward name with different caseTerm: Address with steward robertjTerm: Address with steward RobertjImport fails with an error message that term Address cannot be imported because its steward robertj could not be found.

Importing using XML format

The new Business Glossary XML file import, available since Version 8.1.1, contains the capability to import every type of value and relationship for a category or term. Because it supports every type of value, it is more complicated to write than the CSV file. But the design of the schema for this XML was created to make it as readable and understandable as possible so users can actually edit it.

The XML import process allows for merging of data. Objects in the XML file can connect to objects already in the repository on import, including references to technical assets.

Access the InfoSphere Business Glossary XML import from the Glossary tab inside the Information Server Web console, as shown in Figure 10.

Figure 10. Information Server Web console Glossary Import XML page
Screen cap: Glossary tab, Import XML with File is a BG 8.1 XML format and Download reference files highlighted

Complete the following steps to import an XML file:

  1. Click the Glossary tab.
  2. Click Import XML in the Import and Export section on the left side.
  3. Click Browse and select a file.
  4. Select one of the four merge methods. The merge options are discussed later in more detail.
  5. Click Import to complete the import.

Instead of browsing to one of your files, you can click the Download a sample XML file link to see a sample XML file that contains many of the different types of data that an XML can contain. You can also download the XML schema.

The XML format used to upload categories and terms in Business Glossary 8.0.1 and 8.1 is still supported in 8.1.1 and later versions. In order use the older XML format in 8.1.1 or later releases, you need to select the File is Business Glossary 8.1 or earlier XML format checkbox as shown in Figure 10. Because all options in 8.1.1 XML import are irrelevant for the older XML import, when this checkbox is selected, these options are not displayed.

Note: The older XML format has a different schema than the new XML format. The newer (8.1.1 and later) XML format is far more comprehensive than the older XML format and enables the merging of data. Use the new format when writing new Business Glossary XML files.

Importing the XML schema

The Business Glossary 8.1.1 XML schema was designed according to XML standards, where XML attributes define object attributes and XML elements define object relationships. When designing this XML schema, though, a deviation was made from the standard: categories and terms are not structured hierarchically. Categories and terms are defined as separate elements on the same hierarchical level, even though terms are contained in categories. This design accommodates that the 8.1.1 XML import supports partial import. For example, you can import terms without categories. Partial import will be detailed in a later section.

The XML schema was designed to prevent users from making mistakes when writing the XML file. There are many glossary relationships that are bidirectional, where the definition can be on either of the two objects. This is an area that could lead to errors on import if the XML data is not written properly. In most cases, the schema allows for only one of these objects to define the relationship. Specific cases are presented in a later section of this tutorial.

To help you write an XML file, examine the schema to see the different elements and attributes that can be added, and to understand the requirements for each type.

Writing the XML for two categories and a term

Listing 1 shows a simple example that illustrates the import of two categories and a contained term.

Listing 1. Two basic categories and a term in the XML
<categories>
    <category name="Business Concepts"/>
    <category name="Location">
        <parentCategory identity="Business Concepts"/>
    </category>
</categories>
<terms>
    <term name="Address">
        <parentCategory identity="Business Concepts::Location"/>
    </term>
</terms>

In Listing 1, the category Business Concepts has no parent category, therefore it is a root or top-level category.

Categories and terms in the business glossary can be uniquely identified by their names and parent categories. In the XML schema, when a category or term references another object, such as a parent category, you can use the identity attribute to hold the value of the unique identification for that referenced object.

For the parentCategory tag of a category or term, the identity value in the XML contains the full path to its parent category starting from the top-level category and going down to the direct-parent category. The categories in a path are separated with a double colon (::).

In Listing 1, category Location is contained in the root parent category Business Concepts, so the identity value for the parent category for Location is Business Concepts. The term Address is contained in category Location, which is itself contained in category Business Concepts. So, the identity value for the parent category for term Address is Business Concepts::Location.

Importing the XML for two categories and a term

When you import the XML, the import service looks at each identity value to see if a match can be found for an object with that identity. The service looks for this potential referenced object both in the XML file and in the target repository. If the match is found, the reference connection is made. If the match is not found, there are two possible outcomes:

  • The object that is making the reference is added to the target repository. If the referenced object is not mandatory for the new object to exist, the new object is added without adding that reference.

    For example, a category can exist without a parent category. So if the XML defines a new category with a parent category, but that parent category does not exist in the XML or in the repository, then on import the new category is created as a top-level category. The parent category defined for it in the XML is ignored.

    A term can exist without a referenced term. So if the XML defines a term with a referenced term, but that referenced term does not exist in the XML or in the repository, then on import the new term is created, but it does not have the referenced term.

  • The object that is making the reference is not added to the target repository. If the referenced object is mandatory for the new object to exist, the new object is not added to the repository.

    For example, if the XML defines a new term with a parent category, but that parent category does not exist in the XML or in the repository, then on import the new term is not created at all because the parent category is required for a term to exist.

Examples of import results

This section describes some examples of these import-result concepts.

Listing 2 adds a new term.

Listing 2. Adding a new term
<term name="Street">
    <parentCategory identity="Customer"/>
    <replacedByTerm identity="Business Concepts::Location::Address"/>
</term>

Listing 2 contains the term Street. For this example, Street's parent category, with identity Customer, does not exist either in the XML or in the target repository. On import, the term Street is not created because its parent category is required and was not found either in the XML or in the repository.

In the next example, the parent category Customer does exist in the target repository, but the term Address, which is set to be the replacedByTerm value for the term Street, does not exist in either the XML or in the target repository. Note that the identity for this term, Business Concepts::Location::Address, includes its category path and the term itself, and the potential reference is between two terms. In this case, on import, the term Street is created and has the parent category Customer. But the replacedByTerm relationship between Street and Address will not be created because the term Address does not exist in the XML or in the target repository.

You can conclude from this that, if Business Concepts::Location and Business Concepts:: Location::Address both exist in either the XML or in the target repository, then on import, the term Street will be added to the target repository and its replacedByTerm relationship will be created.

These same principles apply to other types of relationships defined in the XML where the referenced object is not found in the XML or in the repository. In the case where a relationship is required for a new item in the XML, if that referenced object is not found, then the new object is not added at all. In the case where the referenced object is not required for the existence of the new object, the import succeeds, but that relationship is not established.

Note: If some of your data does not get imported, this situation is something to check for. Typos, wrong case of letters, or other invalid relationship definitions can also cause this kind of import failure.

Matching on import

In Listing 2, the example added a new term to the target repository. The following sections describe scenarios where an XML import can update an object that already exists in the target repository.

During an import, the service first checks to see if each object in the XML already exists in the target repository. This determines whether a new object is to be added or an existing object is to be updated. When performing this check, the service first determines whether there is a repository ID (RID) defined for an object in the XML. A RID is a unique ID that is generated internally by the Information Server. If there is a RID defined, the service checks to see whether there is an object in the target repository that has that RID. If there is no RID defined for the object in the XML, or if the RID was not found, then the service looks at the identity of that object to see if there is a match.

When you export data in the XML format from a business glossary, the RID is exported as an attribute for each object. But note that when writing a new object in an XML file to be imported into the repository, an RID attribute and value should not be added into the XML file for this object. Only the InfoSphere Information Server can generate these values internally, and you cannot create them manually. In this case, you must add an identity value where relevant instead of an RID.

In Listing 3, an existing term Address was exported from a business glossary. The term has a RID attribute and value. This value is made up of a long set of numbers, letters, and other characters.

Listing 3. RID matching
<term name="Address" 
    rid="b1c497ce.e1b1ec6c.38683868.73b6c473-1dc5-45f6.8e65.550faa5565f8"
    shortDescription="LOCATION ADDRESS TYPE">
    <parentCategory identity="Business Concepts::Location"/>
</term>

In this example, the term has both its RID and its parent category identity specified. So on import, the order of checking for a matching object is as follows:

  1. A check is done to see if another term with the RID value of b1c497ce.e1b1ec6c.38683868.73b6c473-1dc5-45f6.8e65.550faa5565f8 already exists in the target repository. If it is found, then no new term is added, and the term in the target repository can be updated. In this case, where a match was made by RID, the name of the term or its parent is updated also.
  2. If a term with that RID is not found, an identity check is done, which checks whether a term named Address in category Business Concepts::Location exists in the target repository. If it is found, no new term is added, and the term in the target repository is updated.
  3. If neither the RID nor the identity is found, a new term is created in the target repository according to the XML data.

You can refer to objects by RID or by identity. In Listing 3, both an RID and an identity are specified for the term. When you write an XML file, only one RID or identity is actually required.

Listing 4 describes a scenario in which only the identity for a term is in the XML file, without an RID.

Listing 4. Identity matching
<term name="Address">
    <parentCategory identity="Business Concepts::Location"/>
</term>

In this example, there is no RID specified in the XML file for the term, but there is a parent category identity. When the term Address is imported into the business glossary, because there is no RID value in the XML, the import service checks by identity to see if this term already exists in the target repository. The following actions occur upon import in this sequence:

  1. A check is done to determine whether a term named Address in category Business Concepts::Location exists in the target repository. If the term is found, no new term is added, and the term in the target repository is updated.
  2. If the identity is not found, a new term is created in the target repository according to the XML data.

Merging imported attributes

Once it is determined that an object in the XML file already exists in the target repository, the issue then becomes how to merge all of its attribute values. For example, for a term, there could be different attribute values in the XML file than are in the target repository. This section describes the merge process in more detail.

Listing 5 imports Address in Business Concepts::Location.

Listing 5. Merging in XML
<term name="Address" 
    status="CANDIDATE"
    example="1222 Park Avenue"
    abbreviation="ADDR">
    <parentCategory identity="Business Concepts::Location"/>
</term>

If the term is found in the target repository, either having been matched by RID or identity, then each value of that term from the XML file is compared to the value for that term in the target repository. For example, in Listing 5, the value for abbreviation is ADDR, and in the target repository, the value for abbreviation is ADR. The question then becomes: is the value from the XML file used or the one in the target repository? And what happens if there are no values for an attribute in either the XML or in the target repository?

The answer is determined by the merge method you choose. The InfoSphere Business Glossary XML file import has four different merge options to choose from. Besides the four merge methods, the behavior of the merge depends on whether the attribute can contain a single value or if it can contain a list of many values.

Figure 11 shows how these merge methods are presented in the InfoSphere Information Server Web console and some examples.

Figure 11. GUI of XML merge methods
Radio buttons: Ignore the imported, Overwrite the existing, Merge & Ignore imported, or Merge &Overwrite existing

Listing 6 shows how to import the Address term from Listing 5 when an equivalent term exists in the repository.

Listing 6. Merging a single value attribute in XML
<term name="Address" 
    status="CANDIDATE"
    abbreviation="ADR"
    additionalAbbreviation="AD">
    <parentCategory identity="Business Concepts::Location"/>
</term>

XML and target repository both have a value, single value allowed

The abbreviation attribute can have a single value. In the XML, this abbreviation value is ADDR. In the target repository, this abbreviation value is ADR.

When importing this term, the result for the abbreviation attribute for each merge method is as follows:

Ignore
The imported value ADDR is ignored, and the value in the target repository ADR remains unchanged.
Overwrite
The imported value ADDR is overriding the value in the target repository ADR.
Merge and ignore
Because both the imported and existing values are populated, the merge is done the same way it is done for Ignore: the imported value is ignored, and the value in the target repository ADR remains.
Merge and overwrite
Because both imported and existing values are populated, the merge is done the same way it is done for Overwrite: the imported value ADDR is used.

XML has no value, and target repository has a value, single value allowed

The additionalAbbreviation attribute can also have a single value. In the XML there is no value for additionalAbbreviation, while in the target repository there is a value AD.

When importing this term, the result for the additionalAbbreviation attribute for each merge method is as follows:

Ignore
The imported value is ignored, and therefore the existing value AD remains unchanged.
Overwrite
The imported non-value overwrites the existing value AD, therefore in the result, additionalAbbreviation has no value.
Merge and ignore
Because the existing value is populated, the imported non-value is ignored, and the existing value AD remains unchanged.
Merge and overwrite
Because the imported additionalAbbreviation has no value, it does not overwrite the existing value AD.

XML has a value, and target repository has no value, single value allowed

For the example attribute, there is a value in the XML, but there is no value in the target repository. The results are as follows:

Ignore
The imported value is ignored, therefore the existing value remains unchanged, remaining with no value.
Overwrite
The imported value for the example overwrites the existing value, which is no value.
Merge and ignore
Because the existing example has no value, the imported value 1222 Park Avenue is used.
Merge and overwrite
Because the imported example has a value, it is used.

Merging properties with multiple cardinality values

The previous section explained how single attribute values are merged in XML import. This section describes merging values that have multiple cardinality, which means containing a list of more than one value.

In general, these merges are similar to single value merges. The difference is that with the Merge and ignore option and with the merge and overwrite option, the resulting list is a combination of the existing values and the imported values.

XML contains term Address with related term Street

In the example in Listing 7, the XML contains the term Address to merge with the related term Street.

Listing 7. XML value for a multiple value attribute
<term name="Address">
    <parentCategory identity="Business Concepts::Location"/>
    <relatedTerms>
        <termRef identity="Customer::Street"/>
    <relatedTerms/>
</term>

Target repository contains term Address with related term Road

In the example in Listing 8, the target repository contains the term Address to merge with the related term Road.

Listing 8. Target repository value for a multiple value attribute
<term name="Address">
    <parentCategory identity="Business Concepts::Location"/>
    <relatedTerms>
        <termRef identity="Industry::Road"/>
    <relatedTerms/>
</term>

The results for each merge method are as follows:

Ignore
The imported related term is ignored, which leaves the existing related term unchanged, remaining as Road.
Overwrite
The imported related term Street overwrites the existing related term Road.
Merge and ignore
The existing and imported related terms are combined, therefore the resulting related terms list contains both Street and Road.
Merge and overwrite
The existing and imported related terms are combined, therefore the resulting related terms list contains both Street and Road.

Understanding case sensitivity in XML

All names in the Information Server are case sensitive. This means that the term Street is not the same as the term street. When you refer to new or existing glossary data, be sure to use the right case, otherwise an incorrect glossary hierarchy might be created. For example, the target repository might have the term Business Concepts::Location::Street, but not Business Concepts::Location::street. If you are trying to update the term Street, but by mistake in the XML you type this term as Business Concepts::Location:street, then on import, the term street is added to Business Concepts::Location, and Street will not be updated as you expected.

Understanding partial import support

The new XML import introduced the capability to perform a partial import. This means that objects such as categories or terms can be defined in the XML file without adding their referenced objects to the XML file. In this case, the referenced objects need to exist in the target repository before import in order for the connections defined in the XML to be made.

Listing 2 offered an example of this type of import by importing the term Street. In that case, you added a new term to the glossary in which its parent category and replaced-by term were not in the XML, but were expected to be in the target repository. The exact identity or RID for these existing referenced objects needs to be defined in the XML file.

Assigning stewards

The XML import enables you to assign a steward to manage categories or terms. Listing 9 shows an example of how to assign a steward to manage a category or term in an XML. You need to define the steward's userName and type.

Listing 9. XML to assign a steward to a category and a term
<category name="Location" longDescription="Set of Terms relating to:  Location">
    <parentCategory identity="Business Concepts"/> 
    <steward userName="analysts" type="USERGROUP"/>
</category>
		
<term name="Address" shortDescription="LOCATION ADDRESS TYPE" 
      status="CANDIDATE" type="NONE" isModifier="false">
    <parentCategory identity="Business Concepts::Location"/>
    <steward userName="robertj" type="USER"/>
</term>

The userName value is the same as was described in the CSV section about assigning stewards. The type value can be either USER or USERGROUP, which are defined as valid values in the schema.

If the USER or USERGROUP defined in the XML does not exist in the repository, the import succeeds, but the connection from the term or category to the steward is not made. This follows the previously described principle that if a non-required reference is defined in the XML for a new category or term, and that non-required reference (a steward) does not exist in the XML or in the repository, then on import, the new category or term is created, but the steward connection is ignored.

Note: If the steward for a userName in the XML does not exist in the repository, but the USER or USERGROUP for the userName in the XML does exist in the repository, then on import, the steward is created for that userName. And the connection is made to the category or term, as was defined in the XML.

Handling categories that contain terms and subcategories

This tutorial describes several examples in which a term defines its parent category (parentCategory element) in the XML. It was decided when designing the XML schema that this relationship of a category containing terms can be defined only within the term tag, but it cannot be defined from the category side. This decision was made because a term can be imported without its parent category, and a term requires exactly one parent category in order to exist, therefore a parent category needs to be defined for a term. On the other hand, a category does not require contained terms to exist. By allowing this relationship to be defined only on the term level, it minimizes the risk of possible errors in the XML. Listing 10 shows the XML code to define a parent category of a term.

Listing 10. Parent category of a term
<term name="Address">
    <parentCategory identity="Business Concepts::Location"/>
</term>

A similar principle applies to the way that subcategories are defined in the XML schema. A category in the XML can define its parent category, but it cannot define its subcategories. The difference here is that a category is not required to have a parent category or subcategories. But because a category can have only one parent category, it is easier and less problematic to just allow the definition for this relationship to be defined by the parentCategory tag and not by a subCategories tag. Listing 11 shows the XML code to define a parent category of a category.

Listing 11. Parent category of a category
<category name="Location">
    <parentCategory identity="Business Concepts"/>
</category>

Understanding how categories reference terms

In the InfoSphere Business Glossary, a category can reference terms that it does not contain. In the XML you can define this relationship on both the category level and the term level. When defining the schema, it was decided to allow this capability because a category can reference many terms. A term can be referenced by many categories. Both directions have multi-cardinality, so when writing an XML, the chance of error is very low. Listing 12 shows the XML code for a category referencing a term.

Listing 12. Category referencing a term
<category name="Industry">
    <referencedTerms>
        <termRef identity="Business Concepts::Location::Address"/>
    </referencedTerms>
</category>

Creating custom attributes

Custom attributes are properties of categories and terms that can be created to extend the standard glossary template, as described in the CSV section. You can define custom attribute definitions and custom attribute values in the XML file.

Defining custom attributes

The custom attribute definitions are defined in their own section in the XML, separated from the category and term sections. Category custom attribute definitions and term custom attribute definitions are each defined in their own separate section.

Custom attribute definitions can be of type string or enumerated. When defining a string custom attribute definition, only the name is required, and a description is optional. When defining an enumerated custom attribute definition, an enumerated list of valid values is additionally required.

Listing 13 shows an example of category and term custom attribute definition sections in the XML, where the category custom attribute definition is of type string, and the term custom attribute definition is enumerated.

Listing 13. Custom attribute definitions
<categoryCustomAttributes>
    <customAttributeDef name="Priority">
        <validValues>
            <validValue value="high"/>
            <validValue value="medium"/>
            <validValue value="low"/>
        </validValues>
    </customAttributeDef>
</categoryCustomAttributes>
<termCustomAttributes>
    <customAttributeDef name="ID"/>
</termCustomAttributes>

The category custom attribute definition has the name Priority, and it is of type enumerated. Its enumerated values are high, medium, and low.

The term custom attribute definition has the name ID, and it is of type string. Because it is of type string, it can have any value.

On import, if these custom attribute definitions do not exist, they are created. If they already exist, they will remain as they are in the repository.

If an enumerated custom attribute definition already exists in the repository and the list of valid values is different in the repository than in the XML, then the list could be changed by changing the import merge option, similar to other category and term lists.

Defining custom attribute values

Custom attribute values are defined in the XML within the specific category and term sections. Listing 14 shows a custom attribute value defined.

Listing 14. Category with a custom attribute value
<category name="Location" longDescription="Set of Terms relating to:  Location">
    <parentCategory identity="Business Concepts"/>
    <customAttributes>
        <customAttributeValue customAttribute="Priority" value="medium"/>
    </customAttributes>
</category>

In Listing 14, the category Location has a custom attribute value of medium defined for the custom attribute definition Priority. Because the custom attribute definition Priority is of type enumerated, the valid values are high, medium, and low, so the custom attribute values need to be one of those three values. If in the XML a custom attribute value for Priority is defined to be a different value than one of these three, such as very high, then the import fails with an error message that says:

Custom attribute value very high defined for Category
Location does not match its custom attribute definition
Priority. Valid values are: high, low,
medium.

Listing 15 shows a term with a custom attribute value defined.

Listing 15. Term with a custom attribute value
<term name="Address" shortDescription="LOCATION ADDRESS TYPE"
    status="CANDIDATE" abbreviation="ADDR">
    <parentCategory identity="Business Concepts::Location"/>
    <customAttributes>
        <customAttributeValue customAttribute="ID" value="2"/>
    </customAttributes>
</term>

In Listing 15, the term Address has a custom attribute value of 2 defined for the custom attribute definition ID. Because the custom attribute definition Priority is of type string, the value can be any value.

The custom attribute definitions referred to by the custom attribute values need to be defined either in the XML or in the metadata repository in order for the custom attribute values to be imported. If the definitions do not exist, the import succeeds, but these custom attribute values are not created.

Defining synonym groups

In the business glossary, each term can have one or more synonyms. In the database, terms are set as synonyms by defining a synonym group, which contains the terms. Each term is only allowed to be included in a single synonym group. In order to ensure this, synonym groups are defined as separate objects in the XML, rather than being defined within the term tag like other term relationships.

Each synonym group can have up to one preferred synonym. This is also defined in the synonymGroup tag, as shown in Listing 16.

Listing 16. Synonyms and preferred synonym
<synonymGroups>
    <synonymGroup>
        <synonyms>
            <termRef identity="Business Concepts::Location::Address"/>
            <termRef identity="Customer::Street"/>
        </synonyms>
        <preferredSynonym identity="Customer::Street"/>
    </synonymGroup>
</synonymGroups>

In Listing 16, the two terms Street and Address are defined as synonyms. When an XML with synonyms is imported, if one or more of the terms in the synonym group are already in the repository and already have a different synonym, then the three synonyms are merged to be part of one synonym group. Note that for synonyms, this merge takes place in the same way when using all four XML merge options.

In Listing 16, the term Street is defined as being the preferred synonym of this synonym group.

Assigning assets

In the Business Glossary, terms can have a list of assigned assets, which can be many different types, including databases, columns, BI reports, and so on. Assigning assets to terms is one of the important capabilities of the Business Glossary. XML import lets you assign assets to terms in a straightforward manner. Listing 17 shows an example of assigning a host called BUSINESS_SERVER to term Street.

Listing 17. Assigned asset of type host
<term name="Street">
    <parentCategory identity="Business Concepts::Location"/>
    <assignedAssets>
        <hostRef host="BUSINESS_SERVER"/>
    </assignedAssets>
</term>

On import, if the host named BUSINESS_SERVER exists in the target repository, it is assigned to the term. If BUSINESS_SERVER does not exist, the term is imported without the assignment.

In Listing 17, the host is identified by its name. Each asset type has its own set of identifying attributes, which the schema defines. Some identity attributes are required in the XML, and some are not. The key for the match to take place on import is that in the XML it is necessary to specify all the relevant attributes that are defined for this asset in the repository.

Listing 18 shows an example that assigns a database column to a term.

Listing 18. Assigned asset of type column
<term name="Street">
    <parentCategory identity="Business Concepts::Location"/>
    <assignedAssets>
        <columnRef column="CUSTOMER_NAME" table="CUSTOMER" schema="DB1_SC" 
        databaseName="BUSINESS" host="BUSINESS_SERVER" />
    </assignedAssets>
</term>

In Listing 18, for column CUSTOMER_NAME there are five attribute values defined within the columnRef tag:

  • column
  • table
  • schema
  • databaseName
  • host

All of these attributes are required by the XML schema for an asset of type column. If a column with all of these five attribute values exists in the target repository on import, then the match is made.

Other types of assets might have a hierarchical definition. For example, the object can be contained in a hierarchical structure that can include several instances of the same type. Listing 19 shows an example of this kind of asset.

Listing 19. Assigned asset of type BI report field
<term name="Street">
    <parentCategory identity="Business Concepts::Location"/>
    <assignedAssets>
        <BIReportFieldRef field="Total net. profit">
            <BIReportGroupRef group="Sales">
                <BIReportGroupRef group="Products">
                    <BIReportRef report="Annual Report" />
                </BIReportGroupRef>
            </BIReportGroupRef>
        </BIReportFieldRef>
    </assignedAssets>
</term>

This assigned asset is of type BI report field, which is always contained in a BI report group. The BI report group might be contained in another BI report group, or it might be contained in a BI report. The hierarchy between the BI report group instances is represented by XML nested elements. In Listing 19, the BI report field named Total net. profit is contained in a BI report group called Sales. Sales is contained in a BI report group called Products. Products is then contained in a BI report called Annual Report. If a BI report field with the same name with the same hierarchy exists in the target repository on import, the match is made.

Some asset types have optional attributes defined by the schema.

Listing 20 shows an example of an assigned database in an XML file that has three identification attributes defined.

Listing 20. Assigned asset of type database
<term name="Street">
    <parentCategory identity="Business Concepts::Location"/>
    <assignedAssets>
        <databaseRef 
            databaseName="BUSINESS"
            host="BUSINESS_SERVER" 				
            databaseDBMS="DB2" />
    </assignedAssets>
</term>

For a database asset, according to the schema, only the databaseName and host attributes are required. A database has three other optional attributes:

  • databaseDBMS
  • databaseInstance
  • databasePath

In Listing 20, only one of these optional attributes is defined, databaseDBMS. When importing this XML, if a database with these three attribute values, and only these three attribute values, is found in the repository, the assignment will be made to the term.

If there is a database with the same name and host in the repository, but the databaseDBMS value is not defined there, the assignment is not made.

If there is a database with the same name, host, and databaseDBMS in the repository, but it also has a value defined for databaseInstance in the repository, then on import, the assignment to the term is not made, because the assets are not considered to be identical.


Conclusion

Exporting

The InfoSphere Business Glossary also supports export of content in CSV and XML formats, which can be used to transfer data between environments and server instances. This exported file can be edited, adding more categories and terms, and then re-imported to the same server from which it was exported to add new data.

Once a business glossary has been richly populated, exporting can be useful to copy or transfer this content from one business glossary to another, which is a method of populating that second repository. In addition to XML and CSV formats, exporting is available in an XMI format.

Conclusion

The CSV and XML import and export features in the InfoSphere Business Glossary, introduced in version 8.1.1, provide users with powerful tools to create and update business glossary data.

Acknowledgments

The authors would like to thank Michael Fankhauser, Benny Halberstadt, Roger Hecker, Nancy Navarro, and Erel Sharf for their feedback and review of this tutorial.


Download

DescriptionNameSize
Sample CSV and XML files for this tutorialSampleCSV_XML_ImportFiles.zip3KB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, XML
ArticleID=522340
ArticleTitle=Use CSV and XML import methods to populate, update, and enhance your InfoSphere Business Glossary content
publish-date=09162010