Generating data classes
Data classes can be used in conjunction with IBM Information Analyzer to classify data sources registered with IBM Information Server. For example, within a data lake environment, IBM Information Analyzer can use generated data classes to auto-assign to UDMH business terms (where possible).
When there is a need to auto-classify data source content, the data class generation capability becomes relevant. In InfoSphere Information Governance Catalog (IGC) there is a type of Asset that categorizes columnar data, called a data class. There are a number of data class types, but for the purpose of the data class generation, the one that is relevant is data class of type valid values. A data class with a list of valid values can be used by Information Analyzer, to classify the contents of a column automatically. This can result in an assigned data class and also assigned UDMH business terms.
UDMH can provide glossary content, that can be used as a source by the data class generator to create Data Classes. The generators describe the type of content expected in detail. The generators are:
- Terms that have a label 'enumeration' and a custom attribute that has valid values.
- Terms (labeled enumeration) that has type Terms labeled 'enumeration items'.
The generators will also attempt to set one or more assigned terms for the data class, assuming there is a recognizable navigation path from the term (from which the data class gets generated) to the business terms.
It is possible to markup new content in IGC, following the rules in the generator links below, to enable the Data Class generator to identify the terms and generate corresponding data classes for them, as follows:
- Enumeration terms with custom attribute
- Enumeration terms with has types enumeration item terms
In summary, this glossary term content (that is used as input into the Data Class generator) has been marked up, to identify lists of valid values. If these valid values exist in columnar data, then the generated Data Classes could result in auto classification by IBM Information Analyzer, to those generated data classes and by extension, any assigned UDMHs business terms of those data classes.