Knowledge base properties

To specify knowledge base properties, use the Management Console.

Knowledge base name

Specifies a display name the knowledge base. If you are adding a knowledge base to the IBM® Content Classification server, type a name that complies with 7-bit Unicode Transformation Format (UTF-7) character encoding.

Statistics

Statistics is information derived from texts that enables Content Classification to classify future texts.

Add an empty knowledge base

Adds a knowledge base that has no statistics to the Content Classification server. You can train an empty knowledge base by using Classification Workbench or by programmatically applying learning through a Content Classification application.

You can also add an empty knowledge base here, and then export a new knowledge base from Classification Workbench. If you use this approach to add a new knowledge base to the Content Classification server, the empty knowledge base must exist before you export the knowledge base from Classification Workbench for the first time.

Keep existing statistics

For a knowledge base that was previously added to the Content Classification server, specifies that you want to keep the current knowledge base statistics.

Import a knowledge base (KB file)

Imports a knowledge base that contains statistics, such as a knowledge base that you created in Classification Workbench, or a knowledge base that you analyzed and fine-tuned, and you now want to import the latest statistics back into IBM Content Classification.

The default path for knowledge bases that you create in Classification Workbench is Classification_Home\Classification Workbench\Projects_Unicode\project_name\project_name.kb, where project_name is the name that was given to the knowledge base project in Classification Workbench.

To import statistics to an existing knowledge base, ensure that the knowledge base is running before you click OK. These options are available:

Import from the Management Console computer

Select this option to import small knowledge base files (up to 30 MB in size). Larger files can take a long time and might not be processed before time limits expire. You can also select this option to import files when there is no network connectivity between the Content Classification server and the directory where the file is stored.

If the knowledge base file is local, click Browse to find the file or specify the local path, for example, on Windows:
d:\directory_path\knowledge_base_name.kb
If the knowledge base file is on a different computer, specify the full network path that the Content Classification server needs to access the file, for example, on Windows:
\\computer_name\directory_path\knowledge_base_name.kb

Tip: If you have a large knowledge base file, and the Management Console and Content Classification server are on separate computers with no shared file systems between them, you can transfer the file to the Content Classification server by using a file transfer program such as FTP, and then select Import from the Data Server computer and specify the Content Classification server path for the file.

Import from the Data Server computer

Select this option to import a large knowledge base file. If the KB file is not on the Data Server computer, ensure that the file is in a shared directory and that the Content Classification server can access the file. The network path that you specify must be relative to the Content Classification server.

Properties

Specifies runtime options for the knowledge base. If you change any of the properties, you must restart the knowledge base for the changes to take effect.

Use a cache

Specifies whether the knowledge base is to use a cache. A cache enables the system to handle knowledge bases that are too large to load into memory in one piece. Because holding the statistics for all categories in memory at the same time requires a large amount of memory, such as 1 GB or more, knowledge bases that contain many (thousands of) categories must be loaded in cached mode.

Categories can be organized under a top-level classifier and be physically separated from other classifiers through principal nodes. For example, categories named Honda, Ford, or Saab might be organized under a classifier named Cars. When training a cached knowledge base, each principal node, which contains the set of categories that comprise a classifier, is loaded into memory and trained separately from the other principal nodes. After a classifier is trained, its statistical data is saved to disk, a digest of the results is created, and the next principal node is loaded into memory and trained.

Digests are all that is needed to run matches on a knowledge base. A read-only knowledge base, which contains only digests, does not require a cache. Because digests do not require much memory, all digests can be loaded into memory at the same time.

A read/write knowledge base can handle both matching and training. With a very large category set, a read/write knowledge base might require a cache so that the categories can be partitioned into classifiers (through principal nodes) which can be separately trained.

The read/write instance of a cached knowledge base is primarily suited for submitting feedback. In addition, a cached knowledge base should have at least one read-only instance to handle Suggest requests.

Cache size

Specifies the maximum number of principal nodes that can be loaded into memory at the same time. Typically, this value is 1, which ensures that each set of categories is trained on its own without overburdening memory resources.

Back up automatically

Automatically creates backup copies of the knowledge base when you make changes to it, such as importing knowledge base statistics, changing feedback options, and adding or removing a read-only instance. Having a backup is useful, for example, if you need to reproduce results from the previous version after the knowledge base is changed.

The backups are created in the Classification_Home/dserverdir/VERSIONS directory on the data server. The file name is the name of the knowledge base concatenated with the backup version number.

Attention: Start of change

If your knowledge base has associated learning data, you might have several identical knowledge bases with different version numbers. This occurs when the value of the global feedback frequency is less than the value of the learning data retrain frequency setting.

(The global feedback frequency is defined by the Train when this number of feedbacks are accumulated parameter in the Knowledge Base Training area of the Global Properties window. The learning data retrain frequency is defined by the Retrain frequency parameter in the Associate learning data (SARC file) area of the Knowledge Base Properties window.)

The knowledge base version number is updated each time the number of effective feedbacks reaches the limit that is specified by the global feedback frequency. The version number that is reported by the API is version of the knowledge base in memory, which is the latest version that is saved in the Classification_Home/dserverdir/VERSIONS directory plus 1. End of change

Associate learning data (SARC file)

Select this option to store learning data with the knowledge base. Learning data is configured by using Classification Workbench. End of change

SARC files are stored in Classification Workbench knowledge base project folders: Classification_Home\Classification Workbench\Projects_Unicode\knowledge_base_project_name\KBCache. SARC files on the server are stored in the following directory on the computer where the read\write instance of the knowledge base is running: Classification_Home/data/rw_knowledge_base_name. End of change

Retrain frequency: Specifies the minimum number of new feedbacks that are sent by the server to the learning data file that are required to trigger an update of knowledge base statistics. The rate that the server sends the feedback to the learning data file is controlled by the Knowledge Base Training settings in the Management Console's Global Properties window.

For more information about learning data, see Saving learning data. End of change

Feedback

Specifies how feedback, which helps the knowledge base grow more accurate over time, is to be processed.

Process as accumulated: Applies feedback as it is accumulated. The feedback options that you specify when you configure global properties determine how frequently the accumulated feedback is applied to train the knowledge base. This default option has the fastest effect on improving the accuracy of knowledge base results.
Defer processing: Delays feedback processing until a later time. Select this option, for example, if you want to analyze the feedback programmatically or analyze it in Classification Workbench before using it to train the knowledge base.
Do not process: Does not apply feedback to the knowledge base.

Servers

Specifies the servers and ports for read/write and read-only instances of the knowledge base.

Read/write server

Specifies the server and port on which you want to run a read/write instance of the knowledge base. Select a server from the list of available servers and specify the port number.

A read/write instance is a server component that can handle read/write and read-only requests on a knowledge base. A single read/write instance must exist for each knowledge base.

Read-only servers

Specifies the servers and ports on which you want to run read-only instances of the knowledge base.

A read-only instance is a server component that can handle read-only requests on a knowledge base. Read-only requests can be processed by the read/write instance or forwarded to a read-only instance depending on the current workload of the read/write instance. Read-only instances are optional components. To provide scalability and enhance performance, you can configure the system to run any number of read-only instances of a specific knowledge base on multiple computers.

You can add, modify, and remove read-only instances:

To add an instance, click Add, select the server from the list of available servers, and specify the port number.
To change the server on which an instance resides or to change the port number, select the read-only instance and click Properties.
To remove an instance, select the read-only instance and click Remove.

Supported languages

Specifies the languages that the knowledge base is required to support.

Each knowledge base has its own set of supported languages. The classification technology provides suggestions only for texts that are written in a language that the knowledge base supports. A knowledge base can be either monolingual or multilingual, but must support at least one language. If a knowledge base is monolingual, all questions submitted to the knowledge base are assumed to be in that language.

The GenericLanguage option is provided for basic processing of texts in unsupported or partially supported languages. This option is available for monolingual knowledge bases and cannot be selected for use with other languages.

The set of languages that you define for the knowledge base determines memory consumption. Each language requires approximately 20 MB.

You can add or remove supported languages:

To add a language to the supported language set, select the language in the Available languages list and click the arrow that points towards the Supported languages list.
To remove a language from the supported language set, select the language in the Supported languages list and click the arrow that points towards the Available languages list.
Tip: Double-clicking a language in one list moves it to the other list.