Supervised Multi-labeling classifier

The Supervised Multi-labeling classifier classifies a document into pre-defined classes and give it labels representing assigned classes to it. A set of classes into which the documents are classified is defined by providing training data, which is a set of documents having correct labels.

The document classification predicts types, topics, or user-defined categories of a document, based on a set of parameters trained with user-provided training data.

The document classifier implementation of Watson™ Explorer is classified as supervised multi-labeling. The word supervised means that this classifier uses supervised learning, a type of machine learning method that requires training data to learn how to recognize classes of documents based on their metadata and keywords in text. On the other hand, the word multi-labeling means it may give zero or more labels to a single document.

Supervised multi-labeling example

Suppose we are going to apply the supervised multi-labeling classifier to news articles to estimate their topics such as politics, economy, and sports as in the figure below. A single article may contain more than one topic, so multi-labeling suits to this task.

News articles normally have both metadata such as publishing media and publishers and text such as title and body. The multi-labeling classifier of Watson Explorer uses both of these metadata and text as features, in other words, as hints to estimate classes of documents.

The classifier needs to learn which metadata or keywords in text are likely to appear in articles having a particular topic. For instance, the publisher in this example, Sports News Inc. would likely to appear in sport articles, and words such as evolution and technology would likely to appear in science articles. Training data is used to learn such relationship between those contents of documents and their expected labels. Training data consists of pairs of a document (a news article in this example) and its expected labels (topics in this example). The pair of a document and a set of labels is normally called as a training example in the machine learning field.

After the training is completed, the classifier can predict topics of a given document based on its content. In this example, sports and science are predicted topics for the document in the left.

Usage

A standard procedure for applying supervised multi-labeling to new documents is as follows:

  1. Prepare training data.

    Prepare a set of documents with correct class labels and ingest the documents as a dataset. All data formats supported by crawlers and converters are acceptable as the training data format, but CSV files imported by the CSV Importer and JSON files imported by the File system crawler are recommended.

  2. Create and train a classifier using the training data.
  3. Evaluate the accuracy of multi-labeling.
  4. Deploy a trained model as a classifier instance to annotate labels to new documents.
  5. Create a new collection by specifying the collection template created with the classifier.
  6. Check the prediction result on miner: Add a new dataset to the collection and create an index.
  7. Check the prediction result on Realtime NLP API.

    Realtime NLP API returns labels and their probabilities predicted by the classifier (Note that the collection must be associated with the enrichment using a classifier instance).

    If one or more labels are predicted with probabilities over the given threshold, those pairs of a label and a probability are included in the metadata field of the API response with the key classes. The stored field also includes the prediction result, but it contains a list of labels only.

    API response example:

    {
          "id": "rd10xx",
          "tag": "crqsl111",
          "removed": false,
          "metadata": {
            "classes": [
              {
                "label": "label-1",
                "probability": 0.8
              },
              {
                "label": "label-2",
                "probability": 0.24
              }
            ]
          },
          "enriched": {
            "body": [
              {
                "text": "this is a field",
                "properties": {
                  "locale": "en"
                },
                "features": [],
                "annotations": [
                  {
                    "id": "bd7230ca-c4cf-495a-b783-3182cf8a958d",
                    "type": "uima:customtype",
                    "text": "this",
                    "beginIndex": 0,
                    "endIndex": 4,
                    "properties": {}
                  }
                ]
              }
            ]
          },
          "stored": {
            "predicted_label": [
              "label-1",
              "label-2"
            ]
          }
        }

Applies to version 12.0.2.2 and subsequent versions unless specifically overridden figure below shows the data flow of the above process.

Questions

How many documents are required for training data?
It depends on the classification task where the training data is applied. As a rule of thumb, at least 10 training examples are needed for each label that the classifier should predict, but that number varies depending on the amount of information and the quality of each document as well as the target accuracy for the task. For the least number of documents to make the classifier work without error (but not guaranteed to predict well), the training data must have 3 or more documents.
How many fields should documents in training data have?
At least two. Answer field that contains the correct labels and a field used to predicts those labels.
How to add multiple labels to documents in training data?
You can specify multiple labels by giving a string value like ["label-1","label-2","label-3"] in the answer field of the document. Note that square brackets, double-quotes and columns must be escaped appropriately according to the file format of the training data.