Document Classification Training

Document Classification Training under Training tools system.

Follow the training steps:
  1. Upload files associated with a document class.
  2. Users can start training, when there are at least two classes with 25-files each. A maximum of five pages is used for training from each file.
  3. Three models are returned to you to give you the flexibility of deciding, which model works best for their documents by testing files against the models.
  4. Validate or Pick a model by testing files against the three models that are returned to you or using the validation results produced by Content Analyzer after training to pick the best.
    Note: It is recommended that you test the files against the models and not base your decision on the validation results returned by Content Analyzer.

    It is recommended that you test the files that belong to a class and files that do not belong to any of the classes that you have trained. You can select the most precise model for your document set.

  5. Select a model that works best for your document set by using the intuition report that is provided by Content Analyzer and by testing files before Deploying.
  6. New system-generated classification words are added to your Ontology after deployment, which cannot be modified or deleted.
  7. Document class similarity and intuition reports are also shown on each document class after deployment.
Best Practices
  1. Ensure that Documents belonging to a document class does not exist in another document class.
  2. Ensure the document set provided to each document class adequately represents the class.
  3. Test files against the models to determine the best model that works for your document set (Testing includes files that do not belong to any of your document classes).

Model Accuracy

You can get more information about the accuracy results produced to better understand my document set. Model accuracy intuition report provides opportunity to produce better files or give more information for training. The intuition report would be on document class level training sets and model. A Model's accuracy is based on the correct predictions that are made for your document classes. Training files are bundled in group in to a data set, and then verified against algorithms to predict accuracy. The 70% of the data set is used to generate the classifier, and then 30% of the data set is compared to the classifier to measure how well the classifier is predicting the results. Accuracy is the number of correct predictions from all predictions made. Accuracy is calculated by using the number of correct predictions or all predictions made.

Confidence Scores

The score levels are based on the documents provided. Use the best practices section to get good confidence scores.

Model

  1. Low: The model has a low confidence, which might lead to incorrect document classes.
  2. Medium: The model is likely to get a mixture of correct and incorrect classes.
  3. High: The model has a high confidence, which means the model is likely to get all the correct classes. Due to the conservative nature of the confidence scoring, it is possible that some documents cannot be classified successfully.

Possible reasons for low scores

  • Sufficient features might not be extracted from the documents that are provided for each document class.
  • Documents that are submitted under each document class can be similar to other document classes.
  • Document set provided are insufficient to get a good confidence score. Consider adding more training files to help improve the confidence levels.

Document Class

  1. Low: Documents belonging to that document class are likely to be incorrectly classified.
  2. Medium: Documents belonging to that document class are likely to be correctly and incorrectly classified.
  3. High: Documents belonging to that document class is correctly classified, although you might get a Low accuracy for some documents.
Possible reasons for low scores
  • Insufficient features not available to correctly identify documents under the document class.
  • Documents that are submitted under the document class can be similar to other document classes.