Understanding model accuracy
Document Processing Extension provides a pre-trained model for common document types. As you add sample documents and fields, the overall model is trained to match your specific requirements. You evaluate the resulting models to determine which model to use for your project.
Best practices for model building
- Ensure that documents in a document type do not exist in another document type.
- Ensure that the document set provided for each document type adequately represents the type.
- Ensure that you use the same number of files to train or test each of your document types, including the pre-trained document types.
- Test files against the models to determine the best model that works for your document set. Testing includes files that do not belong to any of your document types.
Model accuracy
- Confidence Scores
- The score levels are based on the documents provided. Using the best practices for model building ensures better confidence scores.
- Possible reasons for low scores
-
- Sufficient features might not be extracted from the documents that are provided for each document type.
- Documents that are submitted under each document type can be similar to other document types.
- Document set provided are insufficient to get a good confidence score. Consider adding more training files to help improve the confidence levels.
- Insufficient features available to correctly identify documents in the document class.
Evaluating the models
- Classification
- It might seem obvious that the model with the highest confidence score is the appropriate model
to choose for your project. However, when you evaluate the models, you should consider the accuracy
for your most important document types.
For example, if one of your document types is misclassified, but with a low confidence, that might mean that the model has the correct information about the document type, but encountered an unusual sample.
A low confidence classification means that if your model encounters that document type in a processing application, the application flags it for a user to fix the classification.
On the other hand, if you have a model that classifies an invoice as a bill of lading, and with a high degree of confidence, you can see that the model is not going to be as useful when it's time to process documents. If the model is not accurate, you can check where your known categories don't match with the document types that the model assigns.
Adding more samples can help refine the model.
- Extraction
- An extraction model tells you how many fields or values it found, with low, medium, or high
confidence.
However, this model does not know whether the values that are extracted are what you expect them to be. Check the extraction results to ensure that you are getting the expected values.