Parameswaran Venkatraman is a software engineer in IBM StoredIQ of Enterprise Content Management group at IBM India Software Labs. He has a Bachelor degree in Information Technology with over four years of IT experiences in design and development of front-end, API, and visual effect. As a part of the engineering team, he works extensively across StoredIQ product suite and applications. His primary responsibilities are designing and implementing features for user interface and application layer.
This blog is about using IBM StoredIQ to automatically classify documents at scale.
What is IBM StoredIQ?
IBM StoredIQ allows enterprises to identify, analyze, and manage their data at scale. The core platform (the gateway and data servers) has a distributed architecture that provides scalability. In addition to this, the application stack provides users with a number of applications to identify and act on data.
IBM StoredIQ uses information sets, abbreviated as infosets to identify all the data which share some common properties and encapsulates them in a set. IBM StoredIQ has two types of infosets:
- System Infosets
- User Infosets
Here we consider only the user infosets.
The first step in using IBM StoredIQ to manage data is to create such infosets of data with common properties. This is done by using IBM StoredIQ Data Workbench which allows users to create their own filters using metadata, full-text, and properties. These filters can be applied to a system infoset or user infoset to generate a new user infoset. Filters can be created using a number of attributes, some of which are specific to the data source on which the object resides.
IBM StoredIQ Data Workbench provides a form-view which allows users to build filters using a combination of select boxes and text fields. This is a quick, easy and simple method of identifying data of a certain type. But in some cases, describing a data object using such filters becomes really complex, especially when required to use none, one, or many attributes together. This can be done using the code-view which allows a number of arbitrary filter statements to be constructed using several forms of conjunction/negation. For example, an enterprise needs to identify all documents (DOC/PDF/WORD/TXT) created five years ago that are either in a specific folder or have a specific name or contain specific phrases and are between 2MB and 5 MB. These could be documents that have been created using a proprietary tool/template and share the properties mentioned above. In such cases the code-view is an excellent tool for identifying data.
The code-view is such a powerful tool on its own, what is the problem, one may wonder. The standard filter attributes are great for identifying data when the connection between the data is evident. What if this connection is not so evident or just cannot be translated into a filter? Then, there is also the question about the reliability the results generated by such a filter, in other words potential false positive or negative results.
Here is where auto-classification can help. IBM StoredIQ Data Workbench supports automatic classification of documents using a combination of machine learning, natural language processing and semantic analysis. Consider a corporation that processes a large number of contracts every year. These contracts range from simple lease agreements to sensitive defense contracts. Each contract is significantly different from the other and must be handled accordingly. When dealing with millions of these documents categorizing them becomes all the more important.
How does auto-classification work?
IBM StoredIQ utilizes IBM Content Classification's classification model in the infoset generation process to automatically classify data. In order to use auto-classification, a classification model must be created and uploaded in IBM StoredIQ Administrator. The classification model itself is created using IBM Content Classification and an initial corpus of data. A classification model comprises of a decision plan and knowledge base. When we have identified the categories against which the classification should happen, IBM Content Classification can ingest the corpus data for each category into its knowledge base. This knowledge base would serve as the baseline for identifying and classifying documents. The auto-classification process while classifying documents into specific categories also assigns confidence scores for each document and each category. For example a document may be classified as type X with a score of 0.25.
Depending on the use case this can either be treated as a low confidence score (narrow search) or a high confidence score (broad search) for that category. Similarly if the same document has a score of 0.25 for type X and 0.6 for type Y, then selecting the highest score would result in the document being included in a filter for type Y.
After a classification model has been created it can be uploaded in IBM StoredIQ Administrator. Before this model can be used to filter an infoset, that infoset must be enhanced to support the model. This can be done from the "Enhancements" tab in IBM StoredIQ Data Workbench. Multiple classification models can be created and each infoset can be enhanced using any number of models. When an infoset has been enhanced using a classification model, that model is available under the "Auto-Classify" filter. Selecting that model lists the categories available for that model. One or more categories can be selected for the filter. A document is included in the filter results if it belongs to any of the selected categories, but it is possible to select only if it is the highest score in the selected category. It is also possible to control the threshold of the score for example, include only documents whose score in a category is greater than 0.6.
Picking up on the example scenario from earlier, assume an enterprise user is responsible for three types of contracts A, B, and C. These contracts contain sensitive information and as part of the enterprise's agreement with their customers, the contracts must be retained on a specific repository and cannot be stored on any other repository for more than 24 hours. Each contract can have different content depending on the entity with whom the contract has been entered with. For example contract type A can have different content for two different clients.
Using auto-classification it is possible to periodically harvest an enterprise's data sources and classify data accordingly. To do this, a classification model for categories A, B and C is created using a corpus for each category. This classification model is then uploaded to IBM StoredIQ Administrator to create a model which can be used to enhance infosets. A user infoset is selected and enhanced using this model. A new filter is created using "Auto-Classify" option and selecting the model that was just created. As the user is interested in all three contract types, select all the categories under that model.
The documents contain sensitive information, hence an appropriate threshold is selected to ensure there are no false negatives. This filter can be saved to the filter library and can be reused later. The last step is to create an infoset using the filter. After the infoset creation is complete, the user can verify the objects listed in that infoset.
Every time the auto-classification filter is used, it generates new documents that can be used to improve the corpus. Thus the classification model is self-sustaining. After creating an infoset using the filter, the copy action can be used to copy the objects to a location that can be accessed by IBM Content Classification. The classification model can then be regenerated using the updated corpus.
Auto-classification with User Feedback
Auto-classification filters can be used to filter documents based on the model's categories. But in some cases it is possible for a document to be incorrectly classified or have a higher score for a category to which it does not belong to. In such cases IBM StoredIQ allows users to provide feedback for each document. Once sufficient feedback has been collected for a model the administrator can choose to retrain the model. This is the process of applying the feedback onto the decision plan and knowledge base. Once the model has been retrained, the infoset must be enhanced with that model. After this process the documents score would have changed based on the feedback provided in the previous step.
Feedback can be submitted through IBM StoredIQ Data Workbench's "Filter" tab. The desired model must be selected as the "Auto-Classify" filter. The categories and score can be adjusted based on the data that is targeted by the user. For example if the user wants to inspect only contracts of type A, they can leave the other categories unchecked. This will result in a more refined subset of documents which the user can inspect to determine if they have been correctly classified or not. Once the filter has been finalized clicking "Preview Filter Results" would return the objects matching the filter. Clicking on each document in the list displayed will open the document viewer which contains a selection box with the categories and scores. If a document has been correctly classified, but the score is not as high as the user expects, the user can select the same category and submit feedback. If a document has been incorrectly classified or does not have the highest score for that category, the user can select the correct category and submit feedback.
As users submit feedback on each document, they get aggregated. Information on the feedback received for each model is available in IBM StoredIQ Administrator under the "Auto Classification" tab. Selecting a classification model from the list of available models displays the feedback received for that model. If the administrator feels that sufficient feedback has been received, they can initiate a retrain on that model by clicking "Retrain". This moves the model to "Retraining" state and cannot be used for enhancing infosets until the training completes. When the model returns to "Available", it can be used to enhance an infoset. Running the filter now will return the documents with the updated scores based on the feedback.
Although auto-classification is a powerful tool for classifying and filtering documents, it must be noted that the accuracy of the model depends on the corpus. If the initial training set is small, the accuracy of the model decreases significantly. Similarly, if new documents added to the corpus are drastically different from the original corpus, accuracy would be affected adversely. In some cases it would be more prudent to regenerate the corpus if the document has significant changes.
Likes before 03/04/2016 - 0
Views before 03/04/2016 - 1501