Planning for classifying text

Before you work with text classification, you must consider a few key concepts about it in IBM RPA to make an informed business decision.

The following are a few considerations on how text classification works in IBM RPA and what you need to do before using it.

Preprocessing and cleaning data
You don't need to preprocess the text and remove unneeded words or sentences. The chosen algorithms automatically clean and normalize the text according to the specified groups or tags.

Folder structure to build a text classification model
When you build a text classification model, you must create a structure of documents that serve as the basis for the labels in your model. The folder structure follows a balanced tree structure with text files as the leaf nodes of the tree.

The folder names are used as labels for classifying text and the outermost folders must contain the text files. Each text file must contain words or sentences that relate to the subject in the group.

The names of the files are not considered by the text classification algorithms. You can separate files by subject or synonym words, for example, if they belong to the same label that you want to classify.

The following diagram represents a valid folder structure:

Text classification folder structure

Consider the scenario where a company wants to classify the feedback from their customers on their product. The company's IBM RPA developers built the following folder structure:

Text classification model example, with a root node, customer feedback on the product and their website, questions asked to the support channel and bug reports.

Notice that the name of the root folder is not considered for text classification, only the other labels are relevant. You can also have folders inside each folder, but ensure that the height of the tree stays the same in each node.

If you add folders or files and one of nodes of the tree becomes unbalanced, with different heights for each folder, you get an error message stating that the tree is unbalanced. Ensure that the tree is always balanced when you build a new text classification model.

Supported document formats
IBM RPA supports the following formats: txt, pdf, docx, and xlsx.

The most commonly used file formats are txt and xlsx.

Text classification algorithms
You can use the algorithms that are provided by the tool to build your text classification model. Picking the right algorithm is essential for your model to work as expected.

To learn the differences between the algorithms and how to use them effectively, see Text classification models.

Training data
After you build the model, you must train it with data to improve the model's accuracy.

You can't upgrade models. If you need to improve an existing model, you must use a new data set composed of the previous training data with additional data. The building operation always overwrites the previous model if you use an existing model name.

Model output
The trained model is uploaded to the server's repository. You can test the model in the IBM RPA Control Center and use the model in bots that are connected to the same tenant that the model was published.

Use the Classify Text (classifyText) command to classify a text with the trained model.

What to do next

Knowing the basics of text classification is essential for building an accurate model. After considering these key points and deciding that you must build a new model for your business, you must learn how to use the tool in your favor. See the following topics to learn more about using the text classification tool.

To learn the differences between text classification algorithms in more detail, see Text classification algorithms.

To build and train your own text classification model, see Training a text classification model.

To learn how to use the text classification model in your scripts, see Using a text classification model.