Sampling Upstream to Save Time

When you have a large amount of data, the processing times can take minutes to hours, especially when using the interactive workbench session. The greater the size of the data, the more time the extraction and categorization processes will take. To work more efficiently, you can add aIBM® SPSS® Modeler Sample nodes upstream from your Text Mining node. Use this Sample node to take a random sample using a smaller subset of documents or records to do the first few passes.

A smaller sample is often perfectly adequate to decide how to edit your resources and even create most if not all of your categories. And once you have run on the smaller dataset and are satisfied with the results, you can apply the same technique for creating categories to the entire set of data. Then you can look for documents or records that do not fit the categories you have created and make adjustments as needed.

Note: The Sample node is a standard IBM SPSS Modeler node.