File List Node

To read in text from unstructured documents saved in formats such as Microsoft Word, Microsoft Excel, and Microsoft PowerPoint, as well as Adobe PDF, XML, HTML, and others, the File List node can be used to generate a list of documents or folders as input to the text mining process. This is necessary because unstructured text documents cannot be represented by fields and records—rows and columns—in the same manner as other data used by IBM® SPSS® Modeler. This node can be found on the Text Mining palette.

The File List node functions as a source node; however, as well as reading and outputting the actual data of the source files, you can alternatively use the node to read the names of the documents or directories below the specified root and produce these as a list. When used to read document or directory names, the output is a single field, with one record for each file listed, which can be selected as input for a subsequent Text Mining or Text Link Analysis node.

You can find this node on the IBM SPSS Modeler Text Analytics tab of nodes palette at the bottom of the IBM SPSS Modeler window. See the topic IBM SPSS Modeler Text Analytics Nodes for more information.

Important: Any directory names and filenames containing characters that are not included in the machine local encoding are not supported. When attempting to execute a stream containing a File List node, any file- or directory names containing these characters will cause the stream execution to fail. This could happen with foreign language directory names or file names, such as a Japanese filename on a French locale.

Local data support. If you are connected to a remote IBM SPSS Modeler Text Analytics Server and have a stream with a File List node, the data should reside on the same machine as the IBM SPSS Modeler Text Analytics Server or ensure that the server machine has access to the folder where the source data in the File List node is stored.

Note: You cannot use the File List node for scoring within an IBM SPSS Collaboration and Deployment Services - Scoring configuration.