Reading in Source Text
Data for text mining can be in any of the standard formats that are used by IBM® SPSS® Modeler, including databases or other "rectangular" formats that represent data in rows and columns, or in document formats, such as Microsoft Word, Adobe PDF, or HTML, that do not conform to this structure.
- To read in text from documents that do not conform to standard data structure, including Microsoft Word, Microsoft Excel, and Microsoft PowerPoint, in addition to Adobe PDF, XML, HTML, and others, the File List node can be used to generate a list of documents or folders as input to the text mining process. For more information, see File List node.
- To read in text from web feeds, such as blogs or news feeds in RSS or HTML formats, the Web Feed node can be used to format web feed data for input into the text mining process. For more information, see Web Feed node.
- To read in text from any of the standard data formats used by SPSS Modeler, such as a database with one or more text fields for customer comments, you can use any of the SPSS Modeler source nodes. For more information, see the SPSS Modeler node documentation.
- When you are processing large amounts of data, which might include text in several different languages, use the Language node to identify the language used in a specific field. For more information, see Language Node.