Web Feed Node: Input Tab

The Input tab is used to specify one or more Web addresses, or URLs, in order to capture the text data. In the context of text mining, you could specify URLs for feeds that contain text data.

Important: When working with non RSS data, you may prefer to use a web scraping tool, such as WebQL®, to automate content gathering and then referring the output from that tool using a different source node.

You can set the following parameters:

Enter or paste URLs. In this field, you can type or paste one or more URLs. If you are entering more than one, enter only one per line and use the Enter/Return key to separate lines. Enter the full URL path to the file. These URLs can be for feeds in one of two formats:

  • RSS format. RSS is a simple XML-based standardized format for Web content. The URL for this format points to a page that has a set of linked articles such as syndicated news sources and blogs. Since RSS is a standardized format, each linked article is automatically identified and treated as a separate record in the resulting data stream. No further input is required for you to be able to identify the important text data and the records from the feed unless you want to apply a filtering technique to the text.
  • HTML format. You can define one or more URLs to HTML pages on the Input tab. Then, in the Records tab, define the record start tag as well as identify the tags that delimit the target content and assign those tags to the output fields of your choice (description, title, modified date, and so on). When working with non RSS data, you may prefer to use a web scraping tool, such as WebQL®, to automate content gathering and then referring the output from that tool using a different source node. See the topic Web Feed Node: Records Tab for more information.

Number of most recent entries to read per URL. This field specifies the maximum number of records to read for each URL listed in the field starting with the first record found in the feed. The amount of text impacts the processing speed during extraction downstream in a Text Mining node or Text Link Analysis node.

Save and reuse previous web feeds when possible. With this option, web feeds are scanned and the processed results are cached. Then, upon subsequent stream executions, if the contents of a given feed have not changed or if the feed is inaccessible (an Internet outage, for example), the cached version is used to speed processing time. Any new content discovered in these feeds is also cached for the next time you execute the node.

  • Label. If you select Save and reuse previous web feeds when possible, you must specify a label name for the results. This label is used to describe the cached feeds on the server. If no label is specified or the label is unrecognized, no reuse will be possible.