Web Feed Node: Records Tab

The Records tab is used to specify the text content of non-RSS feeds by identifying where each new record begins, as well as other relevant information regarding each record. If you know that a non-RSS feed (HTML) contains text that is in multiple records, you must identify the record start tag here or else the text will be treated as one record. While RSS feeds are standardized and do not require any tag specification on this tab, you can still preview the content in the Preview tab.

Important: When working with non RSS data, you may prefer to use a web scraping tool, such as WebQL®, to automate content gathering and then referring the output from that tool using a different source node.

URL. This drop-down list contains a list of URLs entered on the Input tab. Both HTML and RSS formatted feeds are present. If the URL address is too long for the drop-down list, it will automatically be clipped in the middle using an ellipsis to replace the clipped text, such as http://www.ibm.com/example/start-of-address...rest-of-address/path.htm.

  • With HTML formatted feeds, if the feed contains more than one record (or entry), you can define which HTML tags contain the data corresponding to the field shown in the table. For example, you can define the start tag that indicates a new record has started, a modified date tag, or an author name.
  • With RSS formatted feeds, you are not prompted to enter any tags since RSS is a standardized format. However, you can view sample results on the Preview tab if desired. All recognized RSS feeds are preceded by the RSS logo image.

Source tab. On this tab, you can view the source code for any HTML feeds. This code is not editable. You can use the Find field to locate specific tags or information on this page that you can then copy and paste into the table below. The Find field is not case sensitive and will match partial strings.

Preview tab. On this tab, you can preview how a record will be read by the Web feed node. This is particularly useful for HTML feeds since you can change how a record will be read by defining HTML tags in the table below the Preview tab.

Non-RSS record start tag. This option only applies to non-RSS feeds. If your HTML feed contains multiple text that you want to break up into multiple records, specify the HTML tag that signals the beginning of a record (such as an article or blog entry) here. If you don't define one for a non-RSS feed, Modeler will try to guess the XML format and return corresponding records. If Modeler can't guess the XML format, nothing will be returned. If your goal is to import the whole content of a page and then process it later, we recommend using separate XML readers with more powerful functionality and then import the result into Modeler Text Analytics.

Field table. This option only applies to non-RSS feeds. In this table, you can break up the text content into specific output fields by entering a start tag for any of the predefined output fields. Enter the start tag only. All matches are done by parsing the HTML and matching the table contents to the tag names and attributes found in the HTML. You can use the buttons at the bottom to copy the tags you have defined and reuse them for other feeds.

Table 1. Possible output fields for non-RSS feeds (HTML formats)
Output Field Name Expected Tag Content
Title The tag delimiting the record title. (optional)
Short Desc The tag delimiting the short description or label. (optional)
Description The tag delimiting the main text. If left blank, this field will contain all other content in either the <body> tag (if there is a single record) or the content found inside the current record (when a record delimiter has been specified).
Author The tag delimiting the author of the text. (optional)
Contributors The tag delimiting the names of the contributors. (optional)
Published Date The tag delimiting the date when the text was published. If left blank, this field will contain the date when the node reads the data.
Modified Date The tag delimiting the date when the text was modified. If left blank, this field will contain the date when the node reads the data.

When you enter a tag into the table, the feed is scanned using this tag as the minimum tag to match rather than an exact match. That is, if you entered <div> for the Title field, this would match any <div> tag in the feed, including those with specified attributes (such as <div class=”post three”>), such that <div> is equal to the root tag (<div>) and any derivative that includes an attribute and use that content for the Title output field. If you enter a root tag, any further attributes are also included.

Table 2. Examples of HTML tags used identify the text for the output fields
If you enter: It would match: And also match: But not match:
<div> <div> <div class=”post”> any other tag
<p class=”auth”> <p class=”auth”> <p color=”black” class=”auth” id=”85643”> <p color=”black”>