Topic
4 replies Latest Post - ‏2011-10-11T16:12:36Z by bfoyle
reddz
reddz
23 Posts
ACCEPTED ANSWER

Pinned topic Web crawling, the rss feed is excluded stating the reason "No index META "

‏2011-10-07T12:56:42Z |
Hi All,

<?xml version="1.0" encoding="utf-8" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<atom:link href="http://cf3.corp.pep.pvt/pepline/rss/pepline_8.xml" rel="self" type="application/rss+xml" />
<title>Bradenton Local News</title>
<description>Recent PEPLine stories</description>
<link>http://cf3.corp.pep.pvt/pepline/rss/pepline_8.xml</link>
<lastBuildDate>Tue, 06 Sep 2011 12:43:09 GMT</lastBuildDate>
<language>en-us</language>
<copyright>2011 PepsiCo, Inc. All rights reserved.</copyright>
<docs>http://cf3.corp.pep.pvt/pepline/rss/pepline_8.xml</docs>
<item>
<title>Calendar of Events (Aug 29, 2011)</title>
<category>Events This Week</category>
<link>http://cf3.corp.pep.pvt/pepline/rss/story_2011.08.29_59307.xml</link>
<author>peptrop@tropicana.com</author>
<pubDate>Mon, 29 Aug 2011 05:00:00 GMT</pubDate>
<guid>http://cf3.corp.pep.pvt/pepline/rss/story_2011.08.29_59307.xml</guid>
<description>Date Event Location/Time Contact Sept. 1 EnAble-General Meeting HQ Conf. Room B/10-11 a.m. Marji Drake ext. 3464Click on the date to add to your Outlook Calendar.</description>
</item>
</channel>
</rss>

while analysing, it seems if i remove the rss element from the xml file, its getting indexed. I would like to know the ways, the special meaning attached to <rss> tag can be suppressed.
Thanks in advance.
Updated on 2011-10-11T16:12:36Z at 2011-10-11T16:12:36Z by bfoyle
  • reddz
    reddz
    23 Posts
    ACCEPTED ANSWER

    Re: Web crawling, the rss feed is excluded stating the reason "No index META "

    ‏2011-10-07T13:03:40Z  in response to reddz
    There is a rss feed and that is linked from the html page. While doing web crawling, the rss feed is excluded stating the reason "No index META tag" 2004. The XML file format is as below

    <?xml version="1.0" encoding="utf-8" ?>
    <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
    <atom:link href="http://cf3.corp.pep.pvt/pepline/rss/pepline_8.xml" rel="self" type="application/rss+xml" />
    <title>Bradenton Local News</title>
    <description>Recent PEPLine stories</description>
    <link>http://cf3.corp.pep.pvt/pepline/rss/pepline_8.xml</link>
    <lastBuildDate>Tue, 06 Sep 2011 12:43:09 GMT</lastBuildDate>
    <language>en-us</language>
    <copyright>2011 PepsiCo, Inc. All rights reserved.</copyright>
    <docs>http://cf3.corp.pep.pvt/pepline/rss/pepline_8.xml</docs>
    <item>
    <title>Calendar of Events (Aug 29, 2011)</title>
    <category>Events This Week</category>
    <link>http://cf3.corp.pep.pvt/pepline/rss/story_2011.08.29_59307.xml</link>
    <author>peptrop@tropicana.com</author>
    <pubDate>Mon, 29 Aug 2011 05:00:00 GMT</pubDate>
    <guid>http://cf3.corp.pep.pvt/pepline/rss/story_2011.08.29_59307.xml</guid>
    <description>Date Event Location/Time Contact Sept. 1 EnAble-General Meeting HQ Conf. Room B/10-11 a.m. Marji Drake ext. 3464Click on the date to add to your Outlook Calendar.</description>
    </item>
    </channel>
    </rss>

    while analysing, it seems if i remove the rss element from the xml file, its getting indexed. I would like to know the ways, the special meaning attached to <rss> tag can be suppressed.
    Thanks in advance.
    • bfoyle
      bfoyle
      29 Posts
      ACCEPTED ANSWER

      Re: Web crawling, the rss feed is excluded stating the reason "No index META "

      ‏2011-10-07T16:02:45Z  in response to reddz
      http://publib.boulder.ibm.com/infocenter/discover/v9r1m0/topic/com.ibm.discovery.es.ad.doc/iiysafollow.htm

      I think this is a description of what you are encountering...there is more info there on how to override that behavior.

      bf
      • reddz
        reddz
        23 Posts
        ACCEPTED ANSWER

        Re: Web crawling, the rss feed is excluded stating the reason "No index META "

        ‏2011-10-09T11:26:26Z  in response to bfoyle
        Thanks bfoyle,
        I have tried out these options before and it didnt solve the problem. The XML file gets indexed when the <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"> field and its closing tag </rss> is removed. So the problem is happening during the crawling process and somehow the xml files with the RSS tags are marked with http status "No index META tag-2004" due the occurane of the crawling instructions. please let me know how we can override it and make all the xml files with <rss> fields as searchable.
        • bfoyle
          bfoyle
          29 Posts
          ACCEPTED ANSWER

          Re: Web crawling, the rss feed is excluded stating the reason "No index META "

          ‏2011-10-11T16:12:36Z  in response to reddz
          Here is the response I got from engineering upon further investigation...

          ... there's no way to overwrite that noindex directive, even if you use followindex.rules. (The manual description was wrong.)

          http://www-01.ibm.com/support/docview.wss?uid=swg21512261

          Thus if the page is including that noindex metatag, that page won't be crawled and there's no way to have web crawler crawl it.
          (The only thing they can do is remove noindex metatag from the target page, or inject the page with REST API(or develop a custom crawler to crawl the page), but that would require another effort.

          Sorry for the misdirection.