Topic
  • 4 replies
  • Latest Post - ‏2011-10-11T16:12:36Z by bfoyle
reddz
reddz
23 Posts

Pinned topic Web crawling, the rss feed is excluded stating the reason "No index META "

‏2011-10-07T12:56:42Z |
Hi All,

<?xml version="1.0" encoding="utf-8" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<atom:link href="http://cf3.corp.pep.pvt/pepline/rss/pepline_8.xml" rel="self" type="application/rss+xml" />
<title>Bradenton Local News</title>
<description>Recent PEPLine stories</description>
<link>http://cf3.corp.pep.pvt/pepline/rss/pepline_8.xml</link>
<lastBuildDate>Tue, 06 Sep 2011 12:43:09 GMT</lastBuildDate>
<language>en-us</language>
<copyright>2011 PepsiCo, Inc. All rights reserved.</copyright>
<docs>http://cf3.corp.pep.pvt/pepline/rss/pepline_8.xml</docs>
<item>
<title>Calendar of Events (Aug 29, 2011)</title>
<category>Events This Week</category>
<link>http://cf3.corp.pep.pvt/pepline/rss/story_2011.08.29_59307.xml</link>
<author>peptrop@tropicana.com</author>
<pubDate>Mon, 29 Aug 2011 05:00:00 GMT</pubDate>
<guid>http://cf3.corp.pep.pvt/pepline/rss/story_2011.08.29_59307.xml</guid>
<description>Date Event Location/Time Contact Sept. 1 EnAble-General Meeting HQ Conf. Room B/10-11 a.m. Marji Drake ext. 3464Click on the date to add to your Outlook Calendar.</description>
</item>
</channel>
</rss>

while analysing, it seems if i remove the rss element from the xml file, its getting indexed. I would like to know the ways, the special meaning attached to <rss> tag can be suppressed.
Thanks in advance.
Updated on 2011-10-11T16:12:36Z at 2011-10-11T16:12:36Z by bfoyle
  • reddz
    reddz
    23 Posts

    Re: Web crawling, the rss feed is excluded stating the reason "No index META "

    ‏2011-10-07T13:03:40Z  
    There is a rss feed and that is linked from the html page. While doing web crawling, the rss feed is excluded stating the reason "No index META tag" 2004. The XML file format is as below

    <?xml version="1.0" encoding="utf-8" ?>
    <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
    <atom:link href="http://cf3.corp.pep.pvt/pepline/rss/pepline_8.xml" rel="self" type="application/rss+xml" />
    <title>Bradenton Local News</title>
    <description>Recent PEPLine stories</description>
    <link>http://cf3.corp.pep.pvt/pepline/rss/pepline_8.xml</link>
    <lastBuildDate>Tue, 06 Sep 2011 12:43:09 GMT</lastBuildDate>
    <language>en-us</language>
    <copyright>2011 PepsiCo, Inc. All rights reserved.</copyright>
    <docs>http://cf3.corp.pep.pvt/pepline/rss/pepline_8.xml</docs>
    <item>
    <title>Calendar of Events (Aug 29, 2011)</title>
    <category>Events This Week</category>
    <link>http://cf3.corp.pep.pvt/pepline/rss/story_2011.08.29_59307.xml</link>
    <author>peptrop@tropicana.com</author>
    <pubDate>Mon, 29 Aug 2011 05:00:00 GMT</pubDate>
    <guid>http://cf3.corp.pep.pvt/pepline/rss/story_2011.08.29_59307.xml</guid>
    <description>Date Event Location/Time Contact Sept. 1 EnAble-General Meeting HQ Conf. Room B/10-11 a.m. Marji Drake ext. 3464Click on the date to add to your Outlook Calendar.</description>
    </item>
    </channel>
    </rss>

    while analysing, it seems if i remove the rss element from the xml file, its getting indexed. I would like to know the ways, the special meaning attached to <rss> tag can be suppressed.
    Thanks in advance.
  • bfoyle
    bfoyle
    29 Posts

    Re: Web crawling, the rss feed is excluded stating the reason "No index META "

    ‏2011-10-07T16:02:45Z  
    • reddz
    • ‏2011-10-07T13:03:40Z
    There is a rss feed and that is linked from the html page. While doing web crawling, the rss feed is excluded stating the reason "No index META tag" 2004. The XML file format is as below

    <?xml version="1.0" encoding="utf-8" ?>
    <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
    <atom:link href="http://cf3.corp.pep.pvt/pepline/rss/pepline_8.xml" rel="self" type="application/rss+xml" />
    <title>Bradenton Local News</title>
    <description>Recent PEPLine stories</description>
    <link>http://cf3.corp.pep.pvt/pepline/rss/pepline_8.xml</link>
    <lastBuildDate>Tue, 06 Sep 2011 12:43:09 GMT</lastBuildDate>
    <language>en-us</language>
    <copyright>2011 PepsiCo, Inc. All rights reserved.</copyright>
    <docs>http://cf3.corp.pep.pvt/pepline/rss/pepline_8.xml</docs>
    <item>
    <title>Calendar of Events (Aug 29, 2011)</title>
    <category>Events This Week</category>
    <link>http://cf3.corp.pep.pvt/pepline/rss/story_2011.08.29_59307.xml</link>
    <author>peptrop@tropicana.com</author>
    <pubDate>Mon, 29 Aug 2011 05:00:00 GMT</pubDate>
    <guid>http://cf3.corp.pep.pvt/pepline/rss/story_2011.08.29_59307.xml</guid>
    <description>Date Event Location/Time Contact Sept. 1 EnAble-General Meeting HQ Conf. Room B/10-11 a.m. Marji Drake ext. 3464Click on the date to add to your Outlook Calendar.</description>
    </item>
    </channel>
    </rss>

    while analysing, it seems if i remove the rss element from the xml file, its getting indexed. I would like to know the ways, the special meaning attached to <rss> tag can be suppressed.
    Thanks in advance.
    http://publib.boulder.ibm.com/infocenter/discover/v9r1m0/topic/com.ibm.discovery.es.ad.doc/iiysafollow.htm

    I think this is a description of what you are encountering...there is more info there on how to override that behavior.

    bf
  • reddz
    reddz
    23 Posts

    Re: Web crawling, the rss feed is excluded stating the reason "No index META "

    ‏2011-10-09T11:26:26Z  
    • bfoyle
    • ‏2011-10-07T16:02:45Z
    http://publib.boulder.ibm.com/infocenter/discover/v9r1m0/topic/com.ibm.discovery.es.ad.doc/iiysafollow.htm

    I think this is a description of what you are encountering...there is more info there on how to override that behavior.

    bf
    Thanks bfoyle,
    I have tried out these options before and it didnt solve the problem. The XML file gets indexed when the <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"> field and its closing tag </rss> is removed. So the problem is happening during the crawling process and somehow the xml files with the RSS tags are marked with http status "No index META tag-2004" due the occurane of the crawling instructions. please let me know how we can override it and make all the xml files with <rss> fields as searchable.
  • bfoyle
    bfoyle
    29 Posts

    Re: Web crawling, the rss feed is excluded stating the reason "No index META "

    ‏2011-10-11T16:12:36Z  
    • reddz
    • ‏2011-10-09T11:26:26Z
    Thanks bfoyle,
    I have tried out these options before and it didnt solve the problem. The XML file gets indexed when the <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"> field and its closing tag </rss> is removed. So the problem is happening during the crawling process and somehow the xml files with the RSS tags are marked with http status "No index META tag-2004" due the occurane of the crawling instructions. please let me know how we can override it and make all the xml files with <rss> fields as searchable.
    Here is the response I got from engineering upon further investigation...

    ... there's no way to overwrite that noindex directive, even if you use followindex.rules. (The manual description was wrong.)

    http://www-01.ibm.com/support/docview.wss?uid=swg21512261

    Thus if the page is including that noindex metatag, that page won't be crawled and there's no way to have web crawler crawl it.
    (The only thing they can do is remove noindex metatag from the target page, or inject the page with REST API(or develop a custom crawler to crawl the page), but that would require another effort.

    Sorry for the misdirection.