Topic
1 reply Latest Post - ‏2012-08-28T23:49:15Z by bwchang
James_Gan
James_Gan
2 Posts
ACCEPTED ANSWER

Pinned topic Configure MaxErrorCount of seedlist crawler (OF 9.1 and WCM 7.0)

‏2012-08-26T22:42:27Z |
Hi, dear all

I'm facing a difficult problem in my OF_9.1+WCM_7.0 setup. The WCM export a seedlist which contains many inaccessible URLs (404 error). These inaccessible URLs should not appear in search result. So I hope omnifind can just ignore these inaccessible URLs. But it turns out that seedlist crawler will stop after it encounter 100 inaccessible URLs. Is there a way to configure the MaxErrorCount for this crawler?

I found there is a solution for OF_8.5, as shown belo. Though it doesn't seem to work for 9.1 during my tests.

IC65924: Web Content Management crawler stops when it encounters errors caused by links to inaccessible documents more than 6 times Configuration parameters to change the maximum number of consecutive error documents that can be skipped and the maximum number of retries per one error document were added. To configure 

this support, create a file named ES_NODE_ROOT/master_config/<Collection ID>.<Crawler ID>/wcmcrawler_ext.xml with following content and restart the crawler: <?xml version=
"1.0" encoding=
"UTF-8"?> <ExtendedProperties> <AppendChild XPath=
"/Crawler/DataSources/Server" Name=
"MaxErrorCount">100</AppendChild> <AppendChild XPath=
"/Crawler/DataSources/Server" Name=
"MaxRetryPerDoc">2</AppendChild> </ExtendedProperties>
Updated on 2012-08-28T23:49:15Z at 2012-08-28T23:49:15Z by bwchang
  • bwchang
    bwchang
    146 Posts
    ACCEPTED ANSWER

    Re: Configure MaxErrorCount of seedlist crawler (OF 9.1 and WCM 7.0)

    ‏2012-08-28T23:49:15Z  in response to James_Gan
    James,

    Basically this is not a problem that should be handled by OmniFind. The customer should report this to WCM support that too many 404 documents are present on the seedlist.

    Nevertheless, the crawler still has a configuration to tolerate on errors. The XML element names were changed since 9.1. In v8.5, the error count was cumulative. From v9.1, the error count is reset to 0 when the crawler successfully obtain a document.

    <?xml version="1.0" encoding="UTF-8"?>
    <ExtendedProperties>
    <AppendChild XPath="/Crawler/DataSources/Server" Name="ErrorThreshold">100</AppendChild>
    <AppendChild XPath="/Crawler/DataSources/Server" Name="MaxRetry">2</AppendChild>
    </ExtendedProperties>

    The default threshold is 5, for the purpose of overcoming accidental and/or short network outages. Please be aware that the documents with errors will be ignored, and data integrity may not be guaranteed if the threshold is too large - say 100.