Topic
  • 6 replies
  • Latest Post - ‏2013-12-12T15:40:40Z by dschoppmann
reddz
reddz
23 Posts

Pinned topic Omnifind seedlist crawler not obeying robots.txt

‏2011-09-10T13:04:33Z |
Hi,
We are trying to exclude site areas of WCM 7.0 using robots.txt by using omnifind seedlist crawler. Omnifind is not obeying robots.txt.
Followed steps in the link.
https://www-304.ibm.com/support/docview.wss?uid=swg21443701
Updated on 2011-09-14T12:24:29Z at 2011-09-14T12:24:29Z by reddz
  • reddz
    reddz
    23 Posts

    Re: Omnifind seedlist crawler not obeying robots.txt

    ‏2011-09-10T15:43:56Z  
    Steps we followed

    By Default,robots.txt was present in the wps.ear in installed apps
    It was accessible by URL
    http://10.226.173.89:10039/wps/robots.txt
    We went and edited the robots.txt present in the wps.ear as mentioned below

    contents of robots.txt to exclude site were
    User-agent: *
    Disallow:/wps/wcm/myconnect/Web+Content/SiteAreaName/

    we were not successful to exclude the specified area so .
    We tried exclude everything from being crawled in omnifind with below robots.txt
    contents of robots.txt
    User-agent: *
    Disallow: /
    Still contents were get crawling by omnifind

    Not sure whether omnifind really robots.txt from WCM or i am doing any steps incorrect

    Thanks
    Reddz
  • bwchang
    bwchang
    146 Posts

    Re: Omnifind seedlist crawler not obeying robots.txt

    ‏2011-09-13T18:06:21Z  
    • reddz
    • ‏2011-09-10T15:43:56Z
    Steps we followed

    By Default,robots.txt was present in the wps.ear in installed apps
    It was accessible by URL
    http://10.226.173.89:10039/wps/robots.txt
    We went and edited the robots.txt present in the wps.ear as mentioned below

    contents of robots.txt to exclude site were
    User-agent: *
    Disallow:/wps/wcm/myconnect/Web+Content/SiteAreaName/

    we were not successful to exclude the specified area so .
    We tried exclude everything from being crawled in omnifind with below robots.txt
    contents of robots.txt
    User-agent: *
    Disallow: /
    Still contents were get crawling by omnifind

    Not sure whether omnifind really robots.txt from WCM or i am doing any steps incorrect

    Thanks
    Reddz
    Reddz,

    OEE's WCM crawler is a seedlist crawler, meaning the crawler only crawls what is sent from the seedlist received from the WCM server. Thus, it will not make any attempt to consult any robots.txt file to exclude content. Only OEE's web crawler honors the robots.txt file.

    While I'm not well versed with WCM, but I think the administrator needs to somehow configure the WCM server such that its seedlist includes only content that should be crawled.

    Hope this helped.

    Billy Chang.
  • reddz
    reddz
    23 Posts

    Re: Omnifind seedlist crawler not obeying robots.txt

    ‏2011-09-14T12:24:29Z  
    • bwchang
    • ‏2011-09-13T18:06:21Z
    Reddz,

    OEE's WCM crawler is a seedlist crawler, meaning the crawler only crawls what is sent from the seedlist received from the WCM server. Thus, it will not make any attempt to consult any robots.txt file to exclude content. Only OEE's web crawler honors the robots.txt file.

    While I'm not well versed with WCM, but I think the administrator needs to somehow configure the WCM server such that its seedlist includes only content that should be crawled.

    Hope this helped.

    Billy Chang.
    Bwchang,

    Thanks a lot for information . ya this information surely helps
  • RNQN_divya_s
    RNQN_divya_s
    3 Posts

    Re: Omnifind seedlist crawler not obeying robots.txt

    ‏2013-12-06T12:36:12Z  
    • reddz
    • ‏2011-09-14T12:24:29Z
    Bwchang,

    Thanks a lot for information . ya this information surely helps

    Hi all ,

     

    I have issue with creating crawler in OmniFind9.1. i have chose crawler type is websphere portal and gave portal portal server url and userid and password.

    am getting below exception when i test the configuration  

    FFQD5311E An error occurred while parsing portal seed list url.

     

    Thanks in advance

  • RNQN_divya_s
    RNQN_divya_s
    3 Posts

    Re: Omnifind seedlist crawler not obeying robots.txt

    ‏2013-12-09T07:18:06Z  

    Hi all ,

     

    I have issue with creating crawler in OmniFind9.1. i have chose crawler type is websphere portal and gave portal portal server url and userid and password.

    am getting below exception when i test the configuration  

    FFQD5311E An error occurred while parsing portal seed list url.

     

    Thanks in advance

    Hi all ,

     

    am getting below exception when i tried to create crawler with crawler type websphere portal in omnifind 9.1 .

    <OFMsg>251658517"1"1386572739495"603984915"9608"13" " discovery" S1T1458-A70Z" WPLibrary.java"208"3 javax.net.ssl.SSLHandshakeException: com.ibm.jsse2.util.h: PKIX path building failed: java.security.cert.CertPathBuilderException: PKIXCertPathBuilderImpl could not build a valid CertPath.

     

    Thanks in advacne

  • dschoppmann
    dschoppmann
    8 Posts

    Re: Omnifind seedlist crawler not obeying robots.txt

    ‏2013-12-12T15:40:40Z  

    Hi all ,

     

    am getting below exception when i tried to create crawler with crawler type websphere portal in omnifind 9.1 .

    <OFMsg>251658517"1"1386572739495"603984915"9608"13" " discovery" S1T1458-A70Z" WPLibrary.java"208"3 javax.net.ssl.SSLHandshakeException: com.ibm.jsse2.util.h: PKIX path building failed: java.security.cert.CertPathBuilderException: PKIXCertPathBuilderImpl could not build a valid CertPath.

     

    Thanks in advacne

    Sounds like your server is using a custom certificate. Can you please check this by accessing the seedlist crawl URL by browser? The URL should look like this:
    https://<HOSTNAME>.com/seedlist/myserver?Action=GetDocuments&Start=0&Range=500&Format=com.ibm.lotus.search.plugins.seedlist.ATOMFormatterFactory&Locale=en_US&SeedlistId=72dfb5804eddaef68749af7269006f77&Source=com.ibm.workplace.wcm.plugins.seedlist.retriever.WCMRetrieverFactory

    If you are using a custom certificate you need to import it into the crawlers JVM trust store. Remember that crawler are using the 32-bit JVM:

    When the crawler is supposed to crawl secured documents using https protocol, the corresponding certificates for SSL handshake have to be available. Therefore, import them in cacerts on indexing server using the keytool delivered with WAS installation.

    On OmniFind with WAS:
    $WAS_HOME/java/jre/bin/keytool -import -file [CERTIFICATE] -keystore [WAS TRUST FILE] -alias [CERTIFICATE ALIAS] -trustcacerts

    On OmniFind without WAS (Jetty only):
    $ES_INSTALL_DIR/_jvm64/jre/bin/keytool -import -file [CERTIFICATE] -keystore [JVM TRUST FILE] -alias [CERTIFICATE ALIAS] -trustcacerts

    [TRUST FILE]: <JVM>/jre/lib/security/cacerts