Topic
  • 2 replies
  • Latest Post - ‏2013-03-14T08:38:37Z by SystemAdmin
Santiago Montico
Santiago Montico
8 Posts

Pinned topic Web crawling secure site problem https

‏2012-02-17T22:32:03Z |
Hi,

I'm trying to crawl the content from a secure https site and i have developed a PrefetchPlugin, PostParsePlugin jar file in order to logme in and manage the redirections.
In the init() method of the PrefetchPlugin, the login is made correctly, and the secure url (https) is processed by the processDocument of PrefetchPlugin, the the crawler enter to the site because i saw the https url but it not crawl the content, indeed the processDocument method from the PostParsePlugin is never executed i don't know why...

In the log it throws the following error:
<OFMsg>67113772"4"1329326462693"#X'0'"30094"34" " " crsssp" com.ibm.es.wc.http.commons.HttpClientWrapper.java"160"3 rewrite request header"3
GET /login/main.htm HTTP/1.1
User-Agent: siia-agent
Host: 192.168.202.31:8443
From: smontico@gmail.com
Accept: text/*,application/pdf,application/msword,application/rtf,application/x-msexcel,application/x-mspowerpoint,application/vnd.ms-powerpoint,application/xml,application/vnd.lotus-1-2-3,application/vnd.lotus-freelance,application/vnd.lotus-wordpro,application/vnd.ms-excel
Cookie: ; JSESSIONID=1A80DCB802B5963A96AC3715FC3B7A29

""</OFMsg>
<OFMsg>402654392"4"1329326462693"#X'0'"30094"34" " " crsssp" com.ibm.es.wc.http.commons.HttpSecureSocketFactory.java"160"3 ssl context: TLS""</OFMsg>
<OFMsg>251658517"1"1329326462699"16780123"30094"34" " " crsssp" AuditLogger.java"374"3 com.ibm.es.wc.WebCrawlerException"3 com.ibm.es.wc.WebCrawlerException
at com.ibm.es.wc.log.LogUtil.getException(LogUtil.java:58)
at com.ibm.es.wc.http.commons.HttpSecureSocketFactory.loadKeyStore(HttpSecureSocketFactory.java:96)
at com.ibm.es.wc.http.commons.HttpSecureSocketFactory."lt;init"gt;(HttpSecureSocketFactory.java:60)
at com.ibm.es.wc.scan.Bucket.getSocketFactory(Bucket.java:1940)
at com.ibm.es.wc.http.commons.HttpClientWrapper.setSSLSocketFacotry(HttpClientWrapper.java:683)
at com.ibm.es.wc.http.commons.HttpClientWrapper.getMethod(HttpClientWrapper.java:696)
at com.ibm.es.wc.http.commons.HttpClientWrapper.download(HttpClientWrapper.java:380)
at com.ibm.es.wc.scan.UserAgent._downloadAndParse(UserAgent.java:387)
at com.ibm.es.wc.scan.UserAgent._clientRun(UserAgent.java:227)
at com.ibm.es.wc.th.WCRunnableImpl.run(WCRunnableImpl.java:105)
at java.lang.Thread.run(Thread.java:736)
""</OFMsg>
<OFMsg>251658517"1"1329326462701"16780123"30094"34" " " crsssp" AuditLogger.java"374"3 com.ibm.es.wc.err.OperationFailedException: java.lang.IllegalArgumentException: socketFactory is null"3 com.ibm.es.wc.err.OperationFailedException: java.lang.IllegalArgumentException: socketFactory is null
at com.ibm.es.wc.scan.UserAgent._downloadAndParse(UserAgent.java:506)
at com.ibm.es.wc.scan.UserAgent._clientRun(UserAgent.java:227)
at com.ibm.es.wc.th.WCRunnableImpl.run(WCRunnableImpl.java:105)
at java.lang.Thread.run(Thread.java:736)
Caused by: java.lang.IllegalArgumentException: socketFactory is null
at org.apache.commons.httpclient.protocol.Protocol."lt;init"gt;(Protocol.java:180)
at com.ibm.es.wc.http.commons.HttpClientWrapper.setSSLSocketFacotry(HttpClientWrapper.java:689)
at com.ibm.es.wc.http.commons.HttpClientWrapper.getMethod(HttpClientWrapper.java:696)
at com.ibm.es.wc.http.commons.HttpClientWrapper.download(HttpClientWrapper.java:380)
at com.ibm.es.wc.scan.UserAgent._downloadAndParse(UserAgent.java:387)
... 3 more
""</OFMsg>
What could be wrong in the web crawler?

Thanks a lot.

Santiago.
Updated on 2013-03-14T08:38:37Z at 2013-03-14T08:38:37Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    2014 Posts

    Re: Web crawling secure site problem https

    ‏2013-03-14T08:28:51Z  
    Hi Santiago,

    we faced exactly the same issue after moving collection configuration to another environment.
    The problem was, somehow there was defined some password for the keystore for SSL connections. Erasing that field over administration wasn't successful. This password wasn't defined for other crawler crawling https site and this one works fine.
    So the solution was:
    • stop crawler
    • find the crawler configuration file in <ICA install>/esadmin/master_config/<collection ID>.<crawler ID>/
  • SystemAdmin
    SystemAdmin
    2014 Posts

    Re: Web crawling secure site problem https

    ‏2013-03-14T08:38:37Z  
    Hi Santiago,

    we faced exactly the same issue after moving collection configuration to another environment.
    The problem was, somehow there was defined password for the keystore for SSL connections. Erasing that field over administration wasn't successful. This password wasn't defined for other crawler crawling https site and this one works fine.
    So the solution was:
    • stop crawler
    • find the crawler configuration file in <ICA install>/esadmin/master_config/<collection ID>.<crawler ID>/crawl.properties
    • make a backup of that file
    • open that file in some editor that won't screw up encoding (I have used Notepad++)
    • remove the line with cacert_password property and save the file
    • start the crawler
    Now crawler should work again.

    David