Topic
  • 7 replies
  • Latest Post - ‏2015-07-20T09:49:49Z by NAnand
SystemAdmin
SystemAdmin
603 Posts

Pinned topic BigInsights Webcrawler / Apache Nutch / Proxy Setting

‏2013-03-21T16:16:21Z |
Hi all,

we installed BigInsights on a server that reaches the internet via a proxy.

We are trying out the web crawler
1) We go to app and pick the web crawler
2) We put all the relevant informations in the form. We tried i.e. wikipedia.org.
3) We "run" the task.
-> After the job is completed we look at it (as basic crawler data) and we don't get any results.

The problem seemed to be the proxy
In segments/part-000000/data there is a little hint: java.net.UnknownHostException: de.wikipedia.org
This means that crawler is not able to make its way to the internet -> problem with the proxy because seemed to work out fine.

What we did to find the solution
We checked the proxy setting of apache nutch:
1) We uncommpressed the file /opt/ibm/biginsights/sheets/libext/nutch-1.4.jar
2) We put our proxy in the file nutch-site.xml:

<property> <name>http.proxy.host</name> <value>our.proxy</value> <description>The proxy hostname.  If empty, no proxy is used.</description> </property> <property> <name>http.proxy.port</name> <value>8080</value> </property>

3) We compress it as *.jar-file
4) This change had no impact.

Other important characteristics of our system
  • lynx http://wikipedia.org work fine; we tried the same proxy settings
  • if we crawl localhost there is no problem
Any idea how to configure nutch for access to the www via a proxy?

Thanks in advance for any hint or advice.
Simon
===================
Futher information
Oozie job description

<java xmlns=
"uri:oozie:workflow:0.1"> <job-tracker>localhost.localdomain:9001</job-tracker> <name-node>hdfs:
//localhost.localdomain:9000</name-node>  <configuration> <property> <name>mapred.job.queue.name</name> <value>default</value> </property> </configuration> <main-class>com.ibm.biginsights.apps.nutch.NutchApp</main-class> <arg>http:
//de.wikipedia.org/wiki/Nutch</arg>  <arg>+</arg> <arg>hdfs:
//localhost.localdomain:9000/user/biadmin/Test3</arg>  <arg>5</arg> <arg>50</arg> </java>
  • rfchong
    rfchong
    5 Posts

    Re: BigInsights Webcrawler / Apache Nutch / Proxy Setting

    ‏2013-04-19T20:18:55Z  

    Hi Simon,

    Sorry for the delay getting back to you. I've contacted the BigInsights Tech Support team who in turn have contacted the Development team to take a look at your issue in more detail.  I expect someone from Development to provide an answer shortly. Just to confirm...are you using BigInsights 2.0?

    Thanks,

    Raul Chong

    Senior Program Manager - Cloud and Big Data

  • ZachZ
    ZachZ
    14 Posts

    Re: BigInsights Webcrawler / Apache Nutch / Proxy Setting

    ‏2013-04-22T20:02:51Z  

    Hi Simon,

    What kind of proxy server you are using? And what kind of authentication?

    Thank you,


    Zach

  • simon123
    simon123
    2 Posts

    Re: BigInsights Webcrawler / Apache Nutch / Proxy Setting

    ‏2013-04-25T15:43:33Z  
    • ZachZ
    • ‏2013-04-22T20:02:51Z

    Hi Simon,

    What kind of proxy server you are using? And what kind of authentication?

    Thank you,


    Zach

    Thanks for your reply, ZachZ.

    Our proxy server is squid. The authentication is based on a user account (containing username and password) and not on the MAC- or IP-address of the device.

    Simon

    Updated on 2013-04-25T15:44:38Z at 2013-04-25T15:44:38Z by simon123
  • simon123
    simon123
    2 Posts

    Re: BigInsights Webcrawler / Apache Nutch / Proxy Setting

    ‏2013-04-25T15:45:11Z  
    • rfchong
    • ‏2013-04-19T20:18:55Z

    Hi Simon,

    Sorry for the delay getting back to you. I've contacted the BigInsights Tech Support team who in turn have contacted the Development team to take a look at your issue in more detail.  I expect someone from Development to provide an answer shortly. Just to confirm...are you using BigInsights 2.0?

    Thanks,

    Raul Chong

    Senior Program Manager - Cloud and Big Data

    Thanks for your reply, rfchong.

    We're using BigInsights 2.0

  • rfchong
    rfchong
    5 Posts

    Re: BigInsights Webcrawler / Apache Nutch / Proxy Setting

    ‏2013-05-10T02:09:00Z  
    • simon123
    • ‏2013-04-25T15:43:33Z

    Thanks for your reply, ZachZ.

    Our proxy server is squid. The authentication is based on a user account (containing username and password) and not on the MAC- or IP-address of the device.

    Simon

    Hi Simon,

    I've sent a note to IBM Germany colleagues with an attachment with instructions on how this can be done (from the development team).  They should be reaching out to you.

    Cheers,

    Raul.

  • dougspadotto
    dougspadotto
    8 Posts

    Re: BigInsights Webcrawler / Apache Nutch / Proxy Setting

    ‏2014-07-29T12:51:41Z  
    • rfchong
    • ‏2013-05-10T02:09:00Z

    Hi Simon,

    I've sent a note to IBM Germany colleagues with an attachment with instructions on how this can be done (from the development team).  They should be reaching out to you.

    Cheers,

    Raul.

    Hi everyone,

    Could the solution be shared publicly, please?

    I'm facing a similar issue and need to customize the BigInsights Nutch Crawler to use a proxy that authenticates via Active Directory. Any technote was published on this?

    Thanks in advance,

    Douglas

  • NAnand
    NAnand
    2 Posts

    Re: BigInsights Webcrawler / Apache Nutch / Proxy Setting

    ‏2015-07-20T09:49:49Z  

    Hi everyone,

    Could the solution be shared publicly, please?

    I'm facing a similar issue and need to customize the BigInsights Nutch Crawler to use a proxy that authenticates via Active Directory. Any technote was published on this?

    Thanks in advance,

    Douglas

    I am using BigInsights v 3.0.0.2 and require to configure the crawler to work across a squid proxy with username/password as well. Could you please share the solution ??