Topic
IC4NOTICE: developerWorks Community will be offline May 29-30, 2015 while we upgrade to the latest version of IBM Connections. For more information, read our upgrade FAQ.
6 replies Latest Post - ‏2014-07-29T12:51:41Z by dougspadotto
SystemAdmin
SystemAdmin
603 Posts
ACCEPTED ANSWER

Pinned topic BigInsights Webcrawler / Apache Nutch / Proxy Setting

‏2013-03-21T16:16:21Z |
Hi all,

we installed BigInsights on a server that reaches the internet via a proxy.

We are trying out the web crawler
1) We go to app and pick the web crawler
2) We put all the relevant informations in the form. We tried i.e. wikipedia.org.
3) We "run" the task.
-> After the job is completed we look at it (as basic crawler data) and we don't get any results.

The problem seemed to be the proxy
In segments/part-000000/data there is a little hint: java.net.UnknownHostException: de.wikipedia.org
This means that crawler is not able to make its way to the internet -> problem with the proxy because seemed to work out fine.

What we did to find the solution
We checked the proxy setting of apache nutch:
1) We uncommpressed the file /opt/ibm/biginsights/sheets/libext/nutch-1.4.jar
2) We put our proxy in the file nutch-site.xml:

<property> <name>http.proxy.host</name> <value>our.proxy</value> <description>The proxy hostname.  If empty, no proxy is used.</description> </property> <property> <name>http.proxy.port</name> <value>8080</value> </property>

3) We compress it as *.jar-file
4) This change had no impact.

Other important characteristics of our system
  • lynx http://wikipedia.org work fine; we tried the same proxy settings
  • if we crawl localhost there is no problem
Any idea how to configure nutch for access to the www via a proxy?

Thanks in advance for any hint or advice.
Simon
===================
Futher information
Oozie job description

<java xmlns=
"uri:oozie:workflow:0.1"> <job-tracker>localhost.localdomain:9001</job-tracker> <name-node>hdfs:
//localhost.localdomain:9000</name-node>  <configuration> <property> <name>mapred.job.queue.name</name> <value>default</value> </property> </configuration> <main-class>com.ibm.biginsights.apps.nutch.NutchApp</main-class> <arg>http:
//de.wikipedia.org/wiki/Nutch</arg>  <arg>+</arg> <arg>hdfs:
//localhost.localdomain:9000/user/biadmin/Test3</arg>  <arg>5</arg> <arg>50</arg> </java>
  • rfchong
    rfchong
    5 Posts
    ACCEPTED ANSWER

    Re: BigInsights Webcrawler / Apache Nutch / Proxy Setting

    ‏2013-04-19T20:18:55Z  in response to SystemAdmin

    Hi Simon,

    Sorry for the delay getting back to you. I've contacted the BigInsights Tech Support team who in turn have contacted the Development team to take a look at your issue in more detail.  I expect someone from Development to provide an answer shortly. Just to confirm...are you using BigInsights 2.0?

    Thanks,

    Raul Chong

    Senior Program Manager - Cloud and Big Data

    • simon123
      simon123
      2 Posts
      ACCEPTED ANSWER

      Re: BigInsights Webcrawler / Apache Nutch / Proxy Setting

      ‏2013-04-25T15:45:11Z  in response to rfchong

      Thanks for your reply, rfchong.

      We're using BigInsights 2.0

  • ZachZ
    ZachZ
    14 Posts
    ACCEPTED ANSWER

    Re: BigInsights Webcrawler / Apache Nutch / Proxy Setting

    ‏2013-04-22T20:02:51Z  in response to SystemAdmin

    Hi Simon,

    What kind of proxy server you are using? And what kind of authentication?

    Thank you,


    Zach

    • simon123
      simon123
      2 Posts
      ACCEPTED ANSWER

      Re: BigInsights Webcrawler / Apache Nutch / Proxy Setting

      ‏2013-04-25T15:43:33Z  in response to ZachZ

      Thanks for your reply, ZachZ.

      Our proxy server is squid. The authentication is based on a user account (containing username and password) and not on the MAC- or IP-address of the device.

      Simon

      Updated on 2013-04-25T15:44:38Z at 2013-04-25T15:44:38Z by simon123
      • rfchong
        rfchong
        5 Posts
        ACCEPTED ANSWER

        Re: BigInsights Webcrawler / Apache Nutch / Proxy Setting

        ‏2013-05-10T02:09:00Z  in response to simon123

        Hi Simon,

        I've sent a note to IBM Germany colleagues with an attachment with instructions on how this can be done (from the development team).  They should be reaching out to you.

        Cheers,

        Raul.

        • dougspadotto
          dougspadotto
          8 Posts
          ACCEPTED ANSWER

          Re: BigInsights Webcrawler / Apache Nutch / Proxy Setting

          ‏2014-07-29T12:51:41Z  in response to rfchong

          Hi everyone,

          Could the solution be shared publicly, please?

          I'm facing a similar issue and need to customize the BigInsights Nutch Crawler to use a proxy that authenticates via Active Directory. Any technote was published on this?

          Thanks in advance,

          Douglas