we installed BigInsights on a server that reaches the internet via a proxy.
We are trying out the web crawler
1) We go to app and pick the web crawler
2) We put all the relevant informations in the form. We tried i.e. wikipedia.org.
3) We "run" the task.
-> After the job is completed we look at it (as basic crawler data) and we don't get any results.
The problem seemed to be the proxy
In segments/part-000000/data there is a little hint: java.net.UnknownHostException: de.wikipedia.org
This means that crawler is not able to make its way to the internet -> problem with the proxy because seemed to work out fine.
What we did to find the solution
We checked the proxy setting of apache nutch:
1) We uncommpressed the file /opt/ibm/biginsights/sheets/libext/nutch-1.4.jar
2) We put our proxy in the file nutch-site.xml:
<property> <name>http.proxy.host</name> <value>our.proxy</value> <description>The proxy hostname. If empty, no proxy is used.</description> </property> <property> <name>http.proxy.port</name> <value>8080</value> </property>
3) We compress it as *.jar-file
4) This change had no impact.
Other important characteristics of our system
- lynx http://wikipedia.org work fine; we tried the same proxy settings
- if we crawl localhost there is no problem
Thanks in advance for any hint or advice.
Oozie job description
<java xmlns= "uri:oozie:workflow:0.1"> <job-tracker>localhost.localdomain:9001</job-tracker> <name-node>hdfs: //localhost.localdomain:9000</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>default</value> </property> </configuration> <main-class>com.ibm.biginsights.apps.nutch.NutchApp</main-class> <arg>http: //de.wikipedia.org/wiki/Nutch</arg> <arg>+</arg> <arg>hdfs: //localhost.localdomain:9000/user/biadmin/Test3</arg> <arg>5</arg> <arg>50</arg> </java>