Topic
  • 9 replies
  • Latest Post - ‏2014-03-31T12:57:01Z by AshishKumar9
dhmlau
dhmlau
23 Posts

Pinned topic how to view Web Crawler app result file

‏2012-05-03T22:44:40Z |
Hi,

I'm trying out the "Web Crawler" app and have some problems in viewing the resulting file.

In the web crawler, I have the following:
urls: http://publib.boulder.ibm.com/infocenter/bigins/v1r3/index.jsp
filters: +install
output directory: /diana/test
max crawl depth:5
max pages per crawl depth: 5

After the application is run successfully, I saw 2 files ("data" and "index" under /diana/test/crawldb/current/part-00000). I picked the "sheet" view for both files, and tried all the available reader and still couldn't see anything in the sheet.

I'm using BigInsights 1.3 enterprise edition.

Thanks,
Diana
Updated on 2012-05-10T15:22:29Z at 2012-05-10T15:22:29Z by dhmlau
  • SystemAdmin
    SystemAdmin
    603 Posts

    Re: how to view Web Crawler app result file

    ‏2012-05-04T19:13:26Z  
    Hi Diana,

    I think I am able to re-create the issue.

    I get something like this in my data file:
    SEQorg.apache.hadoop.io.Text!org.apache.nutch.crawl.CrawlDatumG�Eh��;�e�r�T

    I will check and get back to you on this.

    Thanks,
    Zach
  • SystemAdmin
    SystemAdmin
    603 Posts

    Re: how to view Web Crawler app result file

    ‏2012-05-06T19:05:06Z  
    I have not seen any resolution to this question. I have run several successful jobs, but I never seem to be able to actually see any output.

    I always apply a filter in the input screen to the job, (in the Filers pane under the URL pane) prefaced by a + or -. But when the job finishes it seems to say "no filter applied" (see screen capture).

    Which could be why there is no output. If I preface the filet by a - I get a set of URLs, presumably showing the URLs that were scanned.

    But if I preface the search string with a + I don't seem to get any URLs and certainly no data from within the webpage.
    It would be great to get a working example of what the output should look like. They would help a lot.
  • SystemAdmin
    SystemAdmin
    603 Posts

    Re: how to view Web Crawler app result file

    ‏2012-05-08T18:27:49Z  
    1) I found out that there some known "Web Crawler" app defects in v1.3 release which were fixed in v1.3 FP1
    To be able to use the "Web Crawler" app, you should upgrade to FP1.

    2) To view the results in sheets:

    a. Click on the root folder where you placed your output (and not the data file)
    b. Click Sheet radio button
    c. Select 'Basic Crawler Data' reader

    See attached File for more details.

    Note to Diana.

    For filters you are using +install (where you use it as a keyword, I think. "Give me all the pages which mention word install"?)

    I found out that those are URL filters. It means it will search only URLs which contain install. And our Information Center doesn't have such URLs.

    So it was suggested to me, that you change your filter to something like +publib.boulder.ibm.com

    Thank you,

    Zach
  • SystemAdmin
    SystemAdmin
    603 Posts

    Re: how to view Web Crawler app result file

    ‏2012-05-10T14:56:56Z  
    1) I found out that there some known "Web Crawler" app defects in v1.3 release which were fixed in v1.3 FP1
    To be able to use the "Web Crawler" app, you should upgrade to FP1.

    2) To view the results in sheets:

    a. Click on the root folder where you placed your output (and not the data file)
    b. Click Sheet radio button
    c. Select 'Basic Crawler Data' reader

    See attached File for more details.

    Note to Diana.

    For filters you are using +install (where you use it as a keyword, I think. "Give me all the pages which mention word install"?)

    I found out that those are URL filters. It means it will search only URLs which contain install. And our Information Center doesn't have such URLs.

    So it was suggested to me, that you change your filter to something like +publib.boulder.ibm.com

    Thank you,

    Zach
    Zach, Very good, thanx, which is exactly what I am seeing, i.e. a list of URLs.
  • dhmlau
    dhmlau
    23 Posts

    Re: how to view Web Crawler app result file

    ‏2012-05-10T15:22:29Z  
    1) I found out that there some known "Web Crawler" app defects in v1.3 release which were fixed in v1.3 FP1
    To be able to use the "Web Crawler" app, you should upgrade to FP1.

    2) To view the results in sheets:

    a. Click on the root folder where you placed your output (and not the data file)
    b. Click Sheet radio button
    c. Select 'Basic Crawler Data' reader

    See attached File for more details.

    Note to Diana.

    For filters you are using +install (where you use it as a keyword, I think. "Give me all the pages which mention word install"?)

    I found out that those are URL filters. It means it will search only URLs which contain install. And our Information Center doesn't have such URLs.

    So it was suggested to me, that you change your filter to something like +publib.boulder.ibm.com

    Thank you,

    Zach
    it worked for me as well. thanks, Zach.
  • aaronglg
    aaronglg
    14 Posts

    Re: how to view Web Crawler app result file

    ‏2013-11-13T04:14:18Z  
    Zach, Very good, thanx, which is exactly what I am seeing, i.e. a list of URLs.

    hi sir, could you please tell me what's the solution for seeing the result of web cralwer?

    I get the following results:

    SEQorg.apache.hadoop.io.Text!org.apache.nutch.crawl.CrawlDatumG�Eh��;�e�r�T

     

    I didn't see what Zach said, sorry....

     

    Thank you

  • PUNSAR
    PUNSAR
    4 Posts

    Re: how to view Web Crawler app result file

    ‏2014-03-06T21:43:25Z  
    1) I found out that there some known "Web Crawler" app defects in v1.3 release which were fixed in v1.3 FP1
    To be able to use the "Web Crawler" app, you should upgrade to FP1.

    2) To view the results in sheets:

    a. Click on the root folder where you placed your output (and not the data file)
    b. Click Sheet radio button
    c. Select 'Basic Crawler Data' reader

    See attached File for more details.

    Note to Diana.

    For filters you are using +install (where you use it as a keyword, I think. "Give me all the pages which mention word install"?)

    I found out that those are URL filters. It means it will search only URLs which contain install. And our Information Center doesn't have such URLs.

    So it was suggested to me, that you change your filter to something like +publib.boulder.ibm.com

    Thank you,

    Zach

    I have the same issue. I am using  BigInsight QuickStart V 2.1. you had suggested three steps above, but I don;t know how to do step C on the console

    a. Click on the root folder where you placed your output (and not the data file)
    b. Click Sheet radio button
    c. Select 'Basic Crawler Data' reader

  • PUNSAR
    PUNSAR
    4 Posts

    Re: how to view Web Crawler app result file

    ‏2014-03-06T21:51:11Z  
    • PUNSAR
    • ‏2014-03-06T21:43:25Z

    I have the same issue. I am using  BigInsight QuickStart V 2.1. you had suggested three steps above, but I don;t know how to do step C on the console

    a. Click on the root folder where you placed your output (and not the data file)
    b. Click Sheet radio button
    c. Select 'Basic Crawler Data' reader

    It is working now. I was able to find the basic crawler data reader option.

    Thanks

  • AshishKumar9
    AshishKumar9
    4 Posts

    Re: how to view Web Crawler app result file

    ‏2014-03-31T12:57:01Z  
    • PUNSAR
    • ‏2014-03-06T21:51:11Z

    It is working now. I was able to find the basic crawler data reader option.

    Thanks

    I am facing the same issue .

    When I run the Web Crawler Application with required configuration , application ran successful and provided Crawler outputs in the mentioned output HDFS Directory .

    When I look at the output directory , it does contain a number of directories and files .

    In order to read the output , the recommendation is to convert it to BigSheet by using Basic Crawler Data reader .
    Although the directory has the data however crawler data reader is not able to convert and hence no data is seen in BigSheet format .

    Is there any bug in Basic Crawler Data Reader ?