Topic
  • 3 replies
  • Latest Post - ‏2012-10-09T16:51:12Z by SystemAdmin
VsV
VsV
10 Posts

Pinned topic Web crawl. Content type

‏2012-09-28T17:42:15Z |
Hello!
How to specify 'Content type' for web crawler? I want to crawl only plain HTML files, no images or video.
Is it possible?
Thanks.
Updated on 2012-10-09T16:51:12Z at 2012-10-09T16:51:12Z by SystemAdmin
  • VsV
    VsV
    10 Posts

    Re: Web crawl. Content type

    ‏2012-10-08T10:59:20Z  
    In pure Nutch it is possible to edit conf/regex-urlfilter.txt:
    Set files suffix for ignore: -.(jpg|gif|zip|ico)$

    I think it will work in BI but I have not checked it.
  • SystemAdmin
    SystemAdmin
    603 Posts

    Re: Web crawl. Content type

    ‏2012-10-08T17:23:51Z  
    • VsV
    • ‏2012-10-08T10:59:20Z
    In pure Nutch it is possible to edit conf/regex-urlfilter.txt:
    Set files suffix for ignore: -.(jpg|gif|zip|ico)$

    I think it will work in BI but I have not checked it.
    I am checking with our dev team on this. Hope to have an answer soon.
  • SystemAdmin
    SystemAdmin
    603 Posts

    Re: Web crawl. Content type

    ‏2012-10-09T16:51:12Z  
    I am checking with our dev team on this. Hope to have an answer soon.
    Hi,

    Dev thinks filter you have should work in BigInsights, but we have not tried it.

    Please let me know if you have any more questions, after you try it.

    Thank you,

    Zach