Topic
IC4NOTICE: developerWorks Community will be offline May 29-30, 2015 while we upgrade to the latest version of IBM Connections. For more information, read our upgrade FAQ.
3 replies Latest Post - ‏2012-10-09T16:51:12Z by SystemAdmin
VsV
VsV
10 Posts
ACCEPTED ANSWER

Pinned topic Web crawl. Content type

‏2012-09-28T17:42:15Z |
Hello!
How to specify 'Content type' for web crawler? I want to crawl only plain HTML files, no images or video.
Is it possible?
Thanks.
Updated on 2012-10-09T16:51:12Z at 2012-10-09T16:51:12Z by SystemAdmin
  • VsV
    VsV
    10 Posts
    ACCEPTED ANSWER

    Re: Web crawl. Content type

    ‏2012-10-08T10:59:20Z  in response to VsV
    In pure Nutch it is possible to edit conf/regex-urlfilter.txt:
    Set files suffix for ignore: -.(jpg|gif|zip|ico)$

    I think it will work in BI but I have not checked it.
    • SystemAdmin
      SystemAdmin
      603 Posts
      ACCEPTED ANSWER

      Re: Web crawl. Content type

      ‏2012-10-08T17:23:51Z  in response to VsV
      I am checking with our dev team on this. Hope to have an answer soon.
      • SystemAdmin
        SystemAdmin
        603 Posts
        ACCEPTED ANSWER

        Re: Web crawl. Content type

        ‏2012-10-09T16:51:12Z  in response to SystemAdmin
        Hi,

        Dev thinks filter you have should work in BigInsights, but we have not tried it.

        Please let me know if you have any more questions, after you try it.

        Thank you,

        Zach