How to specify 'Content type' for web crawler? I want to crawl only plain HTML files, no images or video.
Is it possible?
VsV 120000JNKJ10 Posts
Re: Web crawl. Content type2012-10-08T10:59:20ZThis is the accepted answer. This is the accepted answer.In pure Nutch it is possible to edit conf/regex-urlfilter.txt:
Set files suffix for ignore: -.(jpg|gif|zip|ico)$
I think it will work in BI but I have not checked it.
SystemAdmin 110000D4XK603 Posts
Re: Web crawl. Content type2012-10-09T16:51:12ZThis is the accepted answer. This is the accepted answer.
- SystemAdmin 110000D4XK
Dev thinks filter you have should work in BigInsights, but we have not tried it.
Please let me know if you have any more questions, after you try it.