Filtering URLs At Crawl Time
About this task
If you do not wish for certain URLs to be crawled at all, you can add a conditional setting to your OneDrive for Business connector seed, which will add a new condition that filters URLs by specifying conditions that they must satisfy or conditions that they may never satisfy. For example, you can create a filter so that the crawler never crawls URLs that match a certain pattern.
You can filter URLs at crawl time by adding a conditional setting to the OneDrive for Business connector seed configuration. Adding a Custom attribute URL filter will cause the OneDrive for Business connector to simply ignore URLs meeting the filtered URL criteria. As a result, filtered URLs will not be fed by the crawler to the index. Such URLs will simply be ignored.
To add a Custom attribute URL filter as a conditional setting, navigate to your search collection Crawling Configuration page, and do the following:
Procedure
Results
The custom attribute URL filter will now be applied as a conditional setting to your search collection.
Optionally, if you want your filter to include files located in the folder indicated by the filter URL https://xx555.genericsite.it:30000/ccxxxx/200001 you can add filtering on URLs, by appending the following to your XML:
*xx.foo.com *xx360.foo.com/ *xx360.foo.com:30000 *xx360.foo.com:30000/ *xxfoo/ *xxfoo *xxfoo/200001*"If a site collection with a Managed Path URL is specified as a seed URL for the OneDrive for Business connector, then all of the children of the site collection will be excluded from the crawl (including the site collection's immediate children).
An example of a
Managed Path URL site collection
https://onedrive-host/not-root/site-collection/
An example
of root site collection URL, which includes all children in the crawl:
https://onedrive-host/
<call-function name="vse-crawler-allow">
<with name="field">url</with>
<with name="pattern">io-sp://onedrive-host/*</with>
</call-function>
.To
enable child URLs by adding a Custom attribute URL filter
condition:
<crawl-condition-when how="wc-set" field="protocol">
<crawl-pattern>io-sp</crawl-pattern>
<call-function
name="vse-crawler-custom-attribute-url-filter">
<with name="urls">
https://onedrive-host/not-root/site-collection https://onedrive-host/not-root/site-collection/*
</with>
</call-function>
</crawl-condition-when>
<crawl-condition-when how="wc-set" field="protocol">
<crawl-pattern>io-sp</crawl-pattern>
<curl-options>
<crawl-extender-option
name="seed-protocol">https</crawl-extender-option>
</curl-options>
</craw-condition-when>