Filtering URLs At Crawl Time
About this task
If you do not wish for certain URLs to be crawled at all, you can add a conditional setting to your OneDrive for Business connector seed, which will add a new condition that filters URLs by specifying conditions that they must satisfy or conditions that they may never satisfy. For example, you can create a filter so that the crawler never crawls URLs that match a certain pattern.
You can filter URLs at crawl time by adding a conditional setting to the OneDrive for Business connector seed configuration. Adding a Custom attribute URL filter will cause the OneDrive for Business connector to simply ignore URLs meeting the filtered URL criteria. As a result, filtered URLs will not be fed by the crawler to the index. Such URLs will simply be ignored.
To add a Custom attribute URL filter as a conditional setting, navigate to your search collection Crawling Configuration page, and do the following:
- In the Conditional Settings menu, click Add a new condition.
- From the list of options, scroll down and select Custom attribute URL filter and click Add. The Custom attribute URL filter menu expands.
Set the desired options for the custom attribute URL filter. These options include
setting the filter to retrieve or ignore all URLs matching a specified criteria. You can
also specify the exact type of URL in which to apply the filter as well as specifying URL
string matching methods and wildcard criteria.
Tip: If you want to know exactly what URLs were actually ignored by the crawler, enable Filter level logging. The URLs that matched the filter and were ignored by the crawler, will be shown on the live status page for your collection.
- After making the desired configuration changes, click OK/Apply.
The custom attribute URL filter will now be applied as a conditional setting to your search collection.
Optionally, if you want your filter to include files located in the folder indicated by the filter URL https://xx555.genericsite.it:30000/ccxxxx/200001 you can add filtering on URLs, by appending the following to your XML:
*xx.foo.com *xx360.foo.com/ *xx360.foo.com:30000 *xx360.foo.com:30000/ *xxfoo/ *xxfoo *xxfoo/200001*"If a site collection with a Managed Path URL is specified as a seed URL for the OneDrive for Business connector, then all of the children of the site collection will be excluded from the crawl (including the site collection's immediate children).
An example of a
Managed Path URL site collection
of root site collection URL, which includes all children in the crawl:
.To enable child URLs by adding a Custom attribute URL filter condition:
<call-function name="vse-crawler-allow"> <with name="field">url</with> <with name="pattern">io-sp://onedrive-host/*</with> </call-function>
<crawl-condition-when how="wc-set" field="protocol"> <crawl-pattern>io-sp</crawl-pattern> <call-function name="vse-crawler-custom-attribute-url-filter"> <with name="urls"> https://onedrive-host/not-root/site-collection https://onedrive-host/not-root/site-collection/* </with> </call-function> </crawl-condition-when>
<crawl-condition-when how="wc-set" field="protocol"> <crawl-pattern>io-sp</crawl-pattern> <curl-options> <crawl-extender-option name="seed-protocol">https</crawl-extender-option> </curl-options> </craw-condition-when>