Filtering URLs At Crawl Time

About this task

If you do not wish for certain URLs to be crawled at all, you can add a conditional setting to your OneDrive for Business connector seed, which will add a new condition that filters URLs by specifying conditions that they must satisfy or conditions that they may never satisfy. For example, you can create a filter so that the crawler never crawls URLs that match a certain pattern.

You can filter URLs at crawl time by adding a conditional setting to the OneDrive for Business connector seed configuration. Adding a Custom attribute URL filter will cause the OneDrive for Business connector to simply ignore URLs meeting the filtered URL criteria. As a result, filtered URLs will not be fed by the crawler to the index. Such URLs will simply be ignored.

Note: If you have URLs that have already been crawled and are part of the available index, you can blacklist those URLs if you do not want them returned as search results. This may seem like the same thing as adding a Custom attribute URL filter, but the key distinction is that a Custom attribute URL filter prevents URLs from being indexed at crawl time, whereas a blacklisted URL is indexed, but prevented from being returned at search time.

To add a Custom attribute URL filter as a conditional setting, navigate to your search collection Crawling Configuration page, and do the following:

Procedure

  1. In the Conditional Settings menu, click Add a new condition.
  2. From the list of options, scroll down and select Custom attribute URL filter and click Add. The Custom attribute URL filter menu expands.
  3. Set the desired options for the custom attribute URL filter. These options include setting the filter to retrieve or ignore all URLs matching a specified criteria. You can also specify the exact type of URL in which to apply the filter as well as specifying URL string matching methods and wildcard criteria.
    Tip: If you want to know exactly what URLs were actually ignored by the crawler, enable Filter level logging. The URLs that matched the filter and were ignored by the crawler, will be shown on the live status page for your collection.
  4. After making the desired configuration changes, click OK/Apply.

Results

The custom attribute URL filter will now be applied as a conditional setting to your search collection.

Optionally, if you want your filter to include files located in the folder indicated by the filter URL https://xx555.genericsite.it:30000/ccxxxx/200001 you can add filtering on URLs, by appending the following to your XML:

*xx.foo.com
*xx360.foo.com/
*xx360.foo.com:30000
*xx360.foo.com:30000/
*xxfoo/
*xxfoo
*xxfoo/200001*"
If a site collection with a Managed Path URL is specified as a seed URL for the OneDrive for Business connector, then all of the children of the site collection will be excluded from the crawl (including the site collection's immediate children).

An example of a Managed Path URL site collection https://onedrive-host/not-root/site-collection/

An example of root site collection URL, which includes all children in the crawl: https://onedrive-host/

To ensure that all children URLs are crawled when specifying a Managed Path URL, you need to add a crawl condition which enables the child URLs to be crawled. This can be done by using an Allow URLs crawl condition which matches the specified crawl URL:

<call-function name="vse-crawler-allow">
  <with name="field">url</with>
  <with name="pattern">io-sp://onedrive-host/*</with>
</call-function>
.To enable child URLs by adding a Custom attribute URL filter condition:

          <crawl-condition-when how="wc-set" field="protocol">
  <crawl-pattern>io-sp</crawl-pattern>
  <call-function
          name="vse-crawler-custom-attribute-url-filter">
    <with name="urls">
https://onedrive-host/not-root/site-collection https://onedrive-host/not-root/site-collection/*
    </with>
  </call-function>
</crawl-condition-when>
Note: SSL or HTTP (HTTPS) - If the site collection's protocol is SSL of HTTP (HTTPS), then you will also need to add a crawl condition which sets the seed-protocol setting on the children of the site collection:

<crawl-condition-when how="wc-set" field="protocol">
  <crawl-pattern>io-sp</crawl-pattern>
  <curl-options>
    <crawl-extender-option
        name="seed-protocol">https</crawl-extender-option>
  </curl-options>
</craw-condition-when>