SharePoint crawler - configuration properties

The SharePoint crawler crawls Microsoft SharePoint Server and SharePoint Online.

The Create crawler: SharePoint screen is where you enter the configuration parameters for this crawler.

Prerequisites for SharePoint on-prem

The web service packages are bundled in the Docker container /opt/ibm/wex/zing/resources.

Before you configure a SharePoint crawler, you must deploy web services on the SharePoint server to allow the crawler to access content. For more information, see Deploying web services for SharePoint crawlers and Deploying web services for SharePoint crawlers to SharePoint 2016

You then need to follow this procedure.

From the SharePoint Management Shell, run the following command: Add-SPSolution -LiteralPath C:\ESSPSolution.wsp
Open Sharepoint Central Administration > System Settings.
Select the link at Farm Management > Manage Farm Solutions.
Click Deploy Solution.

When you create the crawler, you can specify credentials that allow the crawler to connect to the sources to be crawled.

To crawl a SharePoint server, the connection user ID that the crawler uses must be able to access the target SharePoint server URL. This ID must have Full Read permission for the web application to be crawled.
To crawl social elements, the connection ID that the crawler uses must be added as an Administrator under the User Profile Service Application settings. Also the connection ID must be granted the Manage Social Data permission.
When you create a SharePoint crawler, only the sites that can be crawled with the specified connection credentials are listed as candidates.

Prerequisites for SharePoint online

Legacy authentication must be enabled for crawl user accounts.
To enable legacy authentication, go to the Azure portal or contact your Azure Active Directory administrator.
The connector supports the Password hash synchronization (PHS) method for enabling hybrid identity only. Use any other type (such as pass-through authentication or Federation) at your own risk. Unless you created your SharePoint Online account before January 2020, two-factor authentication is enabled for the account by default. Disable two-factor authentication.
Account must have an Azure Active Directory user ID with permission to access all the objects that you want to crawl. For example, admin_user@company.onmicrosoft.com. The user ID must have Site Collection Administrator permission.

Crawler Properties

Crawler name

The name of the crawler. Alphanumeric characters, hyphens, underscores, and spaces are allowed.

Crawler description

A description of the crawler.

Advanced options

Time to wait between retrieval requests: The time is expressed in milliseconds.
Maximum number of active crawler threads: The maximum number of active crawler threads.
Maximum document size: The maximum size expressed in kilobytes. The maximum value is 131,071 kilobytes.
When the crawler session is started: Specifies which content to crawl.

Data Source Properties

SharePoint Web Service URL

The SharePoint web service URL. For example, http://wss.com/.

User name

The user name to crawl the SharePoint.

Password

The password of the specified user.

View names of SharePoint Lists

SharePoint view names to be used for crawling. The first existing view in the list is used.

Respect SharePoint search visibility for Sites

To crawl only sites that were configured to be visible in the search results.

Respect SharePoint search visibility for Lists

To crawl only lists that were configured to be visible in the search results.

Crawl social data

Enable this option to crawl social data.

SharePoint 2010 and later extend support to several social elements, comments, rating averages, and tags. If the crawler has permission to access social elements, these social elements are mapped as field metadata when documents are crawled. Add the Manage Social Data permission for the crawler user in SharePoint before you create the SharePoint crawler.

Proxy settings

Enable this option when you use the proxy server to access the data source server.

Proxy server hostname: The proxy server hostname.
Proxy server port: The proxy server port.
User ID for the proxy server: The user ID for the proxy server.
Password for the proxy server: The password for the proxy server.

Use SAML authentication

Enable this option to use SAML authentication. This option is ignored when the Crawl SharePoint Online option is enabled. This option needs to be enabled to activate document-level security for SharePoint Server (on-premise versions of SharePoint).

Identity Provider endpoint: The URL of the Identity Provider endpoint (for example, https://adfs.server.example.com/adfs/services/trust/2005/UsernameMixed). This option is ignored when the Crawl SharePoint Online option is enabled.
Relying Party Trust endpoint: The URL of the Relying Party Trust endpoint (if empty, the following value is used: https://<sharepoint_server>:<port>/_trust/). This option is ignored when the Crawl SharePoint Online option is enabled.
Relying Party Trust identifier: The URL of the Relying Party Trust identifier (if empty, the following value is used: https://<sharepoint_server>:<port>/_trust/). This option is ignored when the Crawl SharePoint Online option is enabled.

Crawl SharePoint Online

Enable this option when you crawl SharePoint Online. Note that the SharePoint crawler can crawl SharePoint Online only with the default Azure Active Directory (Azure AD) authentication at this point. It is not supported to crawl SharePoint Online with the other types of authentication, such as using your own local Active Directory Federation Service (ADFS). The username for the default Azure AD authentication would be in the form of <username>@<domain>.onmicrosoft.com. Consult with Microsoft support for more details about SharePoint Online configuration.

Identity Provider endpoint: The URL of the Identity Provider endpoint (if empty, the following value is used: https://login.microsoftonline.com/extSTS.srf).
Relying Party Trust endpoint: The URL of the Relying Party Trust endpoint (if empty, the following value is used: https://<sharepoint_server>:<port>/_forms/default.aspx/).
Relying Party Trust identifier: The URL of the Relying Party Trust identifier (if empty, the following value is used: https://<sharepoint_server>:<port>).
Application (Client) ID assigned on Azure Portal: The application (client) ID assigned when you registered an application with Azure Active Directory (Azure AD). This ID needs to be set when you enable Document-level security for SharePoint Online. The ID can be found on Microsoft Azure Portal. For more details on this ID, refer to Configuring Document-level Security for SharePoint Online.

Crawl space Properties

You can find and add multiple crawl spaces for a SharePoint crawler. For instructions, see Finding and adding crawl spaces in a SharePoint crawler.

Crawler plug-in

Data source crawler plug-ins are Java™ applications that can change the content or metadata of crawled documents. You can configure a data source crawler plug-in for all non-web crawler types. For more information, see Crawler plug-ins.

Enable the crawler plug-in: Enable this option when you use the crawler plug-in.
Plug-in class name: The class name for the crawler plug-in.
Plug-in class path: The JAR file location of the crawler plug-in. The folder that contains the JAR file must be mounted so it is available. For more information, see Providing access to the local filesystem from Watson Explorer oneWEX.