SharePoint crawlers

You can configure a SharePoint crawler to include Windows SharePoint Server content in a collection.

Before you configure a SharePoint crawler, you must deploy provided web services on the SharePoint server. These web services enable support for several functions, including farm-aware crawling and secure search. If you upgrade from and earlier version of Watson Explorer Content Analytics, you must re-deploy the web services before running the crawler in Version 11.0.2.

Crawler connection credentials

When you create the crawler, you can specify credentials that allow the crawler to connect to the sources to be crawled. You can also configure connection credentials when you specify general security settings for the system. If you use the latter approach, multiple crawlers and other system components can use the same credentials. For example, the search servers can use the credentials when determining whether a user is authorized to access content.

Restriction:
  • To crawl a SharePoint server, the connection user ID that the crawler uses must be able to access the target SharePoint server URL. This ID must have Full Read permission for the web application to be crawled.
  • To crawl social elements, the connection ID that the crawler uses must be added as an Administrator under the User Profile Service Application settings and granted the Manage Social Data permission.
  • When you create a SharePoint crawler, only the sites that can be crawled with the specified connection credentials are listed as candidates.
  • By SharePoint design, metadata for an attachment is inherited from the parent list item. Individual attachments do not have any metadata other than the file name. Therefore, the file size, file type, and other metadata cannot be indexed for attachments to SharePoint list items.

Farm-aware crawling

To crawl sites in a SharePoint farm, you must specify the URL of the web application, like http://server (for the default port 80) or http://server:10000 (to specify a port), as the SharePoint Web Service URL when you create a SharePoint crawler. The crawler can detect all top-level sites that belong to the specified web application. You can configure the crawler to collect content from multiple top-level sites and specify options for crawling content from sites that belong to a top-level site.

When you select the Sharepoint content to crawl, you specify filters to include content or exclude content from the crawl space. The filter rules can include wildcard characters for site collection names, site names, and URLs. Use wildcard characters to include and exclude content that matches a pattern.

To enable the crawler to collect content from a SharePoint farm, you must deploy provided web services on the SharePoint server before you create a SharePoint crawler.

Social elements

SharePoint supports several social elements, comments, rating averages, and tags. If the crawler has permission to access social elements, these social elements are mapped as field metadata when documents are crawled. You must add the Manage Social Data permission for the crawler user in SharePoint before you create the SharePoint crawler.

Social search

If you create a collection that supports social search, you can collect person information that enables users to explore relationships between people, documents, and tags. For example, users can see person cards for people relevant to a query, see recommendations for related documents and people, explore relationships through a social network graph, and explore tags through a weighted tag cloud. For more information, read about support for social search and creating person crawlers.

Compound documents

If a document contains multiple parts, and you want all parts of the document to be treated as a single document in the search results, you can configure the crawler to support compound documents. In this case, a parent document that contains child documents can be searched as a single document. If the search terms are found, all of the child documents are listed with the parent document in the search results. If support for compound documents is not enabled in the crawler configuration, the parent and child documents are searched separately and returned as separate documents in the search results. For more information, read about support for crawling compound documents.

SAML Authentication

When you want to crawl a SharePoint site under SAML authentication, check Use SAML authentication in Advanced options and enter the Identity Provider endpoint URI. Other settings might be necessary based on the SAML configurations. Please note that document level security based on SAML authentication is not supported.

Security

If security is enabled when a collection is created, the crawler can associate security data with documents in the index. This data enables applications to pre-filter the search results based on the indexed access control list (ACL). The system indexes the document-level ACL, not the list-level ACL. If you prefer to index the list-level ACL, edit the ES_NODE_ROOT/master_config/collection_ID.crawler_ID/crawler_config.xml file. Under the crawler element, set the sp:crawl_document_acl property to false, and then restart the crawler:

<property name="sp:crawl_document_acl" type="boolean" options="empty_if_missing_at_runtime">false</property>

To support pre-filtering, so that users search only the documents that they are allowed to see, ensure that users have the following SharePoint permissions:
  • View Items
  • View Application Pages
  • View Pages
  • Open

You can also configure security settings to validate user credentials when a user submits a query. In this case, instead of comparing user credentials to indexed security data, the system post-filters the search results by comparing the user credentials to current ACLs that are maintained by the original data source. The SharePoint crawler supports BASIC authentication, Digest authentication, and NT LAN Manager (NTLM) authentication through Internet Information Services (IIS).

When you select the content to crawl on a SharePoint server, the crawler can automatically detect whether the SharePoint server is configured to use claims-based authentication or classic-mode authentication. Both modes are supported. To respect claims-based authentication, you must deploy the provided ESSPSolution.wsp web services on the SharePoint server.

The following tables summarize support for searching SharePoint content, depending on whether security is or is not enabled in the SharePoint crawler configuration.
Table 1. Support for searching SharePoint content when security is enabled
SharePoint 2010 Non-Farm SharePoint 2010 Farm
Not supported Supported1
Table notes:
  1. To support farm-aware crawling and secure search, you must deploy the provided web services on the SharePoint server.
Table 2. Support for searching SharePoint content when security is not enabled
SharePoint 2010 Non-Farm SharePoint 2010 Farm
Supported Supported1
Table notes:
  1. To support farm-aware crawling, you must deploy the provided web services on the SharePoint server.