Windows file system crawlers

To include documents that are stored in Microsoft Windows file systems in a collection, you can configure a Windows file system crawler.

If you install the crawler server on AIX® or Linux®, you cannot crawl Windows file system sources with the Windows file system crawler (the crawler does not appear in the list of available crawler types). However, you can install an agent server that enables you to crawl remote Windows file systems with the Agent for Windows file systems crawler.

You can use the Windows file system crawler to crawl any number of Windows file systems. When you configure the crawler, you select the local and remote directories and subdirectories that you want to crawl.

The user ID that the crawler uses to access the documents to be crawled must have the following Windows administrator rights:
  • List Folder Contents. This right allows the crawler to list the documents in the folder.
  • Read Permissions. This right allows the crawler to access the access control list (ACL) information for each document.

Crawling shared network directories

The Windows file system crawler crawls documents according to read permissions that are specified for the administrator. The administrator is the Watson Explorer Content Analytics services account.

You can specify a user ID and password for the directories to be crawled. However, the user ID and password are used only to connect to shared network directories. The crawler crawls files according to the read permissions that are set for this user for the shared network directories, not for local drives.

Connections to network directories are not disconnected until you restart the Watson Explorer Content Analytics service. After a connection is established, it is possible to access the directory with an incorrect user ID and password. However, this connection is allowed only for the Windows file system discovery and crawler sessions that are under the control of the system. To prevent possible security risks, ensure that authorizations for the administrator's account (under which the Watson Explorer Content Analytics service runs) are set properly.

To avoid problems with connecting to a network directory in the future, specify the same user ID and password for the same network directory. If you specify the wrong user ID and password and restart the Watson Explorer Content Analytics service, the Windows file system crawler might fail to crawl because it is attempting to connect to the directory with incorrect credentials. The crawl can succeed if the network connection is established by another Windows file system crawler that is using the correct user ID and password.

Crawler connection credentials

When you create the crawler, you can specify credentials that allow the crawler to connect to the sources to be crawled. You can also configure connection credentials when you specify general security settings for the system. If you use the latter approach, multiple crawlers and other system components can use the same credentials. For example, the search servers can use the credentials when determining whether a user is authorized to access content.

Configuration overview

When you create the crawler, a wizard helps you do these tasks:
  • Specify properties that control how the crawler operates and uses system resources. The crawler properties control how the crawler crawls all subdirectories in the crawl space.
  • Set up a schedule for crawling the file systems.
  • Select subdirectories to crawl.

    You can specify how many levels of subdirectories that you want the crawler to crawl. To crawl remote file systems, you also specify a user ID and password that enables the crawler to access data.

  • Specify options for making documents in subdirectories searchable. For example, you can exclude certain types of documents from the crawl space or specify a user ID and password that enables the crawler to access files in a particular subdirectory.
  • Configure document-level security options. If security was enabled when the collection was created, the crawler can associate security data with documents in the index. This data enables applications to enforce access controls based on the stored access control lists or security tokens.

    You can also select an option to validate user credentials at the time a user submits a query. In this case, instead of comparing the user's credentials to indexed security data, the system compares the credentials to current access control lists that are maintained by the original data source.

    To enforce document-level security, you must ensure that user and domain account information is configured correctly on the crawler server.

To maximize performance, the Windows file system crawler can detect updates to files and determine whether a file needs to be recrawled without opening it.