IBM Connections crawler - configuration properties

The IBM Connections crawler crawls documents on an IBM Connections server.

Before you create a IBM Connections crawler, you must configure a search administrator user for IBM Connections as a crawler user. For more information, see Configuring a search administrator user for IBM Connections.

The Create crawler: IBM Connections screen is where you enter the configuration parameters for this crawler.

Crawler Properties

Crawler name

The name of the crawler. Alphanumeric characters, hyphens, underscores, and spaces are allowed.

Crawler description

A description of the crawler.

Advanced options

Time to wait between retrieval requests: The time is expressed in milliseconds.
Maximum number of active crawler threads: The maximum number of active crawler threads.
Maximum number of documents to crawl: The maximum number of documents to crawl.
Maximum document size: The maximum size expressed in kilobytes. The maximum value is 131,071 kilobytes.
Time to wait before a request times out: The timeout value (in seconds) to access the data source server. If the data source server does not respond to the request in this property value, the crawler skips the corresponding document.
When the crawler session is started: Specifies which content to crawl.

Data Source Properties

Version: Select the version of the IBM Connections server to crawl from the drop-down list.
Protocol: Select the protocol to access the IBM Connections server from the drop-down list.
Host name: The host name of the IBM Connections server.
Port: The port of the IBM Connections server.
User name: The user name to crawl the IBM Connections server.
Password: The password of the specified user.

Crawl space Properties

Specify how you want the content of this data source to be made available for searching. To apply changes, restart the crawler. The crawler automatically does a full crawl to apply changes to indexed documents.

Check or uncheck the following items to specify which type of content should be crawled: Activities, Blogs, Bookmark(Dogear), Communities, Events (only available when Communities is selected), Libraries (only available when Communities is selected), Files, Forums, Profiles, Status Update, Wikis.

Use an incremental seed list: The datasource servers support the incremental Seedlist which provides the document list inserted/updated/deleted from the specified date. Using the incremental Seedlist, Seedlist crawlers can crawl only inserted/updated/deleted documents from the last crawling effectively. This option is only respected when the crawl mode is normal or quick. When the crawl mode is full, these crawlers consume the full Seedlist.
Crawl only public documents: Enable this option to crawl only documents which every user has a viewing right.
URL patterns to exclude: The URL lists not to be crawled
Content type filter: Select Included filter or Excluded filter. Based on this selection, the File content types to exclude or File content types to include property is displayed. The crawler gets the content type of a document from the data source server and filters the document with the content type filter. In most cases, image or movie files are not appropriate to index so that it is better to filter them by crawlers. When the setting of the content type filter conflicts with that of the extension filter so that one filter is the included filter and the other filter is excluded, then only the excluded filter is effective.
Extension filter: Select Included filter or Excluded filter. Based on this selection, the File extensions to exclude or File extensions to include property is displayed. The selected file extensions are then excluded or included. When the setting of the content type filter conflicts with that of the extension filter so that one filter is the included filter and the other filter is excluded, then only the excluded filter is effective.
Automatic code page detection: When this property is disabled, the encoding converter detects the code pages of crawled documents. When you want to hint, enable this property and specify the code page from the list.

Crawler plug-in

Data source crawler plug-ins are Java™ applications that can change the content or metadata of crawled documents. You can configure a data source crawler plug-in for all non-web crawler types. For more information, see Crawler plug-ins.

Enable the crawler plug-in: Enable this option when you use the crawler plug-in.
Plug-in class name: The class name for the crawler plug-in.
Plug-in class path: The JAR file location of the crawler plug-in. The folder that contains the JAR file must be mounted so it is available. For more information, see Providing access to the local filesystem from Watson Explorer oneWEX.