Selecting an Appropriate Base Search Collection

The Watson™ Explorer Engine platform expedites search collection creation by enabling you to create a new collection based on an existing collection. The Watson Explorer Engine platform provides three configurations that are particularly relevant to how applications ingest data:

  • default: This is the default configuration used for search collections that will be populated by conventionally crawling a data source. This configuration enables the user to browse the crawler's log in the Watson Explorer Engine administration tool by using the Live Status -> URLs tab for a search collection, or via the search-collection-url-status-query API command.

    This configuration is most useful for small collections (containing less than a million URLs) or test collections that can benefit from the ability to browse the crawler's log and view related statistics that are associated with the crawl, cached copies of documents in the search results, and crawler duplicate filtering. This configuration should not be used when attempting to optimize performance or minimize the disk space associated with a collection.

    Tip: In collections that you have based on the default configuration, you may want to adjust the Maximum idle time option, located in the General section of a collection's Configuration > Crawling tab, to a non-zero number. If this number is set to zero (the default), the crawler will immediately exit after it indexes the data that has been sent to it. Setting this value to a non-zero number keeps the crawler active longer, enabling it to service any additional URLs that are enqueued during that time. Setting this value to -1 prevents the crawler from exiting, making it always available to process enqueued items.

    Changing the value of this setting provides a performance optimization for collections to which items are frequently being enqueued or where enqueued items must be processed as quickly as possible. Changing this value is not a requirement, and is not recommended for collections that are updated infrequently.

  • default-broker-push: This is an advanced configuration for use in conjunction with the Collection Broker and its associated API calls. The Collection Broker is discussed in detail in the Watson Explorer Engine API Developer's Guide. This search collection template should only be used for search collections that will be managed by the Collection Broker.
  • default-push: This is the recommended configuration for use with applications that manually enqueue data to a search collection. Such applications are often referred to as push applications. This configuration is optimized for scalability relative to the total number of documents, ingestion throughput, and a minimal secondary storage footprint.

    The disadvantage of using this configuration as a basis for your search collection(s) is that the crawler will not keep statistics or log records for anything other than errors. This option also precludes crawler duplicates filtering and the display or retrieval of the cached copies of documents.