Indexing, Reindexing, and Caching

Crawling, converting, and indexing of any resource happen concurrently. Each conversion process runs in the crawler until the data is converted into VXML, at which point it can be indexed. As discussed in the previous section, the conversion process consists of a number of consecutive steps determined by the current state of each document being converted and the conversion rules that apply to that state.

During the conversion process, a copy of the data that is being converted is cached for access through the Cache link in Watson Explorer Engine search results. The type of data that is cached depends on the content types listed in the Cache content types variable that you can set for a search collection on the Configuration tab's Crawling sub-tab. By default, this variable contains a list of available content types (text/html, text/plain, text/xml, application/vxml-unnormalized, and application/vxml). A copy of the first of these content types produced during the conversion process is cached.

Refreshing a crawl deletes the cached data for any URLs that have changed, recreating that cached data from the newly-retrieved URLs. The process of determining whether a URL has changed is specific to each data source. For example, changes to data retrieved via HTTP depend on the web server, changes to files or fileshares depend on timestamps, and so on. No additional processing is done for the data associated with any URLs that are identified as unchanged.

Any changes made to converters require a new crawl, because there is no way to guarantee that the information produced by the new conversion process is the same as that produced by the previous conversion process.