Crawling and Indexing Your Search Collection

About this task

The Watson™ Explorer Engine crawls and indexes the documents in a search collection in order to be able to quickly and flexibly search that data. When you perform a search against the search collection, Watson Explorer Engine uses its index to identify matching documents, and then returns the title and relevant portion of the document for each result. You can then click the title to go to the original document wherever the original data is located.

To begin crawling and indexing the email data that is contained in the Enron email archive that we installed in About This Tutorial

Procedure

  1. Open your search collection by using the List icon besides the Search Collections entry in the Watson Explorer Engine administration tool's left-hand navigation menu or the quick jump field

    (the latter was explained in the previous section). Once the collection displays

  2. Select the Overview tab if it is not already selected
  3. Click start to the right of the Live Status heading.

Results

Crawling and indexing begin!

Note: The Enron email collection that we are using as sample data contains over 500,000 documents, and can take well over an hour or two to crawl and index, depending on the speed, performance, and amount of memory available in the machine on which you are working.

You can monitor the progress of the crawling and indexing process from the screen where you began the indexing process - the page is automatically updated every 5 seconds. When the number of pending and unprocessed URLs in the Crawling section is 0, the crawler is done and will quit. When the number of uncommitted URLs in the Indexing section is 0, indexing is complete, and the indexer will also go into an idle state.

Once crawling and indexing complete, we are almost ready to classify the data in our collection. The remaining steps are to create a project that unites all of the components that we have created into a single Watson Explorer Engine application, and to do the classification itself.

To proceed to the next section, click Creating Your Project.