Defining Resources to Crawl and Index

About this task

After creating a search collection, you must next identify the location of the information that you are going to crawl and index. The starting point for a search collection is usually referred to as a seed. This tutorial uses the directory of sample files that you installed in About This Tutorial, which you will crawl as Files, which means that you must be executing this tutorial on the system on which the Watson™ Explorer Engine software is installed.

Tip: If you do not have access to the system where Watson Explorer Engine has been installed, you can crawl the directory where you installed the sample data via URLs if you export it from a web server, or as an SMB share if you can export it as a shared, network drive on a Microsoft Windows system, or via Samba on a Linux system. If you cannot work through this tutorial literally, you will need to make the appropriate selections from the pop-up that displays after selecting Add a new seed from the screen shown in Figure 2.

To identify the location of the files that you want to crawl for your search collection

Click Add a new seed from the screen shown in Figure 2.

For the purposes of this tutorial, select Files in the pop-up that displays

Click Add to close the pop-up

It displays a screen like the one shown in Figure 1.

Figure 1. Defining the Seed for Your Collection

In the Files text area that displays, enter /data/enron/maildir on a Linux system, or c:\data\enron\maildir on a Microsoft Windows system

Click OK.

Enter /data/training/maildir80 to follow along with this tutorial in a Watson Explorer Engine training class.

The files used in this tutorial are email messages. To improve the classification process, you will want to configure a custom conditional setting that specifies the file type.

Procedure

  1. Select the Crawling tab.
  2. Scroll down
  3. Click Add a new condition.
  4. Select Custom conditional settings from the list
  5. Click Add.

    You will want to create a condition that matches a regular expression, such as a string or pattern that you provide.

  6. For this tutorial, you will want to check the radio button beside Conditions apply for a
  7. Select url from the drop-down list.

    The condition should match a regular expression.

  8. From the second drop-down list, select the regex.
  9. In the text box, type the following:
    ^file.*$
  10. Scroll down to Retrieval and encodings.
  11. Expand that section.
  12. In the text box by Forced content type, type text/mail.
  13. Click OK.

    Finally, we need to add a binning set to this search collection, so that the classification hierarchy that we generate is displayed beside your search results as a hierarchical list of entries and sub-entries. A binning set is the term used to describe sets of results that are grouped together based on the values of certain fields or on logical relationships between their content. Each of these sets is referred to as a bin.

    To add a binning set to your search collection

  14. Click the Binning tab
  15. Click Add Top-Level Component.
  16. Select Binning Set from the dialog that displays
  17. Click Add to add a new binning set to your search collection.

    A screen like the one shown in Figure 2 displays.

    Figure 2. Adding a Binning Set to Your Collection
  18. Enter the XPath expression for the annotation that you defined in your display in Adding Annotations to a Display

    (which should be $tags if you've been following along with this tutorial).

  19. For the purposes of this tutorial, let's also add the label Taxonomy in the Label field

    This will make it easy for us to identify the binning set in the display for the final application.

  20. Click OK to save your changes.
  21. Click Add Child Component
  22. Select Binning Tree from the dialog that displays
  23. Click Add.
  24. Click OK to save the binning tree without making any modifications to it.

Results

To proceed to the next section, click Customizing the Source for Your Search Collection.