Defining Resources to Crawl and Index
About this task
After creating a search collection, you must next identify the location of the information that you are going to crawl and index. The starting point for a search collection is usually referred to as a seed. This tutorial uses the directory of sample files that you installed in About This Tutorial, which you will crawl as Files, which means that you must be executing this tutorial on the system on which the Watson™ Explorer Engine software is installed.
To identify the location of the files that you want to crawl for your search collection
Click Add a new seed from the screen shown in Figure 2.
For the purposes of this tutorial, select Files in the pop-up that displays
Click Add to close the pop-up
It displays a screen like the one shown in Figure 1.
In the Files text area that displays, enter /data/enron/maildir on a Linux system, or c:\data\enron\maildir on a Microsoft Windows system
Enter /data/training/maildir80 to follow along with this tutorial in a Watson Explorer Engine training class.
The files used in this tutorial are email messages. To improve the classification process, you will want to configure a custom conditional setting that specifies the file type.
- Select the Crawling tab.
- Scroll down
- Click Add a new condition.
- Select Custom conditional settings from the list
You will want to create a condition that matches a regular expression, such as a string or pattern that you provide.
- For this tutorial, you will want to check the radio button beside Conditions apply for a
Select url from the drop-down list.
The condition should match a regular expression.
- From the second drop-down list, select the regex.
In the text box, type the following:
- Scroll down to Retrieval and encodings.
- Expand that section.
- In the text box by Forced content type, type text/mail.
Finally, we need to add a binning set to this search collection, so that the classification hierarchy that we generate is displayed beside your search results as a hierarchical list of entries and sub-entries. A binning set is the term used to describe sets of results that are grouped together based on the values of certain fields or on logical relationships between their content. Each of these sets is referred to as a bin.
To add a binning set to your search collection
- Click the Binning tab
- Click Add Top-Level Component.
- Select Binning Set from the dialog that displays
Click Add to add a new binning set to your search collection.
A screen like the one shown in Figure 2 displays.
Enter the XPath expression for the annotation that you defined in your display in Adding Annotations to a Display
(which should be $tags if you've been following along with this tutorial).
For the purposes of this tutorial, let's also add the label Taxonomy in the
This will make it easy for us to identify the binning set in the display for the final application.
- Click OK to save your changes.
- Click Add Child Component
- Select Binning Tree from the dialog that displays
- Click Add.
- Click OK to save the binning tree without making any modifications to it.
To proceed to the next section, click Customizing the Source for Your Search Collection.