Tutorial

This tutorial guides you through how to create and train the IBM Watson™ Retrieve and Rank service with sample data. The tutorial uses a predefined data set to demonstrate the capabilities of the service, but you can use the same steps to create a service instance that uses your own data.

To complete this tutorial, you use the publicly available test data that is called the Cranfield collection. The collection contains abstracts of aerodynamics journal articles, a set of questions about aerodynamics, and indicators of how relevant an article is to a question.

Before you begin

To complete this tutorial, you need the following pieces:

  • You need a Bluemix account.

  • You need cURL.

    To check whether cURL is installed, enter curl -V at a command prompt:

    • If you see a response that includes a version number, you're all set.

    • If you need to install cURL, see the "Download Wizard" on the download page.

      Make sure to select the SSL-enabled version of cURL.

  • You need Python version 2.

    To check whether Python is installed, enter python --version at a command prompt:

    • If you see a response that includes a version number that starts with 2, you're all set.
    • If you need to install Python, see Downloading Python.

Stage 1: Get your service credentials

Before you can work with a service in Bluemix, you need service credentials. If you already have credentials for the Retrieve and Rank service, you can skip this stage.

To get your service credentials, follow these steps:

  1. Log in to Bluemix.

  2. Create an instance of the service:

    1. In the Bluemix Catalog, select the Retrieve and Rank service.
    2. Under Add Service, type a unique name for the service instance in the Service name field. For example, type rr_tutorial_{username}, and replace {username} with your name. Leave the default values for the other options.
    3. Click Use.
  3. Copy your credentials:

    1. On the left side of the page, click Service Credentials to view your service credentials.
    2. Copy username and password from these service credentials. You'll need them in the following stages.

Stage 2: Create a cluster

To use the Retrieve and Rank service, you must create a Solr cluster. A Solr cluster manages your search collections, which you will create later.

  1. Issue the following cURL command to call the "Create Solr cluster" method. Replace {username} and {password} with the service credentials you copied in Stage 1:

    curl -X POST -u "{username}":"{password}" "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters" -d ""
    

    The response includes the cluster ID.

  2. Copy the solr_cluster_id. You'll need it later.

The response also includes the cluster availability status. The cluster must be ready before you can use it. Continue on with the tutorial and we'll check the status later.

Warning: The free cluster you can create to test the Retrieve and Rank demo application is a single reduced-size unit consisting of a maximum of 50 MB of disk storage. It does not guarantee any specific amount of RAM. The free cluster is meant only to run the demonstration application or small proof-of-concept applications. It cannot be used as a unit in a paid Retrieve and Rank cluster. It is not intended for production use. See Sizing your Retrieve and Rank cluster for more information.

Stage 3: Create a collection and add documents

A Solr collection is a logical index of the data in your documents. A collection is a way to keep data separate in the cloud. In this stage, you create a collection, associate it with a configuration, and upload and index your documents.

  1. First, let's check the status of your cluster. Issue the following command to retrieve the status of the cluster that you created in Stage 2. Replace {username}, {password}, and {solr_cluster_id} with the information you copied earlier:

    $ curl -u "{username}":"{password}" "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/{solr_cluster_id}"
    

    If the solr_cluster_status is READY, you can create a collection. If it's not ready, check the status every few minutes.

  2. Download the sample cranfield-solr-config.zip configuration set. The Solr configuration identifies how to index the documents so that you can search the important fields. For the tutorial, we edited the default Solr configuration files to work with the Cranfield collection.

  3. Issue these commands to create a collection:

    1. Upload the sample configuration that you downloaded in Step 2. We name the configuration example_config:

      $ curl -X POST -H "Content-Type: application/zip" -u "{username}":"{password}" "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/{solr_cluster_id}/config/example_config" --data-binary @{/path_to_file}/cranfield-solr-config.zip
      
      
    2. Create a collection that is named example_collection and associate it with the example_config configuration:

      $ curl -X POST -u "{username}":"{password}" "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/{solr_cluster_id}/solr/admin/collections" -d "action=CREATE&name=example_collection&collection.configName=example_config"
      

      You now have a Solr cluster that holds a collection and configuration. Your Solr instance is ready for documents.

  4. Add the documents that you will search:

    1. Download the cranfield-data.json file. For this tutorial, we converted the Cranfield collection documents to JSON format.

    2. Issue the following command to upload the cranfield-data.json data to the example_collection collection. Replace {username}, {password}, {solr_cluster_id}, and {/path_to_file} with your information:

      $ curl -X POST -H "Content-Type: application/json" -u "{username}":"{password}" "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/{solr_cluster_id}/solr/example_collection/update" --data-binary @{/path_to_file}/cranfield-data.json
      

    Important: If you use the cranfield-data.json sample file as a model for adding your own data, be sure to include the commit statement at the end of your file.

Stage 4: Create and train the ranker

To return the most relevant documents at the top of your results, the Retrieve and Rank services uses a machine learning component called a ranker. You send queries to the trained ranker.

The ranker learns from examples before it can rerank results from queries that it hasn't seen before. Collectively, the examples are referred to as "ground truth."

  1. Download the sample cranfield-gt.csv ground truth file.

    The file is a set of questions that a user might ask about the documents. The file provides the example information to train the ranker about questions and relevant answers.

    For each question, there is at least one identifier to an answer (the Doc ID). Each Doc ID includes a number to indicate how relevant the answer is to the question. The document ID points to the answer in the cranfield-data.json file that you downloaded in Stage 3.

  2. Download the train.py Python script file.

    The script takes care of the details of converting the ground truth file to training data for the ranker. It then uploads the training file and creates and trains the ranker.

    Note: The script requires Python major version 2, not major version 3, to run successfully. See Preparing training data for information.

  3. Navigate to the downloaded train.py script.

  4. Run the train.py script. Replace {username}, {password}, {/path_to_file}, and {solr_cluster_id} with your information:

    $ python ./train.py -u {username}:{password} -i {/path_to_file}/cranfield-gt.csv -c {solr_cluster_id} -x example_collection -n "example_ranker"
    

    The script takes less than 5 minutes to finish. You'll know when the script is finished when a new ranker ID and status is displayed in the script window. For example:

    {
      "name": "example_ranker",
      "url": "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/rankers/6C76AF-ranker-43",
      "ranker_id": "6C76AF-ranker-43",
      "created": "2015-09-21T18:01:57.393Z",
      "status": "Training",
      "status_description": "The ranker instance is in its training phase, not yet ready to accept requests"
    }
    
  5. Copy the ranker_id. You'll need it later.

Continue on with the tutorial and we'll check the status of the ranker training later.

Note: When you set up a Retrieve and Rank service instance that uses your own data, you can create training data manually instead of using the train.py script.

Stage 5: Retrieve some answers

While you're waiting for the ranker to finish training, you can search your documents. This search, which uses the "Retrieve" part of the Retrieve and Rank service, does not use the machine learning ranking features. It's a standard Solr search query.

You can search your collection from your browser or by issuing a cURL command. We'll use the browser:

  • Paste the following URL into your browser. Replace {username}, {password}, and {solr_cluster_id} with your information.

    https://{username}:{password}@gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/{solr_cluster_id}/solr/example_collection/select?q=what is the basic mechanism of the transonic aileron buzz&wt=json&fl=id,title
    

    Note: The /select request handler in the cURL command indicates that this is a standard Solr search query, not a query that uses the Rank portion of the Retrieve and Rank service. Standard Solr search results are unordered and might not contain the most significant results. In Stage 6, we will use the /fcselect request handler with a trained ranker to obtain results that are ordered (ranked) by significance.

Your query returns the 10 most relevant results from Solr.

If you want to try other queries, look in the cranfield-gt.csv that you downloaded in Stage 4. Copy a question from the first column and paste it as the value of the q parameter in your browser. For example:

&q=can the three-dimensional problem of a transverse potential flow about a body of revolution be reduced to a two-dimensional problem.&wt=json&fl=id,title

Stage 6: Rerank the results

  1. Check the status of the ranker until you see a status of Available. With this sample data, training takes about 5 minutes and started when the script finished in Stage 4.

    Issue the following command to retrieve the status of the ranker. Replace {username} and {password} with your information. Replace {ranker_id} with the information you copied in Stage 4:

    $ curl -u "{username}":"{password}" "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/rankers/{ranker_id}"
    
  2. Query the ranker to review the reranked results, now that the ranker is trained.

    Here is an example call. It's a similar query as in Stage 5, but the select parameter is changed to fcselect. Using fcselect runs the request against the trained ranker instead of against Solr as in Stage 5. When you call fcselect, you must specify the ID of the trained ranker against which to run the request.

    Note: See Reranking results for a list of standard Solr query modifications that are not supported by /fcselect. You can use any standard Solr query modifications with /select.

    Paste the following URL into your browser. Replace {username}, {password}, {solr_cluster_id}, and {ranker_id} with your information:

    https://{username}:{password}@gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/{solr_cluster_id}/solr/example_collection/fcselect?ranker_id={ranker_id}&q=what is the basic mechanism of the transonic aileron buzz&wt=json&fl=id,title
    

    Note: Note the use of the /fcselect request handler in the cURL command. This indicates the query uses the ranker specified by the value of the ranker_id parameter to return results in descending order of significance according to the training data that has been provided to the ranker.

    The query returns your reranked search results in JSON format. You can compare these results against the results you got with the simple search in Stage 5.

    After evaluating the reranked search results, you can refine them by repeating Stages 4, 5, and 6. You can also add new documents, as described in Stage 3, to broaden the scope of the search. Repeat the process until you are completely satisfied with the results. This can require multiple iterations of refining and reranking.

    Tip: Try out the demo, which also uses the Cranfield collection, to see the difference between the Solr search results and the reranked results.

  3. Experiment with other queries. Look in the cranfield-gt.csv file that you downloaded in Stage 4. Copy a question from the first column and use it to replace the value of the q parameter in your browser. For example:

    &q=can the three-dimensional problem of a transverse potential flow about a body of revolution be reduced to a two-dimensional problem.
    

Stage 7: Clean up

You might want to delete the Solr components and ranker that you created in this tutorial. To clean up, you delete the cluster that you created in Stage 2 and the ranker that you created in Stage 4.

In the following examples, replace {username}, {password}, {solr_cluster_id}, and {ranker_id} with your information:

  1. Delete the cluster:

    $ curl -i -X DELETE -u "{username}":"{password}" "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/{solr_cluster_id}"
    

    When the cluster is deleted, the response is HTTP response 200.

  2. Delete the ranker:

    $ curl -X DELETE -u "{username}":"{password}" "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/rankers/{ranker_id}"
    

    When the ranker is deleted, the response is an empty JSON object.

What to do next

You have a basic understanding of how to use the Retrieve and Rank service. Now dive deeper: