Contents


Migrating from Watson Retrieve and Rank to Discovery, Part 1

Migrating from Watson Retrieve and Rank to Watson Discovery Service

Comments

Content series:

This content is part # of 2 in the series: Migrating from Watson Retrieve and Rank to Discovery, Part 1

Stay tuned for additional content in this series.

This content is part of the series:Migrating from Watson Retrieve and Rank to Discovery, Part 1

Stay tuned for additional content in this series.

This tutorial guides you through the process of migrating a Watson Retrieve and Rank example by creating and training a Watson™ Discovery Service instance (Discovery) with the data from the example. This tutorial uses the same data set used in the Retrieve and Rank Getting started tutorial, but you can use the same approach to create a service instance that uses your own data. This tutorial gives you step-by-step instructions for migrating data used with the Retrieve and Rank API. The next tutorial will cover how data can be exported from Retrieve and Rank tooling to Discovery.

The process for users migrating data from Retrieve and Rank to Discovery consists of two main steps:

  1. Migrating the collection data
  2. Migrating the training data

When migrating your collection data, what is most important is to keep the document IDs the same. This is because your training data uses those IDs to reference the ground truth, and if the IDs get changed in moving from Retrieve and Rank to Discovery, then your re-ranking is going to be completely off (or training might not even start). Discovery allows you to specify the document ID in the API upload process, so this problem can be avoided by following the guidelines in this document. The Retrieve and Rank training data is usually stored in a CSV file. In this tutorial, this CSV file is used to upload the sample training data into Discovery. This tutorial assumes Retrieve and Rank was set up similar to the Retrieve and Rank Getting started tutorial and uses the Migrate from source path. For other paths, including migration from Retrieve and Rank tooling, additional tutorials will be available soon.

To complete the tutorial, you need the following:

The following prerequisites are necessary before beginning this tutorial:

  • This tutorial assumes you have already created a Watson Discovery Service instance.
  • This tutorial assumes that you have your service credentials.
    1. When in the Watson Discovery Service instance on Bluemix®, click Service credentials.
    2. Click View credentials under Actions.
    3. Copy the username and password values.

Adding Cranfield data to Discovery

  1. Create an environment:
    curl -X POST -u "{username}":"{password}" -H "Content-Type: application/json" -d '{
    "name": "my_environment", "description": "My environment" }' https://gateway.watsonplatform.net/discovery/api/v1/environments?version=2017-09-01

    Copy your environment ID.
  2. Create a collection:
    curl -X POST -u "{username}":"{password}" -H "Content-Type: application/json" -d '{ "name": "test_collection", "description": "My test collection", "configuration_id": "{configuration_id}", "language_code": "en" }' "https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/collections?version=2017-09-01"

    Copy your collection ID.
    You can use the default configuration to create your collection (see steps 1 and 2). You can also create an environment and collection via the tooling (see steps 1 and 2.)
  3. Add the documents to be searched.
    • Download the cranfield-data.json file if you haven't already. This is the source of documents that are used in Retrieve and Rank. The Cranfield collection documents are in JSON format, which is the format Retrieve and Rank accepted and which works well for Discovery as well.
      Note: Discovery does not require uploading the Solr schema. This is because Discovery infers the schema from the JSON structure automatically.
    • Download the data upload script. This script will upload the Cranfield JSON into Discovery.
      The script reads through the JSON file and sends each individual JSON document to the Watson Discovery Service using a default configuration in Discovery. The default configuration in Discovery provides similar settings to the default Solr config in Retrieve and Rank.
    • Issue the following command to upload the cranfield-data-json data to the cranfield_collection collection. Replace {username}, {password}, {path_to_file}, {environment_id}, {collection_id} with your information. Note that there are additional options: -d for debug and –v for verbose output from cURL.
      python ./disco-upload.py -u {username}:{password} 
      -i {path_to_file}/cranfield-data.json –e {environment_id} -c {collection_id}

      This upload should take about 20 minutes to complete.
    • Once the upload process has completed, you can check that the documents are there by issuing the following command to view the collection details:
      curl -u "{username}":"{password}" "https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/collections/{collection_id}?version=2017-09-01"

The output will look something like this:

{
  "collection_id" : "01743d74-564a-4a12-b270-b58908526a9c",
  "name" : "Cranfield",
  "configuration_id" : "10324002-3abb-4477-b264-76cf59d00695",
  "language" : "en",
  "status" : "active",
  "description" : null,
  "created" : "2017-09-28T18:38:27.552Z",
  "updated" : "2017-09-28T18:38:27.552Z",
  "document_counts" : {
    "available" : 1400,
    "processing" : 0,
    "failed" : 0
  },
  "disk_usage" : {
    "used_bytes" : 3322085
  },
  "training_status" : {
    "data_updated" : "",
    "total_examples" : 0,
    "sufficient_label_diversity" : false,
    "processing" : false,
    "minimum_examples_added" : false,
    "successfully_trained" : "",
    "available" : false,
    "notices" : 0,
    "minimum_queries_added" : false
  }
}

Look at the document_counts to see how many documents were uploaded successfully. We aren't expecting any document failures with this sample data set. However, with other data sets, you may see failed document counts. If you have any failed document counts, you can view the notices API to see the error messages. You can review the notices API command.

The section training gives you information about your training. We'll review that section after you upload your training data.

Adding training data into Discovery

Discovery uses a machine-learning model to re-rank documents. To do so, you need to train a model. Training occurs after you have loaded example queries along with the appropriate rated documents. By loading enough examples with enough variance to Discovery, you are teaching it what a "good" document is. In this step, we will use the existing Cranfield "ground truth," which is used in Retrieve and Rank, to train Discovery.

  1. Download the sample Cranfield ground truth CSV file from the Retrieve and Rank tutorial if you haven't already done so. The file is a set of questions that a user might ask about the data. The file provides the information needed to train a ranker in Retrieve and Rank or relevancy training in Discovery about questions and relevant answers. For each question, there is at least one identifier to an answer (the document ID). Each document ID includes a label to indicate how relevant the answer is to the question. The document ID points to the answer in the cranfield-data.json file that you downloaded in the previous step.
  2. Download the training data upload script. You will use this script to upload the training data into Discovery:
    • The script transforms the CSV file into a set of JSON queries and examples and sends them to the Discovery service using the training data APIs.
    • Discovery manages training data within the service, so when generating new examples and training queries, they can be stored in Discovery itself rather than as part of a separate CSV file that needs to be maintained.
  3. Execute the training upload script to upload the training data into Discovery. Replace {username}, {password}, {path_to_file}, {environment_id}, {collection_id} with your information. Note that there are additional options: -d for debug and –v for verbose output from cURL.
    python ./disco-train.py -u {username}:{password} 
    -i {path_to_file}/cranfield-gt.csv –e {environment_id} -c {collection_id}

    This may take 2-3 minutes to complete.
  4. Once the data is loaded, you can check the status of training using the collection details command we saw in the previous section. Watson will automatically check about once per hour to see if there is any new data, and if there is, it will begin processing it and turn it into a machine-learning model. When a model is training, you will see the state of the training section change from "processing": false to "processing": true. Once the model has been trained, you will see the state in the training section to change from "available": false to "available": true. You will also see the date change for the value "successfully_trained". If there are any errors, you can view them by looking at the notices API as described in the previous section.

Searching for documents

Discovery will automatically use a trained model to re-rank search results if available. When an API call is made with natural_language_query instead of query, a check will be made to see if there is a model available. If a model is available, Watson will use that model to re-rank results. This is similar to using the /fcselect endpoint in Retrieve and Rank with a specified ranker_id. Discovery manages the model itself, so you don't need to provide any indicator in the natural_language_query. First, we will do a search over unranked documents, then we will do a search using the trained model:

  1. You can search for documents in your collection by using a cURL command. Perform a query using the query API call to see unranked results. Replace {username}, {password}, {environment_id}, {collection_id} with your own values. The results returned will be unranked results and will use the default Discovery ranking formulas. You can try other queries by opening the training data CSV file and copying the value of the first column into the query parameter:
    curl -u "{username}":"{password}"
    "https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/collections/{collection_id}/query?version=2017-09-01&query=what
    is the basic mechanism of the transonic aileron buzz&"
  2. Now perform a search using the model by setting the natural_language_query parameter. Before you do so, make sure you check that you have a trained model as described in the previous section. Paste the following code in your console, replacing the {username}, {password}, {environment_id}, {collection_id} with your values.
    curl -u "{username}":"{password}"
    "https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/collections/{collection_id}/query?version=2017-09-01&natural_language_query=what
    is the basic mechanism of the transonic aileron buzz&"

    This command will return re-ranked results using the model you trained earlier. Compare the results of this search, as well as the results of some of the other searches you tried earlier. You may see some differences in results compared to what you see in Retrieve and Rank. This is because some of the techniques used for search have changed to simplify the experience and improve results, but overall the quality of results should be similar.

    After evaluating the re-ranked search results, you can refine them in Discovery by repeating the step of uploading training data with additional training queries and examples, and viewing the search results. You can also add new documents, as described in the first step, to broaden the scope of the search. Similar to Retrieve and Rank, improving results with training is an iterative process.

Next steps

This tutorial showed how to use the same data from a Retrieve and Rank example to create and train a Discovery collection. You can use the scripts and code in this tutorial as examples for how to migrate your own data to Discovery to take advantage of Discovery's latest capabilities and improvements. Check out Part 2 for additional approaches to migrating your data from Watson Retrieve and Rank, as well as best practices for evaluating and improving the relevancy of your search results.


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Cognitive computing
ArticleID=1050495
ArticleTitle=Migrating from Watson Retrieve and Rank to Discovery, Part 1: Migrating from Watson Retrieve and Rank to Watson Discovery Service
publish-date=10032017