Enqueueing Data

When enqueueing data directly to a search collection rather than letting the crawler fetch it via the URL reference, you will need to use the search-collection-enqueue function. (If you are using the Collection Broker, you can also use the collection-broker-enqueue-xml function.)

XML message:

      <SearchCollectionEnqueue xmlns="urn:/velocity/types">
      <collection>COLLECTION</collection>
      <crawl-urls>
      ...
      </crawl-urls>
      </SearchCollectionEnqueue>

In C#:

    SearchCollectionEnqueue sce = new SearchCollectionEnqueue();
    sce.collection = COLLECTION;
    sce.crawlurls = new SearchCollectionEnqueueCrawlurls();
    crawlurl[] cus = new crawlurl[2];
    sce.crawlurls.crawlurl = cus;

In Java:

    SearchCollectionEnqueue sce = new SearchCollectionEnqueue();
    sce.setCollection(COLLECTION);
    
    CrawlUrls sceurls = new CrawlUrls();
    sce.setCrawlUrls(sceurls);
    java.util.List<CrawlUrl> lcu = sceurls.getCrawlUrl();

Note: As shown in the previous example, note that you must fully qualify the Java List object because of a potential name collision with the Watson Explorer Engine list object.

The crawl-urls function can be passed a list/array of crawl-url objects that will allow you to batch your enqueue requests. Batching requests is usually recommended if you are planning to call the service more than 10 times per second.

The crawl-urls function can fetch URLs in effectively the same way that the search-collection-enqueue-url function does:

XML message:

      <crawl-urls>
      <crawl-url url="http://vivisimo.com"/>
      ...
      </crawl-urls>

In C#:

    cus[0] = new crawlurl();
    cus[0].url = ENQ_URL;

In Java:

    CrawlUrl cu0 = new CrawlUrl();
    cu0.setUrl(ENQ_URL);
    lcu.add(cu0);

When using the crawl-urls function, it is also possible to add content directly in the call. You must still provide a URL as a reference for future updates or deletes.

warning: Make sure to set the status property to complete when using URLs for reference only (i.e., when you do not want to fetch any content). If you do not set the status to complete, the crawler will try to fetch this URL, which is inefficient because it is unnecessary and could even generate an error.

It is possible to pass content in multiple formats. The most commonly used formats are:

base64, which is typically used to pass binary content such as Office files (which will subsequently be converted by the Watson Explorer Engine conversion framework to become vxml)
vxml, which is XML that conforms to the Watson Explorer Engine document schema, and is typically used to pass content containing multiple fields

XML message:

      <crawl-urls>
      ...
      <crawl-url url="myproto://doc?id=1" status="complete">
      <crawl-data encoding="text" content-type="text/html"><
      <html>
      <head><title>My HTML page title</title></head>
      <body>My body</body>
      </html>
      >
      </crawl-data>
      <crawl-data encoding="xml" content-type="application/vxml">
      <document>
      <content name="field1" weight="3">my first field</content>
      <content name="field2" weight="1">my second field</content>
      </document>
      </crawl-data>
      <crawl-data encoding="base64">YmFkAA==</crawl-data>
      </crawl-url>
      ...
      </crawl-urls>

In C#:

    cus[1] = new crawlurl();
    cus[1].url = MY_CRAWL_URL;
    
    cus[1].status = crawlurlStatus.complete;
    cus[1].crawldata = new crawldata[3];
    
    cus[1].crawldata[1] = new crawldata();
    cus[1].crawldata[1].contenttype = "application/vxml-unnormalized";
    cus[1].crawldata[1].vxml = new crawldataVxml();
    document[] doc = new document[1];
    cus[1].crawldata[1].vxml.document = doc;
    doc[0] = new document();
    doc[0].url = MY_CRAWL_URL;
    doc[0].vsekey = "my_vse_key_1";
    doc[0].vsekeynormalizedSpecified = true;
    doc[0].content = new content[2];
    doc[0].content[0] = new content();
    doc[0].content[0].name = "field1";
    doc[0].content[0].Value = "My first field";
    doc[0].content[1] = new content();
    doc[0].content[1].name = "field2";
    doc[0].content[1].Value = "My second field";
    cus[0] = new crawlurl();
    cus[0].url = MY_CRAWL_URL + "a";
    
    cus[0].status = crawlurlStatus.complete;
    cus[0].crawldata = new crawldata[1];
    
    cus[0].crawldata[0] = new crawldata();
    cus[0].crawldata[0].contenttype = "application/vxml-unnormalized";
    cus[0].crawldata[0].vxml = new crawldataVxml();
    document[] doc2 = new document[1];
    cus[0].crawldata[0].vxml.document = doc2;
    
    doc2[0] = new document();
    doc2[0].vsekey = "my_vse_key_1";
    doc2[0].vsekeynormalizedSpecified = true;
    doc2[0].content = new content[2];
    doc2[0].content[0] = new content();
    doc2[0].content[0].name = "field1a";
    doc2[0].content[0].Value = "My first field a";
    doc2[0].content[1] = new content();
    doc2[0].content[1].name = "field2a";
    doc2[0].content[1].Value = "My second field a";
    
    cus[2] = new crawlurl();
    cus[2].url = MY_CRAWL_URL + "b";
    
    cus[2].status = crawlurlStatus.complete;
    cus[2].crawldata = new crawldata[1];
    
    cus[2].crawldata[0] = new crawldata();
    cus[2].crawldata[0].contenttype = "application/vxml-unnormalized";
    cus[2].crawldata[0].vxml = new crawldataVxml();
    document[] doc3 = new document[1];
    cus[2].crawldata[0].vxml.document = doc3;
    doc3[0] = new document();
    doc3[0].vsekey = "my_vse_key_1";
    doc3[0].vsekeynormalizedSpecified = true;
    doc3[0].content = new content[2];
    doc3[0].content[0] = new content();
    doc3[0].content[0].name = "field1b";
    doc3[0].content[0].Value = "My first field b";
    doc3[0].content[1] = new content();
    doc3[0].content[1].name = "field2b";
    doc3[0].content[1].Value = "My second field b";
    
    
    
    SearchCollectionEnqueueResponse enqresp = port.SearchCollectionEnqueue(sce);

In Java:

    CrawlUrl cu1 = new CrawlUrl();
    lcu.add(cu1);
    cu1.setUrl(MY_CRAWL_URL);
    cu1.setStatus("complete");
    java.util.List<CrawlData> lcd = cu1.getCrawlData();
    CrawlData cd0 = new CrawlData();
    lcd.add(cd0);
    cd0.setContentType("text/html");
    cd0
    .setText("<html><head><title>My HTML page title</title></head><body>My body</body></html>");
    
    CrawlData cd1 = new CrawlData();
    lcd.add(cd1);
    cd1.setContentType("application/vxml");
    Vxml v1 = new Vxml();
    cd1.setVxml(v1);
    java.util.List<Document> ld = v1.getDocument();
    Document d0 = new Document();
    ld.add(d0);
    java.util.List<Content> lc = d0.getContent();
    Content c0 = new Content();
    lc.add(c0);
    c0.setName("field1");
    c0.setValue("My first field");
    Content c1 = new Content();
    lc.add(c1);
    c1.setName("field2");
    c1.setValue("My second field");
    CrawlData cd2 = new CrawlData();
    lcd.add(cd2);
    byte[] binarydata = { 'b', 'a', 'd', '\0' };
    cd2.setBase64(new String(Base64.encodeBase64(binarydata)));
    cd2.setContentType("text/plain");
    SearchCollectionEnqueueResponse enqresp = 
      port.searchCollectionEnqueue(sce);

Watson Explorer Engine does not require a pre-defined schema. Content elements (contents) in the input become searchable in a fielded and non-fielded way as soon as they are indexed. Watson Explorer Engine uses the notion of field at query time and content at indexing time. By default, the name of searchable field is mapped to the same name as the content name used at indexing time (see Submitting Structured Queries). All contents are also searchable through the special query field, which is the default search field and is efficiently mapped to all contents by default.

The mapping for text fields can be customized at query time, by using the field-map parameter, or at the source level. A field can be mapped to multiple content names. The contents searched by the default query field can also be defined at the collection level.

Non-textual fields (like dates or numbers) must be fast indexed so that they can be searched using numerical operations (like comparison or more advanced XPath expressions). This can be conveniently specified at the time of ingestion (to avoid having to modify the collection configuration dynamically when creating new fields). See Filtering and Organizing Query Results for examples of queries using these numeric fields.

Note: Fast indexing various contents is also required if you want to provide structured navigation (also referred to as binning) or sorting within the standard Watson Explorer Engine user interface.

In C#:

    crawlurl[] cus = new crawlurl[2];
    sce.crawlurls = new SearchCollectionEnqueueCrawlurls();
    sce.crawlurls.crawlurl = cus;
    cus[0] = new crawlurl();
    cus[0].enqueuetype = crawlurlEnqueuetype.reenqueued;
    cus[0].status = crawlurlStatus.complete;
    cus[0].crawldata = new crawldata[1];
    cus[0].crawldata[0] = new crawldata();
    cus[0].crawldata[0].vxml = new crawldataVxml();
    cus[0].crawldata[0].contenttype = "application/vxml";
    document[] docs1 = new document[1];
    cus[0].crawldata[0].vxml.document = docs1;
    docs1[0] = new document();
    docs1[0].vsekeynormalizedSpecified = true;
    docs1[0].vsekey = "key-3";
    docs1[0].content = new content[1];
    docs1[0].content[0] = new content();
    docs1[0].content[0].name = "field1";
    docs1[0].content[0].Value = "My field 1";
    docs1[0].content[0].fastindex = fastindextype.set;
    docs1[0].content[0].fastindexSpecified = true;
    
    cus[1] = new crawlurl();
    cus[1].enqueuetype = crawlurlEnqueuetype.reenqueued;
    cus[1].status = crawlurlStatus.complete;
    cus[1].crawldata = new crawldata[1];
    cus[1].crawldata[0] = new crawldata();
    cus[1].crawldata[0].vxml = new crawldataVxml();
    cus[1].crawldata[0].contenttype = "application/vxml";
    document[] docs2 = new document[1];
    cus[1].crawldata[0].vxml.document = docs2;
    docs2[0] = new document();
    docs2[0].vsekeynormalizedSpecified = true;
    docs2[0].vsekey = "key-3";
    docs2[0].content = new content[1];
    docs2[0].content[0] = new content();
    docs2[0].content[0].name = "field2";
    docs2[0].content[0].Value = "12345";
    docs2[0].content[0].fastindex = fastindextype.@int;
    docs2[0].content[0].fastindexSpecified = true;
    SearchCollectionEnqueueResponse enqresp = port.SearchCollectionEnqueue(sce);

In Java:

    java.util.List<CrawlUrl> lcu = sceurls.getCrawlUrl();
    CrawlUrl cu0 = new CrawlUrl();
    lcu.add(cu0);
    cu0.setStatus("complete");
    java.util.List<CrawlData> lcd0 = cu0.getCrawlData();
    CrawlData cd0 = new CrawlData();
    lcd0.add(cd0);
    cd0.setContentType("application/vxml");
    cd0.setVxml(new CrawlData.Vxml());
    java.util.List<Document> ldocs0 = cd0.getVxml().getDocument();
    Document doc0 = new Document();
    ldocs0.add(doc0);
    doc0.setVseKey("3");
    doc0.setVseKeyNormalized("vse-key-normalized");
    java.util.List<Content> contents0 = doc0.getContent();
    Content c0 = new Content();
    contents0.add(c0);
    c0.setName("field1");
    c0.setValue("My first field");
    c0.setFastIndex(FastIndexType.SET);
    CrawlUrl cu1 = new CrawlUrl();
    lcu.add(cu1);
    cu1.setStatus("complete");
    java.util.List<CrawlData> lcd1 = cu1.getCrawlData();
    CrawlData cd1 = new CrawlData();
    lcd1.add(cd1);
    cd1.setContentType("application/vxml");
    cd1.setVxml(new CrawlData.Vxml());
    java.util.List<Document> ldocs1 = cd1.getVxml().getDocument();
    Document doc1 = new Document();
    ldocs1.add(doc1);
    doc1.setVseKey("3");
    doc1.setVseKeyNormalized("vse-key-normalized");
    java.util.List<Content> contents1 = doc1.getContent();
    Content c1 = new Content();
    contents1.add(c1);
    c1.setName("field2");
    c1.setValue("12345");
    c1.setFastIndex(FastIndexType.INT);
    SearchCollectionEnqueueResponse enqresp = port.searchCollectionEnqueue(sce);

To illustrate the fact that Watson Explorer Engine is totally schema-less, it is possible to update a content with a different data type.

In C#:

    cus = new crawlurl[1];
    sce.crawlurls.crawlurl = cus;
    cus[0] = new crawlurl();
    cus[0].enqueuetype = crawlurlEnqueuetype.reenqueued;
    cus[0].status = crawlurlStatus.complete;
    cus[0].enqueuetype = crawlurlEnqueuetype.reenqueued;
    cus[0].crawldata = new crawldata[1];
    cus[0].crawldata[0] = new crawldata();
    cus[0].crawldata[0].vxml = new crawldataVxml();
    cus[0].crawldata[0].contenttype = "application/vxml";
    document[] docs3 = new document[1];
    cus[0].crawldata[0].vxml.document = docs3;
    docs3[0] = new document();
    docs3[0].vsekeynormalizedSpecified = true;
    docs3[0].vsekey = "key-3";
    docs3[0].content = new content[1];
    docs3[0].content[0] = new content();
    docs3[0].content[0].name = "field2";
    docs3[0].content[0].Value = DateTime.Now.ToString("d");
    docs3[0].content[0].fastindex = fastindextype.date;
    docs3[0].content[0].fastindexSpecified = true;
    SearchCollectionEnqueueResponse enqresp = port.SearchCollectionEnqueue(sce);

In Java:

    CrawlUrl cu3 = new CrawlUrl();
    lcu.add(cu3);
    cu3.setStatus("complete");
    cu3.setEnqueueType("reenqueued");
    java.util.List<CrawlData> lcd1 = cu3.getCrawlData();
    CrawlData cd1 = new CrawlData();
    lcd1.add(cd1);
    cd1.setContentType("application/vxml");
    cd1.setVxml(new CrawlData.Vxml());
    java.util.List<Document> ldocs1 = cd1.getVxml().getDocument();
    Document doc1 = new Document();
    ldocs1.add(doc1);
    doc1.setVseKey("3");
    doc1.setVseKeyNormalized("vse-key-normalized");
    java.util.List<Content> contents1 = doc1.getContent();
    Content c1 = new Content();
    contents1.add(c1);
    c1.setName("field2");
    DateFormat df = DateFormat.getDateInstance(DateFormat.SHORT);
    c1.setValue(df.format(new java.util.Date()));
    c1.setFastIndex(FastIndexType.DATE);
    SearchCollectionEnqueueResponse enqresp = port.searchCollectionEnqueue(sce);

It is also possible to enqueue multiple content elements with the same name that contain different data. From a text search standpoint, it is as if their text was concatenated except that the content boundaries act as phrase breakers (i.e., AND searches can span across content boundaries but not phrase searches). From an XPath standpoint, these are treated as nodesets: for example, $price > 10 means that one of the price contents is greater than 10. For sorting purposes, only the first entry that occurs is used.

In C#:

    crawlurl[] cus = new crawlurl[1];
    sce.crawlurls.crawlurl = cus;
    cus[0] = new crawlurl();
    cus[0].status = crawlurlStatus.complete;
    cus[0].crawldata = new crawldata[1];
    cus[0].crawldata[0] = new crawldata();
    cus[0].crawldata[0].vxml = new crawldataVxml();
    cus[0].crawldata[0].contenttype = "application/vxml";
    document[] docs1 = new document[1];
    cus[0].crawldata[0].vxml.document = docs1;
    docs1[0] = new document();
    docs1[0].vsekeynormalizedSpecified = true;
    docs1[0].vsekey = "key-2";
    docs1[0].content = new content[3];
    docs1[0].content[0] = new content();
    docs1[0].content[0].name = "emailTo";
    docs1[0].content[0].Value = "sample1@vivisimo.com";
    docs1[0].content[1] = new content();
    docs1[0].content[1].name = "emailTo";
    docs1[0].content[1].Value = "sample2@vivisimo.com";
    docs1[0].content[2] = new content();
    docs1[0].content[2].name = "emailTo";
    docs1[0].content[2].Value = "sample3@vivisimo.com";
    
    SearchCollectionEnqueueResponse enqresp = port.SearchCollectionEnqueue(sce);

In Java:

    CrawlUrls sceurls = new CrawlUrls();
    sce.setCrawlUrls(sceurls);
    java.util.List<CrawlUrl> lcu = sceurls.getCrawlUrl();
    CrawlUrl cu0 = new CrawlUrl();
    lcu.add(cu0);
    cu0.setStatus("complete");
    java.util.List<CrawlData> lcd0 = cu0.getCrawlData();
    CrawlData cd0 = new CrawlData();
    lcd0.add(cd0);
    cd0.setContentType("application/vxml");
    cd0.setVxml(new CrawlData.Vxml());
    java.util.List<Document> ldocs0 = cd0.getVxml().getDocument();
    Document doc0 = new Document();
    ldocs0.add(doc0);
    doc0.setVseKey("2");
    doc0.setVseKeyNormalized("vse-key-normalized");
    java.util.List<Content> contents0 = doc0.getContent();
    Content c0 = new Content();
    contents0.add(c0);
    c0.setName("emailTo");
    c0.setValue("sample1@vivisimo.com");
    Content c1 = new Content();
    contents0.add(c1);
    c1.setName("emailTo");
    c1.setValue("sample2@vivisimo.com");
    Content c2 = new Content();
    contents0.add(c2);
    c2.setName("emailTo");
    c2.setValue("sample3@vivisimo.com");
    
    SearchCollectionEnqueueResponse enqresp = port.searchCollectionEnqueue(sce);

warning: The fast-index type specified at the content level overwrites whatever is specified at the collection level.

While it is possible to specify different types for the same content (even in the same document), this is not recommended. Watson Explorer Engine will not generate any error or warning in this case because it can interpret each value separately. For example, if a content is not fast-indexed, it is as if it does not exist with regard to XPath expressions. Similarly, if a content has mixed types, the XPath expressions will independently cast the value to the proper type for each content.

If you plan to support the creation of custom fields by the end user, it is good practice to prefix them to avoid collisions with pre-defined fields (such as query, num, and so on). Watson Explorer Engine uses the v. prefix for its own hidden fields, and you should use a similar prefix if you cannot control the name of the fields that are created.

warning: The synchronization property is important. If you want to guarantee that a URL or content that has been enqueued will be processed, even in case of system reboot or failure, you should use the to-be-crawled value for this property. Specifying this value guarantees that if the enqueue call is successful, the request has been committed to disk and will be processed at some point, but allows a synchronous response to the enqueue request. If the crawler or the indexer services were killed before this request was fully processed, they would reconsider the request upon restart.

This feature allows for a more robust and simpler queue design. By handing over content to a collection with the guarantee that it will be processed, an application can synchronously remove items from the queue without the need for asynchronous call backs.