Transactionally Grouping Enqueue/Delete Operations

In some cases, you may want to enqueue URLs and/or delete operations as a group, so that all of the enqueued URLs will be indexed and all deletes will be performed at one time. This prevents a query from returning a result found in one of the URLs in this group but not returning other results that may be found in other URLs in that same group. This is referred to as atomic indexing, and refers to the notion of an atomic operation composed of other operations, all of which must occur together or must not appear to have occurred at all. Atomicity is a standard property of transactional systems. If your system restarts while an atomic operation is in progress, no in-progress indexing and deletes of any of the components of the atomic operation will be visible until the entire atomic operation is re-indexed and completes.

Grouping multiple enqueue operations into a single atomic enqueue request is most commonly used when crawling and indexing content that has multiple components, such as email messages with attachments, content management systems that maintain multiple versions of documents, and so on.

Important: In traditional transaction processing systems, any error that occurs for any component of an atomic transaction would cause the entire transaction to be aborted. That is not the case with atomic indexing, because its focus is on guaranteeing that a set of items are indexed (and deletes are optionally performed) as a single unit. Any errors that occur within any of the enqueue or delete operations within an atomic indexing operation are noted in the audit log for the associated search collection, but will not cause the atomic indexing operation to abort. The traditional abort-on-error behavior that is normally associated with transactions can be enabled by setting the abort-batch-on-error attribute on the atomic indexing operation.

Watson Explorer Engine's atomic indexing implementation supports both enqueueing URLs for indexing (using crawl-url elements or objects) and deleting indexed URLs (using crawl-delete elements or objects). Sets of these elements that are to be done atomically must be located within an index-atomic node located within a standard crawl-urls element, as in the following XML example:

<crawl-urls>
  <crawl-url url="http://somewhere.com/file1"/>
  <index-atomic originator="MyApp" enqueue-id="1">
    <crawl-url url="http://somewhere.com/file2"/>
    <crawl-url url="http://somewhere.com/file3"/>
    <crawl-url url="http://somewhere.com/file4"/>
  </index-atomic>
</crawl-urls>

In this example:

the file http://somewhere.com/file1 will be enqueued and can be successfully or unsuccessfully indexed on its own.
the files http://somewhere.com/file2, http://somewhere.com/file3, and http://somewhere.com/file4 will be enqueued as a single atomic operation. All of the data associated with all of these URLs will appear in the index at the same time. If any of these operations fails, the data associated with that crawl-url is not present in the index, and an audit log message indicating the failure of that crawl-url will be logged as part of the audit log entry for that atomic operation.

Using the SOAP API and C#, the code for this same operation would look something like the following:

SearchCollectionEnqueueXml scex =
    new SearchCollectionEnqueueXml();
scex.collection = COLLECTION;
scex.crawlnodes = new SearchCollectionEnqueueXmlCrawlnodes();
scex.crawlnodes.crawlurls = new crawlurls();

List<object> objList = new List<object>();

crawlurl cu = new crawlurl();
cu.url = "http://somewhere.com/file1";
objList.Add(cu);

indexatomic ia0 = new indexatomic();
ia0 = new indexatomic();
ia0.originator = "MyApp";
ia0.enqueueid = "1";
ia0.crawlurl = new crawlurl[3];
ia0.crawlurl[0] = new crawlurl();
ia0.crawlurl[0].url = "http://somewhere.com/file2";
ia0.crawlurl[1] = new crawlurl();
ia0.crawlurl[1].url = "http://somewhere.com/file3";
ia0.crawlurl[2] = new crawlurl();
ia0.crawlurl[2].url = "http://somewhere.com/file4";
objList.Add(ia0);

scex.crawlnodes.crawlurls.Items = objList.ToArray();
SearchCollectionEnqueueResponse enqresp =
    port.SearchCollectionEnqueue(scex);

As discussed earlier in this section, the standard transaction processing system behavior of aborting an atomic operation if an error occurs within it can be induced by setting the abort-batch-on-error attribute on the index-atomic node. The following example is the same as the previous one, but with this attribute set:

<crawl-urls>
  <crawl-url url="http://somewhere.com/file1"/>
  <index-atomic originator="MyApp" enqueue-id="1"
                abort-batch-on-error="abort-batch-on-error">
    <crawl-url url="http://somewhere.com/file2"/>
    <crawl-url url="http://somewhere.com/file3"/>
    <crawl-url url="http://somewhere.com/file4"/>
  </index-atomic>
</crawl-urls>

In this example:

the file http://somewhere.com/file1 will be enqueued and can be successfully or unsuccessfully indexed on its own.
the files http://somewhere.com/file2, http://somewhere.com/file3, and http://somewhere.com/file4 will be enqueued as a single atomic operation. These enqueued URLs must both be crawled and indexed successfully in order for the atomic operation to complete. If any of these operations fails, all of them fail, and no changes associated with any of those operations will be present in the index for the associated collection. Any changes that were associated with the atomic operation will be undone in the index, which is known as rolling back those actions.

Watson Explorer Engine's index-atomic element and object log the success or failure of index-atomic operations in the same log (referred to as the audit log, that is used to record this information for standalone enqueue (crawl-url) and delete (crawl-delete) operations. See Logging and Examining Enqueue/Delete Status for detailed information about the audit log.

warning: If you use the indexed or indexed-no-sync synchronization modes on an enqueue with the partial attribute set, no response will be provided until all index-atomic elements that share that same enqueue-id are indexed. This can cause enqueue timeouts depending on the amount of data that you are enqueueing atomically and the structure of your application. See Synchronization Modes for Enqueue Operations for information about available synchronization modes.