Optimizing Content

The importance of content to Watson™ Explorer Engine cannot be understated; your platform cannot deliver the results that your users need without having access to the appropriate information stored in the correct manner. What information you will index (store) and where it will come from are two of the first questions that you will need to answer when deciding how to optimize your content. The answer to the first question is another question: What will a user be looking for? If you ask a user to answer this question, the basic answer will be: "Everything, but just give me the thing I want." Mind-reading is unfortunately not a current Watson Explorer Engine feature, however you can index information from almost anywhere on your network with Watson Explorer Engine. This will let you store the information and then work on ensuring that what the user is requesting (or thinking about) is delivered.

Tip: It takes time to crawl (collect) and index information, and it may take you multiple iterations to decide on the correct optimizations for your content. When you are building and testing your platform, consider testing your optimizations with smaller subsets of the final data, as this will produce sample results and reduce the lag associated with data collection.

The following table is designed to help you decide which features will be beneficial to you when optimizing content. The left column of the table identifies a result that you may want to achieve, and the right column describes the feature(s) that can be utilized to achieve that goal.

Table 1. Content Optimization Features

If you want to...

Then the feature you need to learn about is:

...add information from a specific application into a Watson Explorer Engine collection

Connectors - Watson Explorer Engine uses connectors to extract information from different applications to store as items in the index that users can search.

Many applications store enterprise data, and these applications store the data in many different ways. In order to communicate with these applications, Watson Explorer Engine utilizes applications known as Connectors. Some connectors require simple parameters, while others are more complex to meet the requirements of the applications that they are connecting to.

...split a single source of information into multiple documents that can be queried

...make some parts of a document more important than others...

...limit which content in a document is indexed or returnable

Converters - Once you have connected to your data, choosing which data will be stored in the index is a decision that will be guided by what the user will need to search and how the Watson Explorer Engine Platform will be deployed in your network.

Watson Explorer Engine converters are used to process the information from the original application and deliver it as a Watson Explorer Engine document to the search collection. The connectors can provide specific processing directions for different parts of and incoming file, even forking into multiple conversion streams as necessary. Each component of the document to be stored in an index (content element) can contain specific attributes to identify how it will be processed (should it be indexed, the relative importance of the content, availability as a refinement, returnable in a result, and many more). How information is stored in Watson Explorer Engine is a very important part of the platform, and understanding this will make the use of the stored information much simpler.

...have a lot of information stored in a quickly accessible way

Multi-Installation Deployments - Indexing and searching data takes disk space and power. The amount of space required depends on how much of the original data will be stored in the index. When data is stored in Watson Explorer Engine, it may be a net increase or decrease in the original storage size of the data.

Note: This doesn't make immediate sense, but consider two different file types, an untagged text file and a Microsoft Word document. These two documents may contain the same number of words, but will differ greatly in size: this is due to extra data that is added to a Word document (page layout, formatting, styles, etc.). Watson Explorer Engine also adds data to the raw text that is stored, this to identify the content and its use to the user. Typically this will result in Watson Explorer Engine using more space than a untagged text file to store the same data, but much less than many metadata complex document types.

Very large Watson Explorer Engine Platform Applications can be distributed across multiple installations of Watson Explorer Engine, sharing the load of storage, query processing, and traffic. Distributed indexing has different deployment options: you can store Watson Explorer Engine indices across multiple synchronized Watson Explorer Engine installations (this can be useful when dealing with very large indices), or have multiple copies of the same index replicated periodically or synchronized in real time across several installations (delivering high availability and load balanced querying). Front-end/Back-end solutions store any indices on dedicated machines with load balanced front-end machines receiving the query and distributing it to the appropriate back-end(s). Distributed Indexing and Front-end/Back-end solutions can be implemented simultaneously as necessary to provide as much power as necessary to the Watson Explorer Engine Platform Application.

...keep current with all new and modified content

Pushing Content and Scheduled Re-crawling or Refreshing- You can add content to an existing collection by pushing the information to a collection or by scheduling a recrawl of the seeds. A recrawl of the seeds will only add information from the defined seeds, where pushed content can be added from any source.

...have specific documents that are designed to be highlighted to the user, and can be developed by non-administrative personnel

The Spotlight Manager module for Results Module - Spotlight Manager lets Results Module users add highlighted information to the Watson Explorer Engine Platform Application which are returned to the user based on keyword matches. This lets you add information that may be important, but time sensitive to any set of results. The user-friendly interface of Results Module means that users with minimal technical knowledge (some HTML may be beneficial) can create eye-catching customized documents for specific queries.

...index and return results in more than one language

Multiple Language Support - Watson Explorer Engine supports many language options that can be used to augment other features, including segmentation, language detection, and multiple index streams. For example, a query can be identified using language detection then spell checked using the correct dictionary for that language.

...automatically use the root of words in a query (run would match runs, running, and ran, etc.)

Stemming Dictionaries - You can specify different types of stemming to automatically expand user queries and include the stemming variations of the original word(s). Standard stemmers include case and depluralize, Watson Explorer Engine also includes many language stemmers for use in your applications. Dictionaries can be created from using an existing word lists and by using words/phrases collected from search collections in your platform.