Glossary

This glossary defines terms that are used in the IBM® Watson Explorer Content Analytics product interfaces and documentation.

For more information about linguistic terms, see the Glossary of linguistic terms from the Summer Institute of Linguistics. For more information about Unicode-related terms, see the Glossary of Unicode terms from the Unicode Consortium.

access control list (ACL)

In computer security, a list associated with an object that identifies all the subjects that can access the object and their access rights.

administrative role

A classification of a user that prescribes access to a user.

analysis engine

See text analysis engine.

analysis results

The information that is produced by annotators. Analysis results are written to a data structure called a common analysis structure. Analysis results produced by the custom text analysis engines (annotators) can be made available for search by inclusion in the index.

annotation

Information about a span of text. For example, an annotation could indicate that a span of text represents a company name. In the Unstructured Information Management Architecture (UIMA), an annotation is a special feature structure.

annotator

A software component that performs specific linguistic analysis tasks and produces and records annotations. An annotator is the analysis logic component in an analysis engine.

base annotators

A set of standard text analysis engines used for default document analysis processing.

Boolean search

A search in which one or more search terms are combined by using operators such as AND, NOT, and OR.

boost class

An object that contains specifications that can influence the relative rank of a document in the search results.

boost word

A word that can influence the relative rank of a document in the search results. During query processing, the importance of a document that contains a boost word might be raised or lowered, depending on a score that is predefined for the word.

category tree

A hierarchy of categories.

certificate

In computer security, a digital document that binds a public key to the identity of the certificate owner, thereby enabling the certificate owner to be authenticated. A certificate is issued by a certificate authority and is digitally signed by that authority.

certificate authority

A trusted third-party organization or company that issues the digital certificates used to create digital signatures and public-private key pairs. The certificate authority guarantees the identity of the individuals who are granted the unique certificate.

character normalization

A process in which the variant forms of a character, such as capitalization and diacritical marks, are reduced to a common form.

clitic

A word that syntactically functions separately but is phonetically connected to another word. A clitic can be written as connected or separate from the word it is bound to. Common examples of clitics include the last part of a contraction in English (wouldn't or you're).

collection

A set of data sources and options for crawling, parsing, indexing, and searching those data sources.

common analysis structure (CAS)

A structure that stores the content and metadata of a document, and all analysis results that are produced by a text analysis engine. All data exchange during document analysis is handled by using the common analysis structure.

common analysis structure consumer (CAS consumer)

A consumer that does the final processing on the analysis results that are stored in the common analysis structure. For example, a consumer indexes the contents of the common analysis structure in a search engine or it populates a relational database with specific analysis results.

common communication layer (CCL)

The communication infrastructure that unites the various product components (controller, parser, crawler, and index server).

concept extraction

A text analysis function that identifies significant vocabulary items (such as people, places, or products) in text documents and produces a list of those items. See also theme extraction.

correlation

An indication of how relevant a facet value is in documents that match the query conditions. The correlation score measures the uniqueness and frequency of a facet value in some documents as compared to other documents that match the query. A correlation value that is higher than 1.0 represents an anomaly that might require further investigation.

crawl space

A set of sources that match specified patterns (such as Uniform Resource Locators (URLs), database names, file system paths, domain names, and IP addresses) that a crawler reads from to retrieve items for indexing.

crawler

A software program that retrieves documents from data sources and gathers information that can be used to create search indexes.

credential

Detailed information, acquired during authentication, that describes the user, any group associations, and other security-related identity attributes. Credentials can be used to perform a multitude of services, such as authorization, auditing, and delegation. For example, the sign-on information (user ID and password) for a user are credentials that allow the user to access an account.

custom text analysis engine

A text analysis engine that is created by using the Unstructured Information Management Architecture (UIMA) software development kit (SDK) and can be added to the set of standard text analysis engines (also known as base annotators). See also text analysis engine.

data source

Any repository of data from which documents can be retrieved, such as the web, relational and nonrelational databases, and content management systems.

data source type

A grouping of data sources according to the protocol that is used to access the data.

data store

A data structure where documents are kept in their parsed form.

dequeue

To remove items from a queue.

diacritic

A mark indicating a change in the phonetic value of a character or a combination of characters.

discoverer

A function of a crawler that determines which data sources are available for the crawler to retrieve information from.

distinguished name

The name that uniquely identifies an entry in a directory. A distinguished name consists of attribute:value pairs, separated by commas. Also, a set of name-value pairs (such as CN=person's name and C=country or region) that uniquely identifies an entity in a digital certificate.

Document Object Model

A system in which a structured document, such as an XML file, is viewed as a tree of objects that can be programmatically accessed and updated.

Domino® Document Manager cabinet

A Domino Document Manager database that is used to organize documents. Cabinets hold Domino databases.

Domino Document Manager library

A Domino Document Manager database that is the entry point to Domino Document Manager.

Domino Internet Inter-ORB Protocol (DIIOP)

A server task that runs on the server and works with the Domino Object Request Broker to allow communication between Java™ applets that are created with the Notes® Java classes and the Domino server. Browser users and Domino servers use DIIOP to communicate and to exchange object data.

dynamic ranking

A type of ranking in which the terms in the query are analyzed with respect to the documents that are being searched to determine the rank of results. See also text-based scoring. Contrast with static ranking.

dynamic summarization

A type of summarization in which the search terms are highlighted and the search results contain phrases that best represent the concepts of the document that the user is searching for. Contrast with static summarization.

enqueue

To put a message or item in a queue.

escape character

A character that suppresses or selects a special meaning for one or more characters that follow.

facet

A clearly defined property of a subject. Facets for a given subject are mutually exclusive and collectively exhaustive. Faceted classification schemes differ from hierarchical categorization schemes in that more than one facet can be used to find items of interest.

facet value

The combination of a facet and with a specific character string, such as a facet named City combined with the string New York.

faceted browsing

A process of browsing information by filtering a set of topics by progressively selecting from only valid values of a faceted classification system, which is a predefined collection of facets.

feature path

A path that is used to access the value of a feature in a Unstructured Information Management Architecture (UIMA) feature structure.

feature structure

The underlying data structure that represents the result of text analysis. A feature structure is an attribute-value structure. Each feature structure is of a type, and every type has a specified set of valid features or attributes, much like a Java class.

federated search

A search capability that enables searches across multiple search services and returns a consolidated list of search results.

federation

The process of combining naming systems so that the aggregate system can process composite names that span the naming systems.

field

An area into which a particular category of data or control information is entered.

fielded search

A query that is restricted to a particular field.

free-form text

Unstructured text consisting of words or sentences.

free text search

A search in which the search term is expressed as free-form text.

frequency

An indication of how many documents in the queried document set contain a given facet value.

full-text index

A data structure that references data items to enable a search to find documents that contain the query terms.

fuzzy search

A search that returns words with spelling that is similar to that of the search term.

gloss

A unit of information that is associated with a Content Analytics Studio dictionary entry, such as the lemma, part of speech, or synonyms.

hybrid search

A combined Boolean search and free text search.

identity management

A set of APIs that control access to secure data and enable users to search a collection without being required to specify a user ID and password for each repository in the collection.

index

See full-text index.

index cache

A buffer that holds data that enables the index to be rebuilt without recrawling documents.

index field

A field that exists only in the index to represent data that is common between multiple input sources. Index fields can help users retrieve documents without needing to be knowledgeable about actual field names.

inflection

A variation in the form of a word to reflect grammatical information, such as gender, tense, number or person. Inflections are typically generated by adding affixes.

information extraction

A type of concept extraction that automatically recognizes significant vocabulary items, such as names, terms, and expressions, in text documents.

IP address

A unique address for a device or logical unit on a network that uses the IP standard.

Java Database Connectivity (JDBC)

An industry standard for database-independent connectivity between the Java platform and a wide range of databases. The JDBC interface provides a call-level API for SQL-based database access.

JavaScript

A web scripting language that is used in browsers and web servers.

JavaServer Pages (JSP)

A server scripting technology that enables Java code to be dynamically embedded within web pages (HTML files) and executed when the page is served, in order to return dynamic content to a client.

Java virtual machine (JVM)

A software implementation of a processor that runs compiled Java code (applets and applications).

Katakana

A character set that consists of symbols that are used in one of the two common Japanese phonetic alphabets, which is used primarily to write foreign words phonetically.

key database file

See key ring. key ring.

key ring

In computer security, a file that contains public keys, private keys, trusted roots, and certificates. See also keystore file.

keystore file

A key ring that contains both public keys that are stored as signer certificates and private keys that are stored in personal certificates.

language identification

A search function that determines the language of a document.

lemma

The base form of a word plus inflected forms that share the same part of speech.

lemmatization

A process that determines the lemma for each word form that occurs in text. The lemma of a word encompasses its base form plus inflected forms that share the same part of speech. For example, the lemma for go encompasses go, goes, went, gone, and going. Lemmas for nouns group singular and plural forms (such as calf and calves). Lemmas for adjectives group comparative and superlative forms (such as good, better, and best). Lemmas for pronouns group different grammatical cases of the same pronoun (such as I, me, my, and mine).

lexical affinity

The relationship of search words in a document that are close to each other in meaning. Lexical affinity is used to calculate the relevancy of a result.

lexical analysis

The process by which a sequence of characters is grouped into a series of lexical items, known as tokens, and all available dictionary data is associated with the lexical items. Lexical analysis comprises three separate steps: segmentation, normalization, and annotation.

library

A system object that serves as a directory to other objects. See also Domino Document Manager library.

ligature

Two or more characters that are connected so they appear as one character. For example, ff and ffi are characters that can be presented as ligatures.

Lightweight Directory Access Protocol (LDAP)

An open protocol that uses TCP/IP to provide access to directories that support an X.500 model and that does not incur the resource requirements of the more complex X.500 Directory Access Protocol (DAP). For example, LDAP can be used to locate people, organizations, and other resources in an Internet or intranet directory.

linguistic search

A search type that browses, retrieves, and indexes a document with terms that are reduced to their base form (for example, so that mice is indexed as mouse) or expanded with their base form (as with compound words).

link analysis

A method that is based on the analysis of hyperlinks between documents and used to determine what pages in the collection are important to users.

local federator

A client object created by the search and index APIs that enables users to search a set of heterogeneous collections and obtain a unified set of search results.

Lotus Quickr place

A web venue that is provided by Lotus® Quickr® that enables geographically dispersed participants to collaborate on projects and communicate online in a structured and secure workspace.

Lotus Quickr room

A partitioned area of a Lotus Quickr place that is restricted to authorized members who share a common interest and a need to work collectively.

masking character

A character that is used to represent optional characters at the front, middle, and end of a search term. Masking characters are normally used for finding variations of a term in an index. See also wildcard character.

master administrator

An administrative role that enables a user to administer the entire Watson Explorer Content Analytics system.

MIME type

An Internet standard for identifying the type of object that is being transferred across the Internet.

monitor

A user who has the authority to observe collection-level processes.

newline character

A control character that causes the print or display position to move down one line.

n-gram segmentation

A segmentation method that considers overlapping sequences of a specific number of characters as a single word. See also segmentation. Contrast with Unicode-based white space segmentation.

no-follow directive

A directive in a web page that instruct robots (such as the Web crawler) to not follow links found in that page.

no-index directive

A directive in a web page that instruct robots (such as the Web crawler) to not include the contents of that page in the index.

normalization

The process of replacing surface form representations with their canonical form. This can include case normalization (such as replacing Run with run), grammatical normalization (such as replacing runs with run), and lexicographical normalization (such as replacing Unicode full width characters with Unicode basic form, or removing white spaces from Chinese text).

normalized form

A form of a word or multi-word unit after it has undergone a process of normalization. The normalized form is also known as a lemma or stem.

Notes remote procedure call (NRPC)

A communication mechanism of Lotus Notes® that is used for all Notes-to-Notes communication.

out of vocabulary (OOV) word

A word that is not included in the base Content Analytics Studio dictionary that is used for word recognition.

opaque term

A query term that is not parsed by the linguistic query parser. Instead, opaque terms are identified by their syntax to be implementation-specific, such as specific to the syntax for searching XML documents with an XML query language. Opaque query terms begin with the @ character and the query language identifier. For example, @xmlf2 specifies that the query is to be handled by the XML fragment query language, and @xmlp specifies that the query is to be handled by the XPath query language.

operator

A user who has the authority to observe, start, and stop collection-level processes.

parametric search

A type of search that looks for objects that contain a numeric value or attribute, such as dates, integers, or other numeric data types within a specified range.

parser

A program that interprets documents that are added to the data store. The parser extracts information from the documents and prepares them for indexing, search, and retrieval.

parser driver

A service that feeds the parser service with documents. There is one parser driver for each collection. A collection's parser driver service corresponds to the collection's parser in the administration console.

parser service

The service that handles all document parsing and text analysis processing across document collections. At least one parser service is running at all times.

place

A virtual location that is visible in the portal where individuals and groups meet to collaborate. In a portal, each user has a personal place for private work, and individuals and groups have access to a variety of shared places, which can be either public places or restricted places. See also Lotus Quickr place.

popular ranking

A type of ranking that raises a document's existing ranking based on the document's popularity.

processing engine archive

A .pear zip archive file that includes an Unstructured Information Management Architecture (UIMA) analysis engine and all of the resources required to use it for custom analysis.

proximity search

A text search that returns a result when two search patterns occur within a specified distance from each other.

proxy server

A server that acts as an intermediary for HTTP web requests that are hosted by an application or a web server. A proxy server acts as a surrogate for the content servers in the enterprise.

query expansion

Adding search terms to a user's search string. For example, the search string phone might be expanded to include the terms telephone, mobile phone, and cellular phone.

quick link

An association between a Uniform Resource Identifier (URI) and keywords or phrases.

ranking

The assignment of an integer value to each document in the search results from a query. The order of the documents in the search results is based on the relevance to the query. A higher rank signifies a closer match. See also dynamic ranking and static ranking.

raw data store

A data structure where crawled documents are stored before they are sent to the parser. Crawlers write to the raw data store, and the parser reads from the raw data store. When documents have been parsed, they are removed from the raw data store. Not to be confused with data store.

regular expression annotator

A software component that detects entities or units of information in a text document, such as product numbers, based on regular expressions that describe the exact patterns that are searched in the document text. If one of the regular expressions matches parts of the document text, the regular expression annotator creates the corresponding annotations that cover the match or part of it. These annotated expressions are then stored, either in the index by using an index mapping file, or a JDBC-capable database by using a database mapping file.

remote federator

A server federator that federates a set of searchable objects.

Robots Exclusion Protocol

A protocol that allows website administrators to indicate to visiting robots which parts of their site should not be visited by the robot.

room

A program that allows users to create documents for others to read, respond to comments from others, and review project status and deadlines. Users can also chat with others who are in the same room. See also Lotus Quickr room.

rule-based category

Categories that are created by rules that specify which documents are associated with which categories. For example, you can define rules to associate documents that contain or exclude certain words, or that match a Uniform Resource Identifier (URI) pattern, with specific categories.

search application

A program that processes queries, searches the index, returns the search results, and retrieves the source documents.

search cache

A buffer that holds the data and results of previous search requests.

search engine

A program that accepts a search request and returns a list of documents to the user.

search results

A list of documents that match the search request.

Secure Sockets Layer (SSL)

A security protocol that provides communication privacy. With SSL, client/server applications can communicate in a way that is designed to prevent eavesdropping, tampering, and message forgery.

security token

Information about identity and security that is used to authorize access to documents in a collection. Different data source types support different types of security tokens. Examples include user roles, user IDs, group IDs, and other information that can be used to control access to content.

seed list page

In WebSphere Portal, an XML page that contains links to the pages that are available on a portal. Crawlers use the seed list to identify the documents to crawl. The seed list page also contains metadata that is stored with the crawled documents in the index.

segmentation

The division of text into distinct lexical units such as words, phrases, sentences, paragraphs, or lemmas. See also n-gram segmentation and Unicode-based white space segmentation.

semantic search

A type of keyword search that incorporates linguistic and contextual analysis. See also text analysis.

servlet

A Java program that runs on a web server and extends the server's functionality by generating dynamic content in response to web client requests. Servlets are commonly used to connect databases to the web.

shingle

A string of consecutive tokens (words) that are taken from a sentence. For example, from "This is a very short sentence.", the 3-word shingles (or trigrams) are:

This is a
is a very
a very short
very short sentence

Shingles can be used in statistical linguistics. For example, if two different texts have a lot of common shingles, the texts are probably related somehow.

soft error page

A type of web page that provides information about why the requested web page cannot be returned. For example, instead of returning a simple status code, the HTTP server can return a page that explains the status code in detail.

static ranking

A type of ranking in which factors about the documents that are being ranked, such as date, the number of links that point to the document, and so on, augment the rank. Contrast with dynamic ranking.

start Uniform Resource Locator (URL)

The starting point for a crawl.

static summarization

A type of summarization in which the search results contain a specified, stored summary from the document. Contrast with dynamic summarization.

stemming

See word stemming.

stop word

A word that is commonly used, such as the, an, or and, that is ignored by a search application.

stop word removal

The process of removing stop words from the query to ignore common words and return more relevant results.

surface form

The form of a word or multi-word unit as it is found in the unprocessed input text.

summarization

The process of including non-redundant sentences in search results to briefly describe the content of a document. See also dynamic summarization and static summarization.

synonym dictionary

A dictionary that enables users to search for synonyms of their query terms when they search a collection.

taxonomy

A classification of objects into groups based on similarities. A taxonomy organizes data into categories and subcategories. See also category tree.

text analysis

The process of extracting semantics and other information from text to enhance the retrievability of data in a collection. See also semantic search.

text analytics

A form of natural language processing that includes linguistic, statistical, and machine learning techniques for analyzing text and extracting key information for business integration.

text analysis engine

A software component that is responsible for finding and representing context and semantic content in text.

text-based scoring

The process of assigning an integer value to a document that signifies the relevance of the document with respect to the terms in a query. A higher integer value signifies a closer match to the query. See also dynamic ranking.

text extractor

A component that uses document filtering technology based on Oracle Outside In Content Access to identify document formats.

text segmentation

See segmentation.

theme extraction

A type of concept extraction that automatically recognizes significant vocabulary items in text documents to extract the theme or topic of a document. See also concept extraction.

token

A span of text to be considered as a meaningful unit for higher level processing, such as indexing. A token is typically a word, a number, an acronym, or other entity that has syntactic or semantic value.

tokenization

The process of parsing input into tokens.

tokenizer

A text segmentation program that scans text and determines if and when a series of characters can be recognized as a token.

trailing character

A character that holds the last position in a word.

type system

The type system defines the types of objects (feature structures) that may be discovered by a text analysis engine in a document. The type system defines all possible feature structures in terms of types and features. You can define any number of different types in a type system. A type system is domain and application specific.

Unicode-based white space segmentation

A method of tokenization that uses Unicode character properties to distinguish between token and separator characters. See also segmentation. Contrast with n-gram segmentation.

Uniform Resource Identifier (URI)

A compact string of characters that identifies an abstract or physical resource.

Uniform Resource Locator (URL)

The unique address of an information resource that is accessible in a network such as the Internet. The URL includes the abbreviated name of the protocol used to access the information resource and the information used by the protocol to locate the information resource.

Unstructured Information Management Architecture (UIMA)

An IBM architecture that defines a framework for implementing systems for the analysis of unstructured data.

user agent

An application that browses the web and leaves information about itself at the sites that it visits. For example, the Web crawler is a user agent.

Web crawler

A type of crawler that explores the web by retrieving a web document and following the links within that document.

weighted term search

A query in which certain terms are given more importance.

wildcard character

A character that is used to represent optional characters at the front, middle, or end of a search term.

word stemming

A process of linguistic normalization in which the variant forms of a word are reduced to a common form. For example, words like connections, connective, and connected are reduced to connect.

XML Path Language (XPath)

A language that is designed to uniquely identify or address parts of source XML data, for use with XML-related technologies, such as XSLT, XQuery, and XML parsers. XPath is a World Wide Web Consortium standard.