source

Necessary information to meta-search a remote resource

Usage

Can be stored and retrieved (using the element's name and its name attribute) in the repository

Description

Sources are the building block of the meta-search. They contain the necessary information to query a remote resource and convert its output into XML meaningful to Watson™ Explorer.

From the XML point of view, sources are much like functions:

  • A source definition (with a name attribute) is not processed immediately
  • A source element does not correspond to any action. When a source reference is processed, it is replaced by the XML contained in its definition.

Most of the "work" is actually done by the submit sub-element and although sources are strongly connoted as corresponding to remote resources, they can very well correspond to any kind of actions (i.e., contain any XML).

There a few subtle differences between sources and functions:

  • Sources references are made through the add-sources element.
  • When add-sources is resolved it can add some constraints to the submit contained by the sources it references. For example, when its num attribute is specified, Watson Explorer will set the num attribute of the different submit so that it will retrieve in total a certain number of documents.
  • Source references, unless tagged as templates, are not transparently resolved. They generate an schema.x.element.added-source element which will collect the result of the processing of what is underneath the source element and would have otherwised appeared at the top level in the output.

The fact that sources are mere functions and can be used to process any kind of XML is especially useful in the following situations:

  • Sources can have specific knowledge, variables and options associated to them by just adding the corresponding elements underneath them.
  • Complex login operations can be done by adding parse elements inside submit elements. Complex logout operations can be done by adding parse elements outside submit elements. To see example code for complex login and logout operations, see the second set of input and output example code.
  • Use source elements to create bundles of sources (see example 3). Bundles of sources can be used recursively and are used by Watson Explorer in order to organize the sources into a hierarchy.

    A given source, if it is not a template, cannot be added more than once in a single schema.x.element.add-sources call. It is therefore possible and even recommended to create overlapping bundles, and search them simultaneously.

  • It is possible and advised to pass your own information through to the user for a given source. Unknown elements and attributes will be collected under the corresponding added-source element. This is especially useful in order to collect information while parsing the search results output

Sources can be flagged as being templates, in which case, unlike regular sources:

  • They can be loaded more than once in a single schema.x.element.add-sources call.
  • They do not generate an schema.x.element.added-source element. This way they are conveniently transparent to any post-processing (the XSL rendering in particular). For this reason they should never be called directly, but rather through another non-template source.

As for functions it is possible to pass variables when adding a source. Unlike functions, sources have two types of variables:

  • Template variables: similar to the arguments of a function, they should only be used for sources flagged as templates.
  • User variables: these variables can be defined globally using a schema.x.element.with element with an in-source attribute. They are usually defined in a user profile and specified by the end-user himself (giving Watson Explorer the ability to be a username/password repository). Even when required user variables do not generate errors when they are not specified; they generate a warning and the schema.x.element.add-sources call is discarded with a status set to skipped-missing-variables

Attributes

  • async (Boolean default: true) - Asynchronous processing. For parse tags, should the request be enqueued (false) or processed before its next sibling. For other elements, this attribute only makes a difference when they contain asynchronous requests which need to be processed before the element is processed. In this case, when false the element's next sibling will only be processed after the current element, when true Watson Explorer won't wait for the current element to be processed before processing its next sibling.
  • elt-id (Integer) - Usage: Internal
  • max-elt-id (Integer) - Usage: Internal
  • execute-acl (Text)
  • process (Text) - An XPath determining which of the attributes and/or children will be processed. Currently only "", "*", "@*" and "*|@*" are supported.
  • internal (Text)
  • overrides (Text)
  • overrides-status (Any of: identical, merge)
  • no-override (May only be: no-override)
  • modified (Integer)
  • modified-by (Text)
  • do-not-delete (May only be: do-not-delete) - Do not allow deletion of the element in the repository from the admin interface. Usage: Internal
  • read-only (May only be: read-only) - Do not allow modification of the element in the repository from the admin interface. Usage: Internal
  • products (List of: Any of: all, vivisimo, velocity, discovery, clustermed, clustergoogle, life-sciences, japanese, chinese, mobile, ius, admin, admin-full) - Usage: Internal
  • num (Integer) - The total number of documents to be retrieved among all the sources referenced. When this is specified Watson Explorer tries to distribute the number of documents requested between the sources queried and supporting the query syntax, taking into account limitation such as max or min. The multiplier over-request will eventually be applied to this number.
  • num-per-source (Integer default: 50) - The number of documents to be retrieved for each source referenced. num takes precedence over this attribute. Note that the default only applies when the element is not the descendant of another add-sources element.
  • over-request (Decimal number default: 1) - A number by which to multiply num in order to compensate for the loss of results due to duplicates. This number is only applied when more than one source is queried.
  • weight (Decimal number) - Weight multiplier for the documents retrieved. Allows you to give more weight to certain sources.
  • no-fetch (May only be: no-fetch) - If this flag is set, the added source will not enqueue any request (this is useful to find out the hierarchy of sources without requesting anything, e.g., in the advanced form).
  • sort (May only be: sort) - Force the documents to be loaded in sorted order across the different sources.
  • max-passes (Integer default: 1) - Allows going back to each source to get more documents.
  • aggregate (May only be: aggregate) - If specified, the source will appear as a single source to its parents.
  • transparent (May only be: transparent) - If this flag is set, the added source will not generate an schema.x.element.added-source element.
  • maintainers (Text) - Set of users to email when testing of this source fails.
  • test-strictly (May only be: test-strictly) - Source testing should fail on this source when it contains no tests. This is enabled by default on newly-created sources, but was not in older versions of Watson Explorer.
  • categories (Text) - Usage: This functionality is deprecated
  • type (Any of: bundle, ref, key-match, vse, normal)
  • name (NMToken) - Usage: Must be specified once all the forced-attribute child elements have been processed
  • logo (Text) - The URL of an image associated with the source.
  • status (Any of: ignore, disabled, broken) -
    • ignore: Source is disabled and cannot be queried.
    • disabled: Source is disabled and cannot be queried.
    • broken: Source is disabled and cannot be queried because its definition needs to be fixed (could be a bad parser, a bad form etc).
  • template (May only be: template) - Source is a template. By default, when querying multiple sources (see add-sources) at the same time, a same source cannot be queried twice. Turn on this flag if you want to wave this restriction for this specific source. This is useful when a source is never used as-is but corresponds to multiple real sources.
  • display-name (Text) - Name displayed to the user.
  • source-type (Text) - Informal label describing the type of information returned by this source.

Children

  • Use these in the listed order. The sequence may not repeat.
    • prototype: (At most 1) - Prototype of a function or source
    • add-sources: (At most 1) - Process the sources referenced
    • submit: (At most 1) - Submits the current structured query using the underlying form(s) (query to CGI conversion step)
    • tests: (At most 1) - List of tests for a source
    • help: (At most 1) - Contains XML which gives a description of a source and some useful information about using the source
    • description: (At most 1) - A (possibly long) description associated with a prototype or a declare in the admin interface.
    • parser: (Zero or more) - Parser definition
    • Choose any number of these in any order.
      • stopword: (Exactly 1) - Clustering topwords are words that are not interesting or at least less interesting in the input data.
      • stem: (Exactly 1) - Override the stemming of a specific word
      • redisplay: (Exactly 1) - Hard code the output of a word in the cluster labels
      • rephrase: (Exactly 1) - Conceptually like a search-and-replace on the clustered input
      • evoke: (Exactly 1) - Input text triggers the addition of new data for clustering
      • option: (Zero or more) - Specify Watson Explorer options
      • tag: (Zero or more) - Customizes the handling of HTML tags
      • field: (Zero or more) - Query field definition

Examples

Input Example:

  <source name="google">
  <submit forms="google">
  <form name="google" action="http://google.com/search">
  <input name="q" field="query"/>
  <input name="s" field="start"/>
  <input name="num" field="per" max="100"/>
  </form>
  </submit>
  </source>
  <query>
  <operator logic="and">
  <term str="test" field="query"/>
  </operator>
  </query>
  <field name="query" record="record"/>
  <add-sources names="google" num="200"/>

Output Example:

  <meta query=" test "/>
  <source name="google">
  <submit forms="google">
  <form name="google" action="http://google.com/search">
  <input name="q" field="query"/>
  <input name="s" field="start"/>
  <input name="num" field="per" max="100"/>
  </form>
  </submit>
  </source>
  <query>
  <operator logic="and">
  <term str="test" field="query" processing="strict"/>
  </operator>
  </query>
  <field name="query" record="record"/>
  <added-source name="google" num="200" status="queried" requested="200">
  <submit status="translated" source="google" max="200" num="200" last-rank="200" last-page="2">
  <form name="google" action="http://google.com/search" normalized="normalized" status="trans-succeeded">
  <input name="q" field="query" logic="and" delimiters=" &#13;&#10;&#9;&#x3000;" position="0" value="test"/>
  <input name="s" field="start" logic="and" delimiters=" &#13;&#10;&#9;&#x3000;" position="1"/>
  <input name="num" field="per" max="100" logic="and" delimiters=" &#13;&#10;&#9;&#x3000;" position="2"/>
  </form>
  <form name="google" action="http://google.com/search" status="resolved">
  <input name="q" field="query"/>
  <input name="s" field="start"/>
  <input name="num" field="per" max="100"/>
  </form>
  </submit>
  <scope max="200" orig-tag="submit">
  <parse url="http://google.com/search?q=test&amp;s=0&amp;num=100" source="google" per="100" page="0" start="0" parser="#vxml#" ref="0"/>
  <parse url="http://google.com/search?q=test&amp;s=100&amp;num=100" source="google" per="100" page="1" start="100" parser="#vxml#" ref="1"/>
  </scope>
  </added-source>
Source with login and logout (remember that the child elements are processed BEFORE their parents)

Input Example:

  <_comment_>variables with source scope which could be user-defined
  </_comment_>
  <with name="username" in-source="Google" value="johnq"/>
  <with name="password" in-source="Google" value="xxxxxx"/>
  <query>
  <term str="test" field="query"/>
  </query>
  <source name="Google">
  <prototype>
  <declare name="username" required="required"/>
  <declare name="password" required="required"/>
  </prototype>
  <_xml_>
  <a ref="schema.x.group.kb">source specific knowledge and options</a>
  </_xml_>
  <parse url="http://mydomain.com/logout">
  <_comment_>logout (with void parser)</_comment_>
  <parser name="test1"/>
  <submit>
  <parse url="http://mydomain.com/login">
  <_comment_>login with username/password</_comment_>
  <parse-param name="username">
  <value-of-var name="username"/>
  </parse-param>
  <parse-param name="password">
  <value-of-var name="password"/>
  </parse-param>
  </parse>
  <form>
  <input name="q" field="query"/>
  </form>
  <parser name="test2">
  <_xml_>
  parser definition
  </_xml_>
  </parser>
  </submit>
  </parse>
  </source>
  <add-sources names="Google" num="100"/>

Output Example:

  <_comment_>variables with source scope which could be user-defined
  </_comment_>
  <query>
  <term str="test" field="query"/>
  </query>
  <source name="Google">
  <prototype>
  <declare name="username" required="required"/>
  <declare name="password" required="required"/>
  </prototype>
  <_xml_>
  <a ref="schema.x.group.kb">source specific knowledge and options</a>
  </_xml_>
  <parse url="http://mydomain.com/logout">
  <_comment_>logout (with void parser)</_comment_>
  <parser name="test1"/>
  <submit>
  <parse url="http://mydomain.com/login">
  <_comment_>login with username/password</_comment_>
  <parse-param name="username">
  <value-of-var name="username"/>
  </parse-param>
  <parse-param name="password">
  <value-of-var name="password"/>
  </parse-param>
  </parse>
  <form>
  <input name="q" field="query"/>
  </form>
  <parser name="test2">
  <_xml_>
  parser definition
  </_xml_>
  </parser>
  </submit>
  </parse>
  </source>
  <added-source name="Google" num="100" variables="username password">
  <declare name="password" initial-value="xxxxxx"/>
  <declare name="username" initial-value="johnq"/>
  <_xml_>
  <a ref="schema.x.group.kb">source specific knowledge and options</a>
  </_xml_>
  <parse url="http://mydomain.com/logout">
  <_comment_>logout (with void parser)</_comment_>
  <parser name="test1"/>
  <submit>
  <parse url="http://mydomain.com/login?username=johnq&amp;password=xxxxxx" source="Google" ref="0">
  <_comment_>login with username/password</_comment_>
  </parse>
  <form>
  <input name="q" field="query"/>
  </form>
  <parser name="test2">
  <_xml_>
  parser definition
  </_xml_>
  </parser>
  </submit>
  </parse>
  </added-source>
Source Bundle

Input Example:

  <source name="Web">
  <_xml_>
  <a ref="schema.x.group.kb">bundle knowledge</a>
  </_xml_>
  <add-sources names="Google"/>
  <add-sources names="AltaVista"/>
  <add-sources names="Overture"/>
  </source>

Output Example:

  <source name="Web">
  <_xml_>
  <a ref="schema.x.group.kb">bundle knowledge</a>
  </_xml_>
  <add-sources names="Google"/>
  <add-sources names="AltaVista"/>
  <add-sources names="Overture"/>
  </source>
Example of template source. Note that the tmpl source does not generate an schema.x.element.added-source element.

Input Example:

  <source name="tmpl" template="template">
  <prototype>
  <declare name="sw"/>
  </prototype>
  <stopword>
  <attribute name="word">
  <value-of-var name="sw"/>
  </attribute>
  </stopword>
  </source>
  <source name="s1">
  <add-sources names="tmpl">
  <with name="sw" value="v1"/>
  </add-sources>
  </source>
  <source name="s2">
  <add-sources names="tmpl">
  <with name="sw" value="v2"/>
  </add-sources>
  </source>
  <add-sources names="s1 s2"/>

Output Example:

  <source name="tmpl" template="template">
  <prototype>
  <declare name="sw"/>
  </prototype>
  <stopword>
  <attribute name="word">
  <value-of-var name="sw"/>
  </attribute>
  </stopword>
  </source>
  <source name="s1">
  <add-sources names="tmpl">
  <with name="sw" value="v1"/>
  </add-sources>
  </source>
  <source name="s2">
  <add-sources names="tmpl">
  <with name="sw" value="v2"/>
  </add-sources>
  </source>
  <added-source name="s1">
  <stopword word="v1"/>
  </added-source>
  <added-source name="s2">
  <stopword word="v2"/>
  </added-source>
Unknown elements added under a source will be collected under the related added-source element, making it easy to pass information through to the user for each source independently

Input Example:

  <source name="a">
  <mytag/>
  </source>
  <add-sources names="a"/>

Output Example:

  <source name="a">
  <mytag/>
  </source>
  <added-source name="a">
  <mytag/>
  </added-source>
Adding unknowns elements under a source is especially convenient when trying to pass through to the user information only relevant to a given source collected during the parsing of its results.

Input Example:

  <query>
  <operator logic="and">
  <term str="ok" field="query"/>
  </operator>
  </query>
  <source name="b">
  <submit>
  <form action="http://dmoz.org/">
  <input name="q" field="query"/>
  </form>
  <parser type="html-xsl" name="test">
  <xsl:template match="/">
  <attribute name="message">
  Search succeeded
  </attribute>
  <mymessage>
  Search succeeded
  </mymessage>
  </xsl:template>
  </parser>
  </submit>
  </source>
  <add-sources names="b" num="50"/>

Output Example:

  <meta query=" ok "/>
  <query>
  <operator logic="and">
  <term str="ok" field="query" processing="strict"/>
  </operator>
  </query>
  <source name="b">
  <submit>
  <form action="http://dmoz.org/">
  <input name="q" field="query"/>
  </form>
  <parser type="html-xsl" name="test">
  <xsl:template match="/">
  <attribute name="message">
  Search succeeded
  </attribute>
  <mymessage>
  Search succeeded
  </mymessage>
  </xsl:template>
  </parser>
  </submit>
  </source>
  <added-source name="b" num="50" status="queried" requested="50" message="&#10;    Search succeeded&#10;  ">
  <submit status="translated" source="b" max="50" num="50" last-rank="50" last-page="1"/>
  <parse url="http://dmoz.org/?q=ok" source="b" per="50" page="0" start="0" parser="test" ref="0" start-time="98" end-time="701" http-status="200 OK" body-length="6818" status="fetched processed parsed" parsing-time="25"/>
  <mymessage>
  Search succeeded
  </mymessage>
  </added-source>
Unlike functions, sources can have "user variables" defined globally, usually specified by the end-user and especially convenient to create a username/password repository.

Input Example:

  <source name="s1">
  <prototype>
  <declare name="password" required="required"/>
  </prototype>
  <parse url="http://z.com/login">
  <parse-param name="password">
  <value-of-var name="password"/>
  </parse-param>
  </parse>
  </source>
  <add-sources names="s1"/>
  <with name="password" in-source="s1" value="xxxx"/>
  <add-sources names="s1"/>

Output Example:

  <log function="vivisimo_input_xml" fid="0">
  <warning time="0" date="1170275776" cputime="0" id="XML_RESOLVE_MISSING_VAR" function="vivisimo_input_xml" fid="0">Required variable
  <string>password</string> has not been passed when resolving the reference
  <xmlnode/> to
  <xmlnode xpath="////source[@name='s1']">
  <source name="s1">[...]</source>
  </xmlnode>
  </warning>
  </log>
  <source name="s1">
  <prototype>
  <declare name="password" required="required"/>
  </prototype>
  <parse url="http://z.com/login">
  <parse-param name="password">
  <value-of-var name="password"/>
  </parse-param>
  </parse>
  </source>
  <added-source name="s1" status="skipped-missing-variables"/>
  <added-source name="s1" variables="password">
  <declare name="password" initial-value="xxxx"/>
  <parse url="http://z.com/login?password=xxxx" source="s1" ref="0"/>
  </added-source>