parse

Parse data (eventually after fetching it from a URL) into XML

Description

  • When the url attribute is specified, the data is retrieved from the remote location specified. Note that this may delay the processing of this element and its parent element and eventually its following siblings depending on the async attribute.
  • When no url is specified, the first child text element is parsed.

When Watson™ Explorer processes this element, it replaces it with the XML output of the parsing.

When fetching URLs, cookies are collected on the go in the same way a browser does it during a browsing session. Cookies are always passed to subsequent requests (which in terms of XML, means the parse elements containing this element or the one generated by the parsing). Note that two sibling parses are not considered subsequent requests as they are usually fetched in parallel.

Attributes

  • async (Boolean default: true) - Asynchronous processing. For parse tags, should the request be enqueued (false) or processed before its next sibling. For other elements, this attribute only makes a difference when they contain asynchronous requests which need to be processed before the element is processed. In this case, when false the element's next sibling will only be processed after the current element, when true Watson Explorer won't wait for the current element to be processed before processing its next sibling.
  • elt-id (Integer) - Usage: Internal
  • max-elt-id (Integer) - Usage: Internal
  • execute-acl (Text)
  • process (Text) - An XPath determining which of the attributes and/or children will be processed. Currently only "", "*", "@*" and "*|@*" are supported.
  • cookie-jar (Text) - Usage: Internal
  • timeout (One of the types: xs:unsignedInt May only be: -1) - Number of milliseconds after which this request will be terminated. Note that this is used in combination with the global timeout passed to vivisimo_fetch_uris_finish.
  • max-size (One of the types: xs:unsignedInt May only be: -1 default: -1) - Limits the number of bytes that can be fetched by this request No limit if -1 is specified.
  • proxy ( Restricted form of xs:string) - Proxy (hostname:port) to use when fetching this request.
  • proxy-user-password (Text) - Colon separated username and password to use for the proxy.
  • cache-write (Boolean default: false) - If (and only if) the request is successful, save a copy of the response (headers + body) in the cache directory, to eventually reuse it later.
  • cache-read (Boolean default: false) - If this request has already been cached and the cache is still valid (see cache-write), then read the response from the cache.
  • cache-max-age (Integer default: -1) - Load the cache only if it has been created less than cache-max-age seconds ago.
  • headers (Text) - HTTP header fields (cookies, user-agent, etc) which will be sent to the remote server.

    The headers specified here will overwrite the default headers of Watson Explorer which cannot take multiple values. Note that the default Watson Explorer header looks like the following:

      HTTP/1.1
      User-Agent: curl/7.11.1 (os) libcurl/7.11.1 OpenSSL/0.9.7b zlib/1.1.4
      Host: xxx.xxx.xxx.xxx
      Pragma: no-cache
      Accept: */*
      Accept-Encoding: deflate, gzip
  • username (NMToken) - Username used for authentication. Watson Explorer will use the right protocol when available. For HTTP, Basic, Digest and NTLM are currently supported.
  • password (Text) - Password used for authentication.
  • method (Any of: GET, POST, HEAD, GET-POST, POST-XML, POST-SOAP default: GET) - Method (HTTP protocol) used to fetch a URL.
  • xml-container (Text) - When the method is POST-XML or POST-SOAP, the name of the containing tag.
  • xml-namespace-url (Text) - When the method is POST-SOAP, the namespace associated with the request elements.
  • content-type ( Restricted form of xs:string) - Overwrites the content-type returned by the remote server (useful in case the remote server specify an erroneous content-type).
  • separator (Text default: &) - Separator used to separate CGI parameters in the query string of the url sent to the remote server.
  • ignore-http-status (May only be: ignore-http-status) - Use this flag to force parsing of the output whatever the http status returned is. For example Watson Explorer follows redirections automatically, looking at the location and refresh HTTP headers, and looking at the http-equiv="refresh" meta tag. When redirecting, Watson Explorer does not execute any parsing on the redirected URL. Use this flag if you need to parse something from the page redirected.
  • disable-compression (May only be: disable-compression) - By default, Watson Explorer accepts to fetch data in compressed form by adding the following to the HTTP header:
      Accept-Encoding: deflate, gzip

    This flag allows you to disable this in case it is causing problems.

  • store-headers (May only be: store-headers) - Store the HTTP headers as the parse of the parse elements.
  • ssl-version (Any of: Any, TLSv1, SSLv2, SSLv3 default: Any) - Specify a verion of SSL to use for HTTPS connections. By default the strongest protocol available is used.
  • ssl-cert (NMToken) - Full path to a file containing an SSL certificate to be used for HTTPS connections. This may or may not contain a private key.
  • ssl-cert-type (Any of: pem, der, p12 default: pem) - Type of the certificate referred to by the cert attribute.
  • ssl-key (NMToken) - Full path to a file containing an SSL private key to be used for HTTPS connections.
  • ssl-key-type (Any of: pem, eng, der, p12 default: pem) - Type of the certificate referred to by the ssl-key attribute.
  • ssl-key-password (NMToken) - Password used to read the private key file (or the certificate file when it contains the private key).
  • ssl-verify-peer (May only be: ssl-verify-peer) - Determines whether Watson Explorer verifies the authenticity of the peer's certificate. When negotiating an SSL connection, the server sends a certificate indicating its identity. Watson Explorer verifies whether the certificate is authentic, i.e. that you can trust that the server is who the certificate says it is. This trust is based on a chain of digital signatures, rooted in certification authority (CA) certificates supplied through the meta.ssl-ca-cert option.
  • url (URI) - A fully qualified URL to be parsed. The http, https, ftp, ftps and file protocols are supported.
  • uri (URI) - Same as url. Overwrites it.
  • paging-url (URI) - When efficient paging is enabled, this attribute will be set to the URL that was actually fetched to retrieve stubs.
  • paging-doc-cond-url (URI) - When efficient paging is enabled, this attribute will be set to the URL that was fetched to retrieve complete documents.
  • url-encoding ( Restricted form of xs:string default: UTF-8) - Encoding that should be used to encode the URL (before URL escaping it).
  • filename ( Restricted form of xs:string) - Full path of a file to be parsed. This is similar to using the file:// except that a failure to fetch the file will result in an error.
  • display-url (URI) - An alternative URL to be displayed to the user (instead of url).
  • source (NMToken) - The source from which this parse comes from. Useful as a label when displaying it to the user.
  • encoding ( Restricted form of xs:string) - Overwrites the encoding returned by the remote server (useful in case the remote server returns an erroneous encoding).
  • headers-sent (Text) - The actual HTTP header sent to the remote server (with all the collected cookies). Usage: Internal
  • headers-received (Text) - The actual HTTP header returned by the remote server. This will only be set when the form flag attribute is set. Usage: Internal
  • post-data (Text) - CGI parameters which will be sent through the POST protocol (note that if this is specified, the parameters specified in the url will be passed with the URL as with GET, independently of the method)
  • message (Text) - Message reporting a problem while fetching.
  • disable-global-timeout (May only be: disable-global-timeout) - Do not apply the global timeout to this request. This is especially useful for logouts (which must be processed in any case).
  • base-64-encoded (May only be: base-64-encoded) - Base 64 decode the string before processing it.
  • start (xs:unsignedInt default: 0) - Start rank to use when assigning ranks to the documents parsed.
  • per (xs:unsignedInt) - Number of documents expected to be parsed. If more documents are extracted they will be discarded.
  • page (xs:unsignedInt) - Specified for parse tags extracted Usage: Internal
  • weight (Decimal number default: 1) - Weight multiplier applied to the scores of the documents parsed.
  • parser (One of the types: NMToken Restricted form of xs:string: Pattern \#anonymous\#\d+) - Name of the parser used to parse the (fetched) string.
  • root-id (NMToken) - Root id used to build the document ids...
  • inherit (Text default: per parser method url-encoding display-url source encoding username password content-type proxy proxy-user-password max-size weight id disable-global-timeout ssl-cert ssl-cert-type ssl-key ssl-key-type ssl-key-password ssl-verify-peer store-headers) - Space separated list of attributes which will be automatically inherited by a request (i.e., parsing of a URL) from a parent request (unless they are specified for the child request).
  • process-output (Text) - If specified, a corresponding process attribute will be added to the output.
  • query (NMToken) - The name of the query from which this parse is issued. Usage: Read-only
  • new-cookies (Text) - The cookies returned by the remote server for this specific request.
  • base-url (URI) - Use this instead of url to convert the URLs of the child documents and parses from relative to absolute.
  • ref (NMToken) - Usage: Internal
  • depth (xs:unsignedInt) - How nested is this request (including the redirects). Note that because child elements are processed first in the XML, a parse element containing another parse element will have a higher depth. Usage: Read-only
  • debug-id (xs:unsignedInt) - Usage: Read-only
  • parse-debug-type (Any of: xsl, regexp) - Usage: Read-only
  • parse-debug-id (xs:unsignedInt) - Usage: Read-only
  • http-status (Text) - The HTTP status returned by the remote server. Usage: Read-only
  • body-length (Text) - The number of bytes retrieved in the body of this request. Usage: Read-only
  • retrieved (Integer) - The number of documents retrieved by this request Usage: Read-only
  • start-time (Integer) - The number of ms elapsed from the creation of the Vivisimo object and the start of the processing of this request Usage: Read-only
  • end-time (Integer) - The number of ms elapsed from the creation of the Vivisimo object and the end of the processing of this request Usage: Read-only
  • parsing-time (Integer) - The number of ms elapsed during the execution of the parsing only (not including any fetching). Usage: Read-only
  • status (List of: Any of: processed, skipped, partially-fetched, parsing-failed, parsed, failed, connection-failed, dns-timeout, timeout, url-encoding-error, content-encoding-error, fetched, redirected, cached, not-redirected, http-error, too-deep, malformed-url, base64-decoding-failed) -
    • processed: The eventual fetching and the parsing have been completed
    • skipped: Skipped because enough documents have been retrieved (only for subsequent requests generated by parsing)
    • partially-fetched: The fetching has been aborted (due to timeout) but part of the content has been retrieved. Note that in this case Watson Explorer will try to parse the partial content anyway.
    • parsing-failed: A problem occurred during the parsing (possible causes are wrong parser definitions, broken XML etc). Complementary log messages will be issued.
    • parsed: The data has been parsed successfully.
    • failed: An unidentified error occurred during the fetching.
    • connection-failed: The connection to the remote server could not be established.
    • dns-timeout: The DNS resolution of the domain name timed out.
    • timeout: The fetching timed out.
    • url-encoding-error: The URL cannot be converted to url-encoding encoding.
    • content-encoding-error: There was a problem converting the content retrieved to UTF-8 (some characters may not respect the original encoding or the original encoding may be unknown, see the debug log for more details).
    • fetched: The data has been fully fetched successfully.
    • redirected: The remote server redirected the request to another URL.
    • cached: The data has been cached.
    • not-redirected: The remote server redirected the request to another URL but the meta.redirect option was turned off.
    • http-error: The remote server returned an error.
    • too-deep: This request will not be processed because its depth exceed the maximum allowed.
    • malformed-url: The URL to be fetched is not well-formed.
    • base64-decoding-failed: Failed to decode base 64 encoding.
  • encode (May only be: encode) - Return the data (which may be binary) into an schema.x.element.encoded-data node.

Children

  • Choose any number of these in any order.
    • parse-param: (Zero or more) - CGI parameter (name/value pair) for a containing element

Examples

In the following example, the login url will be fetched before submitting the query, which in turn is performed before the logout URL is fetched.

Input Example:

<parse url="http://vivisimo.com/logout">
<submit>
<_xml_/>
<parse url="http://vivisimo.com/login"/>
</submit>
</parse>

Output Example:

<parse url="http://vivisimo.com/logout">
<submit>
<_xml_/>
<parse url="http://vivisimo.com/login" ref="0"/>
</submit>
</parse>