parser
Parser definition
Usage
Can be stored and retrieved (using the element's name and its name attribute) in the repository
Description
A parser converts a string (any text, HTML, XML etc) into XML which will be interpreted by the Vivisimo object. Watson Explorer supports two types of parsers, one based on regular expressions, the other based on XSL.
See also:
- parsing tutorial in the online or printed documentation
Attributes
- async (Boolean default: true) - Asynchronous processing. For parse tags, should the request be enqueued (false) or processed before its next sibling. For other elements, this attribute only makes a difference when they contain asynchronous requests which need to be processed before the element is processed. In this case, when false the element's next sibling will only be processed after the current element, when true Watson Explorer won't wait for the current element to be processed before processing its next sibling.
- elt-id (Integer) - Usage: Internal
- max-elt-id (Integer) - Usage: Internal
- execute-acl (Text)
- process (Text) - An XPath determining which of the attributes and/or children will be processed. Currently only "", "*", "@*" and "*|@*" are supported.
- internal (Text)
- overrides (Text)
- overrides-status (Any of: identical, merge)
- no-override (May only be: no-override)
- modified (Integer)
- modified-by (Text)
- do-not-delete (May only be: do-not-delete) - Do not allow deletion of the element in the repository from the admin interface. Usage: Internal
- read-only (May only be: read-only) - Do not allow modification of the element in the repository from the admin interface. Usage: Internal
- products (List of: Any of: all, vivisimo, velocity, discovery, clustermed, clustergoogle, life-sciences, japanese, chinese, mobile, ius, admin, admin-full) - Usage: Internal
- url (URI)
- name (One of the types: NMToken Restricted form of xs:string: Pattern \#anonymous\#\d+)
- type (Any of: regex, regex-text, case-insensitive-regex,
case-insensitive-regex-text, perl-regex, perl-regex-text, case-insensitive-perl-regex,
case-insensitive-perl-regex-text, xsl, html-xsl, java default: regex) -
- regex: Regular expression-based parser. Much like flex, these parsers use regular expressions to switch from one state to another and delimit contents. Outputs XML conformant to the Watson Explorer schema. This parser uses Posix regular expressions.
- regex-text: Same as regex but outputs text instead of XML. It is useful for transforming data such as HTML web pages.
- case-insensitive-regex: Same as regex but case insensitive.
- case-insensitive-regex-text: Same as regex-text but case insensitive.
- perl-regex: Regular expression-based parser. Much like flex, these parsers use regular expressions to switch from one state to another and delimit contents. Outputs XML conformant to the Watson Explorer schema. This parser uses Perl-style regular expressions.
- perl-regex-text: Same as perl-regex but outputs text instead of XML. It is useful for transforming data such as HTML web pages.
- case-insensitive-perl-regex: Same as perl-regex but case insensitive.
- case-insensitive-perl-regex-text: Same as perl-regex-text but case insensitive.
- xsl: XSL transformation parsers. The XSL needs to appear as a text node under the parser node. The XML declaration and the xsl:stylesheet containing tags are optional. If the XSL is malformed XML, Watson Explorer will attempt to fix it.
- html-xsl: Same as xsl except that it is preceded by an HTML to XML conversion. Since HTML is ambiguous, the HTML to XML conversion can be done in many ways. Watson Explorer will try to close unclosed tags, add missing tags (like html and body), escape entities, and perform other normalizing steps.
- java: Instantiates and runs a Java class on the input data. This parser type is not finished and should not be used.
- java-classname (Text) - The fully-qualified Java classname to instantiate. This attribute is only applicable with the java parser type.
Children
- Choose 1 of these.
- scope: (Exactly 1) - Variable scope
- Choose any number of these in any order.
- match: (Zero or more) - Defines regular expression matches to go from one state to another in a regexp parser
- state: (Zero or more) - Flex-like state for regexp parsers
- add-content: (Zero or more) - In a regexp parser, add a content element to the XML output when entering the containing state/match
Examples
Input Example:
<parser name="p"> <match token="ok"> <add-document/> <add-content name="c"/> </match> </parser> <parse parser="p">ok</parse>
Output Example:
<parser name="p"> <match token="ok"> <add-document/> <add-content name="c"/> </match> </parser> <documents> <document id="Ndoc0"> <content name="c" type="html" action="cluster" weight="1">ok</content> </document> </documents>
Input Example:
<parser name="p" type="xsl"> <xsl:template match="/test"> <document> <content name="test"> <xsl:value-of select="."/> </content> </document> </xsl:template> </parser> <parse parser="p"><?xml version="1.0"?><test>ok</test></parse>
Output Example:
<parser name="p" type="xsl"> <xsl:template match="/test"> <document> <content name="test"> <xsl:value-of select="."/> </content> </document> </xsl:template> </parser> <documents> <document id="Ndoc0"> <content name="test" type="html" action="cluster" weight="1">ok</content> </document> </documents>
Input Example:
<parser name="p" type="xsl"> <xsl:template match="/html/body"> <document> <content name="test"> <xsl:value-of select="."/> </content> </document> </xsl:template> </parser> <parse parser="p"><html><body>ok</body></html></parse>
Output Example:
<parser name="p" type="xsl"> <xsl:template match="/html/body"> <document> <content name="test"> <xsl:value-of select="."/> </content> </document> </xsl:template> </parser> <documents> <document id="Ndoc0"> <content name="test" type="html" action="cluster" weight="1">ok</content> </document> </documents>