Release Notes
Abstract
This document describes IBM Content Analytics Studio Version 3.0, provides installation instructions, and describes known issues and workarounds.
Content
Description
IBM Content Analytics Studio 3.0 is the latest release of an Eclipse-based development environment for building custom text analyzers in various languages.
Note: The name of this product changed since its last release. Efforts were made to update references in the product and documentation to the new name: IBM Content Analytics Studio. However, if you find any references to LanguageWare Resource Workbench or LRW (for short) in the product or documentation, be aware that they refer to the same product.
IBM Content Analytics Studio 3.0 allows users to easily:
- Develop rules to spot facts, entities, and relationships by using a simple drag-and-drop paradigm.
- Build language and domain resources into a dictionary or ontology.
- Import and export dictionary data to or from a database.
- Browse the dictionaries to assess their content and quality.
- Test rules and dictionaries in real time on documents.
- Create UIMA annotators for annotating text with the contents of dictionaries and rules.
- Annotate text and browse the contents of each annotation.
- Export a UIMA annotator as a UIMA PEAR file so the same annotator can be run outside of IBM Content Analytics Studio. This PEAR file can optionally be tailored to provide Index field information to IBM Content Analytics with Enterprise Search. PEAR files exported from IBM Content Analytics Studio and not used in IBM Content Analytics with Enterprise Search are not supported by IBM.
- Export a UIMA annotator directly to IBM Content Analytics with Enterprise Search. This pipeline can either be automatically added as the Custom stage of the collection pipeline, or it can be manually added to replace the Lexical Analysis stage of the collection pipeline. By replacing the Lexical Analysis stage of the collection pipeline, it is possible to use IBM Content Analytics Studio to add support for additional languages to IBM Content Analytics with Enterprise Search.
- Annotate a collection of documents, and compare the results with previous runs to review the improvement in the accuracy of the annotator.
- Send documents to an IBM Content Analytics with Enterprise Search collection for annotation, and retrieve the resulting annotations which can then be viewed and compared.
- Work with a team of developers to collaboratively develop linguistic resources and models to analyze text, by sharing your IBM Content Analytics Studio project in a Source Control repository.
- Check the spelling of a document. This spell checking function of IBM Content Analytics Studio is provided AS IS, and is not supported by IBM.
Requirements
For the most current information about the hardware and software requirements for running IBM Content Analytics Studio, see System requirements.
Installation
IBM Content Analytics Studio 3.0 requires Java 6 or later. The installation procedure is described below. If you have an earlier version installed on your system (this was called LanguageWare Resource Workbench Version 7.2), you can keep this earlier version. You can install IBM Content Analytics Studio alongside LanguageWare Resource Workbench provided you install the software into a different directory.
Installing IBM Content Analytics Studio interactively
- Obtain the installation file ContentAnalyticsStudio-install.exe either from the DVD, or by extracting it from the ZIP file that you download from Passport Advantage.
- Launch the installation program.
- Read and accept the license agreement and terms of use.
- Accept the default installation path, or specify a new folder where you want IBM Content Analytics Studio to be installed. Each version of IBM Content Analytics Studio is a separate application. Do not install a new version in the same folder as a previous version. Instead, either remove your previous version or ensure that all versions are installed in separate locations. If in doubt, the default folder selected by the installation program will ensure this behavior.
When it runs, the installation program creates relevant directories, files and shortcuts.
Note: the installation program does not support installing multiple instances of IBM Content Analytics Studio onto a computer. Doing so can cause data to be overwritten and other unexpected results. For example:
- No attempt is made to prevent the Start menu links that were created by one installation from overwriting the links from an earlier installation.
- Multiple installations of the Studio Demonstrator workspace will result in one workspace overwriting another, with unpredictable results.
- Installing IBM Content Analytics Studio without the Studio Demonstrator workspace and then using it to create a workspace in the default location might cause problems. If you reinstall IBM Content Analytics Studio and select to install the Studio Demonstrator workspace, then the Studio Demonstrator workspace will overwrite the workspace that you created. It might be possible to recover files you created in the workspace by creating a new project in the Demonstrator workspace with the same name that you used for the Project in the original workspace. However, unexpected consequences can occur.
Installing IBM Content Analytics Studio from the command line
To install IBM Content Analytics Studio by using a command line interface (rather than using user interface dialogs), run the following command from a command window and follow the instructions:
ContentAnalyticsStudio-install.exe -i console
To perform a silent installation, you must put the installation options in a response file, and then run the following command from a command window:
ContentAnalyticsStudio-install.exe -i silent -f response_file
The easiest way to create a response file is to install IBM Content Analytics Studio on one system by using the record option. All of your responses to options on the installation dialogs will be recorded into the specified response file. This response file can then be used to install IBM Content Analytics Studio on any number of other systems. To run the installation in record mode, use the command below:
ContentAnalyticsStudio-install.exe -r response_file
Specify the response file name with a full path name, for example:
C:/temp/studio.properties
The generated response file contains the following lines that you can customize:
USER_INSTALL_DIR=studio_home
-fileOverwrite_studio_home\\Uninstall\\Uninstall\ IBM\ Content\ Analytics\ Studio.lax=Yes
-fileOverwrite_studio_home\\Uninstall\\resource\\iawin32.dll=Yes
-fileOverwrite_studio_home\\Uninstall\\resource\\win64_32_x64.exe=Yes
-fileOverwrite_studio_home\\Uninstall\\resource\\remove.exe=Yes
-fileOverwrite_user_home\\studioworkspace\\README.txt=Yes
In this example, studio_home is the target installation directory that you chose for IBM Content Analytics Studio, and user_home is the home directory of the target user. Windows considers the backslash character to be an escape character, so use a double backslash when defining the path.
Starting IBM Content Analytics Studio for the first time
You can start IBM Content Analytics Studio by using the Start>Programs>IBM Content Analytics Studio>IBM Content Analytics Studio, Version 3.0 Windows menu option.
When you launch IBM Content Analytics Studio you will be prompted to select a workspace. Either:
- Accept the default workspace, which will open the Demonstrator Workspace (provided you selected to install it). Note, if you intend to access the same Workspace while logged on to Windows using different user IDs, then you should not use the default Workspace. This is because the default workspace for one user will not be accessible to other users.
- Select a new folder in which IBM Content Analytics Studio will create a new empty workspace.
- Select the location of an existing workspace, for example, one created with an earlier version of the product (which was called LanguageWare Resource Workbench). If the workspace was created by a previous version, create a backup copy of the workspace before upgrading. If you are using IBM Content Analytics Studio with a workspace that was created on an earlier release of LanguageWare Resource Workbench, you must upgrade your workspace in order to enable some of the new features of this release.
Information on all features is available in the IBM Content Analytics Studio online help, which can be accessed by selecting the Help / Help Contents menu option.
If you open the Demonstrator Workspace, you can find information about the two projects in the workspace by looking at the files in the README folders of each of the projects.
Uninstalling IBM Content Analytics Studio
To remove IBM Content Analytics Studio from your system, select the Programs>IBM Content Analytics Studio>Uninstall IBM Content Analytics Studio, Version 3.0 option from your system Start menu. Then follow the dialogs to uninstall the IBM Content Analytics Studio software.
To uninstall IBM Content Analytics Studio by using a command line interface:
- Change to the Uninstall directory in the installation directory.
- Run the command "Uninstall IBM Content Analytics Studio.exe" -i console and follow the instructions.
To perform a silent uninstallation by using a command line interface:
- Change to the Uninstall directory in the installation directory.
- Run the command "Uninstall IBM Content Analytics Studio.exe" -i silent.
Note: If you have updated your Studio installation, for example by installing a Source Control client, or by installing a Fix Pack, then the un-installation process will not remove this additional software from your system. After running the Studio un-installation program, the additional software can be removed by manually deleting the target installation directory, for example "c:/Program Files/IBM/ContentAnalyticsStudio-3.0"
Upgrading a workspace created on an earlier release
With each new release of IBM Content Analytics Studio, new functions become available, and changes are applied. Because of these changes, a workspace created on an earlier release of IBM Content Analytics Studio might not function as expected on the current release.
So that you do not have to throw away your old workspace, and start again, a provided Workspace Upgrade utility can make any changes needed to your old workspace. Running the Upgrade utility on an old workspace should make it function correctly on IBM Content Analytics Studio 3.0. The utility will upgrade any workspace created on LanguageWare Resource Workbench release 7.2 or later fix pack.
Before running the Workspace Upgrade utility, create a backup copy of the workspace. This will enable you to revert to the old release of the LanguageWare Resource Workbench in the situation where you encounter a problem.
To run the Workspace Upgrade utility, select the Help/Upgrade Workspace menu option. This will scan your workspace looking for resources that need updating and tasks that need running. Once the scan is complete, you will be presented with a list of tasks that the utility thinks need to be run. You can override the default action by un-checking tasks but, for best results, accept the default and run all of the tasks. When you click OK, the utility will perform each of the tasks in the list that are checked.
For information on the function of the individual tasks, refer to the online help by clicking the help (?) button at the lower left of the dialog.
For further details about migrating custom annotators, see Upgrading to IBM Content Analytics with Enterprise Search Version 3.0.
Changes and enhancements
The following section contains a list of what is new in this release:
- A simpler process for adding support for extra languages in IBM Content Analytics Studio.
- A mechanism to normalize the features of an annotation created by Parsing Rules. The value of a feature can be built or calculated by taking the values of some of the other features and applying a selected function on them (for example string concatenation, date conversion, multiplication).
- Support for source control (known as revision control, version control or source code management) systems to manage IBM Content Analytics Studio resources. This allows a team of developers to collaborate on the development of linguistic resources and models. The following Source Control repositories and Eclipse clients have been tested with IBM Content Analytics Studio:
- A tool to extract document text from an IBM Content Analytics with Enterprise Search server. Documents can either be selected because they have a specified flag in the IBM Content Analytics with Enterprise Search server, or because they have been dropped by the server during analysis because of an error.
- Support for Czech, Hebrew, Polish, and Russian languages.
- A Semantic Analysis stage can be included in a UIMA Pipeline. This stage uses a Semantic dictionary to identify associations between annotations already found in the pipeline.
- The IBM Content Analytics Studio installation program has an option to install a sample workspace, called the Studio Demonstrator. This workspace contains the following two projects:
- When exporting a UIMA Pipeline to IBM Content Analytics with Enterprise Search:
- A UIMA annotation can be associated directly with a facet, so there is no longer the need to define an index field. With direct facet mapping, it is also possible to assign a literal value to each facet that is created in this way.
- Field/Facet mapping data can be updated without updating the PEAR file. This is useful if the Annotator and all of its resources have not changed and all you want to do is to update the Field/Facet mapping. Because the PEAR file deployment is time consuming, it is useful in these situations to be able to skip the deployment, and just update the Field/Facet mapping.
- The Field/Facet mapping is now specific to a collection. Therefore it is possible to export the same pipeline to several collections by using different Field/Facet mappings.
- When exporting a UIMA Pipeline as a PEAR file for use on IBM Content Analytics with Enterprise Search, it is now possible to define facets (previously it was only possible to define fields). Often the production IBM Content Analytics with Enterprise Search server is not accessible to the data modeller, who cannot therefore export a PEAR file directly to the production server.
- Support for Custom Character Class for character rules with the Character Rules Editor. This enables you to define custom sets of characters.
- The ability to add features to annotations generated by Character Rules by using the Character Rules Editor.
- Overall analysis performance has been improved.
SVN The Subclipse client from Tigris. Although this client works, it does not recognize when a Database is open and changes have been made. Therefore changes made to the database will not be synchronized unless the database is closed first.
Rational Team Concert The Rational Team Concert JAZZ client works on IBM Content Analytics Studio. However it does not notice if any file (including a database) is open in an editor with un-saved data. Therefore changes made to files or databases will not be synchronized unless the file or database is closed first.
If you installed Content Analytics Studio 3.0 in the Program Files or Program Files (x86) directory of your computer (this is the default) then you must run the Studio as Administrator when installing the source control client code. Right click the Studio icon on your desktop, and select Run as administrator.
Project IBM contains a set of sample annotators and resources which provide some suggestions for how to develop your own linguistic models.
Project Addendum contains an English Phrase Annotator and some example character rules dictionaries. The character rules dictionaries are used to match strings by using regular expressions.
Known limitations
Lexical Analysis limitations
- Decomposition of plural possessives such as houses' is not consistently supported.
- Lexical analysis may appear inconsistent where a word occurs in text with a hyphen or apostrophe attached to the end. This inconsistency is only evident when the word is explicitly marked in the dictionary as being invalid stand-alone.
- No validation of Roman numerals is performed. Invalid combinations of letters resembling a Roman numeral may be recognized. Letter j is not considered a valid Roman digit.
- Lexical analysis only recognizes abbreviations that have internal punctuation. Abbreviations without punctuation are generally categorized as UPPERCASEWORD unless it is explicitly recognized in the dictionary.
- Where there is an overlap of multi-word unit annotations the matched text assigned to both annotations is the union of both spans.
- There is a potential problem when using a mixed collection of lowercase and allcase dictionaries. It is possible that custom dictionary entries will not be annotated due to lookup ordering.
- When creating a Dictionary entry containing multiple words, where one or more of those words are compounds, then the compounds must be separated by a space in order for them to be recognized by Lexical Analysis. For example in English the term "won't pay" would need to be saved in a dictionary entry as "wo n't pay".
- After migrating a workspace from LanguageWare Resource Workbench 7.2 to IBM Content Analytics Studio 3.0, the Part-of-Speech tagging results for English and German might be different. This is caused by improvements in the precision of the Part-of-Speech tagging for these languages, and in most cases this is an advantage. If however you have rules that rely on the old Part-of-Speech tagging results, then the change might result in some rules no longer being triggered. This issue is described in more detail in this Technote.
Parsing Rules limitations
- The parsing engine will freeze on unbounded repetitions of sequences that have a possible empty path, for example a repeating ordered group covering two optional annotations. To avoid this problem, make sure all repeating groups always have at least one required annotation.
- An optional Token test followed by an ambiguous choice of annotations that includes a dictionary annotation (for example, where an annotation created by an earlier parsing stage exactly covers a dictionary annotation) may behave as a required test, causing the rule to fail on some phrases where it should apply. To work around this problem, change any rule that creates a new annotation exactly matching the span of an existing dictionary annotation to delete the covered annotation.
- Phrases and Entities Rules cannot currently be created across sentence boundaries. A sentence boundary is usually determined by the presence of a dot (.) or a new line character in the text. For terms such as John A. Waters, the . will be detected as ending the sentence, unless A. is in a dictionary. The solution is to put any terms such as A., Cllr. (short for councillor) in a dictionary.
- The Rules engine does not have good support for detecting changes to the annotation types required as input for rules. This can cause issues when running the annotator against a document. The most common issue is that rules will not be triggered because the rules are still trying to match the original type. The following changes can cause this issue:
- Changing the definition of a dictionary type, such as changing the type for uima.tt.Day to be com.ibm.Day. The rules were based on uima.tt.Day as input, and do not automatically detect the change. You might need to delete these rules and recreate them.
- Changing the names of annotations created by a rule. For example, one rule might create a type Org, and another rule might match Org and create a Company annotation on top of it. If the Org is changed to Organization in the first rule, the second rule might not fire because it still expects Org. The solution is to recreate any rules that depended on the Org type so that they now have Organization as input.
- By default, Dictionary entries covering Token Annotations are ignored by rules. This allows Token-based rules to match even when Dictionary Annotations are covering some of the Tokens. However, when the Dictionary Annotation covers more than one Token, they are not ignored in the current release.
- When creating rules, the Constraints tab lets users specify dictionary types to be prioritized over Token Annotations. This means that the dictionary type will hide the token for the purposes of the rule activation. However, if a prioritized Dictionary type and a non-prioritized Dictionary type cover the same Token, then the Token will not be hidden.
- In the Selection tab of the Rule Editor, if a word was found in more than one Custom dictionary, the word will be displayed with one of the custom dictionary types together with a drop-down icon to allow you to select an alternative type to represent this word. Sometimes when clicking the drop-down icon, nothing happens. To clear this problem, click on another Token in the tree to highlight it, and then click on the drop-down icon again.
- There is an issue with writing regular expressions in Arabic. The regular expression editor does not handle right-to-left (Arabic) input well. While it is possible to write a regular expression correctly, it is unlikely to look correct on the screen.
- This release added support for a new annotation type that sits above the Token annotation. It is called Annotation and covers any type of annotation irrespective of its type, such as token, dictionary, rule, punctuation, and so on. As such, using this annotation type in a rule is very aggressive as it will ingest all annotations until the next break point (sentence in the case of phrase/entity rules, and sentence/paragraph/document in the case of aggregates). Therefore use this annotation type with caution.
There is one Use Case where this annotation type (and its aggressive behavior) can be extremely useful when used in combination with the custom break rules. If you have a semi-structured document with a special field delimiter, then you can use the custom break rules to define that delimiter by setting the character in question (such as a tab, double tab, pipe, colon, single carriage return, and so on) as the sentence breaker, and then use the annotation type within a rule to mark up the fields between these delimiters.
For example: - Parsing Rules have a feature to help in the logical grouping of annotations, called the Group function. This allows you to group patterns of annotations in order to create larger annotations and relationships across them. There is one limitation: it does not support the use of features associated with the components of a group. You can physically add features to any annotations that you create under a group, however this feature will incorporate the aggregate of all instances of this group that is matched in the rule.
- If you click the icon to create a pre-condition, the Create Parsing Rules tab will indicate the rule should be saved. That is, an asterisk indicating the rule has been modified and not saved will be displayed even if you have cancelled the pre-condition dialog.
- If you create an annotation over two or more tokens in a rule, and then go back to the Selection tab to delete one of the tokens that the annotation covered, this will cause the annotation to be deleted.
- If a Parsing Rule has all of the annotations in the selection tree marked as optional (occurring zero or one time), then this rule will never be triggered. At least one annotation in the selection criteria must be a required annotation.
- If you create a Rule in the rule editor and save it and then, without closing the rule editor, you modify the rule and save it again, the creation date of the rule remains correct, but the creation time is set to 00:00:00.
- If you Create a Rule which creates a Normalized Feature by using the Covered Text of that rule, then the rule will fail to build and will generate an error. It is possible to get around this problem by creating two rules. The first rule matches the pattern, and creates a single annotation over the entire span. The second rule matches the annotation created by the previous rule, and creates a feature that is the "value" of that annotation. This value will be the covered text of the first rule. You can then create a Normalized feature in this second rule that uses the feature that you just created.
- If you close a Rules database while there is an unsaved rule open in the Rule editor, you will be prompted to save the rule, but the saved rule will not be written to the CSV file used for Source Control. Therefore if you then commit your changes to a source control repository, the modified rule will not be included in those changes. To avoid this problem, always save your rule before closing the database.
- The Parsing Rule editor allows you to add a Feature of type "String Array" to the list of tests in a Rule. IBM Content Analytics Studio does not support tests on this type of feature. Adding a test will cause errors in the UIMA pipeline.
- In the Selection tab of the Parsing Rule editor, if you add an annotation over two or more selected annotations, then the results can be unpredictable. This problem will only occur if:
- Annotation type names that match internal IBM Content Analytics Studio or UIMA types may be treated incorrectly in the current release. The following names are therefore reserved and should not be used in user dictionaries or as rule outputs:
ID: 1234-3434 23D D33
Name: Marie Wallace
Address: 1 Studio Road, Dublin, Ireland
If you set a single carriage return as the sentence delimiter in the break rules, you can then create the following rules:
{token = ID:} + {Annotation}* = {token = ID:} + [Document ID]
{token = Name:} + {Annotation}* = {token = ID:} + [Person]
{token = Address:} + {Annotation}* = {token = ID:} + [Address]
- 1. The first selected annotation is the first node in the parse tree.
2. The annotation that is added is the same type as the first selected annotation.
- Document
- Paragraph
- Sentence
- Annotation
- Token
- Lemma
- WordLikeToken
- Alphabetic
- UppercaseAlphabetic
- TitlecaseAlphabetic
- LowercaseAlphabetic
- Arabic
- Hebrew
- Syllabic
- Hiragana
- Katakana
- Hangul
- Ideographic
- Han
- Numeric
- ChineseNumeral
- Punctuation
- ClauseEndingPunctuation
Character Rules limitations
- Character rules will be unable to recognize entities starting and ending in punctuation when that punctuation character is adjacent to multiple punctuation characters. This is due to the fact that character rule matches must start and end on token boundaries, and Lexical Analysis interprets longer sequences of punctuation as one token. For example, the open bracket in C++</a> is not a token boundary, and thus a character rule finding HTML closing tags will not be triggered on this sequence.
- Setting a character rules file to affect tokenization may cause the rules to fail to be triggered in situations where the match starts with the separating character of the previous match. For example, a tokenization-affecting rule for matching currency expressions will fail to match $13.5 in the string US$13.5 because the dollar symbol is the break that ends the token US. This is due to the fact that lexical analysis treats the text as a sequence of tokens and breaks. The former are handled by standard dictionaries and tokenization-affecting character rules, the latter are normally consumed by break rules. Character rules that do not affect tokenization do not exhibit this behavior.
- Performing a Split operation with more than one Character class node selected results in Character class nodes being added in an indeterminate order.
- When writing Character Rules in a Character Rules database, you will receive an error if you are using a UIMA Pipeline with a Cleanup stage that deletes SentenceAnnotation types. Change the Cleanup stage so that SentenceAnnotation types are not deleted and try again.
- When creating Features in an Annotation generated by a Character Rule, the feature will be shown in the Rule Editor as a List, and multiple sub-annotations can be added to it, but when the Rule is used, the Feature is of type string (not list).
Break Rules limitations
- The option to report one letter abbreviations as one token may be overruled by dictionary entries even when dictionary entries are shorter. For example, in English, the word I will cause the abbreviation I. to be identified as a sequence of two tokens. The lexical analysis stage lacks the analysis depth required to decide whether I. is an abbreviation (as in I. B. M.) or a sequence of a valid English word and a sentence ending (as in So do I.). As the lexical analysis engine is designed to be simple, fast and completely data driven, in these cases it gives priority to dictionary data over break rules based on the assumption that a match for a dictionary word carries more information. This feature can also be used to ensure that problematic abbreviations are always kept together: if a user dictionary contains I., the other interpretation will be overridden and the sequence will be treated as an abbreviation.
Dictionary Editor limitations
- If you have a Dictionary Database open, and open a Term to edit the contents of it, and then you Refresh the Database View and try to save the Term, you will get the error ResultSet not open, and you will be unable to save the Term. In this case, the only option is to close the Term without saving it, and then re-edit and save the term.
- When creating a new dictionary database in DB2 by using the Create IBM Content Analytics Studio tables in an empty database option, the input database name must be specified in uppercase. DB2 database names are uppercase and specifying a different casing in IBM Content Analytics Studio generates SQL errors.
- If you edit the values in a dictionary database by using the in-line editor of the database view, sometimes changes will not be actioned. If you change the value of one field, and then without pressing Enter, click a second field and change the value of that field, the second change will not be actioned. This is caused by a defect in Eclipse.
Semantic Analysis limitations
- The script to build a semantic dictionary (build.bat) cannot contain non-ascii characters. For example, the names of the input RDF file and the output semantic dictionary file must contain only ascii characters.
- In the configuration of a Semantic stage of a UIMA Pipeline, there are two "Value" fields. The description of these fields in the online help is incorrect.
- For a Semantic Type, the value specifies how likely it is that concepts of this Type will be annotated by this Semantic stage. A value of 1 means that it is more likely that concepts of this Type will be annotated, and a value of 0 means that it is unlikely that concepts of this Type will be annotated by this Semantic stage.
- For a Semantic Link, the value specifies how likely it is that the link will be traversed. A value of 1 means that the link is likely to be traversed, and a value of 0 means that it is unlikely to be traversed.
- When exporting an IBM Content Analytics Studio pipeline that includes the Semantic Analysis stage for use with IBM Content Analytics with Enterprise Search, com.ibm.langware.semantic.Concept:ref, com.ibm.langware.semantic.Relation:subject, and com.ibm.langware.semantic.Relation:object are not selectable in the Add Index Field or Facet wizard because these features are not primitive. To work around this limitation, edit cas2index.xml located in an exported PEAR file and upload it to the IBM Content Analytics with Enterprise Search server manually.
For example, if you set a mapping between the surface form of com.ibm.langware.semantic.Concept:ref and an IBM Content Analytics with Enterprise Search facet named concept_ref, you can add the following definition to cas2index.xml:
<indexBuildItem>
<name>com.ibm.langware.semantic.Concept</name>
<indexRule>
<style name="Facet">
<attribute name="fixedName" value="$.concept_ref"/>
<pathComponent name="feature" value="ref/coveredText()"/>
</style>
</indexRule>
</indexBuildItem>
IBM Content Analytics Studio limitations
- If a project is renamed, any documents with an extension of .txt in the project will not be opened with the annotations editor until IBM Content Analytics Studio is restarted.
- If you are using source control to share an IBM Content Analytics Studio workspace between several users. If one user deletes a database, and synchronizes that change to the source control repository, then when another user checks out those changes, they will find that the database has been correctly removed, but is replaced by a folder containing a number of files. This folder and its contents should be deleted.
- When installing IBM Content Analytics Studio on any drive that is not C:, the installation wizard will create a "temp" folder on the drive you select. This temp folder will not be deleted after the installation is completed. The same problem occurs if you uninstall IBM Content Analytics Studio.
- If you export a UIMA pipeline to an IBM Content Analytics with Enterprise Search server and define an index field that maps to a facet, and then at a later time you export the same pipeline to the server and change or remove the index field to facet mapping, the original mapping will not be deleted on the server. To work around this problem, you must open the IBM Content Analytics with Enterprise Search administration console and manually delete the mapping.
- When using the Type Catalog, the option to show the UIMA pipeline configuration and the Type System input files that use the selected type does not always show the full list of UIMA pipeline configuration files that output the selected type.
- Problems occur if you create a Dictionary, Parsing Rule, or Character Rule database with a name that contains a hash (#) character.
- If you have an open document annotated with a UIMA annotator, and then close and re-open the Properties view, the view will fail with an exception. To work around this issue, close all document editors and re-open them.
- If you import an IBM Content Analytics Studio Project into a workspace, you will be prompted whether you want to restore each Dictionary or Rule database from the CSV files. If you respond Yes, you will receive an error message indicating that the database table already exists. This error message can be ignored as the database is up to date and intact.
- The Collection Analysis view shows an asterisk (*) in the title indicating that it has unsaved data even if the collection analysis data has been saved.
Japanese limitations
- The Japanese lexical analyzer cannot work without the built-in Japanese lexical analysis dictionary.
- Custom break rules files are not supported. The Lexical Analysis stage of an annotator most use the Studio's default segmentation rules for Japanese.
- Custom dictionary entries will only be annotated if the underlying token is classified as a "Noun" or "Unknown" part of speech.
Chinese limitations
- The Chinese lexical analyzer cannot work without the built-in Chinese lexical analysis dictionary.
- Custom break rules files are not supported. The Lexical Analysis stage of an annotator most use the Studio's default segmentation rules for Chinese.
Arabic limitations
- There is no support for word boundary detection if more than one space is missing in a sentence fragment. It will not be lexically analyzed and will be tagged as Unknown.
Supported languages
The following languages are built into IBM Content Analytics Studio and are fully supported (that is, Language Identification, Lexical Analysis, and Part of Speech Disambiguation):
- Arabic
- Chinese
- Czech
- Danish
- Dutch
- English
- French
- German
- Hebrew
- Italian
- Japanese
- Polish
- Portuguese
- Russian
- Spanish
Related Information
Was this topic helpful?
Document Information
Modified date:
17 June 2018
UID
swg27024048