Training an Ontolection

To train on a text corpus and generate an ontolection, you need to also specify a PEAR file corresponding to the language the text corpus is written in; Watson™ Explorer comes packaged with a PEAR file for each supported language in the data/pears directory.

An example with an English text corpus:

ontolectiontrainer --trainOntolection --corpus [path to corpus] --pear [path to PEAR file] --outputPath [desired filename of ontolection] 

To save the trained language model to disk for later reuse / additional training, use --persistModel [path to save to]. To load a saved language model, use --loadModel [path to saved model]. For example, to load an existing, previously trained model, train it further using a new text corpus, and output a new ontolection:

ontolectiontrainer --trainOntolection --loadModel [path to saved model] --corpus [path to corpus] --pear [path to PEAR file] --outputPath [desired filename of ontolection] 

The ability to load a model also provides the ability to alter the outputted ontolection without having to retrain the model. For example, to output 5 related terms per vocabulary word,

ontolectiontrainer --trainOntolection --loadModel [path to saved model] --outputPath [desired filename of ontolection] --numNearestNeighbors 5 

Blacklisting Vocabulary for Ontolection Training

Ontolection Trainer supports blacklisting vocabulary; pass it a path to a file containing blacklisted words, one per line, using the --blacklist option, and words in the blacklist will not be output in the generated ontology at all.

Whitelisting Vocabulary for Ontolection Training

Ontolection Trainer supports whitelisted vocabulary; pass it a path to a file containing whitelisted words, one per line, using the --whitelist option, and words not in the whitelist will not be output in the generated ontology at all.

Generating a Vocabulary Whitelist from a Watson Explorer Engine dictionary

If you generate a text corpus from indexed documents in Watson Explorer Engine and train an ontolection based on that corpus without preprocessing it, you might find that the generated ontolection contains a lot of nonsense words. One way to combat this is:
  1. Turn on term expansion dictionary generation for your Watson Explorer search collection (Configuration > Indexing > Term expansion support > Generate dictionaries and refresh your crawl.
  2. Locate the generated term expansion dictionary's XML file in your Watson Explorer deployment; it will be in your search collection's crawl directory in the expansions directory, with filename some alphanumeric hash.
  3. Extract the words from this hash file and put them in a text file; this will be your whitelist. (For example, cat [some_hash_filename] | awk '{ print $1 }' > whitelist.txt.)
  4. Pass Ontolection Trainer this whitelist; your ontolection will only contain words found in your dictionary.

Acronym Extraction

Ontolection Trainer implements a rules-based approach to learn acronyms from a text corpus and output an ontolection containing their expansions. Currently, acronym extraction has only been tested with English.

ontolectiontrainer --extractAcronyms --corpus [path to corpus] --pear [path to PEAR file] --outputPath [desired filename of ontolection]

Phrase Extraction

Ontolection Trainer implements a statistical approach to extract phrases.

By passing the --learnPhrases command line option. Ontolection Trainer will use statistical methods to analyze a plain text corpus (passed using --corpus) to extract word phrases.

To process a text corpus and output a list of phrases, you need to also specify a PEAR file corresponding to the language the text corpus is written in. An example with an English text corpus:

 java -jar ontolectiontrainer.jar --learnPhrases --corpus [path to corpus] --pear [path to PEAR file] --outputPath [desired filename of output]