Training an Ontolection
To train on a text corpus and generate an ontolection, you need to also specify a PEAR
file corresponding to the language the text corpus is written in; Watson™ Explorer comes packaged with a PEAR file for each supported
language in the data/pears
directory.
An example with an English text corpus:
ontolectiontrainer --trainOntolection --corpus [path to corpus] --pear [path to PEAR file] --outputPath [desired filename of ontolection]
To save the trained language model to disk for later reuse / additional training, use
--persistModel [path to save to]
. To load a saved language model, use
--loadModel [path to saved model]
. For example, to load an existing,
previously trained model, train it further using a new text corpus, and output a new
ontolection:
ontolectiontrainer --trainOntolection --loadModel [path to saved model] --corpus [path to corpus] --pear [path to PEAR file] --outputPath [desired filename of ontolection]
The ability to load a model also provides the ability to alter the outputted ontolection without having to retrain the model. For example, to output 5 related terms per vocabulary word,
ontolectiontrainer --trainOntolection --loadModel [path to saved model] --outputPath [desired filename of ontolection] --numNearestNeighbors 5
Blacklisting Vocabulary for Ontolection Training
Ontolection Trainer supports blacklisting vocabulary; pass it a path to a file containing
blacklisted words, one per line, using the --blacklist
option, and words in
the blacklist will not be output in the generated ontology at all.
Whitelisting Vocabulary for Ontolection Training
Ontolection Trainer supports whitelisted vocabulary; pass it a path to a file containing
whitelisted words, one per line, using the --whitelist
option, and words not
in the whitelist will not be output in the generated ontology at all.
Generating a Vocabulary Whitelist from a Watson Explorer Engine dictionary
- Turn on term expansion dictionary generation for your Watson Explorer search collection ( and refresh your crawl.
- Locate the generated term expansion dictionary's XML file in your Watson Explorer deployment; it will be in your search
collection's crawl directory in the
expansions
directory, with filename some alphanumeric hash. - Extract the words from this hash file and put them in a text file; this will be your
whitelist. (For example,
cat [some_hash_filename] | awk '{ print $1 }' > whitelist.txt
.) - Pass Ontolection Trainer this whitelist; your ontolection will only contain words found in your dictionary.
Acronym Extraction
Ontolection Trainer implements a rules-based approach to learn acronyms from a text corpus and output an ontolection containing their expansions. Currently, acronym extraction has only been tested with English.
ontolectiontrainer --extractAcronyms --corpus [path to corpus] --pear [path to PEAR file] --outputPath [desired filename of ontolection]
Phrase Extraction
Ontolection Trainer implements a statistical approach to extract phrases.
By passing the --learnPhrases
command line option. Ontolection Trainer will
use statistical methods to analyze a plain text corpus (passed using
--corpus
) to extract word phrases.
To process a text corpus and output a list of phrases, you need to also specify a PEAR file corresponding to the language the text corpus is written in. An example with an English text corpus:
java -jar ontolectiontrainer.jar --learnPhrases --corpus [path to corpus] --pear [path to PEAR file] --outputPath [desired filename of output]