Extracting IS-A relationships
CraigTrim 110000G799 Visits (2567)
Rules for extracting IS-A relation
Searching for these patterns in unstructured text is a good way of finding is-a relationships.
Y such as X ( (, X)* (, and|or) X)*
to either i/o devices such as disk tape or optical drives or i/o cards.
parsing this sentence for the pattern
Y such as X (or X)*
disk tape IS-A i/o device
1. The parsing solution must rely on part-of-speech tagging to retrieve the correct amount of tokens for both X and Y in the patterns above. If only a single token is retrieved for X and Y, the example above would yield (tape IS-A i/o) instead of (disk tape IS-A i/o device). I recommend using a part-of-speech tagger to find the simple forms1 for each token, and then use only Adjectives and Nouns for the X and Y value. Some tweaking of this algorithm may yield more favorable results depending on the corpus being parsed.
2. Normalization of tokens. I use the idea of normalization as an umbrella over lemmatization, stemming and any other necessary morphological variation required to yield the most favorable results. The above example would yield plural forms (optical drives IS-A i/o devices) if normalization was not applied.
3. The patterns above will not yield perfect results. Rather than attempt to tweak and over-fit the pattern matcher against the training set, go with the largest volumes of data possible. The false positives should yield less-frequent results, while the true hypernyms should be the most frequent, and therefore the most likely.
Terminology:Car is a Hypernym of Corolla.
1. Hypernym: Above Word
2. Hyponym: Below WordCar is a Hyponym of Vehicle.
WordNet is a good source for Hypernyms and Hyponyms.Car > Motor Vehicle > Self-Propelled Vehicle > Vehicle > Conveyance
1. (Hearst, 1992): Automatic Acquisition of Hyponyms