Entity-mentions
At a glance
The entity-mentions
task encapsulates algorithms for extracting mentions of entities (person, organizations, dates) from the input text. The task offers implementations of strong entity extraction algorithms from each of three families:
rule-based, classic ML, and deep-learning.
Class definitions |
---|
watson_nlp.blocks.entity_mentions.rbr.RBR |
watson_nlp.workflows.entity_mentions.sire.SIRE |
watson_nlp.workflows.entity_mentions.bilstm.BiLSTM |
watson_nlp.workflows.entity_mentions.bert.BERT |
watson_nlp.workflows.entity_mentions.transformer.Transformer |
For language support, see Supported languages.
Algorithms available
This table provides a brief overview of each algorithm, and the features it uses.
Block name | Algorithm | Features |
---|---|---|
rbr |
Rule-based algorithm expressed in AQL | Any construct available in AQL |
sire |
1. Maximum Entropy 2. CRF |
- Linguistic token - Dictionaries - Regular expressions |
bilstm |
BiLSTM | - Linguistic token - GloVe embeddings - Character embeddings |
bert |
BERT | - Linguistic token - BERT word pieces - Google Multilingual BERT-Base model Cased (104 Languages) |
transformer |
Transformer | - Linguistic token - Supports any Watson NLP pretrained model of type transformer , such as IBM Slate models, or any transformer from the HuggingFace library |
Pretrained models
Several pretrained models are available, for common entities such as person, organization, and dates. Model names are listed below.
Model ID | Container Image |
---|---|
BERT models | |
entity-mentions_bert-workflow_lang_multi_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_bert-workflow_lang_multi_stock:1.4.1 |
BiLSTM models | |
entity-mentions_bilstm-workflow_lang_ar_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_ar_stock:1.4.1 |
entity-mentions_bilstm-workflow_lang_de_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_de_stock:1.4.1 |
entity-mentions_bilstm-workflow_lang_en_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_en_stock:1.4.1 |
entity-mentions_bilstm-workflow_lang_es_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_es_stock:1.4.1 |
entity-mentions_bilstm-workflow_lang_fr_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_fr_stock:1.4.1 |
entity-mentions_bilstm-workflow_lang_it_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_it_stock:1.4.1 |
entity-mentions_bilstm-workflow_lang_ja_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_ja_stock:1.4.1 |
entity-mentions_bilstm-workflow_lang_ko_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_ko_stock:1.4.1 |
entity-mentions_bilstm-workflow_lang_nl_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_nl_stock:1.4.1 |
entity-mentions_bilstm-workflow_lang_pt_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_pt_stock:1.4.1 |
entity-mentions_bilstm-workflow_lang_zh-cn_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_zh-cn_stock:1.4.1 |
ensemble-workflow | |
entity-mentions_ensemble-workflow_lang_multi_distilwatbert | cp.icr.io/cp/ai/watson-nlp_entity-mentions_ensemble-workflow_lang_multi_distilwatbert:1.4.1 |
entity-mentions_ensemble-workflow_lang_multi_distilwatbert-cpu | cp.icr.io/cp/ai/watson-nlp_entity-mentions_ensemble-workflow_lang_multi_distilwatbert-cpu:1.4.1 |
RBR models | |
entity-mentions_rbr_lang_ar_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_ar_stock:1.4.1 |
entity-mentions_rbr_lang_cs_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_cs_stock:1.4.1 |
entity-mentions_rbr_lang_da_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_da_stock:1.4.1 |
entity-mentions_rbr_lang_de_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_de_stock:1.4.1 |
entity-mentions_rbr_lang_en_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_en_stock:1.4.1 |
entity-mentions_rbr_lang_es_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_es_stock:1.4.1 |
entity-mentions_rbr_lang_fi_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_fi_stock:1.4.1 |
entity-mentions_rbr_lang_fr_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_fr_stock:1.4.1 |
entity-mentions_rbr_lang_he_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_he_stock:1.4.1 |
entity-mentions_rbr_lang_hi_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_hi_stock:1.4.1 |
entity-mentions_rbr_lang_it_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_it_stock:1.4.1 |
entity-mentions_rbr_lang_ja_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_ja_stock:1.4.1 |
entity-mentions_rbr_lang_ko_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_ko_stock:1.4.1 |
entity-mentions_rbr_lang_nb_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_nb_stock:1.4.1 |
entity-mentions_rbr_lang_nl_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_nl_stock:1.4.1 |
entity-mentions_rbr_lang_nn_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_nn_stock:1.4.1 |
entity-mentions_rbr_lang_pl_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_pl_stock:1.4.1 |
entity-mentions_rbr_lang_pt_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_pt_stock:1.4.1 |
entity-mentions_rbr_lang_ro_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_ro_stock:1.4.1 |
entity-mentions_rbr_lang_ru_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_ru_stock:1.4.1 |
entity-mentions_rbr_lang_sk_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_sk_stock:1.4.1 |
entity-mentions_rbr_lang_sv_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_sv_stock:1.4.1 |
entity-mentions_rbr_lang_tr_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_tr_stock:1.4.1 |
entity-mentions_rbr_lang_zh-cn_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_zh-cn_stock:1.4.1 |
entity-mentions_rbr_lang_zh-tw_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_zh-tw_stock:1.4.1 |
SIRE models | |
entity-mentions_sire-workflow_lang_en_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_sire-workflow_lang_en_stock:1.4.1 |
Transformer models | |
entity-mentions_transformer-workflow_lang_multi_distilwatbert | cp.icr.io/cp/ai/watson-nlp_entity-mentions_transformer-workflow_lang_multi_distilwatbert:1.4.1 |
entity-mentions_transformer-workflow_lang_multi_distilwatbert-cpu | cp.icr.io/cp/ai/watson-nlp_entity-mentions_transformer-workflow_lang_multi_distilwatbert-cpu:1.4.1 |
entity-mentions_transformer-workflow_lang_multi_stock | cp.icr.io/cp/ai/watson-nlp_entity-mentions_transformer-workflow_lang_multi_stock:1.4.1 |
entity-mentions_transformer-workflow_lang_multilingual_slate.153m.distilled | cp.icr.io/cp/ai/watson-nlp_entity-mentions_transformer-workflow_lang_multilingual_slate.153m.distilled:1.4.1 |
entity-mentions_transformer-workflow_lang_multilingual_slate.153m.distilled-cpu | cp.icr.io/cp/ai/watson-nlp_entity-mentions_transformer-workflow_lang_multilingual_slate.153m.distilled-cpu:1.4.1 |
entity-mentions_transformer-workflow_lang_multilingual_slate.270m | cp.icr.io/cp/ai/watson-nlp_entity-mentions_transformer-workflow_lang_multilingual_slate.270m:1.4.1 |
Entity models (PII) | |
entity-mentions_rbr_lang_multi_pii | cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_multi_pii:1.4.1 |
For details of the Entity-mention
type system, see Understanding model type systems.
The generic entity models
The models for entity type systems have been trained and tested on labeled data from news reports. These models have two parts:
-
A rule-based model (the
rbr
models), which handles syntactically regular entity types such as number, email and phone. -
A model trained on labeled data for the more complex entity types such as person, organization, or location.
The rbr
sire
, and bilstm
models are monolingual: each model knows how to analyze input text in a single language.
The bert
model is multilingual: the single model can analyze input texts from multiple languages.
The bilstm
models use GloVe embeddings trained on the Wikipedia corpus in each language.
The bert
model uses the Google Multilingual BERT model (Large, Cased, 104 languages).
The entity-mentions_transformer-workflow_lang_multilingual_slate.153m.distilled
model is optimized for GPU, but supports CPU usage. entity-mentions_transformer-workflow_lang_multilingual_slate.153m.distilled-cpu
is
a CPU-only model, optimized for better speed on CPU. These models are trained on 24 languages and are based on IBM Slate Multilingual pretrained resource.
All models output non-overlapping entity mention spans. That is, each character in the input text can belong to either no entity type or exactly one entity type and there are no overlapping entities.
The PII entity models
The PII models recognize personal identifiable information such as person names, SSN, bank account numbers, credit card numbers, etc.
Due to the nature of PII, it is difficult to train machine learning models for the majority of PII, especially credit card numbers, passport numbers and other identifiers. Therefore, the PII model has two parts:
-
A rule-based model handles the majority of the types by identifying common formats of PII entities and performing possible checksum/validations as appropriate for each entity type. For example, credit card number candidates are validated using the Luhn algorithm.
-
A model trained on labeled data for types where labeled data can be obtained, such as person and location. For this, use one of the models available for the Entity-mention V2 type system.
Creating a custom model
For the current release of Watson Natural Language Processing Library for Embed, you can work with Python notebooks in Watson Studio to train some Watson NLP models with your own data. See Creating custom models for information.
Running models
The Entity-mentions model request accepts the following fields:
Field | Type | Required Optional Repeated |
Description |
---|---|---|---|
raw_document |
watson_core_data_model.nlp.RawDocument |
required | The input document on which to perform entity analysis |
language_code |
str |
optional | Language code corresponding to the text of the raw_document |
Example requests
REST API
curl -s \
"http://localhost:8080/v1/watson.runtime.nlp.v1/NlpService/EntityMentionsPredict" \
-H "accept: application/json" \
-H "content-type: application/json" \
-H "Grpc-Metadata-mm-model-id: entity-mentions_rbr_lang_multi_pii" \
-d '{ "raw_document": { "text": "My email is john@ibm.com." }, "language_code": "en" }'
Response
{"mentions":[
{"span":{
"begin":12,
"end":24,
"text":"john@ibm.com"
},
"type":"EmailAddress",
"producerId":{
"name":"RBR mentions",
"version":"0.0.1"
},
"confidence":0.8,
"mentionType":"MENTT_UNSET",
"mentionClass":"MENTC_UNSET",
"role":""
}
],
"producerId":{
"name":"RBR mentions",
"version":"0.0.1"
}
}
Python
import grpc
from watson_nlp_runtime_client import (
common_service_pb2,
common_service_pb2_grpc,
syntax_types_pb2,
)
channel = grpc.insecure_channel("localhost:8085")
stub = common_service_pb2_grpc.NlpServiceStub(channel)
request = common_service_pb2.EntityMentionsRequest(
raw_document=syntax_types_pb2.RawDocument(text="My email is john@ibm.com"),
language_code='en'
)
response = stub.EntityMentionsPredict(
request, metadata=[("mm-model-id", "entity-mentions_rbr_lang_multi_pii")]
)
print(response)
Response
mentions {
span {
begin: 12
end: 24
text: "john@ibm.com"
}
type: "EmailAddress"
producer_id {
name: "RBR mentions"
version: "0.0.1"
}
confidence: 0.8
}
producer_id {
name: "RBR mentions"
version: "0.0.1"
}