Entity-mentions

At a glance

The entity-mentions task encapsulates algorithms for extracting mentions of entities (person, organizations, dates) from the input text. The task offers implementations of strong entity extraction algorithms from each of three families: rule-based, classic ML, and deep-learning.

Class definitions
`watson_nlp.blocks.entity_mentions.rbr.RBR`
`watson_nlp.workflows.entity_mentions.sire.SIRE`
`watson_nlp.workflows.entity_mentions.bilstm.BiLSTM`
`watson_nlp.workflows.entity_mentions.bert.BERT`
`watson_nlp.workflows.entity_mentions.transformer.Transformer`

For language support, see Supported languages.

Algorithms available

This table provides a brief overview of each algorithm, and the features it uses.

Block name	Algorithm	Features
`rbr`	Rule-based algorithm expressed in AQL	Any construct available in AQL
`sire`	1. Maximum Entropy 2. CRF	- Linguistic token - Dictionaries - Regular expressions
`bilstm`	BiLSTM	- Linguistic token - GloVe embeddings - Character embeddings
`bert`	BERT	- Linguistic token - BERT word pieces - Google Multilingual BERT-Base model Cased (104 Languages)
`transformer`	Transformer	- Linguistic token - Supports any Watson NLP pretrained model of type `transformer`, such as IBM Slate models, or any transformer from the HuggingFace library

Pretrained models

Several pretrained models are available, for common entities such as person, organization, and dates. Model names are listed below.

For details of the Entity-mention type system, see Understanding model type systems.

The generic entity models

The models for entity type systems have been trained and tested on labeled data from news reports. These models have two parts:

A rule-based model (the rbr models), which handles syntactically regular entity types such as number, email and phone.
A model trained on labeled data for the more complex entity types such as person, organization, or location.

The rbr sire, and bilstm models are monolingual: each model knows how to analyze input text in a single language.

The bert model is multilingual: the single model can analyze input texts from multiple languages.

The bilstm models use GloVe embeddings trained on the Wikipedia corpus in each language.

The bert model uses the Google Multilingual BERT model (Large, Cased, 104 languages).

The entity-mentions_transformer-workflow_lang_multilingual_slate.153m.distilled model is optimized for GPU, but supports CPU usage. entity-mentions_transformer-workflow_lang_multilingual_slate.153m.distilled-cpu is a CPU-only model, optimized for better speed on CPU. These models are trained on 24 languages and are based on IBM Slate Multilingual pretrained resource.

All models output non-overlapping entity mention spans. That is, each character in the input text can belong to either no entity type or exactly one entity type and there are no overlapping entities.

The PII entity models

The PII models recognize personal identifiable information such as person names, SSN, bank account numbers, credit card numbers, etc.

Due to the nature of PII, it is difficult to train machine learning models for the majority of PII, especially credit card numbers, passport numbers and other identifiers. Therefore, the PII model has two parts:

A rule-based model handles the majority of the types by identifying common formats of PII entities and performing possible checksum/validations as appropriate for each entity type. For example, credit card number candidates are validated using the Luhn algorithm.
A model trained on labeled data for types where labeled data can be obtained, such as person and location. For this, use one of the models available for the Entity-mention V2 type system.

Creating a custom model

For the current release of Watson Natural Language Processing Library for Embed, you can work with Python notebooks in Watson Studio to train some Watson NLP models with your own data. See Creating custom models for information.

Running models

The Entity-mentions model request accepts the following fields:

Field	Type	Required Optional Repeated	Description
`raw_document`	`watson_core_data_model.nlp.RawDocument`	required	The input document on which to perform entity analysis
`language_code`	`str`	optional	Language code corresponding to the text of the `raw_document`

Example requests

REST API

curl -s \
  "http://localhost:8080/v1/watson.runtime.nlp.v1/NlpService/EntityMentionsPredict" \
  -H "accept: application/json" \
  -H "content-type: application/json" \
  -H "Grpc-Metadata-mm-model-id: entity-mentions_rbr_lang_multi_pii" \
  -d '{ "raw_document": { "text": "My email is john@ibm.com." }, "language_code": "en" }'

Response

{"mentions":[
  {"span":{
    "begin":12,
    "end":24,
    "text":"john@ibm.com"
    },
   "type":"EmailAddress",
   "producerId":{
    "name":"RBR mentions",
    "version":"0.0.1"
    },
   "confidence":0.8,
   "mentionType":"MENTT_UNSET",
   "mentionClass":"MENTC_UNSET",
   "role":""
   }
   ],
   "producerId":{
    "name":"RBR mentions",
    "version":"0.0.1"
   }
  }

Python

import grpc

from watson_nlp_runtime_client import (
    common_service_pb2,
    common_service_pb2_grpc,
    syntax_types_pb2,
)

channel = grpc.insecure_channel("localhost:8085")

stub = common_service_pb2_grpc.NlpServiceStub(channel)

request = common_service_pb2.EntityMentionsRequest(
    raw_document=syntax_types_pb2.RawDocument(text="My email is john@ibm.com"),
    language_code='en'
)

  response = stub.EntityMentionsPredict(
    request, metadata=[("mm-model-id", "entity-mentions_rbr_lang_multi_pii")]
)

print(response)

Response

mentions {
  span {
    begin: 12
    end: 24
    text: "john@ibm.com"
  }
  type: "EmailAddress"
  producer_id {
    name: "RBR mentions"
    version: "0.0.1"
  }
  confidence: 0.8
}
producer_id {
  name: "RBR mentions"
  version: "0.0.1"
}

Model ID	Container Image
BERT models
entity-mentions_bert-workflow_lang_multi_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_bert-workflow_lang_multi_stock:1.4.1
BiLSTM models
entity-mentions_bilstm-workflow_lang_ar_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_ar_stock:1.4.1
entity-mentions_bilstm-workflow_lang_de_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_de_stock:1.4.1
entity-mentions_bilstm-workflow_lang_en_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_en_stock:1.4.1
entity-mentions_bilstm-workflow_lang_es_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_es_stock:1.4.1
entity-mentions_bilstm-workflow_lang_fr_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_fr_stock:1.4.1
entity-mentions_bilstm-workflow_lang_it_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_it_stock:1.4.1
entity-mentions_bilstm-workflow_lang_ja_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_ja_stock:1.4.1
entity-mentions_bilstm-workflow_lang_ko_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_ko_stock:1.4.1
entity-mentions_bilstm-workflow_lang_nl_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_nl_stock:1.4.1
entity-mentions_bilstm-workflow_lang_pt_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_pt_stock:1.4.1
entity-mentions_bilstm-workflow_lang_zh-cn_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_zh-cn_stock:1.4.1
ensemble-workflow
entity-mentions_ensemble-workflow_lang_multi_distilwatbert	cp.icr.io/cp/ai/watson-nlp_entity-mentions_ensemble-workflow_lang_multi_distilwatbert:1.4.1
entity-mentions_ensemble-workflow_lang_multi_distilwatbert-cpu	cp.icr.io/cp/ai/watson-nlp_entity-mentions_ensemble-workflow_lang_multi_distilwatbert-cpu:1.4.1
RBR models
entity-mentions_rbr_lang_ar_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_ar_stock:1.4.1
entity-mentions_rbr_lang_cs_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_cs_stock:1.4.1
entity-mentions_rbr_lang_da_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_da_stock:1.4.1
entity-mentions_rbr_lang_de_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_de_stock:1.4.1
entity-mentions_rbr_lang_en_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_en_stock:1.4.1
entity-mentions_rbr_lang_es_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_es_stock:1.4.1
entity-mentions_rbr_lang_fi_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_fi_stock:1.4.1
entity-mentions_rbr_lang_fr_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_fr_stock:1.4.1
entity-mentions_rbr_lang_he_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_he_stock:1.4.1
entity-mentions_rbr_lang_hi_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_hi_stock:1.4.1
entity-mentions_rbr_lang_it_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_it_stock:1.4.1
entity-mentions_rbr_lang_ja_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_ja_stock:1.4.1
entity-mentions_rbr_lang_ko_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_ko_stock:1.4.1
entity-mentions_rbr_lang_nb_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_nb_stock:1.4.1
entity-mentions_rbr_lang_nl_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_nl_stock:1.4.1
entity-mentions_rbr_lang_nn_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_nn_stock:1.4.1
entity-mentions_rbr_lang_pl_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_pl_stock:1.4.1
entity-mentions_rbr_lang_pt_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_pt_stock:1.4.1
entity-mentions_rbr_lang_ro_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_ro_stock:1.4.1
entity-mentions_rbr_lang_ru_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_ru_stock:1.4.1
entity-mentions_rbr_lang_sk_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_sk_stock:1.4.1
entity-mentions_rbr_lang_sv_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_sv_stock:1.4.1
entity-mentions_rbr_lang_tr_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_tr_stock:1.4.1
entity-mentions_rbr_lang_zh-cn_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_zh-cn_stock:1.4.1
entity-mentions_rbr_lang_zh-tw_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_zh-tw_stock:1.4.1
SIRE models
entity-mentions_sire-workflow_lang_en_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_sire-workflow_lang_en_stock:1.4.1
Transformer models
entity-mentions_transformer-workflow_lang_multi_distilwatbert	cp.icr.io/cp/ai/watson-nlp_entity-mentions_transformer-workflow_lang_multi_distilwatbert:1.4.1
entity-mentions_transformer-workflow_lang_multi_distilwatbert-cpu	cp.icr.io/cp/ai/watson-nlp_entity-mentions_transformer-workflow_lang_multi_distilwatbert-cpu:1.4.1
entity-mentions_transformer-workflow_lang_multi_stock	cp.icr.io/cp/ai/watson-nlp_entity-mentions_transformer-workflow_lang_multi_stock:1.4.1
entity-mentions_transformer-workflow_lang_multilingual_slate.153m.distilled	cp.icr.io/cp/ai/watson-nlp_entity-mentions_transformer-workflow_lang_multilingual_slate.153m.distilled:1.4.1
entity-mentions_transformer-workflow_lang_multilingual_slate.153m.distilled-cpu	cp.icr.io/cp/ai/watson-nlp_entity-mentions_transformer-workflow_lang_multilingual_slate.153m.distilled-cpu:1.4.1
entity-mentions_transformer-workflow_lang_multilingual_slate.270m	cp.icr.io/cp/ai/watson-nlp_entity-mentions_transformer-workflow_lang_multilingual_slate.270m:1.4.1
Entity models (PII)
entity-mentions_rbr_lang_multi_pii	cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_multi_pii:1.4.1