Entity-mentions

At a glance

The entity-mentions task encapsulates algorithms for extracting mentions of entities (person, organizations, dates) from the input text. The task offers implementations of strong entity extraction algorithms from each of three families: rule-based, classic ML, and deep-learning.

Class definitions
watson_nlp.blocks.entity_mentions.rbr.RBR
watson_nlp.workflows.entity_mentions.sire.SIRE
watson_nlp.workflows.entity_mentions.bilstm.BiLSTM
watson_nlp.workflows.entity_mentions.bert.BERT
watson_nlp.workflows.entity_mentions.transformer.Transformer

For language support, see Supported languages.

Algorithms available

This table provides a brief overview of each algorithm, and the features it uses.

Block name Algorithm Features
rbr Rule-based algorithm expressed in AQL Any construct available in AQL
sire 1. Maximum Entropy
2. CRF
- Linguistic token
- Dictionaries
- Regular expressions
bilstm BiLSTM - Linguistic token
- GloVe embeddings
- Character embeddings
bert BERT - Linguistic token
- BERT word pieces
- Google Multilingual BERT-Base model Cased (104 Languages)
transformer Transformer - Linguistic token
- Supports any Watson NLP pretrained model of type transformer, such as IBM Slate models, or any transformer from the HuggingFace library

Pretrained models

Several pretrained models are available, for common entities such as person, organization, and dates. Model names are listed below.

Model ID Container Image
BERT models
entity-mentions_bert-workflow_lang_multi_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_bert-workflow_lang_multi_stock:1.4.1
BiLSTM models
entity-mentions_bilstm-workflow_lang_ar_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_ar_stock:1.4.1
entity-mentions_bilstm-workflow_lang_de_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_de_stock:1.4.1
entity-mentions_bilstm-workflow_lang_en_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_en_stock:1.4.1
entity-mentions_bilstm-workflow_lang_es_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_es_stock:1.4.1
entity-mentions_bilstm-workflow_lang_fr_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_fr_stock:1.4.1
entity-mentions_bilstm-workflow_lang_it_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_it_stock:1.4.1
entity-mentions_bilstm-workflow_lang_ja_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_ja_stock:1.4.1
entity-mentions_bilstm-workflow_lang_ko_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_ko_stock:1.4.1
entity-mentions_bilstm-workflow_lang_nl_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_nl_stock:1.4.1
entity-mentions_bilstm-workflow_lang_pt_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_pt_stock:1.4.1
entity-mentions_bilstm-workflow_lang_zh-cn_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_bilstm-workflow_lang_zh-cn_stock:1.4.1
ensemble-workflow
entity-mentions_ensemble-workflow_lang_multi_distilwatbert cp.icr.io/cp/ai/watson-nlp_entity-mentions_ensemble-workflow_lang_multi_distilwatbert:1.4.1
entity-mentions_ensemble-workflow_lang_multi_distilwatbert-cpu cp.icr.io/cp/ai/watson-nlp_entity-mentions_ensemble-workflow_lang_multi_distilwatbert-cpu:1.4.1
RBR models
entity-mentions_rbr_lang_ar_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_ar_stock:1.4.1
entity-mentions_rbr_lang_cs_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_cs_stock:1.4.1
entity-mentions_rbr_lang_da_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_da_stock:1.4.1
entity-mentions_rbr_lang_de_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_de_stock:1.4.1
entity-mentions_rbr_lang_en_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_en_stock:1.4.1
entity-mentions_rbr_lang_es_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_es_stock:1.4.1
entity-mentions_rbr_lang_fi_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_fi_stock:1.4.1
entity-mentions_rbr_lang_fr_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_fr_stock:1.4.1
entity-mentions_rbr_lang_he_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_he_stock:1.4.1
entity-mentions_rbr_lang_hi_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_hi_stock:1.4.1
entity-mentions_rbr_lang_it_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_it_stock:1.4.1
entity-mentions_rbr_lang_ja_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_ja_stock:1.4.1
entity-mentions_rbr_lang_ko_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_ko_stock:1.4.1
entity-mentions_rbr_lang_nb_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_nb_stock:1.4.1
entity-mentions_rbr_lang_nl_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_nl_stock:1.4.1
entity-mentions_rbr_lang_nn_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_nn_stock:1.4.1
entity-mentions_rbr_lang_pl_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_pl_stock:1.4.1
entity-mentions_rbr_lang_pt_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_pt_stock:1.4.1
entity-mentions_rbr_lang_ro_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_ro_stock:1.4.1
entity-mentions_rbr_lang_ru_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_ru_stock:1.4.1
entity-mentions_rbr_lang_sk_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_sk_stock:1.4.1
entity-mentions_rbr_lang_sv_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_sv_stock:1.4.1
entity-mentions_rbr_lang_tr_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_tr_stock:1.4.1
entity-mentions_rbr_lang_zh-cn_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_zh-cn_stock:1.4.1
entity-mentions_rbr_lang_zh-tw_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_zh-tw_stock:1.4.1
SIRE models
entity-mentions_sire-workflow_lang_en_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_sire-workflow_lang_en_stock:1.4.1
Transformer models
entity-mentions_transformer-workflow_lang_multi_distilwatbert cp.icr.io/cp/ai/watson-nlp_entity-mentions_transformer-workflow_lang_multi_distilwatbert:1.4.1
entity-mentions_transformer-workflow_lang_multi_distilwatbert-cpu cp.icr.io/cp/ai/watson-nlp_entity-mentions_transformer-workflow_lang_multi_distilwatbert-cpu:1.4.1
entity-mentions_transformer-workflow_lang_multi_stock cp.icr.io/cp/ai/watson-nlp_entity-mentions_transformer-workflow_lang_multi_stock:1.4.1
entity-mentions_transformer-workflow_lang_multilingual_slate.153m.distilled cp.icr.io/cp/ai/watson-nlp_entity-mentions_transformer-workflow_lang_multilingual_slate.153m.distilled:1.4.1
entity-mentions_transformer-workflow_lang_multilingual_slate.153m.distilled-cpu cp.icr.io/cp/ai/watson-nlp_entity-mentions_transformer-workflow_lang_multilingual_slate.153m.distilled-cpu:1.4.1
entity-mentions_transformer-workflow_lang_multilingual_slate.270m cp.icr.io/cp/ai/watson-nlp_entity-mentions_transformer-workflow_lang_multilingual_slate.270m:1.4.1
Entity models (PII)
entity-mentions_rbr_lang_multi_pii cp.icr.io/cp/ai/watson-nlp_entity-mentions_rbr_lang_multi_pii:1.4.1

For details of the Entity-mention type system, see Understanding model type systems.

The generic entity models

The models for entity type systems have been trained and tested on labeled data from news reports. These models have two parts:

  • A rule-based model (the rbr models), which handles syntactically regular entity types such as number, email and phone.

  • A model trained on labeled data for the more complex entity types such as person, organization, or location.

The rbr sire, and bilstm models are monolingual: each model knows how to analyze input text in a single language.

The bert model is multilingual: the single model can analyze input texts from multiple languages.

The bilstm models use GloVe embeddings trained on the Wikipedia corpus in each language.

The bert model uses the Google Multilingual BERT model (Large, Cased, 104 languages).

The entity-mentions_transformer-workflow_lang_multilingual_slate.153m.distilled model is optimized for GPU, but supports CPU usage. entity-mentions_transformer-workflow_lang_multilingual_slate.153m.distilled-cpu is a CPU-only model, optimized for better speed on CPU. These models are trained on 24 languages and are based on IBM Slate Multilingual pretrained resource.

All models output non-overlapping entity mention spans. That is, each character in the input text can belong to either no entity type or exactly one entity type and there are no overlapping entities.

The PII entity models

The PII models recognize personal identifiable information such as person names, SSN, bank account numbers, credit card numbers, etc.

Due to the nature of PII, it is difficult to train machine learning models for the majority of PII, especially credit card numbers, passport numbers and other identifiers. Therefore, the PII model has two parts:

  • A rule-based model handles the majority of the types by identifying common formats of PII entities and performing possible checksum/validations as appropriate for each entity type. For example, credit card number candidates are validated using the Luhn algorithm.

  • A model trained on labeled data for types where labeled data can be obtained, such as person and location. For this, use one of the models available for the Entity-mention V2 type system.

Creating a custom model

For the current release of Watson Natural Language Processing Library for Embed, you can work with Python notebooks in Watson Studio to train some Watson NLP models with your own data. See Creating custom models for information.

Running models

The Entity-mentions model request accepts the following fields:

Field Type Required
Optional
Repeated
Description
raw_document watson_core_data_model.nlp.RawDocument required The input document on which to perform entity analysis
language_code str optional Language code corresponding to the text of the raw_document

Example requests

REST API

curl -s \
  "http://localhost:8080/v1/watson.runtime.nlp.v1/NlpService/EntityMentionsPredict" \
  -H "accept: application/json" \
  -H "content-type: application/json" \
  -H "Grpc-Metadata-mm-model-id: entity-mentions_rbr_lang_multi_pii" \
  -d '{ "raw_document": { "text": "My email is john@ibm.com." }, "language_code": "en" }'

Response

{"mentions":[
  {"span":{
    "begin":12,
    "end":24,
    "text":"john@ibm.com"
    },
   "type":"EmailAddress",
   "producerId":{
    "name":"RBR mentions",
    "version":"0.0.1"
    },
   "confidence":0.8,
   "mentionType":"MENTT_UNSET",
   "mentionClass":"MENTC_UNSET",
   "role":""
   }
   ],
   "producerId":{
    "name":"RBR mentions",
    "version":"0.0.1"
   }
  }

Python

import grpc

from watson_nlp_runtime_client import (
    common_service_pb2,
    common_service_pb2_grpc,
    syntax_types_pb2,
)

channel = grpc.insecure_channel("localhost:8085")

stub = common_service_pb2_grpc.NlpServiceStub(channel)

request = common_service_pb2.EntityMentionsRequest(
    raw_document=syntax_types_pb2.RawDocument(text="My email is john@ibm.com"),
    language_code='en'
)

  response = stub.EntityMentionsPredict(
    request, metadata=[("mm-model-id", "entity-mentions_rbr_lang_multi_pii")]
)

print(response)

Response

mentions {
  span {
    begin: 12
    end: 24
    text: "john@ibm.com"
  }
  type: "EmailAddress"
  producer_id {
    name: "RBR mentions"
    version: "0.0.1"
  }
  confidence: 0.8
}
producer_id {
  name: "RBR mentions"
  version: "0.0.1"
}