Detag
At a glance
The Detag
task enables you to process HTML text. This task removes HTML tags from the input text and retains the resulting plain text along with a character offset mapping that can be used to map from span offsets on plain text
to span offsets in the original HTML text. Such functionality is desirable when you want to highlight extracted results on the original HTML text.
Pretrained models
Model names are listed below. For language support, see Supported languages.
Model ID | Container Image |
---|---|
detag_rbr_lang_en_stock | cp.icr.io/cp/ai/watson-nlp_detag_rbr_lang_en_stock:1.4.1 |
Processing HTML
The Detag model request accepts the following fields:
Field | Type | Required Optional Repeated |
Description |
---|---|---|---|
raw_document |
watson_core_data_model.nlp.RawDocument |
required | The input document on which to perform the detagging. An HTML encoded string. e.g. <html><body>text</body></html> |
Example requests
REST API
curl -s \
"http://localhost:8080/v1/watson.runtime.nlp.v1/NlpService/DetagPredict" \
-H "accept: application/json" \
-H "content-type: application/json" \
-H "Grpc-Metadata-mm-model-id: detag_rbr_lang_en_stock" \
-d '{ "raw_document": { "text": "<html><body>The only text left</body></html>" } }'
Response
{"html":"<html><body>The only text left</body></html>",
"text":"The only text left",
"offsets":[
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
"producerId":null
}
Python
import grpc
from watson_nlp_runtime_client import (
common_service_pb2,
common_service_pb2_grpc,
syntax_types_pb2,
)
channel = grpc.insecure_channel("localhost:8085")
stub = common_service_pb2_grpc.NlpServiceStub(channel)
request = common_service_pb2.DetagRequest(
raw_document=syntax_types_pb2.RawDocument(text="<html><body>The only text left</b></body></html>"),
)
response = stub.DetagPredict(
request, metadata=[("mm-model-id", "detag_rbr_lang_en_stock")]
)
print(response)
Response
html: "<html><body>The only text left</b></body></html>"
text: "The only text left"
offsets: 12
offsets: 13
offsets: 14
offsets: 15
offsets: 16
offsets: 17
offsets: 18
offsets: 19
offsets: 20
offsets: 21
offsets: 22
offsets: 23
offsets: 24
offsets: 25
offsets: 26
offsets: 27
offsets: 28
offsets: 29
offsets: 30