Detag

At a glance

The Detag task enables you to process HTML text. This task removes HTML tags from the input text and retains the resulting plain text along with a character offset mapping that can be used to map from span offsets on plain text to span offsets in the original HTML text. Such functionality is desirable when you want to highlight extracted results on the original HTML text.

Pretrained models

Model names are listed below. For language support, see Supported languages.

Model ID	Container Image
detag_rbr_lang_en_stock	cp.icr.io/cp/ai/watson-nlp_detag_rbr_lang_en_stock:1.4.1

Processing HTML

The Detag model request accepts the following fields:

Field	Type	Required Optional Repeated	Description
`raw_document`	`watson_core_data_model.nlp.RawDocument`	required	The input document on which to perform the detagging. An HTML encoded string. e.g. `<html><body>text</body></html>`

Example requests

REST API

curl -s \
  "http://localhost:8080/v1/watson.runtime.nlp.v1/NlpService/DetagPredict" \
  -H "accept: application/json" \
  -H "content-type: application/json" \
  -H "Grpc-Metadata-mm-model-id: detag_rbr_lang_en_stock" \
  -d '{ "raw_document": { "text": "<html><body>The only text left</body></html>" } }'

Response

{"html":"<html><body>The only text left</body></html>",
 "text":"The only text left",
 "offsets":[
  12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
 "producerId":null
}

Python

import grpc

from watson_nlp_runtime_client import (
    common_service_pb2,
    common_service_pb2_grpc,
    syntax_types_pb2,
)

channel = grpc.insecure_channel("localhost:8085")

stub = common_service_pb2_grpc.NlpServiceStub(channel)

request = common_service_pb2.DetagRequest(
    raw_document=syntax_types_pb2.RawDocument(text="<html><body>The only text left</b></body></html>"),
)

  response = stub.DetagPredict(
    request, metadata=[("mm-model-id", "detag_rbr_lang_en_stock")]
)

print(response)

Response

html: "<html><body>The only text left</b></body></html>"
text: "The only text left"
offsets: 12
offsets: 13
offsets: 14
offsets: 15
offsets: 16
offsets: 17
offsets: 18
offsets: 19
offsets: 20
offsets: 21
offsets: 22
offsets: 23
offsets: 24
offsets: 25
offsets: 26
offsets: 27
offsets: 28
offsets: 29
offsets: 30