wx-ai text tokenize

Checks the conversion of provided input to tokens for a model. It splits text into words or subwords, which are converted to IDs through a look-up table (vocabulary). Tokenization allows the model to have a reasonable vocabulary size.

Syntax

cpdctl wx-ai text tokenize \
--input INPUT \
--model-id MODEL-ID \
[--cpd-scope CPD-SCOPE] \
[--parameters PARAMETERS | --parameters-return-tokens PARAMETERS-RETURN-TOKENS] \
[--project-id PROJECT-ID] \
[--space-id SPACE-ID]

Options

Table 1: Command options
Option Description
--cpd-scope (string) The IBM Software Hub space, project, or catalog scope. For example, cpd://default-context/spaces/7bccdda4-9752-4f37-868e-891de6c48135.
Status
Optional.
Syntax
--cpd-scope=<cpd-scope>
Input type
string
Default value
No default.
--input (string)

The input string to tokenize. Required.

--model-id (string)

The id of the model to be used for this request. For more information, see list of models. Required.

--parameters

The parameters for text tokenization. This JSON option can instead be provided by setting individual fields with other options. It is mutually exclusive with those options.

Provide a JSON string option or specify a JSON file to read from by providing a file path option that begins with a @, for example --parameters=@path/to/file.json.

--parameters-return-tokens (Boolean)

If this option is true, the actual tokens are also returned in the response. This option provides a value for a sub-field of the JSON option 'parameters'. It is mutually exclusive with that option.

The default value is false.

--project-id (string)

The project that contains the resource. Either space_id or project_id must be given.

The maximum length is 36 characters. The minimum length is 36 characters. The value must match the regular expression /[a-zA-Z0-9-]*/.

--space-id (string)

The space that contains the resource. Either space_id or project_id must be given.

The maximum length is 36 characters. The minimum length is 36 characters. The value must match the regular expression /[a-zA-Z0-9-]*/.

Examples

cpdctl wx-ai text tokenize \
    --model-id google/flan-ul2 \
    --input 'Write a tagline for an alumni association: Together we' \
    --space-id exampleString \
    --project-id 12ac4cf1-252f-424b-b52d-5cdd9814987f \
    --parameters '{"return_tokens": true}'

Alternatively, granular options are available for the sub-fields of JSON string options:

cpdctl wx-ai text tokenize \
    --model-id google/flan-ul2 \
    --input 'Write a tagline for an alumni association: Together we' \
    --space-id exampleString \
    --project-id 12ac4cf1-252f-424b-b52d-5cdd9814987f \
    --parameters-return-tokens true

Example output

The response with the token count.

The response with the token count and the tokens, if requested.

{
  "model_id" : "google/flan-ul2",
  "result" : {
    "token_count" : 11,
    "tokens" : [ "Write", "a", "tag", "line", "for", "an", "alumni", "associ", "ation:", "Together", "we" ]
  }
}