Billing details for generative AI assets

Learn about how usage for generative AI assets is measured using resource unit (RU), hourly rates, or a flat rate.

Review the details for how resources are measured using:

Resource units to measure inferencing activities for foundation models provided by watsonx.ai.
Hourly rates for custom foundation models you import and deploy with watsonx.ai.
Hourly rates for curated foundation models deployed on demand on dedicated hardware.
Flat rates per page for document text classification and extraction.

A resource unit is used to measure the following resources:

Tokens used for inferencing a foundation model to generate text or text embeddings.
Data points used by a time series foundation model for forecasting future values.

Prompt Lab usage is measured by the number of processed tokens.

Tuning a model in the Tuning Studio consumes capacity units per hour (CUH). For details, see Billing details for machine learning assets.

Billing rates for inferencing multitenant foundation models

Each multitenent foundation model provided by IBM watsonx.ai, is assigned a model specific multiplier. Foundation model inference or forecasting is measured by tracking the tokens, data points, or characters used in the input and output for a foundation model or for the output of an encoder model. A token is a basic unit of text (typically 4 characters or 0.75 words). Models are metered based on tokens, data points, or characters.

For details about billing rates, see Calculating the rate of token usage per model on IBM Cloud.

For details about billing rates, see Calculating the rate of token usage per model on AWS.

Important: There are limits by plan on the number of inferencing requests per second that are submitted to a model. If a user exceeds an inferencing request limit, a system notification provides guidance.

For the list of supported foundation models for generating text and their prices, see Supported foundation models. For the list of supported encoder models for reranking and generating text embeddings and their prices, see Supported encoder models.

A tuned foundation model is assigned the same price as the underlying foundation model. For details about tunable foundation models, see Choosing a foundation model to tune.

Billing rates for inferencing multitenant time series foundation models

When measuring foundation model forecasting, a Resource Unit (RU) is equal to 1,000 data points in the foundation model input and output. A data point is a unit of input and output content that is expressed as one or more numbers.

When measuring foundation model usage on AWS, the number of data points are counted in batches of 1000. A data point is a unit of input and output content that is expressed as one or more numbers. The total number of batches is then scaled by a model-specific multiplier. A Resource Unit (RU) is equal to 10,000 such batches.

Billing classes by multiplier

If you are monitoring model usage with the watsonx.ai API, model prices are listed by pricing tier, as follows:

Table 1. API pricing tiers
Model pricing tier	Resource type	Price per RU in USD (IBM Cloud)	Multiplier
Class 1	Tokens	$0.0006	6
Class 2	Tokens	$0.0018	18
Class 3	Tokens	$0.0050	50
Class C1	Tokens	$0.0001	1
Class 5	Tokens	$0.00025	2.5
Class 7	Tokens	$0.016	160
Class 8	Tokens	$0.00015	1.5
Class 9	Tokens	$0.00035	3.5
Class 10	Tokens	$0.0020	20
Class 11	Tokens	$0.000005	0.05
Class 12	Tokens	$0.0002	2
Class 13	Tokens	$0.00071	7.1
Class 14	Data points	$0.00013	1.3
Class 15	Data points	$0.00038	3.8
Class 16	Tokens	$0.0014	14
Class 17	Tokens	$0.0003	3
Class 18	Tokens	$0.00006	0.6

Note:

Certain models, such as Mistral Large, have special pricing that is not assigned by a multiplier. The pricing is listed in Supported foundation models.

Calculating the rate of token usage per model on IBM Cloud

To calculate charges for foundation model inference, divide the total number of tokens consumed during the month by 1000 and round up to the nearest 1000 to obtain the total number of resource units (RUs). Multiply the total number of RUs by the model-specific multipler to obtain total usage charges. The model price varies by model and can also vary for input or output tokens for a given model.

The basic formula is as follows:

Total tokens used/1000 = Resource Units (RU) consumed
RUs consumed x base price per RU x model multiplier = Total usage charge

The base price for an RU is $0.0001. The price for each foundation model is a multiple of the base price.

Calculating the rate of token usage per model on AWS

To calculate charges for foundation model inference, divide the total number of tokens consumed by 1000 and round up to the nearest 1000 to obtain the number of batches of tokens consumed. Multiply the number of consumed batches by a model-specific multiplier to obtain the total number of batches. Divide the total number of batches by 10,000 to obtain the total number of RUs.

The basic formula is as follows:

Total tokens used/1000 = Batches of tokens consumed
Batches of tokens consumed x model multiplier = Total number of token batches
Total number of token batches / 10,000 = Resource Units (RUs) measured

You purchase the RUs required for your use case in advance. RUs are then consumed based on your resource usage.

Calculating the resource unit rate of data points per model

To calculate charges for forecasting with a time series foundation model, use the following equations:

Input calculation: context length x number of series x number of channels
Output calculation: prediction length x number of series x number of channels

These equations use the following parameters:

Context length refers to the number of historical data points that a time series foundation model uses as input to make predictions.
A series is a collection of observations made sequentially over time. For example, when comparing stock prices for many companies, the observed stock price history for each company is a separate series.
Channels are the specific features or variables that are measured within a time series dataset.
Prediction length is the number of future data points for the model to predict.

For more information about these values, see Forecast future data values.

Data point pricing
Resource type	Model pricing tier	Price in USD per RU (IBM Cloud)
Input data points	Class 14	$0.00013
Output data points	Class 15	$0.00038

The following example shows how to calculate the cost for a time series forecasting request with the following parameters:

Parameters used to calculate data point usage
Parameter	Example quantity
Context length (granite-ttm-1536-96-r2 model)	1,536
Channels	10
Series	1,000
Prediction length	96

Total input data points: 15,360,000 (Context length of 1,536, 10 channels, for 1,000 series)
```
15,360,000 / 1,000 = 15,360 x 0.00013 = 1.9968
```
Total output data points: 960,000 (Forecast 96 time points, 10 channels, for 1,000 series)
```
960,000 / 1,000 = 960 x 0.00038 = 0.3648
```
Total price for the time series forecast request: $2.36 (Input cost $1.9968 + Output cost $0.3648)
```
1.9968 + 0.3648 = 2.3616
```

Hourly billing costs for custom foundation models and deploy on demand models

Billing rates are according to model hardware configuration and cover both hosting and inferencing the model. Charges begin when the custom foundation model is successfully deployed and continue until the model is deleted.

You are billed in USD per hour based on actual resource consumption.

You are charged in terms of RUs consumed by your deployment per hour.

Deploying custom foundation models and working with deploy on demand foundation models requires the Standard plan.

The following table provides billing rates to calculate the model hosting price when you specify a hardware specification to use to deploy your model:

Custom foundation model and deploy on demand model billing rates based on GPU hardware configuration
Configuration	Total GPU memory	Billing rate per hour in USD on IBM Cloud	Consumption rate per hour in RU on AWS
1 L40S GPU	48 GB	$4.43	4.43
2 L40S GPUs	96 GB	$8.86	8.86
1 A100 GPU	80 GB	$5.80	5.80
2 A100 GPUs	160 GB	$11.60	11.60
4 A100 GPUs	320 GB	$23.20	23.20
8 A100 GPUs	640 GB	$46.40	46.40
1 H100 GPU	80 GB	$14.50	14.50
2 H100 GPUs	160 GB	$29.00	29.00
4 H100 GPUs	320 GB	$58.00	58.00
8 H100 GPUs	640 GB	$116.00	116.00
1 H200 GPU	141 GB	$16.00	16.00
2 H200 GPUs	282 GB	$32.00	32.00
4 H200 GPUs	564 GB	$64.00	64.00
8 H200 GPUs	1128 GB	$128.00	128.00

The following table provides billing rates to calculate the model hosting price when you specify a configuration size to use to deploy your model:

Custom foundation model and deploy on demand model billing rates based on configuration size
Configuration size	Billing rate per hour in USD on IBM Cloud
Extra small	$4.43
Small	$5.22
Medium	$10.40
Large	$20.85

Important: You can deploy a maximum of four small custom foundation models, two medium models, or one large model per account.

For details on choosing a configuration for a custom foundation model, see Planning to deploy a custom foundation model.

Important: Some deploy on demand foundation models have an additional access fee.

For details about deploy on demand foundation models, see Supported foundation models.

Rates per page for document text processing

Use the document text classification and extraction methods from the watsonx.ai API to convert PDF files that are highly structured and use diagrams and tables to convey information, into an AI model-friendly file format. For more information, see Understanding documents.

A page can be a page of text (up to 3000 characters), an image, or a .tiff frame.

You can use text classification and extraction to process up to 100 pages per month with the Lite plan. Billing is charged at a flat rate per page processed and the billing rate depends on your plan type.

Text classification and extraction pricing
IBM Cloud plan type	Price per page in USD
Essential	$0.038
Standard	$0.030
HIPAA-Ready	Not supported

Usage is measured in resource units (RUs) consumed. Processing 33 pages of a document uses 1 RU. Only the text extraction API is available on AWS.

Learn more

For details on pricing for machine learning assets, see Billing rates for machine learning assets.
For details on tracking computing resource allocation and consumption, see Runtime usage.
For details about regional support for each model, see Regional availability of foundation models.