Billing details for generative AI assets
Learn about how usage for generative AI assets is measured using resource unit (RU), hourly rates, or a flat rate.
Review the details for how resources are measured using:
- Resource units to measure inferencing activities for foundation models provided by watsonx.ai.
- Hourly rates for custom foundation models you import and deploy with watsonx.ai.
- Hourly rates for curated foundation models deployed on demand on dedicated hardware.
- Flat rates per page for document text classification and extraction.
A resource unit is used to measure the following resources:
- Tokens used for inferencing a foundation model to generate text or text embeddings.
- Data points used by a time series foundation model for forecasting future values.
Prompt Lab usage is measured by the number of processed tokens.
Tuning a model in the Tuning Studio consumes capacity units per hour (CUH). For details, see Billing details for machine learning assets.
Billing rates for inferencing multitenant foundation models
Each multitenent foundation model provided by IBM watsonx.ai, is assigned a model specific multiplier. Foundation model inference or forecasting is measured by tracking the tokens, data points, or characters used in the input and output for a foundation model or for the output of an encoder model. A token is a basic unit of text (typically 4 characters or 0.75 words). Models are metered based on tokens, data points, or characters.
For details about billing rates, see Calculating the rate of token usage per model on IBM Cloud.
For details about billing rates, see Calculating the rate of token usage per model on AWS.
For the list of supported foundation models for generating text and their prices, see Supported foundation models. For the list of supported encoder models for reranking and generating text embeddings and their prices, see Supported encoder models.
A tuned foundation model is assigned the same price as the underlying foundation model. For details about tunable foundation models, see Choosing a foundation model to tune.
Billing rates for inferencing multitenant time series foundation models
When measuring foundation model forecasting, a Resource Unit (RU) is equal to 1,000 data points in the foundation model input and output. A data point is a unit of input and output content that is expressed as one or more numbers.
When measuring foundation model usage on AWS, the number of data points are counted in batches of 1000. A data point is a unit of input and output content that is expressed as one or more numbers. The total number of batches is then scaled by a model-specific multiplier. A Resource Unit (RU) is equal to 10,000 such batches.
Billing classes by multiplier
If you are monitoring model usage with the watsonx.ai API, model prices are listed by pricing tier, as follows:
| Model pricing tier | Resource type | Price per RU in USD (IBM Cloud) |
Multiplier |
|---|---|---|---|
| Class 1 | Tokens | $0.0006 | 6 |
| Class 2 | Tokens | $0.0018 | 18 |
| Class 3 | Tokens | $0.0050 | 50 |
| Class C1 | Tokens | $0.0001 | 1 |
| Class 5 | Tokens | $0.00025 | 2.5 |
| Class 7 | Tokens | $0.016 | 160 |
| Class 8 | Tokens | $0.00015 | 1.5 |
| Class 9 | Tokens | $0.00035 | 3.5 |
| Class 10 | Tokens | $0.0020 | 20 |
| Class 11 | Tokens | $0.000005 | 0.05 |
| Class 12 | Tokens | $0.0002 | 2 |
| Class 13 | Tokens | $0.00071 | 7.1 |
| Class 14 | Data points | $0.00013 | 1.3 |
| Class 15 | Data points | $0.00038 | 3.8 |
| Class 16 | Tokens | $0.0014 | 14 |
| Class 17 | Tokens | $0.0003 | 3 |
| Class 18 | Tokens | $0.00006 | 0.6 |
Certain models, such as Mistral Large, have special pricing that is not assigned by a multiplier. The pricing is listed in Supported foundation models.
Calculating the rate of token usage per model on IBM Cloud
To calculate charges for foundation model inference, divide the total number of tokens consumed during the month by 1000 and round up to the nearest 1000 to obtain the total number of resource units (RUs). Multiply the total number of RUs by the model-specific multipler to obtain total usage charges. The model price varies by model and can also vary for input or output tokens for a given model.
The basic formula is as follows:
Total tokens used/1000 = Resource Units (RU) consumed
RUs consumed x base price per RU x model multiplier = Total usage charge
The base price for an RU is $0.0001. The price for each foundation model is a multiple of the base price.
Calculating the rate of token usage per model on AWS
To calculate charges for foundation model inference, divide the total number of tokens consumed by 1000 and round up to the nearest 1000 to obtain the number of batches of tokens consumed. Multiply the number of consumed batches by a model-specific multiplier to obtain the total number of batches. Divide the total number of batches by 10,000 to obtain the total number of RUs.
The basic formula is as follows:
Total tokens used/1000 = Batches of tokens consumed
Batches of tokens consumed x model multiplier = Total number of token batches
Total number of token batches / 10,000 = Resource Units (RUs) measured
You purchase the RUs required for your use case in advance. RUs are then consumed based on your resource usage.
Calculating the resource unit rate of data points per model
To calculate charges for forecasting with a time series foundation model, use the following equations:
- Input calculation:
context length x number of series x number of channels - Output calculation:
prediction length x number of series x number of channels
These equations use the following parameters:
- Context length refers to the number of historical data points that a time series foundation model uses as input to make predictions.
- A series is a collection of observations made sequentially over time. For example, when comparing stock prices for many companies, the observed stock price history for each company is a separate series.
- Channels are the specific features or variables that are measured within a time series dataset.
- Prediction length is the number of future data points for the model to predict.
For more information about these values, see Forecast future data values.
| Resource type | Model pricing tier | Price in USD per RU (IBM Cloud) |
|---|---|---|
| Input data points | Class 14 | $0.00013 |
| Output data points | Class 15 | $0.00038 |
The following example shows how to calculate the cost for a time series forecasting request with the following parameters:
| Parameter | Example quantity |
|---|---|
| Context length (granite-ttm-1536-96-r2 model) | 1,536 |
| Channels | 10 |
| Series | 1,000 |
| Prediction length | 96 |
-
Total input data points: 15,360,000 (Context length of 1,536, 10 channels, for 1,000 series)
15,360,000 / 1,000 = 15,360 x 0.00013 = 1.9968 -
Total output data points: 960,000 (Forecast 96 time points, 10 channels, for 1,000 series)
960,000 / 1,000 = 960 x 0.00038 = 0.3648 -
Total price for the time series forecast request: $2.36 (Input cost $1.9968 + Output cost $0.3648)
1.9968 + 0.3648 = 2.3616
Hourly billing costs for custom foundation models and deploy on demand models
Billing rates are according to model hardware configuration and cover both hosting and inferencing the model. Charges begin when the custom foundation model is successfully deployed and continue until the model is deleted.
You are billed in USD per hour based on actual resource consumption.
You are charged in terms of RUs consumed by your deployment per hour.
Deploying custom foundation models and working with deploy on demand foundation models requires the Standard plan.
The following table provides billing rates to calculate the model hosting price when you specify a hardware specification to use to deploy your model:
| Configuration | Total GPU memory | Billing rate per hour in USD on IBM Cloud | Consumption rate per hour in RU on AWS |
|---|---|---|---|
| 1 L40S GPU | 48 GB | $4.43 | 4.43 |
| 2 L40S GPUs | 96 GB | $8.86 | 8.86 |
| 1 A100 GPU | 80 GB | $5.80 | 5.80 |
| 2 A100 GPUs | 160 GB | $11.60 | 11.60 |
| 4 A100 GPUs | 320 GB | $23.20 | 23.20 |
| 8 A100 GPUs | 640 GB | $46.40 | 46.40 |
| 1 H100 GPU | 80 GB | $14.50 | 14.50 |
| 2 H100 GPUs | 160 GB | $29.00 | 29.00 |
| 4 H100 GPUs | 320 GB | $58.00 | 58.00 |
| 8 H100 GPUs | 640 GB | $116.00 | 116.00 |
| 1 H200 GPU | 141 GB | $16.00 | 16.00 |
| 2 H200 GPUs | 282 GB | $32.00 | 32.00 |
| 4 H200 GPUs | 564 GB | $64.00 | 64.00 |
| 8 H200 GPUs | 1128 GB | $128.00 | 128.00 |
The following table provides billing rates to calculate the model hosting price when you specify a configuration size to use to deploy your model:
| Configuration size | Billing rate per hour in USD on IBM Cloud |
|---|---|
| Extra small | $4.43 |
| Small | $5.22 |
| Medium | $10.40 |
| Large | $20.85 |
For details on choosing a configuration for a custom foundation model, see Planning to deploy a custom foundation model.
For details about deploy on demand foundation models, see Supported foundation models.
Rates per page for document text processing
Use the document text classification and extraction methods from the watsonx.ai API to convert PDF files that are highly structured and use diagrams and tables to convey information, into an AI model-friendly file format. For more information, see Understanding documents.
A page can be a page of text (up to 3000 characters), an image, or a .tiff frame.
You can use text classification and extraction to process up to 100 pages per month with the Lite plan. Billing is charged at a flat rate per page processed and the billing rate depends on your plan type.
| IBM Cloud plan type | Price per page in USD |
|---|---|
| Essential | $0.038 |
| Standard | $0.030 |
| HIPAA-Ready | Not supported |
Usage is measured in resource units (RUs) consumed. Processing 33 pages of a document uses 1 RU. Only the text extraction API is available on AWS.
Learn more
- For details on pricing for machine learning assets, see Billing rates for machine learning assets.
- For details on tracking computing resource allocation and consumption, see Runtime usage.
- For details about regional support for each model, see Regional availability of foundation models.