Adversarial robustness evaluation metric

The adversarial robustness metric measures how well your AI assets maintain performance against adversarial attacks such as prompt injections and jailbreaks.

Metric details

Adversarial robustness is a metric that measures how well your model refuses to provide responses to attack vectors across different categories of jailbreak and prompt injection attacks. The metric is available only when you use the Python SDK to calculate evaluation metrics.

The following attack categories are evaluated with the adversarial robustness metric:

Basic: Basic attacks use direct prompts to generate unwanted responses for models that are not trained to protect against any attacks.
Intermediate: Intermediate attacks use natural language to pre-condition foundation models to follow instructions.
Advanced: Advanced attacks require knowledge of model encoding or access to internal resources.

Scope

The adversarial robustness metric evaluates generative AI assets only.

Types of AI assets: Prompt templates
Generative AI tasks:
- Text classification
- Text summarization
- Content generation
- Question answering
- Entity extraction
- Retrieval augmented generation (RAG)
Supported languages: English, French

Scores and values

The adversarial robustness metric score indicates how resiliently your prompt template performs against adversarial attacks. Lower scores indicate that the prompt template is weak and can be easily attacked. Higher scores indicate that the prompt template is strong and more resistant to attacks.

Range of values: 0.0-1.0
Best possible score: 1.0

Settings

Thresholds:
- Lower bound: 0
- Upper bound: 1

Evaluation process

To calculate the adversarial robustness metric, evaluations use a keyword detector that includes a list of phrases that indicate refusals from the model to provide responses to attacks. The model responses are compared to the list of phrases to calculate the metric score. These scores represent the lower bound of the actual model robustness. If the model does not explicitly refuse to provide responses to attacks, scores indicate that the prompt template is not robust.

Limitations

Rejection phrase detection:

The metric relies on a pre-determined list of rejection phrases to evaluate model responses.
Different models might use different phrases to reject harmful requests, which requires periodic updates to the detection list.
The evaluation might underestimate robustness when models respond with:
- Clarifying questions instead of explicit refusals
- Explanations about request vulnerabilities
- Unrelated information to deflect harmful requests

Technical constraints:

Each evaluation requires a minimum of 50 inferences per prompt template variable, which might impact costs.
Sampling during computation leads to slightly different scores between evaluations
Attack vectors require periodic updates to address newly discovered threats.
Metric computation requires standard/essential plans of Watsonx.governance.

Next steps

You can use the following strategies to mitigate the susceptibility of your prompt template to adversarial robustness attacks:

Model selection and testing:

You can mitigate attack susceptibility by:

Selecting safety-trained models
Using models with built-in guardrails
Testing different model endpoints as they receive safety updates

Prompt template enhancement:

Improve your prompt templates with:

Clear scope limitations and objectives
Explicit instructions against sharing unnecessary information
Structured formatting to prevent instruction overwrites
Counter-instructions against role-playing scenarios
Language engagement restrictions to combat advanced attacks

Implementing guardrails:

You can establish protective measures through:

Input-stage guardrails:
- Attack intent detection
- Proactive filtering to prevent unnecessary inference calls
Output-stage guardrails:
- Content moderation
- Response inspection for attack success criteria
Combined guardrail approaches:
- Implementing both on/off-topic and jailbreak protections
- Using multiple filter layers

Application design:

Enhance your application security by:

Constraining input to permissible languages only
Setting appropriate input size limitations
Implementing user input validation

Note:

The actions that you take to improve or validate performance are not prescribed and depend on your model use case and the goals that you want to achieve. The effectiveness of each approach might vary based on your implementation and requirements.

Learn more

For more information about red teaming metrics, see the following sample notebooks: