Adversarial robustness evaluation metric
The adversarial robustness metric measures how well your AI assets maintain performance against adversarial attacks such as prompt injections and jailbreaks.
Metric details
Adversarial robustness is a metric that measures how well your model refuses to provide responses to attack vectors across different categories of jailbreak and prompt injection attacks. The metric is available only when you use the Python SDK to calculate evaluation metrics. For more information, see Computing Adversarial robustness and Prompt Leakage Risk using IBM watsonx.governance.
The following attack categories are evaluated with the adversarial robustness metric:
- Basic: Basic attacks use direct prompts to generate unwanted responses for models that are not trained to protect against any attacks.
- Intermediate: Intermediate attacks use natural language to pre-condition foundation models to follow instructions.
- Advanced: Advanced attacks require knowledge of model encoding or access to internal resources.
Scope
The adversarial robustness metric evaluates generative AI assets only.
- Types of AI assets: Prompt templates
- Generative AI tasks:
- Text classification
- Text summarization
- Content generation
- Question answering
- Entity extraction
- Retrieval augmented generation (RAG)
- Supported languages: English
Scores and values
The adversarial robustness metric score indicates how resiliently your prompt template performs against adversarial attacks. Lower scores indicate that the prompt template is weak and can be easily attacked. Higher scores indicate that the prompt template is strong and more resistant to attacks.
- Range of values: 0.0-1.0
- Best possible score: 1.0
Settings
- Thresholds:
- Lower bound: 0
- Upper bound: 1
Evaluation process
To calculate the adversarial robustness metric, evaluations use a keyword detector that includes a list of phrases that indicate refusals from the model to provide responses to attacks. The model responses are compared to the list of phrases to calculate the metric score. These scores represent the lower bound of the actual model robustness. If the model does not explicitly refuse to provide responses to attacks, scores indicate that the prompt template is not robust.
Limitations
Rejection phrase detection:
- The metric relies on a pre-determined list of rejection phrases to evaluate model responses.
- Different models might use different phrases to reject harmful requests, which requires periodic updates to the detection list.
- The evaluation might underestimate robustness when models respond with:
- Clarifying questions instead of explicit refusals
- Explanations about request vulnerabilities
- Unrelated information to deflect harmful requests
Technical constraints:
-
Each evaluation requires a minimum of 50 inferences per prompt template variable, which might impact costs.
-
Sampling during computation leads to slightly different scores between evaluations
-
Attack vectors require periodic updates to address newly discovered threats.
-
Metric computation requires standard/essential plans of Watsonx.governance.
Next steps
You can use the following strategies to mitigate the susceptibility of your prompt template to adversarial robustness attacks:
Model selection and testing:
You can mitigate attack susceptibility by:
- Selecting safety-trained models
- Using models with built-in guardrails
- Testing different model endpoints as they receive safety updates
Prompt template enhancement:
Improve your prompt templates with:
- Clear scope limitations and objectives
- Explicit instructions against sharing unnecessary information
- Structured formatting to prevent instruction overwrites
- Counter-instructions against role-playing scenarios
- Language engagement restrictions to combat advanced attacks
Implementing guardrails:
You can establish protective measures through:
- Input-stage guardrails:
- Attack intent detection
- Proactive filtering to prevent unnecessary inference calls
- Output-stage guardrails:
- Content moderation
- Response inspection for attack success criteria
- Combined guardrail approaches:
- Implementing both on/off-topic and jailbreak protections
- Using multiple filter layers
Application design:
Enhance your application security by:
- Constraining input to permissible languages only
- Setting appropriate input size limitations
- Implementing user input validation
Parent topic: Evaluation metrics