Evaluate metrics

The evaluate metrics module can help you calculate LLM metrics.

Evaluate metrics is a module in the ibm-watsonx-gov Python SDK that contains methods to compute scores for the context relevance, faithfulness, and answer similarity metrics. You can use model insights to visualize the evaluation results.

Examples

You can use the evaluate metrics module to calculate metrics as shown in the following examples:

Step 1: generate AI configuration object:

from ibm_watsonx_gov.config import GenAIConfiguration
from ibm_watsonx_gov.metrics import ContextRelevanceMetric, FaithfulnessMetric, AnswerCorrectnessMetric
from ibm_watsonx_gov.entities.enums import TaskType

question_field = "question"
context_field = "contexts"

config = GenAIConfiguration(
    input_fields=[question_field, context_field],
    question_field=question_field,
    context_fields=[context_field],
    output_fields=["answer"],
    reference_fields=["ground_truth", "answer"],
    task_type=TaskType.RAG,
)

metrics = [
    FaithfulnessMetric(method="token_k_precision"),
]

Step 2: calculate metrics

from ibm_watsonx_gov.evaluate import evaluate_metrics

evaluation_result = evaluate_metrics(
    credentials=credentials,
    configuration=config,
    metrics=metrics,
    data=input_df,
    output_format="dataframe",
)

For more information, see the Evaluate metrics notebook.

Parent topic: Metrics computation using Python SDK