Accelerators are specialized tools or platforms designed to expedite and streamline specific processes. They allow IBMers to achieve results more quickly and efficiently by providing pre-built frameworks, intuitive interfaces, and plug-and-play functionalities that reduce the complexity and time required to develop, deploy, and optimize. Typically accelerators cater to various stages of AI implementation, from data ingestion and processing to model training, deployment, and evaluation. They are particularly valuable for enabling rapid prototyping, demonstrating proof-of-concept projects, and facilitating experimentation without the need for extensive coding or configuration. By offering integrated solutions and user-friendly features, accelerators enhance productivity and support both novice and experienced developers in harnessing advanced technologies effectively.
LLMaaJ (LLM as a Judge) has emerged as a leading metric in the last year overcome the challenge of building a reference-based evaluation engine. This eval technique has been shown to produce decent correlation with human judgment. Here are several properties that cannot be quantified by existing metrics and benchmarks but can be evaluated by LLMaaJ:
For example, when using a scoring model to evaluate the output of other models, the scoring prompt should contain a description of the attributes to score and the grading scale, and should be interpolated to include the response to be evaluated.
In this example, the model is asked to evaluate the response’s language style and return a classification.
You are asked to classify a chatbot's response according to its sentiment.
Evaluate the response below and extract the corresponding class.
Possible classes are POSITIVE, NEUTRAL, NEGATIVE.
Explain your reasoning and conclude by stating the classified sentiment.
{{response}}
Or, for instance, below is a few-shot prompting example of LLM-driven evaluation for NER (name-entity recognition) tasks.
You are a professional evaluator, and your task is to assess the accuracy of entity extraction as a Score in a given text. You will be given a text, an entity, and the entity value.
Please provide a numeric score on a scale from 0 to 1, where 1 is the best score and 0 is the worst score. Strictly use numeric values for scoring.
Here are the examples:
Text: Where is the IBM's office located in New York?
Entity: organization name
Value: IBM's
Score: 0
Text: Call the customer service at 1-800-555-1234 for assistance.
Entity: phone number
Value: +1 888 426 4409
Score: 1
Text: watsonx has three components: watsonx.ai, watsonx.data, and watsonx.governance.
Entity: product name
Value: Google
Score: 0.33
Text: The conference is scheduled for 15th August 2024.
Entity: date
Value: 15th August 2024
Score: 1
Text: My colleagues John and Alice will join the meeting.
Entity: person’s name
Value: Alice
Score: 1
----------------Output------------------------------------------
Score: 0.67
-------------------------------
The complexity of RAG systems is significantly influenced by the enigmatic nature of Large Language Models (LLMs), as well as the intricate and interconnected components within the RAG pipeline. As technology continues to progress at an unprecedented rate, evaluating such a complex system becomes an increasingly arduous task. To address this challenge, a myriad of benchmarks and evaluation tools have been developed specifically for RAG systems. These resources serve to provide a standardized and systematic approach to assessing the performance and efficacy of these systems.
For instance, as illustrated in the table below, there exists a diverse array of RAG evaluation methods and tools, each with its unique strengths and applications. This table, which is not exhaustive, serves to provide a succinct overview of the current landscape of RAG evaluation [1].
In the context of the retrieval component of RAG systems, several challenges arise:
In terms of retrieval component, challenges arise primarily due to the extensive and dynamic nature of prospective knowledge repositories, the temporal facets of data, and the heterogeneity of information sources. Considering these challenges, it becomes apparent that conventional evaluation metrics, such as Recall and Precision, are inadequate and ill-equipped to provide a comprehensive assessment. Instead, there is a need for more nuanced and context-dependent metrics that can effectively capture the complexities and subtleties of the retrieval process.
Concerning the generation component, it is crucial to consider the intricate relationship between the precision of the retrieval process and the quality of the generated output. This necessitates the development and implementation of comprehensive evaluation metrics that can provide a holistic and nuanced assessment of the system's performance.
In turn, the evaluation of the RAG system as a whole requires a thorough examination of the impact of the retrieval component on the generation process, as well as an assessment of the overall effectiveness and efficacy of the system in achieving its intended goals and objectives.
The retrieval metrics below are reference-based which means that every chunk must be uniquely identified (contexts_id) and every question has unique IDs of the ground truth contexts.
Rank-aware evaluation metrics used for recommendation systems are appropriate for RAG.
This metric is used in Unitxt and measures the position of the first relevant document in the search results. A higher MRR, close to 1, indicates that relevant results appear near the top, reflecting high search quality. Conversely, a lower MRR means lower search performance, with relevant answers positioned further down in the results.
Pros: Emphasizes the importance of the first relevant result, which is often critical in search scenarios.
Cons: a limitation is that it does not penalize the retrieval for assigning low rank to other ground-truths; Not suitable for evaluating the entire list of retrieved results, focusing only on the first relevant item.
ranking quality metrics that assesses how well a list of items is ordered compared to an ideal ranking, where all relevant items are positioned at the top.
NDCG @k DCG divided by ideal DCG representing a perfect ranking
DCG measures the total relevance of items in a list
ranges from 0 to 1
Pros: Takes into account the position of relevant items, providing a more holistic view of ranking quality; Can be adjusted for different levels of ranking (e.g., NDCG@k).
Cons: More complex to compute and interpret compared to simpler metrics like MRR; Requires an ideal ranking for comparison, which may not always be available or easy to define.
Mean Average Precision (MAP) is a metric that evaluates the ranking of each correctly retrieved document within a list of results
It is beneficial when your system needs to consider the order of results and retrieve multiple documents in a single run.
Pros: Considers both precision and recall, providing a balanced evaluation of retrieval performance; Suitable for tasks requiring multiple relevant documents and their correct ordering.
Cons: Can be more demanding in terms of computation compared to simpler metrics; May not be as straightforward to interpret as other metrics, requiring more context to understand the results fully.
Measures whether the output is based on the given context or if the model generates hallucinated responses.
Pros: Ensures that the generated responses are trustworthy and based on the provided context; Vital for applications where factual correctness is paramount.
Cons: Often requires human judgment to assess, making it labor-intensive and subjective; May not fully capture partial inaccuracies or subtle hallucinations.
Robustness is generally defined as the solution's capability of adapting to different input variations such as data perturbations like whitespace, lower/upper case, tabs, etc.
Testing Robustness is an important aspect of the evaluation process and can be achieved for instance using Unitxt semantic
Pros: Ensures the model performs reliably across varied input conditions; Real-World Applicability: Important for practical applications where input data may not be perfectly formatted.
Cons: Requires thorough testing across many variations, which can be time-consuming; Defining Variations: Challenging to define and measure all possible input perturbations.
It measures the quality of text generation by comparing the overlap of n-grams, word sequences, and word pairs between the machine-generated text and a set of reference texts. Widely used for evaluating tasks such as text summarization and translation.
Pros: Established and recognized in the NLP community, providing a standard for comparison; Suitable for tasks where capturing all relevant information is important.
Focuses on n-gram overlap, which may not capture semantic quality or fluency; Can be influenced by the length of the generated text, potentially penalizing shorter or more concise outputs.
It measures the quality of machine-translated text by comparing it to one or more reference translations. It evaluates the precision of n-grams in the generated text with respect to the reference texts. Primarily used for evaluating translation.
Pros: Effective for tasks where precision and exact matches are important; Standard Metric: Widely adopted in the machine translation community, providing a benchmark for comparison.
Cons: May penalize legitimate variations in phrasing that do not exactly match reference texts; Partial Inaccuracy Blindness: May not fully capture partial inaccuracies or subtle differences in meaning.
Utilization in terms of total number of tokens, number of 429 responses received
Example: Cost from OpenAI API calls
Costs from storage, networking, computing resources, etc.
Costs from maintenance, support, monitoring, logging, security measures, etc.
If the retrieval metrics indicate suboptimal performance, yet the generation metrics yield favourable results, it is advisable to:
Conversely, if the retrieval metrics demonstrate strong performance but the generation results are suboptimal, consider the following strategies to improve model performance:
In the scenario where both the retrieval and generation metrics exhibit subpar performance, it would be prudent to revisit and reconsider the initial stages of the pipeline, such as enhancing metadata, refining the knowledge base, and optimizing the search mechanism.
This approach emphasizes the importance of a comprehensive and nuanced evaluation of the RAG system, taking into account the interplay between retrieval and generation components and the overall effectiveness of the system in achieving its intended goals and objectives.
Rather than blindly choosing a retrieval strategy our CE team has created an evaluation framework that allows us to systematically compare different techniques and choose one that aligns with our computational resources and complexity requirements.this asset uses reference-based and reference-less metrics and can generate Q&A pairs from chunks.
Additionally, passing session ID or user ID with Neuralseek to the RAG system can help maintain context and track the conversation effectively.
You can use the ilab model evaluate command to evaluate the models you are training with on several benchmarks. Currently, four benchmarks are supported.
|Benchmark
|Measures
|Full Name
|Description
|Reference
|MMLU
|Knowledge
|Massive Multitask Language Understanding
|Tests a model against a standardized set of knowledge data and produces a score based on the model's performance
|Measuring Massive Multitask Language Understanding
|MMLUBranch
|Knowledge
|N/A
|Tests your knowledge contributions against a base model and produces a score based on the difference in performance
|N/A
|MTBench
|Skills
|Multi-turn Benchmark
|Tests a model's skill at applying its knowledge against a judge model and produces a score based on the model's performance
|MT-Bench (Multi-turn Benchmark)
|MTBenchBranch
|Skills
|N/A
|Tests your skill contributions against a judge model and produces a score based on the difference in performance
|N/A
This directory here contains an example of a “guard rail” used in generative AI applications, detection of hate, abuse, and profanity, either in a prompt, the output, or both. This notebook provides information about the granite.38m.en.guardrail model, which is designed for detecting Hate, Abuse, and Profanity (HAP) in text. The model has been fine-tuned on several English HAP benchmarks and utilizes the slate.38m.english.distilled base model.
FM Eval is an IBM internal framework to evaluate and benchmark LLMs.
It allows you to easily compose and run benchmarks across various ML environments and computing infrastructures.
Benchmarks for RAG used by FM-EVAL
|Name
|Description
|type
|RAG (generation only)
|IBM collected data (SAP, askHR, maximo, clapNQ, w3IT)
|IBM internal
|RAG (generation only)
|Academic datasets clapNQ, ChatRAG-bench, legal Australian
|opensource
|RAG E2E
|watson documentation
|opensource
|Robustness
|Measures LLMs sensitivity to (naturally occurring) variations in their input
|opensource
|RAG triad metrics (faithfulness, context relevance, answer correctness)
|Evaluate RAG metrics
|opensource
FM-eval relies on the Unitxt Open Source package which is used to define available datasets, the preprocessing required to converted raw datasets to the input required by LLMs, and the metrics used to evaluate the results. Unitxt is automatically installed by FM-eval. Furthermore, in FM-Eval results are automatically uploaded to the Lakehouse for dashboards and you can access it here
Unitxt is a python library for textual data preparation and evaluation of generative language models.
Wit Unitxt RAG evaluation is performed using both automatic reference-based and reference-less metrics.
Unitxt is opensource and focuses on preparing the data and run the metrics.
Unitxt is a flexible framework that will allow you to define resources,tasks and format it using a cataloging approach.
you can create your custom tasks and add them to the catalog and bring you own data to the framework.
here below is the catalog :
|resource type
|description
|cataloging
|Card
|resource loading + Task standardization (select RAG)
|cards.qa.squad
|Template
|Template for instruction task
|cards.qa.squad
|Format
|overall textual layout of the example
|formats.input_output
|instruction
|adding system prompt
|instructions.helpful
|metrics
|large collection of metrics
|metrics.rag
These resources are used during the Unitxt evaluation flow to ensure that data is correctly processed and that model performance is evaluated against the most relevant metrics. These resources are used during the Unitxt evaluation flow :
The unitxt.eval_utils package is designed to help users assess the performance of RAG models by evaluating the correctness of generated answers, the relevance of retrieved contexts, and the faithfulness of the answers to those contexts.
Unitxt's approach to RAG evaluation revolves around a triad of critical components: inputs, outputs, and reference fields. This structure ensures that every aspect of the RAG task is clearly defined and thoroughly evaluated.
Below is a boilerplate code snippet that demonstrates how to use the unitxt.eval_utils package for evaluating RAG metrics. This code includes setting up the data, performing the evaluation, and saving the results.
import json
from ast import literal_eval
import pandas as pd
from unitxt.eval_utils import evaluate
# Load your data from a CSV file containing the RAG task data
data_path = 'rag.csv' # the CSV file containing the RAG task data
# Read the data into a pandas DataFrame
df = pd.read_csv(
filepath_or_buffer=data_path,
converters={
"ground_truths": literal_eval, # Convert ground_truths from string to list
"ground_truths_context_ids": literal_eval, # Convert ground_truths_context_ids from string to list
"contexts": literal_eval, # Convert contexts from string to list
"context_ids": literal_eval, # Convert context_ids from string to list
},
)
# Evaluate RAG metrics using the evaluate function
result, _ = evaluate(
df.to_dict("records"), # Convert DataFrame rows to a list of dictionaries
metric_names=[
"metrics.rag.answer_correctness", # Measure how correct the generated answer is
"metrics.rag.context_relevance", # Measure the relevance of the retrieved contexts
"metrics.rag.faithfulness", # Measure how faithful the answer is to the retrieved contexts
"metrics.rag.context_correctness" # Measure the correctness of the retrieved contexts
],
)
# Save the evaluation results to a JSON file for detailed inspection
with open("dataset_out.json", "w") as f:
json.dump(result, f, indent=4)
# Optionally, save the evaluation results to a CSV file with rounded values
result.round(2).to_csv("dataset_out.csv")
Here’s an example of what the CSV file might look like:
question,answer,contexts,context_ids,ground_truths,ground_truths_context_ids,question_id
"What is the dressing code of our company?","The dressing code of the company is professional attire. ...","['context1', 'context2', 'context3']","['context_id1', 'context_id2', 'context_id3']","['Ground truth 1', 'Ground truth 2']","['ground_truth_id1', 'ground_truth_id2']","question-id-1"
The project of IBM research here provides a unified framework to test generative language models on a large number of different evaluation tasks.
Features:
In order to be able to answer the following questions :
watsonx.gov has introduced an evaluation of RAG tasks supported in 2.0.0 supported only for development-time.Runtime support is expected for future releases.
|Metric
|what it measures
|Faithfulness
- Measures the extent to which the output is based on the context
- Provides attributions from the context to show the most important sentences that contribute to the output
- Scores are between 0 and 1
- Higher value suggests response is more grounded aka “faithful” to context
|Answer Relevance
|Measures how relevant answer is to question based on question
Computed based on pre-trained Reward model trained from human feedback
Scores are between 0 and 1
|Unsuccessful Requests
|Measures how many responses were ‘unsuccessful’ relative to total number of responses
To assess if something is ‘Unsuccessful,’ response is matched against out-of-the-box phrases such as (“I don’t know,” “No idea”) or phrases end user configures.
Currently supported only for English. Scores are between 0 and 1
Example: out of 100 responses in an evaluation, if 10 are matched as “1” for ‘Unsuccessful” then value is 10/100 = 0.1 for that time
|Answer coverage
|Measures proportion of the words in response that are derived from context
Scores are between 0 and 1
Higher value suggests greater proportion of the context words are in responses
|Keyword Inclusion
|Measures proportion of keywords in responses with respect to context ‘Keyword’ are primarily nouns in the context Scores are between 0 and 1 Higher value suggests response has more ‘keywords’ from context
|Spelling robustness of question
|Detects spelling errors in the question and reports corrected spelling (currently supported for English only)
Measures proportion of questions with spelling errors with respect to total questions Scores are between 0 and 1 Higher value suggests LLM can identify spelling errors, understand the question, and provide a response
IBM watsonx.gov can be utilized to measure the RAG Triad metrics: Context Relevance, Faithfulness, and Answer Relevance.
|Metric
|what it measures
|Context Relevance
|This measures whether the retrieved contexts are pertinent to the question. (will be released shortly)
|Faithfulness
|This assesses whether the model is deriving the answer from the provided context or if it is being creative and providing its own answer, which could lead to hallucinations.
|Answer Relevance
|This evaluates whether the obtained answer is relevant to the question.
OOTB metrics: Working with 3rd party (detached) Prompts/Prompt Template Assets
Other custommetrics like faithfulness using custom metrics notebook: Text Summarization Monitoring using AWS Bedrock Anthropic Claude-v2 LLM
You can publish the metrics from the detached prompt to openscale so that all the metrics are displayed together. All the out of the box metrics and custom metrics will appear together in wx.gov console and also in Fact Sheets.
Watsonx.governance helps you to solve the following challenges for your enterprise:
Your team can track your machine-learning models and prompt templates from request to production and evaluate whether they comply with your organization's regulations and requirements.
|What you can use
|What you can do
|Best to use when
|Factsheets
|- Create an AI use case to track and govern AI assets from request through production.
- View lifecycle status for all of the registered assets and drill down to detailed factsheets for models, deployments, or prompt templates that are registered to the model use case.
- Review the details that are captured for each tracked asset in a factsheet associated with an AI use case.
- View evaluation details, quality metrics, fairness details, and drift details.
|- You need to request a new model or prompt template from your data science team.
- You want to make sure that your model or prompt template is compliant and performing as expected.
- You want to determine whether you need to update a model or prompt template based on tracking data.
- You want to run reports on tracked assets to share or preserve details.
You can evaluate machine learning models and prompt templates in projects or deployment spaces to measure their performance. For machine learning models, evaluate the model for quality, fairness, and accuracy. For foundation models, evaluate foundation model tasks, and understand how your model generates responses.
|What you can use
|What you can do
|Best to use when
|Projects
|Use a project as a collaborative workspace to build machine learning models, prompt foundation models, save machine learning models and prompt templates, and evaluate machine learning models and prompt templates. By default, your sandbox project is created automatically when you sign up for watsonx.
|You want to collaborate on machine learning models and prompt templates.
|Spaces user interface
|Use the Spaces UI to deploy and evaluate machine learning models, prompt templates, and other assets from projects to spaces.
|You want to deploy and evaluate machine learning models and prompt templates and view deployment information in a collaborative workspace.
After deploying models, it is important to govern and monitor them to make sure that they are explainable and transparent. Data scientists must be able to explain how the models arrive at certain predictions so that they can determine whether the predictions have any implicit or explicit bias. You can configure drift evaluations to measure changes in your data over time to ensure consistent outcomes for your model. Use drift evaluations to identify changes in your model output, the accuracy of your predictions, and the distribution of your input data.
|What you can use
|What you can do
|Best to use when
|Watson OpenScale
|- Monitor model fairness issues across multiple features.
- Monitor model performance and data consistency over time.
- Explain how the model arrived at certain predictions with weighted factors.
- Maintain and report on model governance and lifecycle across your organization.
|- You have features that are protected or that might contribute to prediction fairness.
- You want to trace model performance and data consistencies over time.
- You want to know why the model gives certain predictions.
Watsonx.governance offers comprehensive governance capabilities for LLM-powered applications, beyond just computing metrics.
During the design phase, LLM application developers can use the IBM watsonx.governance monitoring toolkit to evaluate the RAG Triad Metrics on the RAG prompts output. This helps refine the retrieval of relevant contexts from the vector database and adjust the prompt generation parameters as needed.
Once the prompt is designed by LLM application developers, it can be further developed and tested in a project and tracked as part of an AI use case. If the metric values are satisfactory during testing, the prompt can be validated by LLM application validators in a pre-production or validation environment, known as the Validation Space.
After validation, the prompt can be promoted to a production space for continuous monitoring. As users ask questions, the combination of questions, contexts, and answers are payload logged with watsonx.governance.
At every phase, IBM watsonx.governance provides capabilities to track the prompt as part of the AI use cases.
Checking the quality of the embedding models watsonx supported to ensure the embeddings model is well-trained and suitable for your data. Use tools like cosine similarity or other distance metrics from sklearn to evaluate the quality of embeddings.
Analyzing query processing to verify that queries are being processed and embedded correctly.
Tuning similarity search parameters like k in similarity search to optimize the performance.
A Framework for Evaluating Advanced Retrieval Techniques