What is question answering?

30 January 2025

Authors

Tim Mucci

IBM Writer

What is question answering?

Question answering (QA) is a branch of computer science within natural language processing (NLP) and information retrieval, which is dedicated to developing systems that can respond to questions expressed in natural language with natural language. These systems determine the context behind questions, extract relevant information from large amounts of data and present it back to the user in a concise and readable way.

3D design of balls rolling on a track

The latest AI News + Insights 


Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter. 

Types of question answering

Question answering systems can be categorized based on how they generate answers to the user’s questions, the scope of knowledge they possess and the types of questions or modalities they support.

Extractive and generative question answering

Extractive QA systems work by identifying and extracting answers directly from provided text or data sources. They use techniques such as named entity recognition and span prediction to locate specific text segments that answer a specific question.

For example, an extractive QA system might be asked to pinpoint the population of a country in a document.

In contrast, generative QA systems synthesize their own answers by using knowledge learned during training. These systems are not limited to extracting information verbatim but instead generate creative and nuanced responses, often relying on large language models (LLMs).

A well-known example of generative QA is OpenAI's GPT-3 or ChatGPT, which is powered by generative artificial intelligence (gen AI).

Open-domain and closed-domain question answering

Another way to classify QA systems is by the scope of knowledge they operate within. Open-domain QA systems are designed to handle questions on virtually any topic.

They rely on vast general knowledge and use frameworks such as ontologies to retrieve and organize information effectively. These systems are ideal for applications requiring broad versatility, such as virtual assistants or search engines.

However, closed-domain QA systems specialize in specific areas, such as medicine, law or engineering. They use domain-specific knowledge to deliver detailed and accurate answers tailored to their field.

For instance, a closed-domain medical QA system might assist doctors by answering diagnostic questions based on clinical data.

Closed-book and open-book question answering systems

QA systems can also be categorized as closed-book or open-book, depending on how they access and use information. Closed-book systems rely entirely on knowledge memorized during their training and do not refer to external sources.

For example, GPT-3 can provide answers without real-time access to data. However, open-book systems can access external knowledge bases or data sources during operation, allowing them to provide answers that are up-to-date and contextually relevant. Search-engine-integrated QA systems are a common example of open-book systems.

Conversational, mathematical and visual systems

Specialized QA systems are designed for specific types of input or interaction. Conversational QA systems can maintain context across multiple turns of a conversation, enabling coherent and natural exchanges. This makes them suitable for chatbots and virtual assistants, where continuity and context are essential.  

Mathematical QA systems, by contrast, focus on answering questions that require mathematical reasoning and calculations. These systems must understand mathematical notations and perform calculations to provide answers, such as solving equations or applying formulas.

Visual QA systems are designed to answer questions about images, combining NLP with computer vision techniques. For instance, given an image of a car, a visual QA system could analyze the image and answer a question such as, “What color is the car?” Visual QA has applications in areas including accessibility tools, image captioning and multimodal search engines. 

Mixture of Experts | 27 February, episode 44

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Datasets

Datasets provide the raw information needed to train models, evaluate their performance and measure advancements in the field. QA datasets typically consist of questions paired with their corresponding answers, often drawn from specific contexts such as documents, knowledge bases or structured datasets.

QA models use high-quality training data to associate questions with appropriate answers and identify patterns within the dataset. This process enables models to generalize from the examples they have seen to new, unseen questions.

Datasets also serve as benchmarks that allow researchers and practitioners to compare the capabilities of different QA models. Baseline models are often used as reference points to measure the effectiveness of new or advanced systems against established performance standards.  

Different datasets are designed to test various aspects of QA systems. For example, some datasets evaluate a system's ability to answer questions derived from a wide range of sources, while others focus on understanding complex or ambiguous questions.

Certain datasets test multihop reasoning, where the system must integrate information from multiple documents or sections to arrive at an answer. Some datasets even include unanswerable questions, challenging models to create an answer when no answer exists among the sources.

The availability of diverse and carefully constructed datasets has significantly advanced the field of QA. By presenting systems with increasingly complex and varied challenges, these datasets have encouraged the development of more sophisticated and robust models capable of handling a wide range of real-world scenarios.

Measuring QA systems' performance

Evaluation metrics provide a standardized way to measure performance, enabling developers to identify areas for improvement and refine their models. By offering objective, quantifiable insights, these metrics go beyond subjective assessments and help clarify how effectively a QA system can answer questions.

Metrics play a crucial role in identifying the strengths and weaknesses of a QA system and guide developers in focusing their efforts on improving specific aspects of their systems.

By using consistent benchmarks, such as the Stanford Question Answering Dataset (SQuAD), researchers can assess how their models stack up against others in the field. These benchmarks not only promote fairness in comparisons but also track progress and highlight the most effective techniques for advancing QA technology.

Evaluation metrics help prevent overfitting, a common challenge in machine learning. By testing models on separate datasets, developers can verify that their systems generalize well to new, unseen data rather than memorizing the training set.

In addition, metrics can highlight the limitations of current systems. For instance, a model’s underperformance can signal areas that require further research. This continual pursuit of better scores encourages the development of more advanced QA models capable of handling increasingly complex tasks and datasets.

Reliability is another critical focus of evaluation metrics. They provide a means to validate the accuracy of a QA system's answers and minimize errors. Metrics also guide the iterative development of models by offering feedback on how well a system is performing and helping developers fine-tune its components for optimal results.

Different metrics serve different needs within QA systems. For example, some metrics focus on exact matches between answers, while others assess the degree of overlap between predicted and actual responses.

These distinctions help ensure that the evaluation process is tailored to specific requirements of various QA tasks and models.

  • Exact match (EM): This metric checks if the predicted answer exactly matches the correct answer. It is a strict metric that provides a good way to see if a model can get an answer correct.
  • F1-score: The F1-score is a balanced measure that considers both precision (how many predicted answers are correct) and recall (how many correct answers are found). It provides a single score that reflects a model's overall accuracy by accounting for both false positives and false negatives, making it more nuanced than exact match (EM), which only checks for perfect matches.
  • Relevance: Depending on the system's architecture and retriever, a model can assess how confident it is that a certain document is relevant to a query.

However, existing metrics might not fully capture the complexities of understanding and answering questions effectively.

  • Contextual understanding: While metrics can measure if a model gives a correct answer, they don't always show how well a system understands the totality of a question.
  • Reasoning and synthesis: Some question answering tasks require reasoning and synthesis of information from different parts of a text, which can be difficult to evaluate when using simple metrics.
  • Subjectivity: Some questions might have more than one correct answer, evaluation of these types of questions can be subjective.
  • No answer: In some datasets, there are questions that can't be answered based on the particular information and systems need to recognize this. Metrics have been developed to account for questions that have no answers.
  • Out-of-vocabulary words: Metrics might not fully capture the performance of systems dealing with words that are not in the system's vocabulary.

Despite these challenges, evaluation metrics remain essential for assessing the effectiveness of QA systems. They help developers determine how well a system answers questions and identify areas for improvement. Because QA models are trained on human-generated data, any inaccuracies or biases in the data can lead to biased answers, even if the model scores highly on evaluation metrics.

Another concern is the potential for models to "cheat" by exploiting statistical biases in datasets. For instance, a model might learn to associate specific keywords in a question with a particular answer span without genuinely understanding the query.

To address this issue, some datasets include questions written without allowing access to the corresponding source text during their creation. This approach reduces the likelihood of models relying on superficial patterns instead of meaningful comprehension.

Challenges in question answering systems

Question answering systems face several operational challenges that impact their effectiveness. One major hurdle is understanding the meaning and intent behind a question. This involves not just interpreting the words but also discerning the question's purpose, even when it is phrased ambiguously or unclearly.

QA systems must handle complex language structures, distinguish between similar-sounding words or phrases and recognize subtle variations in meaning.

Questions might be phrased in various ways, presented as multisentence queries or lack explicit clarity, demanding advanced natural language understanding capabilities.

Another significant challenge lies in efficiently retrieving relevant information from vast amounts of data. QA systems must employ sophisticated information retrieval techniques, such as semantic analysis and information extraction, to identify pertinent sources and pinpoint specific answers.

The sheer volume of data these systems process, often spanning massive datasets, adds to the complexity of managing these systems.

QA systems also need robust mechanisms for representing and organizing knowledge. Techniques such as ontologies and semantic networks enable models to categorize and relate concepts, improving their ability to understand how words and ideas connect within a sentence or across a dataset.

Word tokenization, for instance, breaks text into smaller, analyzable units, helping systems better understand relationships between words and their contexts.

Contextual reasoning presents another layer of complexity. Beyond understanding the question itself, QA systems must consider the broader context, synthesizing information from multiple sources or documents to provide appropriate answers.

This requires models to evaluate relationships between data points and draw meaningful conclusions based on their interconnections.

Finally, verifying the accuracy of answers is essential for QA systems. They must critically evaluate the reliability of their sources and account for potential biases in the data.

This involves cross-referencing information, identifying inconsistencies and helping to ensure that responses are supported by credible evidence. 

Applications of QA systems

Applications of QA systems are diverse, spanning industries and use cases, with a focus on automating information retrieval and delivering quick, accurate responses to natural language queries.  

One prominent application is in customer service, where QA systems streamline operations by automating responses to frequently asked questions using a knowledge base. This enhances efficiency and improves customer satisfaction by providing instant, consistent answers.

Similarly, in technical support, QA systems offer both employees and customers immediate access to relevant information, reducing wait times and increasing productivity. Virtual assistants also benefit from QA capabilities, enabling them to understand and respond to user queries more effectively through natural language.

In research and education, QA systems generate reports, assist with research and support fact-checking efforts. These systems help students by providing on-demand answers to educational questions and offering real-time support.

They are also used in academic assessments, such as grading assignments or evaluating answers in university exams, by interpreting text and providing responses based on the specific information.

In search engine functions, QA systems enhance user experiences by providing instant answers directly relevant to user queries. Instead of merely delivering a list of related web pages, modern search systems use QA technology to extract specific information from documents, offering users concise and actionable responses.

Also, QA systems are increasingly applied to internal organizational tasks. They facilitate the efficient processing of information within large repositories of medical records, banking documents and travel logs.

By enabling quick and precise searches through structured and unstructured data, these systems save time and improve decision-making in professional environments. 

Implementation of QA

Implementing an effective QA system requires careful planning and execution across multiple stages, starting with data collection and preprocessing. This involves gathering a large and diverse corpus of text data from sources such as news articles, books and databases.

The data must be cleaned to remove irrelevant content, standardized through stemming or lemmatization and tokenized into individual words or phrases. Sometimes, human annotators create question-answer pairs or translate existing datasets into other languages.

High-quality, human-generated datasets typically lead to better performance than machine-translated ones, underscoring the importance of dataset quality.

Information retrieval is another critical component of a QA system. Algorithms are developed to extract relevant information from the text corpus in response to user questions.

Techniques such as keyword search, text classification and named entity recognition help narrow down relevant documents. To optimize efficiency, passage-ranking models can prioritize documents likely to contain the answer before applying a more computationally intensive QA model.

A common architecture is the retriever-reader pipeline, where the retriever identifies a subset of relevant documents and the reader extracts or generates the specific answer. Dense passage retrieval, which uses deep learning for retrieval, is a promising approach that improves both speed and accuracy.

Another consideration in QA system design is the size of the context window, which determines the amount of information a model can process at once. For example, models such as IBM® Granite™-3, with a context window of 128,000 tokens, can efficiently handle large documents.

When processing extensive datasets, retriever-reader pipelines play a crucial role, allowing systems to filter out irrelevant documents before extracting answers, thereby maintaining both efficiency and accuracy.

Current QA research and trends

Current research and trends in question answering systems focus on enhancing their ability to handle complex and varied tasks while improving efficiency and robustness. A key area of development is open-domain question answering, where systems address questions on virtually any topic by using general ontologies and world knowledge.  

Multilingual QA is another significant trend, with models such as XLM-Roberta demonstrating the ability to handle multiple languages simultaneously while maintaining performance on par with single-language systems.

The development of multilingual QA systems is crucial for global applications, enabling accessibility across diverse languages and communities.

Similarly, the rise of multimodal QA systems marks a transformative shift, allowing systems to process and integrate information from text, images and audio.

These capabilities are especially valuable for question answering tasks about the content of images or videos, enabling a more comprehensive understanding and the ability to deliver richer, more sophisticated answers.

Efforts are also underway to improve model architectures for better performance and efficiency. Transformer-based models such as BERT, which rely on extensive pretraining to capture nuanced language understanding—made widely accessible through platforms such as Hugging Face—have enhanced QA systems by significantly boosting accuracy, making them viable for real-world applications.  

Current research explores methods to reduce the computational demands of these models through techniques such as model distillation, which trains smaller, more efficient networks to replicate the performance of larger models.

Also, new datasets are being designed to challenge QA systems further by introducing tasks requiring multistep reasoning, handling ambiguous or unanswerable questions and addressing more complex queries.

Improvements in retrieval methods are another focus area. Modern QA systems often use a two-stage approach, comprising a retriever to identify the most relevant documents and a reader, typically built with an encoder-based architecture, to extract the answer from these documents.

Innovations including dense passage retrieval, which employs deep learning for the retrieval process, are proving effective in enhancing both speed and accuracy. This is particularly important for scaling QA systems to operate efficiently across massive datasets.

Interactivity is also becoming a central feature of next-generation QA systems. Researchers are developing question answering models that can engage in clarifications, refine their understanding of ambiguous queries, reuse prior answers and present responses in more detailed and intuitive formats. 

Related solutions
IBM® watsonx Orchestrate™

Easily design scalable AI assistants and agents, automate repetitive tasks and simplify complex processes with IBM® watsonx Orchestrate™.

Discover watsonx Orchestrate
Natural language processing tools and APIs

Accelerate the business value of artificial intelligence with a powerful and flexible portfolio of libraries, services and applications.

Explore NLP solutions
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Easily design scalable AI assistants and agents, automate repetitive tasks and simplify complex processes with IBM® watsonx Orchestrate™.

Discover watsonx Orchestrate Explore NLP solutions