What is HumanEval?

By Rina Diane Caballar , Cole Stryker

Published 23 February 2026

HumanEval defined

HumanEval is a benchmark for assessing the code generation capabilities of large language models (LLMs). It was developed by OpenAI to evaluate early versions of the AI models powering Codex, the company’s software engineering agent.

The HumanEval benchmark is specifically designed for Python-generated code. It goes beyond syntax, validating that the code created is both accurate and functions as intended.

The benchmark’s framework can be accessed on the OpenAI HumanEval GitHub repository. HumanEval also has a leaderboard ranking the performance of different code generation models, including the Claude suite, Kimi K2, Google Gemma and Gemini, GPT-5 and the older GPT-4o and GPT-4, and the IBM® Granite® family, among others.

Industry newsletter

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

Dataset structure

The HumanEval dataset consists of 164 handwritten programming problems with their corresponding unit tests.¹ These problems gauge a model’s ability to comprehend language, manipulate strings, search and sort. They also assess problem-solving skills in terms of simple math and complex algorithms. These programming tasks are similar to the algorithmic questions, coding exercises or system design challenges software developers work through during technical interviews.

Each code generation task contains the following components:

Function signature
Docstring
Function body
Unit tests

Function signature

The signature defines the function’s name and parameters. As an example, here is the signature for a function that calculates the product of two integers:

def multiply(a, b):

Docstring

A docstring is a natural language prompt or description of the function’s expected behavior, goals, inputs and outputs. These comments outline what a function does, guiding the model when generating Python code.

For instance, the multiply function’s docstring will be:

“””Complete the function that takes two integers as inputs
and returns their product as the output.
Assume the inputs are always valid.
Examples:
multiply(8, 2) must return 16
multiply(0, 777) must return 0
multiply(-32,64) must return -2048
“””

Function body

This is a segment allocated to the code a model produces. It holds the implemented solution to the problem given the function signature and the docstring.

Unit tests

These test cases verify the generated code’s functional correctness across different scenarios. Each test feeds specific inputs to the function then checks the outputs against intended results.

Here are some sample unit tests for the multiply function:

def test_multiply():
  assert multiply(89, 0) == 0
  assert multiply(37, -5) == -185
  assert multiply(66, 17) == 1122

Evaluation metric

Many code LLM benchmarks apply methodologies used for text generation, such as match-based metrics that compare generated code samples to a reference solution. But match-based metrics don’t usually factor in the various ways a problem can be solved, any of which can be functionally equivalent to the reference solution.

That’s why the HumanEval benchmark turned to functional correctness, which deems a generated code sample correct if it passes a suite of unit tests. This approach mirrors how developers assess the success of their code by running it through a series of unit tests and making sure it passes each one.

HumanEval measures functional correctness using the pass@k metric. For each problem, a model generates k code samples. If any of those samples pass the unit tests, then the problem is considered correctly solved. The pass@k metric estimates the probability that at least one of the k samples is functionally correct.

AI Academy

The rise of generative AI for business

Learn about the historical rise of generative AI and what it means for business.

Go to episode

Limitations of HumanEval

HumanEval is just one of many benchmarks to appraise code LLMs. Software development teams must still evaluate LLM-generated code using their own internal tests and combine multiple metrics for a more comprehensive view of model performance. A human-in-the-loop approach also remains crucial to help ensure the accuracy of AI-generated code and fine-tune and improve machine learning models over time.

Here are some limitations of the HumanEval benchmark:

Contamination
Lack of real-world complexity
Narrow metric of coding capabilities
Restricted programming language support

Contamination

Programming problems included in the dataset might have been encountered during model training due to their broad availability. The number of problems are also few enough that code generation models can perhaps memorize them all.

Lack of real-world complexity

Code generation tasks within HumanEval typically fall under the easy to medium range. Yet real-world programming tasks tend to be more complex, encompassing API integrations with multiple systems, huge codebases and large datasets.

The benchmark also fails to reflect the often tangled state of real-world software development environments and workflows: evolving use cases, incomplete test cases, inconsistent requirements, legacy code or vague specifications, to name a few.

Narrow metric of coding capabilities

There’s more to programming than just functional correctness. For instance, HumanEval doesn’t take into account efficiency. This means LLM-generated code that’s accurate and works as expected might not be the most efficient and optimized solution performance-wise.

The benchmark also doesn’t take into account programming best practices, such as coding conventions, style standards, error handling, input validation and secure coding.

Restricted programming language support

HumanEval is tailored specifically for the open-source Python programming language. Source code generated in other languages must be evaluated using other benchmarks.

HumanEval variations

The benchmark has a few different versions that address some of its limitations:

HumanEval+
HumanEval-V
HumanEval-X
HumanEvalNext

HumanEval+

Each programming problem in HumanEval has an average of around 7 to 8 unit tests.¹ HumanEval+ boosts that test coverage significantly to an average of 764 tests per problem for more rigorous assessment.²

HumanEval-V

HumanEval-V builds upon its predecessor to create a benchmark for multimodal AI models, specifically vision language models (VLMs). It gauges the ability of VLMs to understand and reason over charts, diagrams and graphs in programming contexts, generating code based on flowcharts of algorithms or matrix transformations, for example.

HumanEval-X

HumanEval-X extends the original benchmark to include the C++, Go, Java and JavaScript programming languages. Its 820 tasks can be used to evaluate code generation and code translation skills.

HumanEvalNext

HumanEvalNext improves upon HumanEval. It adds more context through type annotations (programming syntax to indicate the data types of function parameters and return values), incorporates more edge cases, introduces more unit tests and raises the difficulty of problems.³