HumanEval is a benchmark for assessing the code generation capabilities of large language models (LLMs). It was developed by OpenAI to evaluate early versions of the AI models powering Codex, the company’s software engineering agent.
The HumanEval benchmark is specifically designed for Python-generated code. It goes beyond syntax, validating that the code created is both accurate and functions as intended.
The benchmark’s framework can be accessed on the OpenAI HumanEval GitHub repository. HumanEval also has a leaderboard ranking the performance of different code generation models, including the Claude suite, Kimi K2, Google Gemma and Gemini, GPT-5 and the older GPT-4o and GPT-4, and the IBM® Granite® family, among others.
Industry newsletter
Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.
Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.
The HumanEval dataset consists of 164 handwritten programming problems with their corresponding unit tests.1 These problems gauge a model’s ability to comprehend language, manipulate strings, search and sort. They also assess problem-solving skills in terms of simple math and complex algorithms. These programming tasks are similar to the algorithmic questions, coding exercises or system design challenges software developers work through during technical interviews.
Each code generation task contains the following components:
The signature defines the function’s name and parameters. As an example, here is the signature for a function that calculates the product of two integers:
A docstring is a natural language prompt or description of the function’s expected behavior, goals, inputs and outputs. These comments outline what a function does, guiding the model when generating Python code.
For instance, the multiply function’s docstring will be:
This is a segment allocated to the code a model produces. It holds the implemented solution to the problem given the function signature and the docstring.
These test cases verify the generated code’s functional correctness across different scenarios. Each test feeds specific inputs to the function then checks the outputs against intended results.
Here are some sample unit tests for the multiply function:
Many code LLM benchmarks apply methodologies used for text generation, such as match-based metrics that compare generated code samples to a reference solution. But match-based metrics don’t usually factor in the various ways a problem can be solved, any of which can be functionally equivalent to the reference solution.
That’s why the HumanEval benchmark turned to functional correctness, which deems a generated code sample correct if it passes a suite of unit tests. This approach mirrors how developers assess the success of their code by running it through a series of unit tests and making sure it passes each one.
HumanEval measures functional correctness using the pass@k metric. For each problem, a model generates k code samples. If any of those samples pass the unit tests, then the problem is considered correctly solved. The pass@k metric estimates the probability that at least one of the k samples is functionally correct.
HumanEval is just one of many benchmarks to appraise code LLMs. Software development teams must still evaluate LLM-generated code using their own internal tests and combine multiple metrics for a more comprehensive view of model performance. A human-in-the-loop approach also remains crucial to help ensure the accuracy of AI-generated code and fine-tune and improve machine learning models over time.
Here are some limitations of the HumanEval benchmark:
Programming problems included in the dataset might have been encountered during model training due to their broad availability. The number of problems are also few enough that code generation models can perhaps memorize them all.
Code generation tasks within HumanEval typically fall under the easy to medium range. Yet real-world programming tasks tend to be more complex, encompassing API integrations with multiple systems, huge codebases and large datasets.
The benchmark also fails to reflect the often tangled state of real-world software development environments and workflows: evolving use cases, incomplete test cases, inconsistent requirements, legacy code or vague specifications, to name a few.
There’s more to programming than just functional correctness. For instance, HumanEval doesn’t take into account efficiency. This means LLM-generated code that’s accurate and works as expected might not be the most efficient and optimized solution performance-wise.
The benchmark also doesn’t take into account programming best practices, such as coding conventions, style standards, error handling, input validation and secure coding.
HumanEval is tailored specifically for the open-source Python programming language. Source code generated in other languages must be evaluated using other benchmarks.
The benchmark has a few different versions that address some of its limitations:
HumanEval+
HumanEval-V
HumanEval-X
HumanEvalNext
Each programming problem in HumanEval has an average of around 7 to 8 unit tests.1 HumanEval+ boosts that test coverage significantly to an average of 764 tests per problem for more rigorous assessment.2
HumanEval-V builds upon its predecessor to create a benchmark for multimodal AI models, specifically vision language models (VLMs). It gauges the ability of VLMs to understand and reason over charts, diagrams and graphs in programming contexts, generating code based on flowcharts of algorithms or matrix transformations, for example.
HumanEval-X extends the original benchmark to include the C++, Go, Java and JavaScript programming languages. Its 820 tasks can be used to evaluate code generation and code translation skills.
HumanEvalNext improves upon HumanEval. It adds more context through type annotations (programming syntax to indicate the data types of function parameters and return values), incorporates more edge cases, introduces more unit tests and raises the difficulty of problems.3
Accelerate software delivery with Bob, your AI partner for secure, intent-aware development.
Optimize software development efforts with trusted AI-driven tools that minimize time spent on writing code, debugging, code refactoring or code completion and make more room for innovation.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.
1. Evaluating Large Language Models Trained on Code, arXiv, 14 July 2021
2. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation, arXiv, 30 October 2023
3. Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Unified Approach for Elevating Benchmark Quality, arXiv, 12 December 2025