Many code LLM benchmarks apply methodologies used for text generation, such as match-based metrics that compare generated code samples to a reference solution. But match-based metrics don’t usually factor in the various ways a problem can be solved, any of which can be functionally equivalent to the reference solution.

That’s why the HumanEval benchmark turned to functional correctness, which deems a generated code sample correct if it passes a suite of unit tests. This approach mirrors how developers assess the success of their code by running it through a series of unit tests and making sure it passes each one.

HumanEval measures functional correctness using the pass@k metric. For each problem, a model generates k code samples. If any of those samples pass the unit tests, then the problem is considered correctly solved. The pass@k metric estimates the probability that at least one of the k samples is functionally correct.