Adverse impact is a critical consideration when making decisions about employees. Anyone using tools to assist with decisions – whether traditional assessment methods or newer technologies such as artificial intelligence (AI) – needs to understand what adverse impact is, how to measure it, and how to address it. Here we discuss some of the data challenges in monitoring and addressing adverse impact.
Defining adverse impact, bias, and fairness
Let’s set the scene by defining adverse impact and describing how it is similar to and different from the related concepts of bias and fairness.Among laypeople, and likely among many HR professionals who are not psychologists, the term adverse impact is used interchangeably with the terms bias and fairness. To psychologists, however, adverse impact, bias, and fairness are different concepts.
Adverse impact occurs when different groups (such as men and women, or different race and ethnic groups) are selected at different rates due to an employment practice. The effects of adverse impact often occur as a result of using a selection tool, such as a cognitive ability questionnaire, but the principles apply to other employment-related outcomes as well, such as who is promoted, and how much people are paid. In this overview, we focus on the use of selection tools.
Bias exists where equally capable individuals drawn from different groups are expected to get different scores, or when the same predictive model cannot describe the relationship between a test score and performance equally well across groups.
Fairness is a social judgment. For example, fairness may relate to perceptions of procedural justice, which reflects employees’ perceptions of processes and the allocation of resources. Unfairness concerns can’t be resolved with statistics alone, you need to engage with affected individuals and groups about their issues of concern. A good way to minimize concerns about fairness is to involve groups who will be affected by decisions of selection systems in the design of the systems.
In the machine learning (ML) community, fairness is used to address a very narrow scope of the outcomes of the machine learning model. For the ML community, there are multiple definitions of fairness, but they typically fall into two categories: individual fairness where an algorithm is fair if it ensures that equally qualified individuals get treated equally by the algorithm, and group fairness, where an algorithm is deemed fair if its outcomes do not negatively impact one group over another.
The legal primacy of adverse impact
The key employment legislation affecting selection processes targets adverse impact. Psychological interpretations of bias and perceptions of fairness are important, but because required standards regarding these concepts are not prescribed in law, they have taken a back seat in the minds of HR professionals. It’s fortunate then that there has been considerable research on methods to both detect and address adverse impact.
Methods of assessing adverse impact
The most common method of assessing adverse impact is to check whether the four-fifths rule has been violated, i.e., whether any group is selected at less than 80% of the rate of the majority group. For instance, if 50 out of every 100 men who applied for a role were selected, adverse impact would occur if less than 40 out of 100 women who applied for the same role were selected. The majority group has, and often still is, represented by white males less than 40 years of age. However, today, it is common to compare selection rates of all groups against one another for evidence of adverse impact.
Checking the four-fifths rule is usually supplemented with a statistical test. This is because the four-fifths rule can be violated in a sample of your job applicants even when it is not violated in the overall population of applicants. The different tests often perform similarly when sample sizes are large. However, the number of applicants for some jobs may be small. Not only that, but you may not have access to the relevant data because, for example, applicants may choose not to self-identify with respect to their protected variable status. Given these practical constraints, how should a decision-maker proceed?
How to run adverse impact tests when data are missing or incomplete
While there are ways of examining evidence of adverse impact in the absence of data (e.g., an inference might be made by comparing the workforce composition to the local population, or a selection tool may have been shown to produce adverse impact when used in other selection settings, such as in other organizations), the most dependable way is to compare group differences in hiring rates for roles where a selection instrument is used. Here are some suggestions on how to conduct such comparisons when data are missing or incomplete:
Undertake a data discovery audit. We often find early intuitions about the level of missing data either under or over-estimate the extent of the problem. We also find that more candidates self-identify their group status than choose not to. After verifying the extent of the problem in your existing dataset, check whether any procedural changes need to be made to improve data collection in the future. Perhaps a clearer rationale for collection of this information is needed (e.g., letting applicants know that group status information is necessary to ensure processes are unbiased). It is also possible that information about protected status is not being asked appropriately or may not even be captured at all. If this is the case, make any changes that are necessary to ensure that going forward, to ensure you will accurately capture all of the data that applicants are disclosing.
Run the analyses you can for the data that you have. Analyzing the data that you do have is an important step in developing the procedures and capabilities that you will need as you move forward. Most organizations are analyzing selection rates based on samples of their applicants, rather than a census of all applicants. The dependability of analyses based on group sizes of less than 30 may be questionable because of sampling error. Group sizes of 50 would also be considered small, but interpretable with caution, while group sizes of 100 might be considered moderate and acceptable. Importantly, the reliability of results also depends on the proportion selected within each group, not only the absolute number per group. For these reasons, it is unwise to be too prescriptive. Interpretation of results depends on the context. If you need help running the appropriate analyses, tools like Adverse Impact Analysis within IBM Watson Recruitment contain functionality to run all the required tests using an intuitive user interface.
Investigate ways to increase sample sizes. Finally, if you find your groups are too small, or if the selection rates across all groups are very low, you may need to wait until additional data can be collected or consider longer durations over which you analyze applicant data. You could also carefully investigate the merits of analyzing flows across jobs or regions (e.g., if they draw from the same applicant pools and involve very similar work). Refer to the Equal Employment Opportunity Commission (EEOC) Uniform Guidelines (1978) for more discussion of these issues.
What to do when you observe adverse impact
If following your tests, adverse impact is observed, it is important to examine the job relatedness of your selection procedure. This is because job relatedness is a defense against adverse impact claims, albeit with a requirement to look for equally valid methods with less adverse impact. Nonetheless even a selection method that is job-related, such as a cognitive test, may be considered by many of us to be undesirable if it produces adverse impact.
It is also important to examine what part of the selection tool or practice might be contributing to the group differences. Psychologists have identified other techniques that can reduce adverse impact, however these usually come at the cost of decreased accuracy in the prediction of job performance. An experienced practitioner can help with appropriate trade-off decisions.
Methods of reducing adverse impact
Adjusting cut scores or norms
Methods to reduce adverse impact may include lowering the cut score on the test that produces adverse impact, or potentially adjusting norms so long as all applicants are treated in the same way when they’re applying to the same job. For example, cut scores or norms should not be applied differently for different groups applying to the same job. Doing so could result in inappropriate differences in selection or pass rates on a given selection tool. Norms, also referred to as benchmarks, are often used to compare applicants to a group of other applicants. They help the interpretation of assessment scores, but they should not differ between groups of people based on, for example, race or ethnicity. When cut scores and norms are implemented, there is a need to balance the number of qualified applicants passing through the candidate funnel with the potential for adverse impact.
Alternatively, psychologists may employ techniques such as score banding to reduce adverse impact. Score banding treats similar scores as equivalent scores. There are multiple approaches to banding, and again, any reduced adverse impact comes at the cost of predictive accuracy.
Broadening definitions of job performance
It is also possible to broaden the interpretation of job performance beyond the quality and quantity of work performed to include extra-role behavior such as volunteering for additional work and continuing to work hard when work is stressful. When a wider view of performance is taken, the best predictors of job performance broaden to include tests for which there is less adverse impact. These include constructs like personality and other methods like situational judgment tests.
Using alternative statistical approaches
Finally, statistical approaches such as Pareto optimal selection can be applied that calculate the minimum adverse impact possible, given a minimum desired level of predictive validity.
Training the AI models
In machine learning (ML), addressing any group differences found through adverse impact analysis should be handled primarily through training the artificial intelligence (AI) models. Since machine learning discovers patterns in the training data then the primary focus of any remediation should be the examination of the training data.
First to consider is the field used as the target for training (called the “labelled” field in machine learning). This field could have a strong correlation with gender, race, or other group membership values. As one example, in the training data for IBM Watson Recruitment’s Success Score feature, one of the input fields is the “success” field where the client provides an indication of whether they consider the candidate to have been a successful hire. This target field is based on a performance measure like hired vs. not hired (not ideal) or hired and was a good performer (better). If adverse impact exists in the coding of success (yes or no) in this training data, then this can potentially cause adverse impact in the AI model.
A second way that training data could be contributing to adverse impact is when one of the input fields in the training data strongly correlates with group membership. This input field then becomes a proxy for the group membership. For example, the university someone attended could serve as a proxy for gender or race. One example of this is historically black colleges in the United States. One way to address this is to exclude this kind of field (university attended) as an input to the training of the AI model.
The IBM Watson Recruitment Trainer tool allows users to turn on and off feature inputs (e.g. university) to the AI model training process. It is worth noting that the Success Score feature already adjusts for this specific potential problem by abstracting away from specific universities to only using groups (tiers) of universities in the model.
A summary of recommendations
Be diligent. Routinely monitor selection systems and respective tools for the potential of adverse impact on an annual or more frequent basis. Involve representatives of diverse groups in the design of selection systems.
Take care of sample sizes. Sample sizes are important and having more than 100 people in each group is ideal. If only smaller sample sizes are available, analyze what you have, but interpret with caution and continue to gather more data.
Combining jobs. One approach to address this is to move up from the job role level to the job family level which should give a larger sample size for adverse impact analysis. Jobs should be aggregated only after careful consideration of whether the work is similar, and applicants come from the same applicant pool.
Use expert guidance. Adverse impact analyses should be interpreted by experienced practitioners, either internal to your organization, or through a third party such as IBM’s industrial-organizational psychologists and consultants.
Find the source. If adverse impact is detected, investigate what tool or process in the selection system is creating the selection rate or score differences between groups.
Be aware of inherent adverse impact. Investigate criterion, or performance data, to identify inherent adverse impact as a result of your organization’s success metrics.
Use validated assessments. Any selection tool that screens out applicants from the recruitment process should be validated and job related.
To find out more about how IBM could help your organization address the challenges of adverse impact, schedule a personalized demo.