Back in June, we unveiled our company’s commitment to change as the global Black Lives Matter movement – and the incidents that sparked it – took the world by storm. We acknowledged our shortcomings as a company, from the lack of visible diversity in our leadership team to the need for improvement in the design of Casper, our situational judgement test. Today, we’ve released a new research report that sheds light on some of the causes of demographic differences in standardized test performance and how improved test design can help minimize bias to support a more fair and equitable admissions process. Here are some of the things we uncovered:
There are two main categories of differences between demographic groups on assessments
The differences we see in test performance between demographic groups can be explained by both construct-relevant and construct-irrelevant variances. What are they?
- Construct-relevant variance means that differences in test scores are directly related to the constructs of interest on the test. For Casper, this might mean differences in English language communication ability, ethics or professionalism based on cultural norms in the country of the program the applicants are pursuing. Each country likely has its own set of standards for professions based on the inherent rules and social practices that they are used to in their society. This means changes to the design of a standardized test are not sufficient to address some of the differences we see; a large-scale review of education systems or an update to the standards that are measured would be more appropriate.
- Construct-irrelevant variance is test bias. For Casper, an example of test bias could include an applicant who speaks English as a second language not understanding a slang phrase used in a scenario that consequently makes it difficult to answer the questions. Another example would be a rater assigning low scores based on easy-to-identify problems with a response, like grammatical mistakes, instead of more nuanced problems in the content, like a lack of empathy or subtle forms of racism or discrimination. This type of variance can be addressed with improvements in test design and delivery.
Key existing features help minimize demographic differences to a limited extent
Many key features of Casper were selected to reduce test bias while still maintaining high reliability and validity. Existing literature suggests that these features will likely lead to lower test bias, and therefore smaller demographic differences, than other types of tests. This means that changing any of these features will likely increase demographic differences on Casper.
Some of the contributing factors behind these smaller demographic differences include:
- use of behavior-based questions to focus on non-cognitive competencies
- open-ended response format, which means there are no right or wrong answers and applicants can focus on providing the rationale for their answers
- lack of job-specific content so as not to disadvantage applicants who may not have had access to opportunities in the field
- diverse pool of highly-trained human raters who are blinded to the applicant’s identity
There are four key areas to improve Casper test design
The data in our research report has helped us to identify four key areas to further reduce bias in the Casper test:
- Interview raters to better understand their cognitive processes while rating
- Invest more in subject matter expert review of Casper video scripts
- Review performance on past scenarios to detect and understand any concerning trends or themes
- Update rater and content creator training to target specific types of bias
Download our research report to learn more about how we measured demographic differences in Casper test performance and details on our next steps.