Researchers compared the minority-majority differences in performance between different response formats of SJTs. Based on two field experiments, the first study (n = 274) showed that written constructed responses produce much smaller minority-majority differences than traditional multiple choice formats (d = .28 vs. d = .92). The audiovisual format further reduced minority-majority differences (d = .09 vs. d = .41), with validities remaining the same. Scores of raters evaluating transcribed audiovisual responses, which anonymized test-takers, produced larger differences. In sum, altering response modality via more realistic response formats (i.e., audiovisual constructed format) lead to significant reductions in minority-majority differences without impairing criterion-related validity.
This book chapter provides a brief overview of the history of Situational Judgement Tests, and the rationale and theory behind its effectiveness as a selection test. It goes on to summarize the research regarding its reliability and validity, and future trends in SJT research.
The authors conducted an extensive literature review related to examine the effectiveness of various selection methods. Their results suggest that structured interviews/MMIs, SJTs, and Selection Centres are more effective and generally fairer than traditional interviews, references, and personal statements. The evidence is mixed regarding the effectiveness and fairness of aptitude tests, and more long-term validity evidence is required for personality assessments.
Of the 740 violations among 235 physicians that led to disciplinary action on the part of 40 medical state boards, unprofessional behavior was the basis for 74% of the violations. For 94% of those who were disciplined, one or more violations involved unprofessional behavior. Unprofessional behavior as a student was by far the strongest predictor of disciplinary action above and beyond undergraduate MCAT scores and grades in first 2 years of medical school.
A large sample of medical students in Belgium completed a closed-response SJT in 2000 in video format (n = 1,159) and another group in 2003 (n = 1,700) in written format. The predictive validity of the video-format was markedly higher than the text-format (r = .34 vs. r = .08) in predicting scores on interpersonally-oriented courses, the text-format also had higher cognitive loading (r = .18) than the video-format (r = .11), and the video-based SJT had more incremental predictive validity over cognitive predictors than the written SJT (11% vs. 1%). The video-based SJT has better fidelity and more closely resembles the outcome variable, resulting in its higher validity in predicting interpersonal outcomes. Both SJTs were also viewed to be more face valid (i.e., related to activities in medical education) than other parts of the exam (cognitive ability measure).
Based on a sample of undergraduate students in the United States (n = 411), coaching improved performance on one of the closed-SJTs (CSQ, +.24 SD) but not the other (SJI). Coaching was only useful when the SJT is straightforward, and simple coaching strategies can be used, but coaching was ineffective when the SJT itself and the coaching strategies were more complex. However, coaching effects did not influence the validity of SJTs in predicting academic performance.
Many programs in Quebec and Ontario accept scores from both the English and French versions of CASPer, so it is important to ensure that an applicant is not advantaged or disadvantaged by taking one version over the other. There were some indications that the French version may be slightly more difficult than the English version of the CASPer test. However, it is unclear whether this difficulty is due to the difficulty of the individual test items or to differences in the characteristics of the cohort. Nevertheless, a comparison of the psychometric indicators suggests that both French and English versions of CASPer are psychometrically sound and equivalent.
Researchers were interested in examining how CASPer can add to other non-cognitive admissions measures collected at their institution (interview, Defining Issues Test, Professional Identity Formation). Based on a sample of two entering cohorts (n = 596), CASPer and in-person admissions interview ratings had significant positive correlations, suggesting that CASPer can contribute to effective screening processes. In addition, CASPer demonstrated statistically significant positive relationships with professional identity (CASPer and PIE, r=.10, p<.05) and a measure of moral reasoning (CASPer and DIT2 type indicator, r=.09, p<.05). Association between CASPer and PIE remained consistent, even after controlling for MCAT, interview, and GPA.
Traditional cognitive assessments such as MCAT and GPA are known to restrict diversity as minority applicants tend to score lower than their majority counterparts. Noncognitive assessments have been shown to be less harmful to diversity as the subgroup differences tend to be smaller. Based on a sample of 9,096 applicants, the subgroup differences for CASPer were smaller than for GPA and MCAT.
Of the 77 applicants applying for general surgery residency, 85% of respondents reported that completing the CASPer as a requirement for submitting a residency application would have no bearing or would them more likely to apply to the program. Applicants felt that the traditional method (i.e., panel of faculty members assessing the ERAS application) was more accurate but less objective than the CASPer while faculty members thought the CASPer would be more accurate and equally objective as the traditional methods. Using data from the 23 applicants who took the CASPer test, both the traditional assessment and the CASPer equally correlated with applicant rank (ρ = .45, p = .03; ρ = .41, p = .06).
Scores on the CASPer-SJT from a cohort of Canadian medical students (n = 109) in 2006/2007 and 2008/2009 predicted later scores on the personal and professional characteristic subsections (CLEO and PHELO) of the Canadian medical licensing exam taken three to six years after training (r = .3 to .5).
The MMI has been useful in assessing noncognitive skills of medical applicants, but it is not feasible to administer the MMI to the thousands who apply, so researchers developed the Computer-based Multiple Sample Evaluation of Noncognitive Skills (CMSENS) to be able to extend the interview to all medical school candidates. Both audio and typewritten versions of the CMSENS demonstrated good test generalizability (~.8) and good inter-rater reliability (~.8) among a sample of one cohort of McMaster medical students (n = 110), but only the typewritten version demonstrated concurrent validity with the MMI (r = .5), but much more time was required to rate the audio responses. Thus in the subsequent study (n = 167), only the typewritten version of the CMSENS (i.e., CASPer) was used and replicated their results from the first study using a larger sample (generalizability = .8, interrater reliability = .9, correlation with MMI corrected for disattenuation = .6), and additionally found that it was also correlated with the MCAT-verbal reasoning (r = .38).