The Double-Edged Sword of Practice for Testing

The age-old cliché “Practice makes perfect” is one no student is a stranger to. Generally speaking, this is true for most tests; students tend to perform better the second time around (though this effect is often exaggerated). Oftentimes educators want to encourage their students to spend more time studying for tests so that their efforts will be rewarded with a higher grade. The higher grade is not a concern if a student’s retake score is a more accurate representation of their true abilities – for instance, a student who studies harder for a biology test likely has a stronger foundation of biological knowledge. As such, their test score should reflect their efforts.

However, for standardized testing where test makers are often more interested in assessing student’s cumulative knowledge over a longer period of time, practice effects are seen as a threat as they may introduce “construct-irrelevant variables”.

For example, a math test is meant to test a student’s math skills, so anything that the test measures other than the student’s mathematical abilities is considered to be irrelevant, and makes the scores less valid. If the math test consists of many large paragraphs full of heavy text, then the math test may not only be testing the student’s math skills but may also be testing their reading comprehension – and so the score they derive from the test would not be a pure assessment of their math skills.

When students retake a test, there is some concern that another irrelevant variable may be introduced that may compromise the validity of the test. For instance, retest scores may be higher because students were able to remember some of the questions, or they used a more enhanced test-taking strategy the second time around to improve their scores. This can be problematic as the test is now measuring a component of the student’s memory skills or their knowledge of test-taking strategies.

Additionally, practice effects can potentially give an unfair advantage to repeat applicants who have had access to some of the content of the test from their prior application. A more pressing concern is that the opportunities for practice are often related to student’s financial status – students who have the resources to allocate time and money into test preparation courses and materials will have more content to practice from. Hence, the inclusion of a test in the admissions process with large practice effects may be further disadvantage applicants from low-SES backgrounds who are already greatly disadvantaged.

Looking across all the different types of standardized testing, a recently-published review paper found that retest score increases were the highest for knowledge tests, compared to cognitive tests and method-oriented tests (e.g., traditional situational judgment tests, biodata). Interestingly, the greatest gains in scores were observed among high-ability, female, younger, and White test takers. Retest scores also tended to be higher when the stakes were high – such as is the case for admissions and employment selection. There was actually minimal evidence of score improvements for rating-based tests such as interviews and assessment centers.

This aligns with the discussions from our previous blog post, which highlighted the resilience of Situational Judgement Tests (SJTs) to coaching effects – particularly for more complex and challenging SJTs. In other words, for a constructed-response SJT like CASPer® where student’s responses are scored by a panel of raters, there seems to be little evidence to support the existence of a practice effect. In contrast, one study found that students tended to do slightly better (increase in ⅓ standard deviation) when retaking a traditional closed-response SJT, though both the original scores and the retake scores were found to be equally valid. We examined this possibility further by looking at our internal data across all the test-takers thus far to see if we see any evidence of practice effects with CASPer®, a constructed-response SJT.     

Practice Effect for CASPer®

Although we offer a number of sample scenarios and questions online for all test-takers to familiarize themselves with the format of the test, we were interested in examining if any additional benefit is gained from seeing even more.

One way to examine this question is to look within the test, to see if students improve their scores as they complete each section of the test. If a practice effect exists, then we should see that students do better and better with each section as they get more exposure to the content of the test. The order of the sections is completely randomized across all test takers, so there is no difference in the difficulty of the section by the order of the test.

From our examination of all the test-takers thus far, we actually see absolutely no change in CASPer® scores as students progress through the test. In other words, there seems to be no evidence of any practice effects with CASPer®.

Looking at this another way, focusing on just one scenario of the test, we see that the average score that students get on that particular section does not fluctuate in any meaningful pattern regardless of when that scenario was presented.

This finding also demonstrates that the test seems to also be resilient to fatigue effects, where the length of the test may be excessive to the point that it may hamper performance. As there is no decline in CASPer® scores across the order of 12 sections, this suggests that a time limit of 75-minutes does not seem to be too long of a test for students.

From the data above, and our continued efforts to ensure test security, programs can be sure that the scores they are receiving are a fair and accurate representation of students’ intrapersonal and interpersonal abilities. CASPer®, being an assessment of inherent professionalism, breaks free from that age-old cliche, showing that practice does not, in fact, make perfect.

Published: November 22, 2017
By: Christopher Zou, Ph.D.
Education Researcher at Altus Assessments