What Makes a “Good” Test?As we’ve talked about in previous blog posts, we want our tests to produce scores that are both valid and reliable in predicting some meaningful outcome. We’ve discussed the ways to assess validity and reliability, but how do we actually go about creating a test that is valid and reliable? A good test = good questions. It sounds simple but figuring out what exactly makes a test question “good” is often difficult. In this post, we’ll go over different ways to assess whether your test questions are doing what they’re supposed to do, so you can feel more confident that your test is delivering meaningful results. 1. Face validity The first step is to look at the face validity of your question. Face validity means that your test questions look like they’re measuring what you are indeed trying to measure. For example, if you’re trying to evaluate students’ knowledge of Greek mythology, questions about American history wouldn’t be face valid. Ensuring face validity isn’t as easy as it might sound. One of the most common complaints from test-takers are about the seeming irrelevancy of test items. You’ve probably heard the following at some point or another:
- “The questions had nothing to do with the readings or the lectures.”
- “I studied hard for the test, but I was never asked about anything I studied.”
- “The test asked too many questions about minor details, and not about the overall conceptual knowledge.”
—————————————————–A good test is overall face valid, is able to differentiate the strong students from the weaker ones, and is neither too easy nor too difficult. A collection of good test items will ensure that you have a good test. Ideally, you want your overall test scores to look approximately like a normal distribution, where most students perform somewhere around the average, and a few students end up in either extreme. This means that you’re able to differentiate between the stronger and the weaker students, and that you’re able to identify the top performers and high achievers. At Altus Assessments, we continuously strive to ensure that CASPer® generates scores that reliably assess the personal and professional competencies of applicants, which predict meaningful outcomes in school and, more importantly, on the job. Working with our content creators, we closely assess each of our test items on a regular basis to ensure that they’re assessing the essential competencies valued by our academic partners. We monitor each question’s item difficulty and item discrimination using a variety of metrics, so that our overall test scores will generate a normal distribution. The graph on the left illustrates the distribution of CASPer® scores on one of our recent test sessions for U.S. medical programs. As you can see, the scores follow a normal distribution, which allows programs to “select-out” the weaker applicants, but also to “select-in” the superstar candidates who possess strong intrapersonal and interpersonal skills. If the scores were too clustered at the upper end of the spectrum, like the graph on the right, it would show that candidates generally did well and that the test may have been too easy. While this is useful in weeding out the weaker applicants on the lower end of the spectrum, a negatively skewed distribution would not be able to help programs identify the top candidates from the rest of the pack. CASPer® offers the best of both worlds. It helps programs avoid admitting potentially problematic students, while helping them identify the students with the highest potential for success.
Published: September 22, 2017 By: Christopher Zou, Ph.D. Education Researcher at Altus Assessments