What Makes a “Good” Test?
As we’ve talked about in previous blog posts, we want our tests to produce scores that are both valid and reliable in predicting some meaningful outcome. We’ve discussed the ways to assess validity and reliability, but how do we actually go about creating a test that is valid and reliable?
A good test = good questions.
It sounds simple but figuring out what exactly makes a test question “good” is often difficult.
In this post, we’ll go over different ways to assess whether your test questions are doing what they’re supposed to do, so you can feel more confident that your test is delivering meaningful results.
1. Face validity
The first step is to look at the face validity of your question. Face validity means that your test questions look like they’re measuring what you are indeed trying to measure. For example, if you’re trying to evaluate students’ knowledge of Greek mythology, questions about American history wouldn’t be face valid.
Ensuring face validity isn’t as easy as it might sound. One of the most common complaints from test-takers are about the seeming irrelevancy of test items. You’ve probably heard the following at some point or another:
- “The questions had nothing to do with the readings or the lectures.”
- “I studied hard for the test, but I was never asked about anything I studied.”
- “The test asked too many questions about minor details, and not about the overall conceptual knowledge.”
When students don’t feel like the questions on a test measured the important aspects of their knowledge, that’s an issue of face validity. And while we may not always agree with the students’ perceptions, making sure that the items are face valid is important for the acceptability of the test (or “political validity,” as we talked about in a previous post).
Being known as a “bad” test maker amongst students can hurt your teaching evaluations and as a consequence, fewer students may enroll in your future classes.
2. Item discrimination
If you ask students in an astronomy class “What is in the centre of our solar system?”, the resounding answer would (hopefully) be “the Sun.” Because the question is relevant to astronomy, it’s a face valid question. But it has low item discrimination.
In other words, that particular question does a poor job of differentiating between students who are highly knowledgeable about astronomy, and those who never opened their textbook. In other words, the question doesn’t help discriminate between those who mastered the material and those who did not.
There are a number of ways to assess item discrimination, with the most common method being the point-biserial (or item-total) correlation. This method looks at the correlation between a student’s score on one item and their score on the overall test.
Because it is a correlation, the score ranges from -1.0 to +1.0, with anything below 0 indicating that there’s a problem with that particular item. When a question is discriminating negatively, it means that knowledgeable students are getting the wrong answer to the question, and students who did poorly on the overall test are getting the right answer to the question.
A simpler method of assessing item discrimination is to look at the item’s standard deviation — that is, to look at the spread of the scores on a particular item to make sure that students are producing a range of scores. A standard deviation close to zero means that students are generally getting the same scores on that particular item, making the item a poor discriminator. A high standard deviation indicates a good discriminator.
One major caveat with the standard deviation method is that it’s only useful in assessing item discrimination when the questions are scored on a scale, such as with essay grades. It’s not useful for multiple-choice tests with right or wrong answers, because the standard deviation will always be low for those types of questions.
3. Item difficulty
“That test was too hard” may be the most commonly heard complaint from students. Most of the time, test makers shrug their shoulders and assume that students are just being students. But test difficulty is and should be a legitimate concern for test creators.
Tests should not be so difficult that no one gets the right answer. But they also shouldn’t be so easy that everyone gets the right response. Item difficulty is closely related to item discrimination; questions that are too difficult or too easy will tend to have poor item discrimination, as most students will produce similar scores in both instances.
Thankfully, there is a straightforward way to address students’ concerns about item difficulty using psychometrics. Instructors can calculate the item analysis statistics, which is simply the proportion of students who got the correct response, or in the case of scaled scoring (such as essay grades), the students’ mean score.
If the average score on the test was quite high, or if a large proportion of students answered the question correctly, then it would be hard for a student to argue that the test was too difficult.
A good test is overall face valid, is able to differentiate the strong students from the weaker ones, and is neither too easy nor too difficult. A collection of good test items will ensure that you have a good test.
Ideally, you want your overall test scores to look approximately like a normal distribution, where most students perform somewhere around the average, and a few students end up in either extreme. This means that you’re able to differentiate between the stronger and the weaker students, and that you’re able to identify the top performers and high achievers.
At Altus Assessments, we continuously strive to ensure that CASPer® generates scores that reliably assess the personal and professional competencies of applicants, which predict meaningful outcomes in school and, more importantly, on the job.
Working with our content creators, we closely assess each of our test items on a regular basis to ensure that they’re assessing the essential competencies valued by our academic partners. We monitor each question’s item difficulty and item discrimination using a variety of metrics, so that our overall test scores will generate a normal distribution.
The graph on the left illustrates the distribution of CASPer® scores on one of our recent test sessions for U.S. medical programs. As you can see, the scores follow a normal distribution, which allows programs to “select-out” the weaker applicants, but also to “select-in” the superstar candidates who possess strong intrapersonal and interpersonal skills.
If the scores were too clustered at the upper end of the spectrum, like the graph on the right, it would show that candidates generally did well and that the test may have been too easy. While this is useful in weeding out the weaker applicants on the lower end of the spectrum, a negatively skewed distribution would not be able to help programs identify the top candidates from the rest of the pack.
CASPer® offers the best of both worlds. It helps programs avoid admitting potentially problematic students, while helping them identify the students with the highest potential for success.
Published: September 22, 2017
By: Christopher Zou, Ph.D.
Education Researcher at Altus Assessments