What Makes a Good Multiple Choice Test Question?

It measures nothing but subject knowledge. Here’s how.

Lucia Bevilacqua
7 min readJun 7, 2021

Lament all you want about “closed-ended” multiple choice questions, but this format clearly isn’t going away. With so many state standardized tests, college entrance exams, and certification assessments taken every year, we need items that are simple to grade — simple to distinguish those who know the stuff from those who don’t.

Thankfully, some of the biggest pitfalls can be avoided with thoughtful design. Some of this lies in the wording of the answers themselves: test-takers should be getting them right because they actually know the answers, not because they know what a “right” answer tends to look like. Some of it is also shaped by data from real test-takers: a right answer should predict better performance overall.

Here are the techniques in a skilled test maker’s toolbox.

Good Questions Don’t Test “Test-Wiseness”

If you don’t know a subject well enough to know which answer may be right and which ones can’t be, the probability of you getting the question right should be no better than chance. That’s the ideal, if the question really only tests subject knowledge.

But if you catch the cues that hint at the goodness of an answer, you can score better a random guess, even when you would’ve never arrived at that answer by yourself. In other words, that question tests test-wiseness, a skill separate from the subject.

With too many questions like these, differences in test scores wouldn’t necessarily reflect differences in what test-takers know. So how well can we predict how they’d perform relative to each other in the real world, where they won’t get those hints?

These cues, identified by Gibb (1964) and still cited to this day, are the big hints the “test-wise” know — so a wise test writer remembers to avoid them:

  1. Phrase-Repeat: A correct answer contains a key sound, word, or phrase that is contained in the question’s stem.
  2. Absurd Relationship: Distractors are unrelated to the stem.
  3. Categorical Exclusive: Distractors contain words such as “all” or “every.”
  4. Precise: A correct answer is more precise, clear, or nuanced than the distractors.
  5. Length: A correct answer is longer than the distractors.
  6. Grammar: Distractors do not match the verb tense of the stem, or there is not a match between articles (“a”, “an”, “the”).
  7. Give-Away: A correct answer is given away by another item in the test

Here’s a question from a “test-wiseness” test given to pharmacy students in a 2006 study. Which one stands out as the “test-wise” answer?

Hapincantin:
(a) should be taken with food to prevent nausea
(b) must be taken at least two hours after any cardiac medication
(c) cannot be taken with Chanto-Berchunin
(d) will reduce effectiveness of birth-control pills

Hapincantin isn’t even real, yet most test-takers deemed (a) most likely. A definitive must, cannot, or will can easily be wrong; a qualified should sounds more like reality.

Knowing how to “game the test” like this typically works on tests written by your own instructors. I also find these cues shockingly often in certification tests, such as the one I took for radiation safety during a summer internship. (Surely something needs to be done about this?!)

However, these tricks are not so handy on those made by dedicated testing organizations: namely, SATs, ACTs, and AP exams. With all the effort they put into designing valid questions, you might as well put effort into learning what they actually test.

After all, many of their questions are recycled from previous testing sessions. They know how well these questions worked.

Good Questions Distinguish High Performers From Low Performers — Statistically Speaking

Sometimes you’ll hear that a question on an AP exam or SAT got dropped because it was “unreliable.” How does that happen? Do College Board officials just look at it after the fact and decide, “Oops, that wasn’t really fair of us to ask! Our bad!”? If that were all it took, it surely would’ve never made it into the test in the first place. And it can’t be because only a low percentage of test-takers got it right; of course some questions will be much harder than others.

What matters is who’s getting it right. For every question, we can compute the discriminatory index — the difference between the percent of top scorers (say, the top 25% of test-takers) who got it right versus the percent of bottom scorers (say, the bottom 25%) who got it right. A question answered correctly by 65% of top scorers and 33% of the bottom scorers would have a discriminatory index of 0.32, for example.

Ideally, a question’s discriminatory index should be greater than 0.40. A discriminatory index under 0.20 is deemed unacceptably low. It may be so easy that it served as a freebie, or so hard that even the top performers who got it right were probably guessing. Either way, a low index means a correct answer isn’t good evidence that someone understands the subject better than someone who got it wrong.

(A negative index indicates a very poorly discriminating question. If low scorers got it “right” more often than high scorers, maybe those high scorers had a point, thinking of something the test makers didn’t consider.)

Another way to judge test questions is by finding each answer’s point-biserial correlation: plot the total test scores of everyone who didn’t get the right answer versus everyone who did get it, and use a distinct formula to determine the “correlation” between the two. A correlation of at least 0.30 suggests a solid enough relationship between getting it right and performing well.

If the discriminatory index already tells you so, this is redundant. But it’s also helpful for judging distractors. Every wrong answer should have a negative point-biserial correlation, meaning low scorers pick it more often than high scorers. Otherwise, there might be too much truth to it, not enough reason to reject it for its competitor.

And when a test really measures what it’s meant to measure, all the items are internally consistent. Say a test of job satisfaction asked these questions:

  1. On a scale of 1 to 10 (1=“highly dissatisfied,” 10=“highly satisfied”), how satisfied are you with your job?
  2. On a scale of 1 to 10 (1=“strongly recommend against,” 10=“strongly recommend in favor”), how strongly would you recommend this job to someone else?
  3. On a scale of 1 to 10 (1=“strongly disagree,” 10=“strongly agree”) how much do you agree with the following statement: “I am not interested in changing my job”?
  4. On a scale of 1 to 10 (1=“strongly disagree,” 10=“strongly agree”), how much do you agree with the following statement: “If I could go back, I would choose this job again”?
  5. On a scale of 1 to 10 (1=“not at all,” 10=“very much”), how much do you enjoy hot dogs with ketchup?

The first four items don’t ask exactly the same thing, but together they contribute to an overall score of job satisfaction. Answers on all four should be correlated —a high score on one predicts a higher score on all the others. Measuring this same concept in different ways allows a greater range of possible scores, so the more questions you have, the better you can detect respondents’ different levels of job satisfaction.

But what if the fifth item were included in the job satisfaction questionnaire, and using the previous techniques, we found that those with higher job satisfaction indeed enjoyed hot dogs with ketchup more? Would that make it a good question? Surely not. It’s inconsistent with what the other questions are trying to measure. Even if it’s correlated with job satisfaction scores, it’s not as correlated as an actual question about job satisfaction; the test would be more valid without it.

An inconsistent item in a test typically isn’t that obvious to spot. To detect it, we first calculate Cronbach’s alpha, a measure of inter-item correlation. The more questions a test has, the higher the alpha value, as long as they’re all internally consistent. Then for each question, we calculate the the “alpha-if-deleted.” If deleting a question would increase the alpha, the question is out of place.

Similarly, the AP Biology exam covers many topics, but they all point to the same thing: how well test-takers have studied “biology” up to the introductory college level. A higher alpha-if-deleted, then, indicates a question outside this scope; perhaps it’s a better measure of how well they’ve studied chemistry or environmental science in prior courses.

Performing statistical techniques on every test question, test makers can see which questions are worth repeating, which may be worth revising with better answer options, and which are worth dropping. This way, entire tests aren’t generated fresh every time — they’re the refined products of years of data.

Clearly, test questions can be “put to the test” and found highly valid and reliable. But this probably doesn’t address the big concern about closed-ended questions: the concern that they promote “closed-ended” thinking.

In other words, who are we to tell test-takers what’s “right”? Why can’t they solve problems creatively?

Many “creativity”-testing alternatives to multiple choice just combine testing the subject matter with judging skills outside the subject matter, such as video production or artistic design. This grading process is confounded. Why is it my English teacher’s business to judge how well I can draw? You didn’t teach me how to draw; you taught me the book!

Besides, here’s the issue. Reality is filled with closed-ended questions. Laws of physics exist. Mathematical procedures deliver an output. Vocabulary is intended to mean something specific in an author’s sentence.

People aren’t going to come up with “creative” ideas involving physics if they can’t understand the constraints they’re working with — and if they can’t apply the relevant law in a physics multiple choice question, it’s reasonable to assume they don’t understand the law (or even know it exists).

People aren’t going to come up with “creative” interpretations of a text if they wrongly assume the definitions of key English words in it — and if they choose an answer option that incorrectly paraphrases a sentence, it’s reasonable to assume they misinterpreted it.

And so on for any discipline. Your thinking can’t be shaped by the knowledge you don’t have. How are you supposed to know you should have it?

A well-designed multiple choice question measures subject knowledge. Nothing more. That’s what makes it so powerful.

--

--