Using ChatGPT for Creating Multiple- and Single-Choice R/exams Questions
R/exams logo (CC-BY) and ChatGPT logo (Trademark of OpenAI).

Using ChatGPT for Creating Multiple- and Single-Choice R/exams Questions

Tutorial for creating knowledge quiz questions with the help of ChatGPT in multiple- or single-choice format for R/exams.

Guest post by Andreas Steinmayr (Universität Innsbruck, Faculty of Economics and Statistics).

Idea

Large language models such as ChatGPT have received much attention in recent months. There has been a lot of praise, but also criticism. One of the issues raised was that ChatGPT makes up answers to questions, and does so in a way that the answers appear plausible to readers not familiar with the subject. This might be harmful behavior in many instances, especially in academic settings, but there might also be an application: Creating multiple-choice and single-choice questions for exams.

Multiple-choice and single-choice questions are commonly used in assessments, exams, and surveys. They are popular because they can be quickly and efficiently graded by computers or instructors for many test-takers. However, creating high-quality questions in multiple-choice or single-choice format is challenging and time-consuming for various reasons:

  • Questions must be unambiguous, and the false answers must be plausible enough to make the question challenging.
  • Test-takers should have to distinguish right from wrong answers based on the substantive content of the answers, and not from the way how answers are formulated.
  • Questions need to be updated regularly (e.g., every semester) if test-takers can collect and distribute them to future test-takers.

I am currently teaching a class in introductory economics, and we have regular short take-home assignments with single-choice questions using some random variation in the selection of questions and answer alternatives. The assignments are prepared with R/exams and administered in our university’s learning management system (OpenOlat).

Recently, I have started to use ChatGPT to create those questions and found it a useful - yet imperfect - tool. Below, I show how I create a question on the income elasticity of demand with ChatGPT and what further steps I take to turn it into an R/exams question with some shuffling and randomization.

Naturally, the illustrative example I use comes from economics but the same strategy can also be applied for creating knowledge quiz questions in other fields.

Step 1: Elicit the answer list from ChatGPT

In general, I don’t ask ChatGPT to come up with the question but provide the question myself. But that’s pretty much it. What I ask for is a list of correct and incorrect answers.

Note that I ask for more than one correct answer, despite creating a single-choice question, because I can use R/exams’ built-in shuffling to select a different correct answer for different test-takers. Similarly, I ask for more incorrect answers than I need for each test-taker in order to apply some subsampling and shuffling. The same strategy can also be leveraged for multiple-choice questions.

Prompt: Create seven correct and seven incorrect answers for an exam in introductory economics based on the following question: Which of the following statements about income elasticity of demand are correct?

Correct answers:

A) If the income elasticity of demand is positive, it is called a normal good.
B) Income elasticity of demand measures the extent to which demand for a good responds to a change in income.
C) Luxury goods typically have a higher income elasticity of demand than necessities.
D) If the income elasticity of demand is negative, it is called an inferior good.
E) Income elasticity of demand can vary between zero and infinity.
F) A Giffen good is a special case of an inferior good in which demand increases as income decreases.
G) An income elasticity of demand of one means that demand for the good increases proportionally to the change in income.

Incorrect answers:

H) Income elasticity of demand measures the extent to which demand for a good responds to a change in price.
I) If the income elasticity of demand is positive, it is called an inferior good.
J) Income elasticity of demand is the same for all goods.
K) If the income elasticity of demand is zero, it is called a luxury good.
L) If the income elasticity of demand is greater than one, it is called a normal good.
M) Income elasticity of demand is the same for all consumers.
N) Income elasticity of demand is independent of the price of the good.

Step 2: Review answers, potentially modify the prompt

Above, I have color-coded the suggested answers. Green means I am okay with the answer being categorized as correct/incorrect. Yellow means it’s not necessarily incorrect but imprecise or ill-crafted. Red means definitely incorrect.

For answer C), ChatGPT seems to conflate the term “luxury good” in everyday language and economics, where luxury goods are defined by having an income elasticity larger than one.

Answer E) is incorrect, as the income elasticity can be negative. But this will make for a nice answer in the incorrect list.

Answer F) is incorrect because a Giffen good is a good for which demand increases if its price increases.

I decided to ask for more incorrect answers, giving ChatGPT a bit more guidance:

Prompt: Add additional incorrect answer options that include the terms “normal goods” and “Giffen goods.”

Additional incorrect answers:

O) If the income elasticity of demand for normal goods is positive, it is called an inferior good.
P) Giffen goods always have a positive income elasticity of demand.
Q) The income elasticity of demand for Giffen goods cannot be greater than one.
R) If the income elasticity of demand for normal goods is negative, it is called a superior good.
S) If the income elasticity of demand for Giffen goods is negative, demand increases with an increase in income.
T) Giffen goods are luxury goods for which demand decreases as income increases.
U) Normal goods are goods preferred by a specific segment of the population.

I decided to use the two answers highlighted in green and added them to the list.

Step 3: Edit manually

In the next step, I edited the answers I was not fully happy with and added my own answers to the list:

Correct answers:

  • If the income elasticity of demand is positive, the good is said to be a normal good.
  • The income elasticity of demand indicates how strongly the demand for a good reacts to a change in income.
  • If the income decreases by 2% and thus the demand for a good decreases by 3%, it is called a luxury good.
  • If the income elasticity of demand is negative, one speaks of an inferior good.
  • The income elasticity of demand can be positive or negative.
  • A Giffen good is a special case of an inferior good where demand increases if the price of the good increases.
  • An income elasticity of demand equal to one means that the demand for the good increases proportional to the change in income.

Incorrect answers:

  • The income elasticity of demand indicates how strongly the demand for a good responds to a change of the price.
  • If the income elasticity of demand is positive, the good is said to be inferior.
  • The income elasticity of demand is the same for all goods.
  • If the income elasticity of demand is zero, the good is called a luxury good.
  • If the income elasticity of demand is greater than one, the good is called a normal good.
  • The income elasticity of demand is the same for all consumers.
  • The income elasticity of demand is independent of the good’s price.
  • The income elasticity of demand can range between zero and infinity.
  • Giffen goods are luxury goods for which the demand decreases as income increases.
  • Common goods are goods that are purchased preferentially by a certain part of the population.

Step 4: Implement in R/exams format

Finally, let’s turn this into a dynamic R/exams exercise in both R/Markdown (Rmd) and R/LaTeX (Rnw) format. The first version of the exercise, linked in the table below, simply lists all answer alternatives from Step 3, flags them as correct/incorrect in the exsolution meta-information, specifies extype as schoice for single-choice, and sets exshuffle to 5 so that one correct and 4 incorrect answer alternatives are randomly selected.

The second version in the table below is virtually identical to the first one, but extype is now set to mchoice for multiple-choice. In this case five answer alternatives are chosen randomly with at least one correct and at least one incorrect alternative.

# Exercise templates Invert? Type Description
1 income_elasticity1.Rmd
income_elasticity1.Rnw
No schoice Select single correct answer from shuffled subset of five answer alternatives.
2 income_elasticity2.Rmd
income_elasticity2.Rnw
No mchoice As in #1 but with one or more correct answer alternatives.
3 income_elasticity3.Rmd
income_elasticity3.Rnw
Yes schoice As in #1 but randomly vary between selection of correct or incorrect answer.
4 income_elasticity4.Rmd
income_elasticity4.Rnw
Yes mchoice As in #3 but with one or more (in)correct answer alternatives.

In the third and fourth version of the exercise, I add some further dynamic elements to the question in order to create some more random variation and make cheating harder. Specifically, I randomly vary between the original task (Which of the statements is/are correct?) and the inverted task (Which of statements is/are not correct?). Moreover, some of the answer alternatives are randomly modified (increase vs. decrease; different percentages) without changing the nature of the statement (correct vs. incorrect).

The fourth version again differs from the third only by using the mchoice (instead of schoice) type and “are” (instead of “is”) in the question.

For more guidance on using the schoice and mchoice exercise types in R/exams see the corresponding YouTube tutorial which starts out from a knowledge quiz question about capitals.

Conclusion

Just like it’s not a good idea for students to rely on ChatGPT to answer exam questions, it’s not a good idea to entirely rely on it to create them. Correct answers might be phrased incorrectly, vice versa, or you might find answers to be ill-crafted.

But what ChatGPT is very useful for is to give you a set of potential answers, especially incorrect ones, that sound plausible. You can then use this list, use only selected answer options, or modify others. This is what I find most useful at the moment, as it can take a lot of time and cognitive energy to come up with high-quality answer lists from an empty slate.