Guest post by Andreas Steinmayr (Universität Innsbruck, Faculty of Economics and Statistics).
Large language models such as ChatGPT have received much attention in recent months. There has been a lot of praise, but also criticism. One of the issues raised was that ChatGPT makes up answers to questions, and does so in a way that the answers appear plausible to readers not familiar with the subject. This might be harmful behavior in many instances, especially in academic settings, but there might also be an application: Creating multiple-choice and single-choice questions for exams.
Multiple-choice and single-choice questions are commonly used in assessments, exams, and surveys. They are popular because they can be quickly and efficiently graded by computers or instructors for many test-takers. However, creating high-quality questions in multiple-choice or single-choice format is challenging and time-consuming for various reasons:
- Questions must be unambiguous, and the false answers must be plausible enough to make the question challenging.
- Test-takers should have to distinguish right from wrong answers based on the substantive content of the answers, and not from the way how answers are formulated.
- Questions need to be updated regularly (e.g., every semester) if test-takers can collect and distribute them to future test-takers.
I am currently teaching a class in introductory economics, and we have regular short take-home assignments with single-choice questions using some random variation in the selection of questions and answer alternatives. The assignments are prepared with R/exams and administered in our university’s learning management system (OpenOlat).
Recently, I have started to use ChatGPT to create those questions and found it a useful - yet imperfect - tool. Below, I show how I create a question on the income elasticity of demand with ChatGPT and what further steps I take to turn it into an R/exams question with some shuffling and randomization.
Naturally, the illustrative example I use comes from economics but the same strategy can also be applied for creating knowledge quiz questions in other fields.
Step 1: Elicit the answer list from ChatGPT
In general, I don’t ask ChatGPT to come up with the question but provide the question myself. But that’s pretty much it. What I ask for is a list of correct and incorrect answers.
Note that I ask for more than one correct answer, despite creating a single-choice question, because I can use R/exams’ built-in shuffling to select a different correct answer for different test-takers. Similarly, I ask for more incorrect answers than I need for each test-taker in order to apply some subsampling and shuffling. The same strategy can also be leveraged for multiple-choice questions.
Prompt: Create seven correct and seven incorrect answers for an exam in introductory economics based on the following question: Which of the following statements about income elasticity of demand are correct?
A) If the income elasticity of demand is positive, it is called a normal good.
B) Income elasticity of demand measures the extent to which demand for a good responds to a change in income.
C) Luxury goods typically have a higher income elasticity of demand than necessities.
D) If the income elasticity of demand is negative, it is called an inferior good.
E) Income elasticity of demand can vary between zero and infinity.
F) A Giffen good is a special case of an inferior good in which demand increases as income decreases.
G) An income elasticity of demand of one means that demand for the good increases proportionally to the change in income.
H) Income elasticity of demand measures the extent to which demand for a good responds to a change in price.
I) If the income elasticity of demand is positive, it is called an inferior good.
J) Income elasticity of demand is the same for all goods.
K) If the income elasticity of demand is zero, it is called a luxury good.
L) If the income elasticity of demand is greater than one, it is called a normal good.
M) Income elasticity of demand is the same for all consumers.
N) Income elasticity of demand is independent of the price of the good.
Step 2: Review answers, potentially modify the prompt
Above, I have color-coded the suggested answers. Green means I am okay with the answer being categorized as correct/incorrect. Yellow means it’s not necessarily incorrect but imprecise or ill-crafted. Red means definitely incorrect.
For answer C), ChatGPT seems to conflate the term “luxury good” in everyday language and economics, where luxury goods are defined by having an income elasticity larger than one.
Answer E) is incorrect, as the income elasticity can be negative. But this will make for a nice answer in the incorrect list.
Answer F) is incorrect because a Giffen good is a good for which demand increases if its price increases.
I decided to ask for more incorrect answers, giving ChatGPT a bit more guidance:
Prompt: Add additional incorrect answer options that include the terms “normal goods” and “Giffen goods.”
Additional incorrect answers:
O) If the income elasticity of demand for normal goods is positive, it is called an inferior good.
P) Giffen goods always have a positive income elasticity of demand.
Q) The income elasticity of demand for Giffen goods cannot be greater than one.
R) If the income elasticity of demand for normal goods is negative, it is called a superior good.
S) If the income elasticity of demand for Giffen goods is negative, demand increases with an increase in income.
T) Giffen goods are luxury goods for which demand decreases as income increases.
U) Normal goods are goods preferred by a specific segment of the population.
I decided to use the two answers highlighted in green and added them to the list.
Step 3: Edit manually
In the next step, I edited the answers I was not fully happy with and added my own answers to the list:
- If the income elasticity of demand is positive, the good is said to be a normal good.
- The income elasticity of demand indicates how strongly the demand for a good reacts to a change in income.
- If the income decreases by 2% and thus the demand for a good decreases by 3%, it is called a luxury good.
- If the income elasticity of demand is negative, one speaks of an inferior good.
- The income elasticity of demand can be positive or negative.
- A Giffen good is a special case of an inferior good where demand increases if the price of the good increases.
- An income elasticity of demand equal to one means that the demand for the good increases proportional to the change in income.
- The income elasticity of demand indicates how strongly the demand for a good responds to a change of the price.
- If the income elasticity of demand is positive, the good is said to be inferior.
- The income elasticity of demand is the same for all goods.
- If the income elasticity of demand is zero, the good is called a luxury good.
- If the income elasticity of demand is greater than one, the good is called a normal good.
- The income elasticity of demand is the same for all consumers.
- The income elasticity of demand is independent of the good’s price.
- The income elasticity of demand can range between zero and infinity.
- Giffen goods are luxury goods for which the demand decreases as income increases.
- Common goods are goods that are purchased preferentially by a certain part of the population.
Step 4: Implement in R/exams format
Finally, let’s turn this into a dynamic R/exams exercise in both R/Markdown (Rmd) and R/LaTeX (Rnw) format. The first version of the exercise, linked in the table below, simply lists all answer alternatives from Step 3, flags them as correct/incorrect in the
exsolution meta-information, specifies
schoice for single-choice, and sets
exshuffle to 5 so that one correct and 4 incorrect answer alternatives are randomly selected.
The second version in the table below is virtually identical to the first one, but
extype is now set to
mchoice for multiple-choice. In this case five answer alternatives are chosen randomly with at least one correct and at least one incorrect alternative.
||Select single correct answer from shuffled subset of five answer alternatives.|
||As in #1 but with one or more correct answer alternatives.|
||As in #1 but randomly vary between selection of correct or incorrect answer.|
||As in #3 but with one or more (in)correct answer alternatives.|
In the third and fourth version of the exercise, I add some further dynamic elements to the question in order to create some more random variation and make cheating harder. Specifically, I randomly vary between the original task (Which of the statements is/are correct?) and the inverted task (Which of statements is/are not correct?). Moreover, some of the answer alternatives are randomly modified (increase vs. decrease; different percentages) without changing the nature of the statement (correct vs. incorrect).
The fourth version again differs from the third only by using the
mchoice (instead of
schoice) type and “are” (instead of “is”) in the question.
Just like it’s not a good idea for students to rely on ChatGPT to answer exam questions, it’s not a good idea to entirely rely on it to create them. Correct answers might be phrased incorrectly, vice versa, or you might find answers to be ill-crafted.
But what ChatGPT is very useful for is to give you a set of potential answers, especially incorrect ones, that sound plausible. You can then use this list, use only selected answer options, or modify others. This is what I find most useful at the moment, as it can take a lot of time and cognitive energy to come up with high-quality answer lists from an empty slate.