Determine the "difficulty" of quiz with multiple weights?
Asked Answered
C

2

7

Im trying to determine the "difficultly" of a quiz object.

My ultimate goal is to be able to create a "difficulty score" (DS) for any quiz. This would allow me to compare one quiz to another accurately, despite being made up of different questions/answers.

When creating my quiz object, I assign each question a "difficulty index" (DI), which is number on a scale from 1-15.

15 = most difficult
1 = least difficult

Now a strait forward way to measure this "difficulty score" could be to add up each question's "difficulty index" then divide by maximum possible "difficulty index" for the quiz. ( ex. 16/30 = 53.3% Difficulty )

However, I also have multiple "weighting" properties associated to each question. These weights are again one a scale of 1-5.

5 = most impact
1 = least impact

The reason I have (2) instead of the more common (1) is so I can accommodate a scenario as follows...

If presenting the student with a very difficult question (DI=15) and the student answers "incorrect", don't have it hurt their score so much BUT if they get it "correct" have it improve their score greatly. I call these my "positive" (PW) and "negative" (NW) weights.

Quiz Example A:
Question 1: DI = 1 | PW = 3 | NW = 3
Question 2: DI = 1 | PW = 3 | NW = 3
Question 3: DI = 1 | PW = 3 | NW = 3
Question 4: DI = 15 | PW = 5 | NW = 1

Quiz Example B:
Question 1: DI = 1 | PW = 3 | NW = 3
Question 2: DI = 1 | PW = 3 | NW = 3
Question 3: DI = 1 | PW = 3 | NW = 3
Question 4: DI = 15 | PW = 1 | NW = 5

Technically the above two quizzes are very similar BUT Quiz B should be more "difficult" because the hardest question will have the greatest impact on your score if you get it wrong.

My question now becomes how can I accurately determine the "difficulty score" when considering the complex weighting system?

Any help is greatly appreciated!

Caddie answered 19/8, 2016 at 15:1 Comment(7)
Are the questions are multiple-choice? I'm asking because the model would be somewhat different: In multiple-choice quiz the hardest questions still have 1/x chance of getting the correct answer just by guessing. Also, for multiple-choice question the score is all or nothing.Synchromesh
In my case, questions would be "single" choice. Like "true or false" or choose "a, b, c, or d". Only one option would ever be correct. I do see what your saying though, the more options to pick from would increase the difficulty of the question as well.Caddie
Your weights system suggests that sometimes it may be better not to answer than to give the wrong answer (e.g. when PW=1 and NW=5). Is it so, or "no answer" = "wrong answer" ?Synchromesh
Oh I see what your asking now. Yes technically you must answer every question for it to be considered complete at all.Caddie
On a side note: you could also change a question's difficulty score based on what percentage of people get it right. (I remember programming an online quiz like this when I was learning C; good times :-)Emmanuelemmeline
@m69 - This would really help correct any errors made when setting the question's "difficulty index" but it relies on students to take the quiz. I really want to be able to measure the quizzes difficulty before anyone has taken the quiz at all.Caddie
This is way to vague and lacks a model (or constants). Normally you would introduce some statistical-assumptions like "with what chance the student will answer correctly" which might be equivalent to your DI-score, but needs to be normalized with a constant. Without these assumptions not much can be done. One easy approach: just calculate DS=sum(DI_i * PW_i) for all i / card(i) which results in lower_value=higher-difficulty (could always swap PW/NW). In the current formulation above PW/NW also looks symmetric and should be reduced (what has already been done in my simple approachCockneyism
S
5

The challenge of course is to determine the difficulty score for each single question.

I suggest the following model:

  • Hardness (H): Define a hard question such that chances of answering it correctly are lower. The hardest question is such that (1) the chance of answering it correctly are equal to random choice (because it is inherently very hard), and (2) it has the largest number of possible answers. We'll define such question as (H = 15). On the other end of the scale, we'll define (H = 0) for a question where the chances of answering it correctly are 100% (because it is trivial) (I know - such question will never appear). Now - define the hardness of each question by subjective extrapolation (remember that one can always guess between the given options). For example, if a (H = 15) question has 4 answers, and another question with similar inherent hardness has 2 answers - it would be (H = 7.5). Another example: If you believe that an average student has 62.5% of answering a question correctly - it would also be a (H = 7.5) question (this is because a H = 15 has 25% of correct answer, while H = 0 has 100%. The average is 62.5%)

  • Effect (E): Now, we'll measure the effect of PW and NW. For questions with 50% chance of answering correctly - the effect is E = 0.5*PW - 0.5*NW. For questions with 25% chance of answering correctly - the effect is E = 0.25*PW - 0.75*NW. For trivial question NW doesn't matter so the effect is E = PW.

  • Difficulty (DI): The last step is to integrate the hardness and the effect - and call it difficulty. I suggest DI = H - c*E, where c is some positive constant. You may want to normalize again.

    Edit: Alternatively, you may try the following formula: DI = H * (1 - c*E), where the effect magnitude is not absolute, but relative to the question's hardness.


Clarification:

The teacher needs to estimate only one parameter about each question: What is the probability that an average student would answer this question correctly. His estimation, e, will be in the range [1/k, 1], where k is the number of answers.

The hardness, H, is a linear function of e such that 1/k is mapped to 15 and 1 is mapped to 0. The function is: H = 15 * k / (k-1) * (1-e)

The effect E depends on e, PW and NW. The formula is E = e*PW - (1-e)*NW


Example based on OP comments:

Question 1:

k = 4, e = 0.25 (hardest). Therefore H = 15

PW = 1, NW = 5, e = 0.25. Therefore E = 0.25*1 - 0.75*5 = -3.5

c = 5. DI = 15 - 5*(-3.5) = 32.5

Question 2:

k = 4, e = 0.95 (very easy). Therefore H = 1

PW = 1, NW = 5, e = 0.95. Therefore E = 0.95*1 - 0.05*5 = 0.7

c = 5. DI = 1 - 5*(0.7) = -2.5

Synchromesh answered 19/8, 2016 at 16:8 Comment(5)
Thanks for the feedback! So using the approach you described and the following question. Question A: H = 15, PW = 1, NW = 5, (4) possible options. H is what the "teacher" thinks the hardness is. I would get a DI of 21.25 with a 'c' constant of 5.Caddie
Question B: H = 1, PW = 1, NW = 5, (4) possible options. I would get a DI of 17.75. Now technically question "B" should be easier then "A" but I would think it would be dramatically more so. Am i missing something?Caddie
@jrucci: I made some clarifications and addressed your questions. Please see my edit.Synchromesh
Im running though some tests and will update my question when I have some results. Just so I'm clear, your saying that if a question has 4 possible options and only 1 is correct, then maximum difficulty is .25 and lowest difficultly is 1?Caddie
That's true. The highest difficulty is when the average student has no clue. His chances to answer correctly by guessing-are 25%. The lowest difficulty is when you believe that the average student has 100% chance of answering correctly.Synchromesh
O
1

I'd say the core of the problem is that mathematically your example quizzes A and B are identical, except that quiz A awards the student 4 gratuitous bonus points (or, equivalently, quiz B arbitrarily takes 4 points away from them). If the same students take both of them, the score distribution will be the same, except shifted by 4 points. So while the two quizzes may feel different psychologically (because, let's face it, getting extra points feels good, and losing points feels bad, even if you technically did nothing to deserve it), finding an objective way to distinguish them seems tricky.

That said, one reasonable measure of "psychological difficulty" could simply be the average score (per question) that a randomly chosen student would be expected to get from the quiz. Of course, that's not something you can reliably calculate in advance, although you could estimate it from actual quiz results after the fact.

However, if you could somehow relate your (presumably arbitrary) difficulty ratings to the fraction of students likely to answer the question correctly, then you could use that to estimate the expected average score. So, for example, we might simply assume a linear relationship with the question difficulty as the success rate, with difficulty 1 corresponding to a 100% expected success rate, and difficulty 15 corresponding to a 0% expected success rate. Then the expected average score S per question for the quiz could be calculated as:

  • S = avg(PW × X − NW × (1 − X))

where the average is taken over all questions in the quiz, and where PW and NW are the point weights for a correct and an incorrect answer respectively, DI below is the difficulty rating for the question, and X = (15 − DI) / 14 is the estimated success rate.

Of course, we might want to also account for the fact that, even if a student doesn't know the answer to a question, they can still guess. Basically this means that the estimated success rate X should not range from 0 to 1, but from 1/N to 1, where N is the number of options for the question. So, taking that into account, we can adjust the formula for X to be:

  • X = (1 + (N − 1) × (15 − DI) / 14) / N

One problem with this estimated average score S as a difficulty measure is that it isn't bounded in either direction, and provides no obvious scale to indicate what counts as an "easy" quiz or a "hard" one. The fundamental problem here is that you haven't specified any limits for the question weights, so there's technically nothing to stop someone from making a question with, say, a positive or negative weight of one million points.

That said, if you do impose some reasonable limits on the weights (even if they're only recommendations), then you should be able to also establish reasonable thresholds on S for a quiz to be considered e.g. easy, moderate or hard. And even if you don't, you can still at least use it to rank quizzes relative to each other by difficulty.

Ps. One way to present the expected score in a UI might be to multiply it by the number of questions in the quiz, and display the result as "par" for the quiz. That way, students could roughly judge their own performance against the difficulty of the quiz by seeing whether they scored above or below par.

Obannon answered 19/8, 2016 at 21:42 Comment(1)
Thanks for the input @Imari! Im having a little trouble following you (these concepts are a little out of my league at this point). Could you break down your formula with these example questions. Question A ( Hardest ): Difficulty Input: 15 out of 15, Positive Weight: 1 out of 5, Negative Weight: 5 out of 5, (4) possible options to choose from. Question B ( Easiest ): Difficulty Input: 1 out of 15, Positive Weight: 5 out of 5, Negative Weight: 1 out of 5, (4) possible options to choose from. Thanks again!Caddie

© 2022 - 2024 — McMap. All rights reserved.