Setting and maintaining standards

Setting the pass mark is a key stage in the development of many assessments. For many high-stakes assessments, it is crucial that when a candidate passes an exam, the certification body can be confident that they have reached a defined level of minimum competence.

It’s worth stating the obvious – a pass mark is an absolute thing; if the pass mark is 30 out of 50, the candidate who scores 30 or more will pass, and the candidate who scores 29 or less will fail. Although some systems have riders to this (‘borderline fail’ candidates can be re-considered, or candidates may be permitted to appeal their result), the general principle still holds. Ultimately, there is a cut off; persons scoring n marks pass, but those scoring (n – 1) marks fail.

N.B. Rather than having a pass/fail cut score, some exams have grade boundaries instead, e.g. GCSEs, A Levels. These are essentially set in the same way as pass marks. Typically, the methods below will be used to set, say, the A grade and the C grade, and the rest will be extrapolated from those marks.

Setting pass marks

A wide range of standard setting methods exist, but most fall into two categories:

Absolute methods
- Bookmark – items are ordered by difficulty, and experts “bookmark” the point a minimally competent candidate would cease to have a defined chance to answer correctly.
- Hofstee – experts decide the minimum/maximum acceptable cut scores and failure rates. These values are projected onto the cumulative mark distribution, and the line that passes through these points and the mark distribution used to derive a cut score.
- Angoff – experts estimate proportion of minimally competent candidates who would correctly answer each item. These scores are aggregated up and averages across experts to determine a cut score.
- Modified Angoff – as above, with iteration and experts permitted to amend judgements.
- Ebel – each item on a paper is categorised into essential/nice-to-know/supplementary, and easy/medium/hard by experts. The proportion of items in each category a minimally competent candidate would answer correctly is determined, and these values are aggregated to find a cut score.
- Nedelsky – suitable only for purely multiple-choice tests, experts determine the number of implausible distractors for each item. The cut score is derived by summing the probabilities of correct response remaining after eliminating implausible distractors.

Compromise methods
- Borderline candidates – candidates on the “borderline” of passing/failing are selected by experts. Their median score is used as the cut score.
- Contrasting groups – experts select groups of “masters” and “non-masters” from their classes, who sit the test. The score distributions of each group are compared, and their intersection used to locate a cut score.

Maintaining standards over time

It is usually an essential requirement for validity and fairness that the pass criteria for the test forms are:

Equivalent from one test form (a test form is a version of a test) to the next, so that whichever test form a candidate is presented with, they have the same chance of passing).
Equivalent over time, so that the standard that candidates must meet is consistent from one year to the next (except, of course, where the standard itself is adjusted).

Dealing with (2) is a key part of test maintenance, and there are several ways of approaching this:

Repeating the standard setting procedure initially selected. This is risky, particularly with expert judgement approaches. For instance, repeated presentation of the same items can lead to experts perceiving them as easier.
Conducting a classical equating procedure e.g. mean equating, linear equating
Carrying out item response theory (IRT) test equating (or scaling), which uses commonalities between tests to put both onto the same scale:
- Common item equating seeds some identical items onto each test (or via an “anchor test” if there are security concerns).
- Common person equating instead relies on (at least some of) the same candidates taking each test.

Our experience

AlphaPlus advises professional examinations agencies about the best standards setting method for their particular context, and the particular type of questions in their examinations. We carry out standards setting analysis for clients – both advising on best practice, and doing the calculations for methods such as Angoff, borderline regression and so on.

We also work with bodies to move from standards setting to standards maintaining. Our recommendation is that a standard setting method is used when a new exam is first run, say with Angoff, but subsequent occasions, that initially set standard is maintained using statistical techniques such as IRT.

To find out more about standards setting and maintain, and how AlphaPlus can help your organisation, please contact our Director of Research, Andrew Boyle.