Inter-rater reliability

Page contents

Reliability concerns the reproducibility of measurements. A measurement instrument is reliable to the extent that it gives the same measurement on different occasions.

(Read about 'Reliability')

Often research papers include reports of inter-rater reliability. This is where several researchers have been involved in making judgements (e.g., in using an observation schedule; in applying an analytical scheme to some data). The inter-rater reliability is simple the proportion of judgements they make in common. So if two researchers were asked to analyse 50 student responses to a test question according to whether the response was adequate to be considered correct (by whatever criteria had been specified) and they agreed on 44/50 responses (but disagreed on 6/50) the inter-rater reliability would be 88%.

Inter-rater reliability can take any value form 0 (0%, complete lack of agreement) to 1 (10%, complete agreement).

Inter-rater reliability may be measured in a training phase to obtain and assure high agreement between researchers' use of an instrument (such as an observation schedule) before they go into the field and work independently.

It can also be be used when analysing data, especially when the analysis is shared out in a team. Analysis may independently analyse a sample of data according to the agreed scheme ('codebook') before comparing responses, and the discussing disagreements. This process may be repeated with further samples until there is a high consistency in judgements when it is considered they can reliably work independently.

When is inter-rater reliability useful?

The concept is relevant in confirmatory research designs with positivist assumptions, where objectivity is assumed to be possible and desirable. In some forms of research trustworthiness comes from a researcher working intimately alongside study participants and developing a strong rapport – in which circumstances seeking inter-rater reliability with another research may be neither desirable not sensible.

Checking one’s own analysis

A lone analyst can also return to a sample of data previously analysed and repeat the analysis without sight of the original results. A statistic parallel to the inter-rater reliability (i.e., repeated rating reliability) can easily be calculated. This may give some assurance of whether the analysis is being carried out consistently.

Cohen’s kappa

Although inter-rater reliability can be readily calculated in the manner suggested above, more sophisticated approaches are available. One of these is the Cohen's kappa coefficient (κ) which is a statistic which is intended to make allowance for chance agreement between raters. Kappa also has a maximum value of 1, but also potentially has negative values, with 0 indicating the raters' level of agreement is no better than chance. Sometimes p (probability) values are given for calculated values of κ but this is generally not to be considered especially useful.

My introduction to educational research:

Taber, K. S. (2013). Classroom-based Research and Evidence-based Practice: An introduction (2nd ed.). London: Sage.

Share on Facebook