Reliability

A topic in research methodology

“the extent to which an instrument can be expected to give the same measured outcome when measurements are repeated”

Taber, 2018, p.1274

Reliability concerns the reproducibility of measurements.

A measurement instrument is reliable to the extent that it gives the same measurement on different occasions.

Reliability is different from validity or precision. An instrument may not measure what it is claimed to measure (it lacks validity) yet still be reliable. If you decided to measure the weight of your computer screen with a ruler, you might get highly reproducible results (37.9 cm, 38.1 cm, 37.9cm, 38.0 cm) – but the instrument was not valid and did not give a measure of weight!

(Read about 'Validity')

Similarly, reproducible results are not evidence of precision. If you timed the length of a school's morning break (as indicated by the bell) each day for a fortnight using a stopwatch that was running 10% slow, you might find that the breaks were all reliably in the range 22 minutes ± 15 seconds (when the actual break was 20 minutes – outside the range of your measurements).

Validity, precision/accuracy and reliability may then all vary independently.

Why might measures of reliability be low?

If repeated measurements show a large range, then this suggests there is a lack of reliability in the measurements. This may be a problem with the measuring instrument not being well-designed, or a problem with the researcher not having the skills to use the instrument (as some instruments require expert administration and so careful training and practise).

However, it may also be an issue related to the nature of what is being measured (an issue of ontology). Consider a simple instrument to gauge how much a learner is enjoying their course: and imagine repeated administrations of the instrument at different times during an academic year. We might both expect that there could be a shift in enjoyment over the year, but moreover the level of enjoyment at any particular time cold be influenced by all sorts of factors relating both to the course itself which change during the year (which topic, which teacher, how close to a course submission), and other factors (state of health, weather, relationship issues, money worries…)

Some things we measure in education we would expect to be changeable (hopefully including knowledge, understanding, skill level!), and others may not be as fixed as some people might assume (enjoyment of statistics classes? -intelligence? creativity? giftedness? )

Many questionnaires and scales seek to find out bout constructs that cannot be directly observed (e.g., self-efficacy, motivation, confidence), and my be considered as 'latent variables' which can only be measured indirectly – often by looking t pooled/aggregate responses to a range of items that are considered to reflect the variable of interest.

This is particularly considered to be the case with affective variables. However, the same may be true of cognitive variables. We cannot directly observe someone's knowledge and understanding, but infer this form their responses to a sample of test items that are designed to offer relevant evidence.

Inter-rater reliability

Often research papers include reports of inter-rater reliability. This is used where several researchers have been involved in making judgements (e.g., in using an observation schedule; in applying an analytical scheme or 'codebook' to some data), and is a measure of their level of agreement.

(Read about 'Inter-rater reliability')

Reliability and internal consistency

Sometimes the term reliability (scale reliability) is also used for what is otherwise (and more helpfully) known as the internal consistency of a scale.

(Read about 'Internal consistency')

Source cited:

My introduction to educational research:

Taber, K. S. (2013). Classroom-based Research and Evidence-based Practice: An introduction (2nd ed.). London: Sage.