Why write about Cronbach's alpha?

Keith S. Taber

What is Cronbach's alpha?

It is a statistic that is commonly quoted by researchers when reporting the use of scales and questionnaires.

Why carry out a study of the use of this statistic?

I am primarily a qualitative researcher, so do not usually use statistics in my own work. However, I regularly came across references to alpha in manuscripts I was asked to review for journals, and in manuscripts submitted to the journal I was editing myself (i.e., Chemistry Education Research and Practice).

I did not really understand what alpha was, or what is was supposed to demonstrate, or what value was desirable – which made it difficult to evaluate that aspect of a manuscript which was citing the statistic. So, I thought I had better find out more about it.

So, what is Cronbach's alpha?

It is a statistic that tests for internal consistency in scales. It should only be applied to a scale intended to measure a unidimensional factor – something it is assumed can be treated a single underlying variable (perhaps 'confidence in physics learning', 'enjoyment of school science practicals', or 'attitude to genetic medicine').

If someone developed a set of questionnaire items intended to find out, say, how skeptical a person was regarding scientific claims in the news, and administered the items to a sample of people, then alpha would offer a measure of the similarity of the set of items in terms of the patterns of responses from that sample. As the items are meant to be measuring a single underlying factor, they should all elicit similar responses from any individual respondent. If they do, then alpha would approach 1 (its maximum value).

Does alpha not measure reliability?

Often, studies state that alpha is measuring reliability – as internal consistency is sometimes considered a kind of reliability. However, more often in research what we mean by reliability is that repeating the measurements later will give us (much) the same result – and alpha does not tell us about that kind of reliability.

I think there is a kind of metaphorical use of 'reliability' here. The technique derives from an approach used to test equivalence based on dividing the items in a scale into two subsets*, and seeing whether analysis of the two subsets gives comparable results – so one could see if the result from the 'second' measure reliably reproduced that from the 'first' (but of course the ordering of the two calculations is arbitrary, and the two subsets of items were actually administered at the same time as part of a single scale).

* In calculating alpha, all possible splits are taken into account.

Okay, so that's what alpha is – but, still, why carry out a study of the use of this statistic?

Once I understood what alpha was, I was able to see that many of the manuscripts I was reviewing did not seem to be using it appropriately. I got the impression that alpha was not well understood among researchers even though it was commonly used. I felt it would be useful to write a paper that both highlighted the issues and offered guidance on good practice in applying and reporting alpha.

In particular studies would often cite alpha for broad features like 'understanding of chemistry' where it seems obvious that we would not expect understanding of pH, understanding of resonance in benzene, understanding of oxidation numbers, and understanding of the mass spectrometer, to be the 'same' thing (or if they are, we could save a lot of time and effort by reducing exams to a single question!)

It was also common for studies using instruments with several different scales to not only quote alpha for each scale (which is appropriate), but to also give an overall alpha for the whole instrument even though it was intended to be multidimensional. So imagine a questionnaire which had a section on enjoyment of physics, another on self-confidence in genetics, and another on attitudes to science-fiction elements in popular television programmes: why would a researcher want to claim there was a high level of internal consistency across what are meant to be such distinct scales?

There was also incredible diversity in how different authors describe different values of alpha they might calculate – so the same value of alpha might be 'acceptable' in one study, 'fairly high' in another, and 'excellent' in a third (see figure 1).


Fig. 1 Qualitative descriptors used for values/ranges of values of Cronbach's alpha reported in papers in leading science education journals (The Use of Cronbach's Alpha When Developing and Reporting Research Instruments in Science Education)

Some authors also suggested that a high value of alpha for an instrument implied it was unidimensional – that all the items were measuring the same things – which is not the case.

But isn't it the number that matters: we want alpha to be as high as possible, and at least 0.7?

Yes, and no. And no, and no.

But the number matters?

Yes of course, but it needs to be interpreted for a reader: not just 'alpha was 0.73'.

But the critical value is 0.7, is that right?

No.

It seems extremely common for authors to assume that they need alpha to reach, or exceed, 0.7 for their scale to be acceptable. But that value seems to be completely arbitrary (and was not what Cronbach was suggesting).

Well, it's a convention, just as p<0.05 is commonly taken as a critical value.

But it is not just like that. Alpha is very sensitive to how many items are included in a scale. If there are only a few items, then a value of, say, 0.6 might well be sensibly judged acceptable. In any case it is nearly always possible to increase alpha by adding more items till you reach 0.7.

But only if the added items genuinely fit for the scale?

Sadly, no.

Adding a few items that are similar to each other, but not really fitting the scale, would usually increase alpha. So adding 'I like Manchester United', 'Manchester United are the best soccer team', and 'Manchester United are great' as items to be responded to in a scale about self-efficacy in science learning would likely increase alpha.

Are you sure: have you tried it?

Well, no. But, as I pointed out above, instruments often contain unrelated scales, and authors would sometimes calculate an overall alpha (the computer found to be greater than that of each of its component scales – at least that would be the implication if it were assumed that a larger alpha means a higher internal consistency without factoring how alpha tends to be larger the more items are included in the calculation.

But still, it is clear that the bigger alpha the better?

Up to a point.

But consider a scale with five items where everybody responds to each item in exactly the same way (not, that is, different people respond in the same way as each other, just whatever response a person gives to one item – e.g., 2 on a scale of 1-7 – they also give to the other items). So alpha should be 1, as high as it can get. But Cronbach would suggest you are wasting researcher and participant effort by having many items if they all elicit the same response. The point of scales having several items is that we assume no one item directly catches perfectly what we are trying to measure. Whether they do or not, there is no point in multiple items that are effectively equivalent.

Was it necessary to survey science education journals to make the point?

I did not originally think so.

My draft manuscript made the argument by drawing on some carefully selected examples of published papers in relation to the different issues I felt needed to be highlighted and discussed. I think the draft manuscript effectively made the point that there were papers getting published in good journals that quoted alpha but seemed to simply assume it demonstrated something (unexplained) to readers, and/or used alpha when their instrument was clearly not meant to be multidimensional, and/or took 0.7 as a definitive cut-off regardless of the number of items concerned, and/or quoted alpha values for overall instruments as well as for the distinct scales as if that added some evidence of instrument quality, or claimed a high value of alpha for an instrument demonstrated it was unidimensional.

So why did you then spend time reviewing examples across four journals over a whole year of publication?

Although I did not think this was necessary, when the paper was reviewed for publication a journal reviewer felt the paper was too anecdotal: that just because a few papers included weak practice, that may not have been especially significant. I think there was also a sense that a paper critiquing a research technique did not fit in the usual categories of study published in the journal, but a study with more empirical content (even if the data were published papers) better fitted the journal.

At that point I could have decided to try and get the paper published elsewhere, but Research in Science Education is a good journal and I wanted the paper in a good science education journal. This took extra work, but satisfied the journal.

I still think the paper would have made a contribution without the survey BUT the extra work did strengthen paper. In retrospect, I am happy that I responded to review comments in that way – as it did actually show just how frequency alpha is used in science education, and the wide variety of practice in reporting the statistic. Peer review is meant to help authors improve their work, and I think it did here.

Has the work had impact?

I think so, but…

The study has been getting a lot of citations, and it is always good to think someone notices a study, given the work it involves. Perhaps a lot of people have genuinely thought about their use of alpha as a result of reading the paper, and perhaps there are papers out their which do a better job of using and reporting alpha as a result of authors reading my study. (I would like to think so.)

However, I have also noticed that a lot of papers citing this study as an authority for using alpha in the reported research are still doing the very things I was criticising, and sometimes directly justifying poor practice by citing my study! These authors either had not actually read the study (but were just looking for something about alpha to cite) or perhaps did not fully appreciate the points made.

Oh well, I think it was Oscar Wilde who said there is only one thing in academic life worse than being miscited…

Author: Keith

Former school and college science teacher, teacher educator, research supervisor, and research methods lecturer. Emeritus Professor of Science Education at the University of Cambridge.

2 thoughts on “Why write about Cronbach's alpha?”

  1. Hello Keith,
    Thanks for this detained write up and also for the paper on this.
    I need help in understanding about using Cronbach alpha. I think you are the best person to whom I could write to get my query clarified.

    Is it legitimate to use Cronbach alpha when I want to understand about the inter-evaluator consistency in quantitative scores obtained by students in an assessment scored using a rubric? I think, Cronbach is better suited for this case than Cohen's kappa as the later would suggest agreement for a qualitative assessment.
    I could not get a reference for the above. In your paper paper too, you are talking about use of Cronbach apha for the affective constructs, and, knowledge and understanding for which researchers have used this measure and I could not find anything related to the reliability in the assessment of students score using a rubric.
    Could you please help me in my understanding of the usage of Cronbach alpha for the context I have mentioned.

    1. Thank you for your comments Sujatha.

      Firstly, an important caveat – I am not a statistician. (Any statisticians reading this may wish to add their own comments.) I was motivated to write this paper after needing to learn about Cronbach as a reviewer and editor to make sense of journal submissions using this technique. (And consequently discovering that many of those using Cronbach did not seem to understand what it was intended for.)

      You are right that in the paper I suggest that Cronbach is more suited to use in the affective domain, when the construct being considered is something like self-efficacy in … or attitude to … , rather than in the cognitive domain. This is because it is assumed that a scale which is subject to analysis using the Cronbach’s alpha statistic is testing a ‘unitary’ construct. That is is, a single construct rather than a complex of related notions.

      Now that does not mean it could not apply in relation to knowledge and understanding (rather that in most of the applications I have seen its use is questionable). Consider an example. We might consider assessing knowledge of the idea ‘that force is measured in newtons’. We might feel that is a sufficiently ‘unitary’ construct, and design an instrument with a range of items all intended to test whether students understand that force is measured in newtons. If we wanted to check the internal consistency of this instrument, it may be sensible to calculate Cronbach’s alpha.

      However, this is an unlikely scenario. More likely we would be testing the student on their knowledge and understanding of the force concept, and there might perhaps be one item on to see if the students knew that force is measured in newtons, and other items on other aspects of the topic – that forces are vectors, that forces may balance (and in effect cancel), that net force gives rise to acceleration, that forces occur in ‘action-reaction’ pairs, or whatever. It only makes sense to test the internal consistency of this instrument, looking to assess knowledge and understanding across a topic, if we really think that (e.g., knowledge and understanding of force, or whatever topic) was effectively a single construct. That seems wrong. Clearly some students will know force is measured in newtons but will not that it is a vector, and some will know force is measured in newtons, and is a vector, but not understand the nature of action-reaction pairs, and so forth. So we expect (and in an assessment are even looking for) systematic differences in response patterns across the set of items. We should therefore not expect all the items on the test to be targeting the same thing, and so there is no rationale in looking for a high internal consistency where respondents would tend to have similar response patterns across the set of items. A high measure of Cronbach’s alpha on such a test would be of interest – but would not usually be a sign of a good assessment as it suggests that students tend to offer similar patterns of responses, when an assessment is normally designed to identify differences between the students – to find out which students have mastered the various different learning objectives.

      Now, you ask about something else – about looking at evaluator consistency. So, here the construct you are interested in is something like the correctness of a student response in relation to a specified marking scheme: you are looking for an agreed understanding of what will, and what will not, get credit as a correct answer. I think that is in principle something which it seems might be considered as a unitary construct…but only for any specific item (or a set of very similar items with entirely parallel marking guidelines). In general, in an assessment, different questions will be targeting different knowledge, understanding, application – and each question will have its own specific marking guidelines about what responses are worthy of credit (so an item about identifying action-reaction pairs will have marking guidelines linked to that focus; an item to test if a student treats forces as vectors will have guidelines designed according to that aspect of the force concept). So, I do not think it would make sense to use Cronbach’ alpha to test inter-evaluator consistency across a range of items which are each testing different things, and which therefore have different considerations informing which responses deserve credit.

      If the assessment is a single essay-type question, there will still likely be a rubric which offers credit for a range of different things (perhaps structuring a response, knowledge demonstrated, clarity of argumentation, acknowledging different perspectives, originality, and so forth) so you are still interested in how different evaluators apply criteria in each of these areas, so I would still tend to think that Cronbach would not be a good choice.

      Cronbach’s alpha is about seeing if different items seem to be accessing the same underlying attitude, belief, etc. If you are interested in examining how closely different evaluators are scoring the same student scripts, then I would stick to one of the statistics that are more commonly used to measure Inter-rater reliability.

Leave a Reply

Your email address will not be published. Required fields are marked *

Discover more from Science-Education-Research

Subscribe now to keep reading and get access to the full archive.

Continue reading