Keith S. Taber
What is Cronbach's alpha?
It is a statistic that is commonly quoted by researchers when reporting the use of scales and questionnaires.
Why carry out a study of the use of this statistic?
I am primarily a qualitative researcher, so do not usually use statistics in my own work. However, I regularly came across references to alpha in manuscripts I was asked to review for journals, and in manuscripts submitted to the journal I was editing myself (i.e., Chemistry Education Research and Practice).
I did not really understand what alpha was, or what is was supposed to demonstrate, or what value was desirable – which made it difficult to evaluate that aspect of a manuscript which was citing the statistic. So, I thought I had better find out more about it.
So, what is Cronbach's alpha?
It is a statistic that tests for internal consistency in scales. It should only be applied to a scale intended to measure a unidimensional factor – something it is assumed can be treated a single underlying variable (perhaps 'confidence in physics learning', 'enjoyment of school science practicals', or 'attitude to genetic medicine').
If someone developed a set of questionnaire items intended to find out, say, how skeptical a person was regarding scientific claims in the news, and administered the items to a sample of people, then alpha would offer a measure of the similarity of the set of items in terms of the patterns of responses from that sample. As the items are meant to be measuring a single underlying factor, they should all elicit similar responses from any individual respondent. If they do, then alpha would approach 1 (its maximum value).
Does alpha not measure reliability?
Often, studies state that alpha is measuring reliability – as internal consistency is sometimes considered a kind of reliability. However, more often in research what we mean by reliability is that repeating the measurements later will give us (much) the same result – and alpha does not tell us about that kind of reliability.
I think there is a kind of metaphorical use of 'reliability' here. The technique derives from an approach used to test equivalence based on dividing the items in a scale into two subsets*, and seeing whether analysis of the two subsets gives comparable results – so one could see if the result from the 'second' measure reliably reproduced that from the 'first' (but of course the ordering of the two calculations is arbitrary, and the two subsets of items were actually administered at the same time as part of a single scale).
* In calculating alpha, all possible splits are taken into account.
Okay, so that's what alpha is – but, still, why carry out a study of the use of this statistic?
Once I understood what alpha was, I was able to see that many of the manuscripts I was reviewing did not seem to be using it appropriately. I got the impression that alpha was not well understood among researchers even though it was commonly used. I felt it would be useful to write a paper that both highlighted the issues and offered guidance on good practice in applying and reporting alpha.
In particular studies would often cite alpha for broad features like 'understanding of chemistry' where it seems obvious that we would not expect understanding of pH, understanding of resonance in benzene, understanding of oxidation numbers, and understanding of the mass spectrometer, to be the 'same' thing (or if they are, we could save a lot of time and effort by reducing exams to a single question!)
It was also common for studies using instruments with several different scales to not only quote alpha for each scale (which is appropriate), but to also give an overall alpha for the whole instrument even though it was intended to be multidimensional. So imagine a questionnaire which had a section on enjoyment of physics, another on self-confidence in genetics, and another on attitudes to science-fiction elements in popular television programmes: why would a researcher want to claim there was a high level of internal consistency across what are meant to be such distinct scales?
There was also incredible diversity in how different authors describe different values of alpha they might calculate – so the same value of alpha might be 'acceptable' in one study, 'fairly high' in another, and 'excellent' in a third (see figure 1).

Fig. 1 Qualitative descriptors used for values/ranges of values of Cronbach's alpha reported in papers in leading science education journals (The Use of Cronbach's Alpha When Developing and Reporting Research Instruments in Science Education)
Some authors also suggested that a high value of alpha for an instrument implied it was unidimensional – that all the items were measuring the same things – which is not the case.
But isn't it the number that matters: we want alpha to be as high as possible, and at least 0.7?
Yes, and no. And no, and no.
But the number matters?
Yes of course, but it needs to be interpreted for a reader: not just 'alpha was 0.73'.
But the critical value is 0.7, is that right?
No.
It seems extremely common for authors to assume that they need alpha to reach, or exceed, 0.7 for their scale to be acceptable. But that value seems to be completely arbitrary (and was not what Cronbach was suggesting).
Well, it's a convention, just as p<0.05 is commonly taken as a critical value.
But it is not just like that. Alpha is very sensitive to how many items are included in a scale. If there are only a few items, then a value of, say, 0.6 might well be sensibly judged acceptable. In any case it is nearly always possible to increase alpha by adding more items till you reach 0.7.
But only if the added items genuinely fit for the scale?
Sadly, no.
Adding a few items that are similar to each other, but not really fitting the scale, would usually increase alpha. So adding 'I like Manchester United', 'Manchester United are the best soccer team', and 'Manchester United are great' as items to be responded to in a scale about self-efficacy in science learning would likely increase alpha.
Are you sure: have you tried it?
Well, no. But, as I pointed out above, instruments often contain unrelated scales, and authors would sometimes calculate an overall alpha (the computer found to be greater than that of each of its component scales – at least that would be the implication if it were assumed that a larger alpha means a higher internal consistency without factoring how alpha tends to be larger the more items are included in the calculation.
But still, it is clear that the bigger alpha the better?
Up to a point.
But consider a scale with five items where everybody responds to each item in exactly the same way (not, that is, different people respond in the same way as each other, just whatever response a person gives to one item – e.g., 2 on a scale of 1-7 – they also give to the other items). So alpha should be 1, as high as it can get. But Cronbach would suggest you are wasting researcher and participant effort by having many items if they all elicit the same response. The point of scales having several items is that we assume no one item directly catches perfectly what we are trying to measure. Whether they do or not, there is no point in multiple items that are effectively equivalent.
Was it necessary to survey science education journals to make the point?
I did not originally think so.
My draft manuscript made the argument by drawing on some carefully selected examples of published papers in relation to the different issues I felt needed to be highlighted and discussed. I think the draft manuscript effectively made the point that there were papers getting published in good journals that quoted alpha but seemed to simply assume it demonstrated something (unexplained) to readers, and/or used alpha when their instrument was clearly not meant to be multidimensional, and/or took 0.7 as a definitive cut-off regardless of the number of items concerned, and/or quoted alpha values for overall instruments as well as for the distinct scales as if that added some evidence of instrument quality, or claimed a high value of alpha for an instrument demonstrated it was unidimensional.
So why did you then spend time reviewing examples across four journals over a whole year of publication?
Although I did not think this was necessary, when the paper was reviewed for publication a journal reviewer felt the paper was too anecdotal: that just because a few papers included weak practice, that may not have been especially significant. I think there was also a sense that a paper critiquing a research technique did not fit in the usual categories of study published in the journal, but a study with more empirical content (even if the data were published papers) better fitted the journal.
At that point I could have decided to try and get the paper published elsewhere, but Research in Science Education is a good journal and I wanted the paper in a good science education journal. This took extra work, but satisfied the journal.
I still think the paper would have made a contribution without the survey BUT the extra work did strengthen paper. In retrospect, I am happy that I responded to review comments in that way – as it did actually show just how frequency alpha is used in science education, and the wide variety of practice in reporting the statistic. Peer review is meant to help authors improve their work, and I think it did here.
Has the work had impact?
I think so, but…

The study has been getting a lot of citations, and it is always good to think someone notices a study, given the work it involves. Perhaps a lot of people have genuinely thought about their use of alpha as a result of reading the paper, and perhaps there are papers out their which do a better job of using and reporting alpha as a result of authors reading my study. (I would like to think so.)
However, I have also noticed that a lot of papers citing this study as an authority for using alpha in the reported research are still doing the very things I was criticising, and sometimes directly justifying poor practice by citing my study! These authors either had not actually read the study (but were just looking for something about alpha to cite) or perhaps did not fully appreciate the points made.
Oh well, I think it was Oscar Wilde who said there is only one thing in academic life worse than being miscited…