Critical Reading of Empirical Studies Tool evaluation
The Critical Reading of Empirical Studies Tool was designed as a simple aid to help those reading research studies keep in mind the need to read critically – not just to seek to find out what researchers are claiming to have found, but to reach a view on the extent to which study conclusions and recommendations are justified given the reported details of the study.
You can read about, and access, the tool here.
A small scale evaluation of the tool was undertaken to ascertain whether it might prove useful to scholars, and whether any modifications might be indicated.
Executive summary
Participants
I had thought there might be some interest in this tool, given how important critical reading of studies is in university courses – especially for research students.
Volunteers were sought through the Moodle (V.L.E.) sites of students in the Faculty of Education at the University of Cambridge. Permission was first sought from the Director and Deputy Director of Learning and Teaching, being considered appropriate 'gatekeepers' for the students, and the Faculty's ethical procedures were followed (with details of the project lodged with the Faculty Research Office).
[You can read the invitation to volunteer for the project here.]
Volunteers were sent an electronic copy of the tool, a one page form to note ratings of aspects of studies, along with guidance giving instructions on how to use the tool (and how to return feedback for the evaluation). 1
Respondents were requested to try out the form (when they had need to read a research paper for their studies), and to return a copy of the annotated form (showing ratings given) and feedback responses.
In the event, 19 Cambridge students volunteered to receive the materials, and only 3 returned feedback. One volunteer sent apologies and an explanation for not returning materials. Whether others decided they did not think there was any value in trying out the activity, did not have opportunity (unlikely on a university course), or simply forgot, remains unknown. However the generally positive feedback from those who did respond should be read against this being a minority of those who had initially volunteered to participate.
I therefore sought further volunteers from further abroad, and seven further students volunteered to receive materials, of whom 3 sent feedback. (One of these respondents offered three examples of using the tool.)
The pattern of participants sending feedback was:
Undergraduate | Doctoral | Total | |
Cambridge Faculty of Education | 2 | 1 | 3 |
External to Cambridge | – | 3 | 3 |
Total | 2 | 4 | 6 |
Clearly, given the size and nature of the sample, it was not considered useful to seek to associate patterns of feedback with any particular classes of respondent.
Feedback requested
The following information was requested:
Participant ratings
Participants generally rated the studies they read as offering conclusions that they could have confidence in. The ratings given offer little information in themselves as participants rated studies of their choice which have not been independently assessed. One feature of the participant ratings that was noteworthy was that the participants' overall confidence in the conclusions of studies was generally higher than might have seemed justified in relation to their confidence in component parts of the study. In this very small sample, most of the participants considered they had a higher overall level of confidence in the conclusions of a study than their level of confidence in at least one of the component stages of the study. Although the numerical ratings cannot be seen as having any precision, this still suggests something that may be worth further attention.
Read about the ratings given by participants in the evaluation
Feedback from participants
Participant feedback and responses to their comments and questions are provided below.
Was the tool useful?
"Did you think the tool was useful in helping you critically read the study"
Respondents were asked if the tool was useful and the following options were suggested:
Response option | |
very helpful | 1 |
quite helpful | 3 |
only slightly helpful | 2 |
made no difference | 0 |
impeded reading | 0 |
other | 0 |
Total | 6 |
One of the respondents who found the tool "quite helpful" also commented
"This tool pushes me to think about the article from different perspectives to develop my critical thinking skills. A very good tool…
Overall, I like this tool".
One of the respondents who selected "only slightly helpful" explained this response:
"I choose 'only slightly helpful' for the second question because I'm already familiar with reading academic papers critically and pay attention to different aspects of a study's design. I think it would be more helpful for first-year Tripos [undergraduate] students who do not have much experiences reading academic papers."
A student who had rated the tool 'quite helpful' commented that
"the questions and scales were definitely very useful. Even though I would not see myself using them for every study, it definitely made me think. The critical analysis of the different aspects of a study (study design, sampling, research instruments, analysis, interpretation of results) were never at the forefront of my mind (unless there was a really glaring problem in the study) but the tool did make we more conscious of this while reading."
Is the tool recommended?
Respondents were also asked,
Would you recommend that other students try the tool to see if it helps them with their studies?
All respondents suggested the toll should be recommended, at least to some users. Two respondents simply replied: Yes. The others responded:
"yes (different things work for different people, but this tool is definitely worth trying)"
"Yes, to first-year [undergraduate] students."
"I will recommend to my students and colleagues for using this tool."
"Yes
With an accompanying guideline and possible modifications, the tool may be very useful. I would even recommend it for school students in secondary education and beyond. The tool will be particularly helpful for students who have started systematic scientific writing as well as anyone who is new to critical reading of research."
This last respondent makes an insightful observation about research writing – being a critical reader of research helps develop self-criticality as a writer: experience of critiquing other people's work is valuable in learning to read one's own work as if a critic, and so in identifying areas where one's writing needs to be strengthened.
Response: CREST as a 'scaffolding' tool
The CREST is envisaged as a kind of study aid that does not seek to provide the learner with new information, but rather seeks to help structure a learning activity. That is, the tool does not in itself provide knowledge and understanding (although relevant web-links were included in the instructions for any user who was unclear about the features they were bring asked to evaluate), but rather is meant to help the user keep in mind and marshal existing knowledge and understanding. Such a tool is likely to be superfluous to a very experienced critical reader of research, but (it was conjectured) may be useful for someone still developing expertise in this area.
In educational terms, such a tool may help overcome the limitations of working memory by acting as an external memory aid, and helping to structure activity so that the learner can focus on each of a series of discrete manageable tasks, sequentially, rather than trying to hold a good many objectives in mind at once. That is, this is a kind of 'scaffolding' tool that help a person bring to mind and make the best use of existing relevant prior knowledge and understanding.
Read more about the idea of tools to scaffold learning
Possible improvements
Respondents were asked,
Do you have any comments on how the tool might be improved to be made more useful?
There were a number of comments in response to this question. Given the small sample size, however, there was little possibility of identifying common responses which might have highlighted changes that would have been widely valued. The main points that arose are discussed in the sections below, along with my responses/observations. A miscellaneous section responds to a range of comments and questions posed by one of the participants in her/his feedback.
Prompt questions / more space for notes
One of the participants noted:
"I am a person who tends to annotate texts quite heavily, so I did not like to have a separate sheet of notes. I realized that I couldn't help myself from writing the notes on the separate sheet onto the margins of the document itself."
In a similar way, one participant suggested:
"If I can comment on your tool, I would like to say that this instrument is relatively easy to use either the main instrument and its direction. But if every item has blank space for putting comments and the title of articles or books that have read, it will be better."
The version shared with participants had included a modest space for generic notes, such as bibliographic details, but not dedicated spaces to expand upon the distinct ratings given for the different features of a study. Another of the participants made a similar point.
"Consider having the tool accompanied by guiding questions that explain what is meant by each of the different items to be assessed….
Consider including a 'notes' area for justification after each question, for such notes can help document the reasons for selecting one rating vs. another…
Would it be of more value if a rubric was used as an exemplar framework alongside the visual device?"
In the evaluation the volunteers had been sent guidance for completing the task, which included explanations of the points to be evaluated, along with web-links to more detailed discussions of these topics. However,this participant may have meant that such information could have been included on the form itself. That was an explicit suggestion made by another volunteer,
"Maybe add a short description of the criteria under the criteria, or formulate it as a question while still highlighting the criteria (e.g. 'to what degree do the authors convince you that the study they have designed is suitable for answering their research questions?') because I found myself going back to the instructions a lot"
This was a useful suggestion. The design of the materials was undertaken with a view to keeping the evaluation sheet as uncluttered as possible, whilst putting detail in the instructions. This is likely indeed to lead to some 'back-and-forth' initially, but my assumption was (and remains) that with repeated use someone would find they did not need to refer to the instructions. However, of course, if someone initially finds the need to keep referring to another document too distracting, then they would not reach this level of familiarity, and may well abandon use of the tool.
As one of the participants noted:
"As with any instrument, I think that the tool itself may need some practice if it were not to impede the reading pace but organize thoughts and make judgements concerning the confidence in the research presented."
Response:
In response to this suggestion, an alternative version of the tool has been prepared including a sequence of prompt questions, each with space for making brief notes in response to these.
Both formats are available for download for anyone who wished to try out the tool. 2
One respondent suggested having a Google docs version.
"Consider designing the tool in a Google form or something similar; this will come in handy and save the users time to organize/re-enter the data and impressions made of the different articles read."
The existing version can be printed out and used in hard copy, or used as an electronic document, and annotated using a computer/tablet etc. I assume the advantage of a Google forms version would be to allow on-line collaboration between individuals. I did not envisage the tool being used that way, but am happy for anyone to make such a version if they feel it will be useful.
Have fewer scale points
One of the participants asked:
"Why did the tool use an 11-point scale?
Would it be easier to have a 5-point scale?
What constitutes a difference between 0.1 and 0.2 for example? Or between 0.2 and 0.3?"
Another participant commented:
"I am confused about how many points I should give for each aspect. Different people have different views, because I think for this aspect I give 1.0, other people might think it should be given 0.6, for example. So, it it possible to make the features to evaluate more detailed?"
Response:
There is no 'right' numbers for how many points such a scale should have, although it is sometimes recommended that an even number of points are used on scales to encourage respondents on questionnaires away from choosing neutral responses. In terms of the tool acting as a study aid, I am not sure it matters too much. (What is more important, is having reasons for – being able to justify – giving a particular level of rating.)
The choice of a numeric scale was deliberate as it offers an advantage in considering how overall evaluations link to the evaluation of components aspects of a study. In the sample of completed forms returned there was a sense that perhaps readers give too much weight to the overall conclusions of a study in the light of their evaluations of different components.
Read about the ratings given by participants in the evaluation
However, an alternative format has been prepared which uses a six-point verbal scale. 2 Both versions are available to anyone who wishes to use the tool. Clearly other options are possible, and if any potential user has an especially strong preference for something different then they may wish to develop their own bespoke version.
The choice of features to evaluate
One of the participants questioned the specific choice of features that were highlighted on the form:
"Consider having an item that relates to the results themselves."
and also:
"The distinction between confidence "in the analysis" and "in the interpretation of results" is not very clear without the guiding questions.
The distinction between confidence in "the interpretation of results" (to mean in the conclusions) and in "the overall study's conclusion" (to mean the knowledge claims) seems to be thin. Wouldn't the authors often conclude by summing up the discussion and revisiting the RQs?
Would confidence in the analysis and interpretation overlap if the journal required researchers to have one section for both results and discussion vs. two separate sections?"
These are thoughtful questions.
In devising the form I wanted to limit the number of specific features rated, and have features which would apply to most studies. But of course research, and presentation of research, vary greatly.
For example, I did not specify questions about the thoroughness of the literature review, the adequacy of the conceptual framework developed from it or the framing of the research questions – but started with a fairly general point about overall study design (which might lead to readers not taking onto account the background that led to that choice of design).
In most studies, I doubt there is a basis for (as this participant suggested) evaluating the results per se as a features separate from the analysis (which produced the findings) and the way the result were interpreted by the researchers. However, it is difficult to offer a template which works for all kinds of empirical study and all kinds of write-up. A user who feels strongly that a somewhat different parsing of the research process would be more useful them, may wish to modify the format as they see fit.
However, unlike this particular participant, I did think there was clear distinction between 'the interpretation of results ["To what degree are you confident that the researchers' conclusions (e.g., answers to research questions) actually follow from their findings (results)?"] which refers to a step in the argument, and 'Overall level of confidence in the study's conclusions' ("To what degree are you confident that the new 'knowledge claims' made by the authors of the paper are justified in terms of the quality of the study overall?") which refers to an overview of the different steps collectively.
However, I would reiterate that this is only meant to be a tool to support the thinking process, and that if any users find it more helpful to make their own versions of the tool by changing or adding sections (for specific studies, or more generally) then they are welcome to do that.
Providing exemplars to increase reliability
One of the participants suggested
"Consider adding exemplars of evaluating the strength of each feature and what constitutes a 0.0 or a 0.1, etc.
The reader's knowledge and/or understanding of the methodology used in the paper for example might differ. Does the tool assume some 'pre-requisites'?
Should the student using the tool have adequate knowledge about different research designs that may be used to answer different RQs to make a "sound judgement"?
How would the tool help students/users make a better judgement?"
The same respondent also mooted:
"The same article might be marked as a 9.0 by someone and a 6.0 by another person. Alternatively, the same article might be marked differently by the same person if read at different times.
If users don't explain their reasoning following each section, wouldn't their judgment change if they read it again later?
Since evaluation is subjective, wouldn't the ratings change if the user read the article at one point then used the tool to evaluate it at a different time point? How reliable would the instrument then be and/or what would its main purpose become?"
Response
Reliability (producing replicable results) is important in a research instrument.
Read about reliability of research instruments
I can see that if the purpose of the tool were to train readers to be able to assign numerical scores in a reliable way, then this would a serious concern – but this would require there to be some consensus on such scoring within the research community. I do not think such consensus actually exists even within particular fields or disciplines.
At this level of study there are not straightforward right answers, so that people from different backgrounds or having different priorities may reasonably come to quite different views about the adequacy of a necessary research compromise or the seriousness of some limitation of technique. (See, for example, the discussions of strengths and weaknesses of specific papers in Taber, 2013).
However, I was not considering that the Critical Reading of Empirical Studies Tool might be used as a research instrument, but rather as a study tool. So, the purpose of scoring is not to arrive at a score, but to engage in the kind of thinking which leads to making a principled (rather than just an impressionistic) evaluation.
I therefore think both that it would likely be impossible, and indeed that it is not so important, to offer exemplars from actual studies of what might equate to specific ratings. (Some examples of potential issues with published studies that might be considered to undermine their conclusions are discussed in some of the postings in the blog section of the site.)
What could be very valuable for readers of research (especially relative novices), in the spirit of dialogic pedagogy, would be where several people interdependently read and rate a study, to then have a conversation to compare and explain their ratings. Again this would not be to force an agreement, but to share (perhaps diverse) perspectives and to practise making explicit the basis of evaluations.
To respond to some of the specific questions here:
- "Does the tool assume some 'pre-requisites'?: Yes, evaluation of a study requires an understanding of research methods and the overall logic of a research study. The tool is meant to help learners apply this prior learning.
- "Should the student using the tool have adequate knowledge about different research designs that may be used to answer different RQs to make a "sound judgement"?: Again, yes a student's evaluations will be better informed when familiar with the methodology being adopted. The tool does not teach this, but might help a learner become aware of gaps in their knowledge/understanding.
- "How would the tool help students/users make a better judgement?": The tool does not, of itself, provide tuition on research methods, but rather seeks to support a learner in keeping in mind aspects of a study that require critical evaluation.
- ."..the same article might be marked differently by the same person if read at different times. If users don't explain their reasoning following each section, wouldn't their judgment change if they read it again later?": A person's judgement may well shift. People may have complex knowledge and understanding and their evaluations at any moment can be influenced by contextual features (including what they have been thinking, reading and talking about just prior to the activity!) One might also hope that people's evaluations become more nuanced and sophisticated as they learn more about research and become more familiar with critical engagement with research. In education, change is not necessarily a bad thing!
- "Since evaluation is subjective, wouldn't the ratings change if the user read the article at one point then used the tool to evaluate it at a different time point? How reliable would the instrument then be and/or what would its main purpose become?": As suggested above, the tool is not intended to offer a means of carrying out a fully objective evaluation that would be invariant across users or time, but rather to support the user in keeping in mind different features that are worthy of close attention to support them in making an evaluation informed by the knowledge and understanding they can being to bear.
That said, although one would expect different researchers with different backgrounds to reach somewhat different judgements of the merits of studies – there clearly are aspects of some studies which should act as 'red flags' and give any reader cause for concern. For example:
- if no information is given on how data were analysed
- if information about the population being sampled is inconsistent in different parts of the same paper
- if a study is supposed to be an 'experiment', but there was no intervention or comparison condition reported
- a study is described as a 'case study' but the participants are 116 students from four different classes, two from each of two schools
- if a study design requires participants to be randomised into different conditions, and this was done on the basis of "even and odd birth dates"
- etc.
Other comments
One of the participants offered a number of other questions/comments/suggestions on various points, which I respond to here:
- "Would the users' level of confidence in the assessed aspects of the work reported help users better build their argument should they find some kind of 'gap' in the researchers' thinking?": I would assume so. I imagine that it is also the case that if a reader suspected a 'gap', but was not confident in their own knowledge or understanding of that feature (and so perhaps unsure of the validity of such a criticism) this might prompt them to do some reading, review their notes, speak to a teacher or more experienced peer, etc.
- "Or is evaluating articles with 'low ratings' a judgement for the 'low quality' of the article?": Most research studies have limitations and weaknesses due to both inherent difficulties in undertaking some kinds of research, and the inevitable limitations on the researcher's resources. However, one would expect an author to highlight and discuss these issues, and make any relevant caveats and provisos to their conclusions and recommendations explicit. But, yes, in general, a study which is evaluated as having low ratings should be considered of low quality.
- "Shouldn't causes for the 'lack of confidence' be picked up by the reviewers of good journals?": Yes – but 'ought' is not 'is'. But in much the same way as people should be nice to each other, and honesty is the best policy, and violence should always be avoided if at all possible…the world is not perfect. Leaving aside the genuine differences of opinion that various experts might have due to different commitments and experiences, (a) sometimes reviewers and editors of decent journals miss things, and poor research 'slips through';(b) there is now a very large number of recently established journals with weak quality standards either because (i) newly established journals cannot attract sufficient qualified and experienced editors and reviewers when there is already a range of prestigious journals in that field, or, (ii), more worryingly, they are 'predatory journals' that will publish almost any non-sense for a fee. (This is not mere rhetoric, or exaggeration – see some of the examples highlighted in the blog postings.) These predatory journals often look superficially no different to reputable journals. (Read about predatory journals)
- "To what extent does this tool differ compared to the criteria a reviewer uses to assess a new submission for a research article?": The criteria are entirely relevant to someone reviewing for a journal. Reviewers evaluating submissions for publication (in high quality journals) may ALSO be asked to comment on the novelty and significance of the work. (Read about peer review) That is a study might be technically very good (and so get high ratings on the tool), but only make a modest contribution as it is considered largely derivative and so only adding incrementally to a large existing pool of related work.
- "Would the tool further help in the evaluation process when it comes to confidence in the authors' research claims?": I would hope the tool might be useful for those new to evaluating work for publication.
- "To what extent can the tool be used to evaluate empirical replication studies … (when studies are convincing but do not contribute 'new knowledge')?": The focus of the tool is on the justification of research claims based on argument from analysis of relevant evidence. A replication study should be judged in its own terms just as any other (so the tool is just as relevant). So, this should make no difference. In research in education 'replication' studies may be very valuable as questions such as the effectiveness of pedagogy (for example) may be contextually bound – that is what works at one place and time may not work somewhere else, or even in the same place some time later. Arguably, in research in education and other social science, there are strictly no replication studies as conditions can never be recreated exactly. (Read about replication studies).
- "To what extent can the tool be used to evaluate … studies with negative results (when studies are convincing but do not contribute 'new knowledge')?":If a study produces 'negative results' then these depends upon the same process of argumentation as if positive results. So, again, this should make no difference to the relevance of the criteria.
- "When evaluating confidence in the sample, case studies are challenging for numbers of usually small. In this case, what would 'confidence' entail for the population to be 'sufficiently well sampled'?" : A case study explores one instance of some phenomenon (a group discussion; a BEd programme; a physics examination paper; a lesson on flame tests). Often the full population can be included in the 'sample' – all the children involved in a group discussion; all the diagrams in a biology text book) so in effect the sample is the population. Where this is not possible (e.g., a case study of a university course with several hundred students) then a judgment needs to be made over whether the sample size is sufficiently large and whether those sampled are sufficiently representative of the population – as in any other type of study. Note that this is a separate issue to the question of whether the findings from a case study can be generalised beyond the case (i.e., that may be an issue to consider in deciding whether conclusions convincingly follow from the results of a study). (Read about case study as a methodology.)
- "Do journal articles usually include a full justification of the rational for selecting one population vs. another (especially when conducting the research with both samples may yield significant results)?": Certainly not always. More seriously in my experience (certainly in education) too often papers are published without any explicit description of the population, and sometimes even without any explicit statement of where (even in which country) the research was carried out which may give the impression the population is (and research findings can automatically be generalised to) such wide groups as 'chemistry teachers', or 'undergraduate engineering students', nine year olds, etc. (Read about populations of interest in research studies.)
- "With regards to the naming of the tool, I was wondering to what extent does 'confidence in the research claims' genuinely reflect 'critical reading'?": The tool asks users to evaluate some key aspects of a paper (by rating confidence in those aspects of the author's argument) as a process that may be useful to support the development of critical reading habits.
Work cited:
- Taber, K. S. (2013). Classroom-based Research and Evidence-based Practice: An introduction (2nd ed.). London: Sage.
Notes:
1 The instructions linked here have been modified (a) to remove references to returning feedback that were included in the instructions to evaluation participants; (b) to refer to the alternative versions offered following feedback as discussed above. (Anyone who wishes to see the document in the form sent to study participants can request this by emailing creste@science-education-research.com)
2 So there are four versions of the tool available for download:
numeric scale | verbal scale | |
without prompt questions | CREST-n | CREST-n+ |
with prompt questions | CREST-v | CREST-v+ |