Rating features of research studies

Examples of use of the Critical Reading of Empirical Studies Tool

A topic in research methodology

This page presents examples of the use of the Critical Reading of Empirical Studies Tool deriving from an evaluation of the tool.

The tool is intended to support learners in critically reading empirical research reports by suggesting a number of features of studies (i.e., steps in the research process; links in the argument for research conclusions) to evaluate when reading a study.

The evaluation asked volunteers to try out using the tool, and offer some feedback, as well as provide (at least) one example of a completed form. A numerical scale was offered, but participants were told "Do NOT worry about being precise, as this is meant to be purely impressionistic".

A blank copy of CREST

You can read about CREST (the Critical Reading of Empirical Studies Tool) here.

You can read about the CREST evaluation, CRESTe here.

The respondents were assured of anonymity, and they were not required to detail the study they had read. (In order to make the activity authentic, participants were asked to use the tool when they would be reading a research report for their own studies.) A notes panel was provided in case this was useful to participants, but information entered in that panel is not reported. To maintain the confidentiality of the data, the figures used here are redrawn from the submitted form.

The format of the form has been changed to juxtapose the overall rating next to the other ratings.

A person's overall confidence in the conclusions of a study should be informed by how convincing they found the component features of the study

In the redrawn figures below, the red rings reproduce the ratings given by the evaluation participants in relation to five specific features of a research study being read, as well as an overall evaluation of how confident they were in the study conclusions, given their reading of the report.

Participant ratings (redrawn)

The returned ratings are reproduced below:

Undergraduate participant's ratings evaluating a research study (Participants were asked to ignore any scale where they could not offer a rating)
Undergraduate participant's ratings evaluating a research study
Doctoral participant's ratings evaluating a research study
Doctoral participant's ratings evaluating a research study (this participant sent ratings for three different studies)
Doctoral participant's ratings evaluating a research study (this participant sent ratings for three different studies)
Doctoral participant's ratings evaluating a research study (this participant sent ratings for three different studies)

There is an interesting question of how one might expect one's overall confidence in the conclusions of a research report to be related to one's confidence in the discrete processes and steps that make up the study. In each example, the participants rated their overall confidence in the study within the range of their ratings of component aspects:

ExampleRange of ratings of componentsOverall evaluation
C U 10.7 – 0.90.8
C U 20.8 – 0.90.9
C D 30.7 – 0.90.8
X D 4a0.7 – 0.80.7
X D 4b0.8 – 0.90.8
X D 4c0.8 – 0.90.8
X D 50.7 – 0.90.8
X D 61.01.0
How participants rated their overall confidence in a study, in relation to their ratings of some components features.

The final example included {marked XD6} is obviously highly consistent – this participant reported finding all the rated aspects of the study read entirely convincing, and, not unreasonably therefore, found the conclusions of the study entirely convincing as an overall evaluation.

Should ratings of features be treated as independent components?

It may seem at face value entirely reasonable and consistent that someone evaluating a study in this way would rate their overall confidence within the range of ratings of discrete components of the study. However, when we think of research conclusions as depending upon a chain of logical connections (something suggested in the CREST tool guidance), we might question this:

Part of the guide offered for using the CREST in the evaluation

The process reflects the logical 'AND' function – that confidence in the strength of the chain requires confidence in each link. For example, if a totally invalid research instrument is used, then it become irrelevant how strong other aspects of the study are, as, logically, the conclusions cannot be supported by the research. (This does not mean the conclusions are necessarily wrong as propositions – but it does mean they do not logically follow from the research – the research does not justify them or offer a basis for supporting them.)

A calculus for combining confidence ratings?

If the different features evaluated were considered as totally independent aspects of a study then we might feel that overall confidence would be better reflected by finding the product of the confidence levels of all the different components of the study (including perhaps aspects not directly represented in the CREST tool).

An analogy with an unsafe building

As an analogy, if a structural survey suggested that a block of flats was dangerous, and that the walls/supports of each of its floors independently had a 10% chance of collapsing (i.e., only had a 90% chance of not collapsing) over the next 5 years, and this would not be taken to mean there was a 90% chance of the building remaining intact for 5 years. Indeed if the building was 7 floors or more, then it would be more likely to suffer collapse than not in this time.

Treating study components as if independent

The images below show how the level of confidence in the overall study that would be predicted in this way (where orange ovals used to denote the predicted overall confidence ratings). The first image reflects the participant who thought all aspect of the study read were fully convincing.

Participant's overall level of confidence in a study's conclusions, in relation to the level of confidence suggested by treating their ratings of discrete aspects of the study as independent components of the study.

In that case, the 'prediction' obviously matches the overall rating given. However, where 'non-perfect' ratings were given, there is a clear pattern, at least in this small study sample:

Participant's overall level of confidence in a study's conclusions (right top, red), in relation to the level of confidence suggested by treating the participant's ratings of discrete aspects of the study as independent components of the study (right bottom, orange).
Participant's overall level of confidence in a study's conclusions (right top, red), in relation to the level of confidence suggested by treating the participant's ratings of discrete aspects of the study as independent components of the study (right bottom, orange).
Participant's overall level of confidence in a study's conclusions (right top, red), in relation to the level of confidence suggested by treating the participant's ratings of discrete aspects of the study as independent components of the study (right bottom, orange).
Participant's overall level of confidence in a study's conclusions (right top, red), in relation to the level of confidence suggested by treating the participant's ratings of discrete aspects of the study as independent components of the study (right bottom, orange).
Participant's overall level of confidence in a study's conclusions (right top, red), in relation to the level of confidence suggested by treating the participant's ratings of discrete aspects of the study as independent components of the study (right bottom, orange).
Participant's overall level of confidence in a study's conclusions (right top, red), in relation to the level of confidence suggested by treating the participant's ratings of discrete aspects of the study as independent components of the study (right bottom, orange).
Participant's overall level of confidence in a study's conclusions (right top, red), in relation to the level of confidence suggested by treating the participant's ratings of discrete aspects of the study as independent components of the study (right bottom, orange).

If we make this assumption – that each 'link' (step in the argument from evidence) in the chain is independent, and a failure in one 'link' occurs without references to the others – then in this small sample there seems to be a systematic tendency to be over-confident in the conclusions of research studies in relation to the confidence the reader has in its component parts (the grey arrows in figure above).

Can the links in the argument be considered independent?

Is this a reasonable approach to take? Arguably, it is not sensible to assume the different ratings should be treated as independent in the sense that the components of a study are not arbitrarily compiled, but report the work of one researcher or one team.

Consider the building analogy used above. What if the structural survey suggested that each floor of the building had a 10% chance of collapsing in the next 5 years because of the likelihood of an extreme storm event or major earth tremor putting a high level of stress on the structure. 1 If this was conjectured to be the main (and common) potential source of collapse, then it no longer makes sense to treat the collapse of different floors as completely independent potential events. Yet, unless the different floors were thought to have the same weaknesses at equivalent locations in the structure, making them susceptible to the same pattern of action of external stressors, this would not mean the overall risk of collapse was simply 10% either.

So, the probabilistic estimates in the figures above may be seen as a minimal level of confidence (or at least would be if it was considered that all of the steps in the process of reaching conclusions in the study, all of the links in the argument chain, were being rated). However, these predictions may offer heuristic value in leading one to question the tendency of participants to give overall ratings so much greater than these predictions (the grey arrows in figure above). If one has some reservations about so many aspects of a study, then should one have a high confidence in its conclusions?

Of the five participants here (i.e., excluding the participant giving perfect ratings) four rated their overall confidence in a study's conclusions HIGHER than their rating of a least one of the component 'links' in the argument for that conclusion. That seems somewhat irrational.

The data provided by this small sample seems to suggest a tendency to make an overall judgement by looking within the range, perhaps for the mean, of the component judgement.

This was only a small sample, but if the findings here were found to be widespread (and perhaps not just among students) then this might suggest a tendency to over estimate the validity of study conclusions in relation to one's critical evaluation of its component parts. That would be something worth noting, and perhaps addressing in research training programmes.

An(other) analogy with a motor race

Another analogy which might be useful here is with a team competing in a motor race, such as a Formula 1 Grand Prix. Consider the team of the car leading as the race approaches its final stages. The car has a modest lead over competitors, such that it is judged that, barring something unexpected, even though the following cars are on 'fresher rubber' and therefore have some performance advantage at this stage of the race, it is expected the leading car should win the race. However, if the car was called in for a pit stop (to change tyres), it would lose places and so would be unlikely to win.

The engineering team, using telemetry from the car and their computer models, estimate that the tyres will be getting close to a critical state by the final laps. They estimate that the chances of the car completing the race on those tyres without a catastrophic failure are:

TyreLikelihood of finishing the race
Front left0.7
Front right0.8
Back left0.8
Back right0.9
A hypothetical racing team's estimate of their car's tyres completing a race

But of course generally speaking, a car needs all four functioning tyres to complete a race. 2 If potential tyre failures are treated as completely independent then this suggests there is only about a 40% chance of finishing the race on those tyres. However, the driver could be told to drive more cautiously (keep off the kerbs at corners, brake earlier and more gently) in ways that might similarly influence whether each of the tyres failed – or the car might come across debris from a damaged car, carbon fibre shards, that would increase the risk of damage to any of the (already deteriorating) tyres.

Sensibly, one might assume that the best estimate of the probability of the car getting to the end of the race on the tyres could not be higher than the worst probability of tyre failure (0.7 of not failing) but was likely actually somewhat more than if potential tyre failures were truly totally independent events (c.0.4).

0.4 < confidence in completing race without tyre failure < 0.7

I suspect something similar is the case with evaluating research. Perhaps there are common factors within a research study which suggest that our overall confidence in a study need not be as low as the simple product of the confidence levels in the component parts: yet rating overall confidence as high as the weakest component (as participant XD4 did) seems optimistic, and rating overall confidence above that of the rating of the 'weakest link' (as participants CU1, CU2, CD3 and XD5 did) seems illogical.

Thus this small-scale evaluation study raises the question of whether readers tend to put too much confidence in the conclusions of a research study (perhaps as some kind of common cognitive trait). This seems like the basis of a potentially useful conversation to be included as part of researcher training.

You can read about critical reading of research studies here

You can read about CREST (the Critical Reading of Empirical Studies Tool) here.

You can read about the CREST evaluation, CRESTe here

Notes:

1 This is not a very realistic scenario, both because each floor has to support those above it and so the floors are not structurally identical, and because events such as earthquakes will not apply stresses in identical ways to all floors of the building. However, I think this offers a useful analogy as a thought experiment.

2 I remember Lewis Hamilton demonstrating the possibility of completing and winning a race with three tyres! One tyre had completely disintegrated, but he continued at fairly high speed (reports suggested up to 130km/h on corners and 230km/h on the straight sections). I suspect this apparent miracle was due in part to the exceptional skill of Hamilton, but also in part due to the design of racing cars with their wings to provide high downforce – which perhaps kept the chasis fairly level despite one wheel not properly functioning? It is a response to tyre failure not recommended for road cars.