Testing for initial equivalence

A topic in research methodology

One methodology (general research strategy) is the experiment.


If we are testing for equivalence, how much difference should we ignore?
(Bell curve image by mcmurryjulie from Pixabay)

When it is not possible to undertake a true experiment by randomly assigning learners / teacher / classes / et cetera to conditions, researchers often seek to demonstrate the groups experiencing the two conditions are initially equivalent,

When quasi-experimental approaches are employed, statistical tests may also be used to check for significant difference between groups that may already exist prior to any 'treatment' being applied, and which could invalidate any differences 'after' the experimental treatment has been carried out.

So in a paper in the Journal of Science Education and Technology, Çokadar and Yılmaz (2010) reported a study designed to explore "the effect of creative drama-based instruction on seventh graders' science achievements in the ecology and matter cycles unit …" (p. 80). The authors report statistics relating to the achievement test scores in two classes, an experimental group (taught through drama-based instruction) and a control group (taught through lectures and class discussion), at pre-test (before teaching) and in a post-test (afterwards). The mean scores in the two groups are shown below…

Taber, 2013, p.84
GroupAt pre-testAt post-test
Control7.6316.86significant difference
Experimental
7.9119.60significant difference
non-significant differencesignificant difference
Results from the Çokadar and Yılmaz (2010) study

"We can see from [the table] that the experimental group achieved more (on average) in the post-test than the control group, but that they also achieved more in the pre-test and so might be considered to be starting from a stronger knowledge base. Çokadar and Yılmaz present the outcomes of statistical tests to tell us that in both groups achieve- ment was significantly better after teaching (confirming what most teachers would infer from simple inspection of the change in mean scores). Readers are also told that the difference between the two groups after teaching was statistically significant whereas the difference before teaching did not reach statistical significance. In this example, most teachers would find this reasonably convincing – after all, the difference in mean scores at post-test is noticeably greater than the small difference at pre-test.

However, statistics only tell us what is likely by chance. The small difference in pre- test scores could conceivably mean that more students in the experimental group had some key understanding that was important for learning more about the topic. That seems unlikely, but we cannot completely rule out such a possibility. Of course, average test score are just that: it would be quite possible that students in a class with a lower average score on a test were in a better position to progress if the profile across different test items was very different in the two classes – not all knowledge is equally central for further learning. As always, teaching and learning are complex matters. A better grasp of central concepts is more important than knowing many discrete facts. Although we might find Çokadar and Yılmaz's results… convincing, we might not always be so readily convinced that a non-significant difference in pre-test scores should be ignored in explaining a significant difference in post-test scores (see Figure)."

Taber, 2013, p.84

The figure below illustrates schematically a hypothetical example where using the arbitrary cut-off (p<0.05) a statistically significant outcome on a post-test is seen as a positive result as the groups were initially considered 'equivalent' (p≥0.05) at pre-test: yet in practice two groups that had performances that were already quite different have become slightly more different over time (as might be expected even if the teaching input has been identical).



Visual representation in learning gains in two groups of learners in a hypothetical quasi-experiment (After Figure 4.3, Taber, 2013, p.85)

Equivalent scores on pre-test?

The example below is taken from a paper exploring the use of P-O-E learning activities in school using a (very questionable) experimental design (Kibirige, Osodo & Tlala, 2014).


"The achievement of the [experimental group – school A] and [control group – school B] from pre-test results were not significantly different which suggest that the two groups had similar understanding of concepts" (Kibirige et al. 2014, p.305).
Pre-test results for an item with no statistically significant difference between groups (offered as evidence of 'similar' levels of initial understanding in the two groups)

I have drawn a chart to reflect the scores on a pre-test of one of a number of times where the researchers found 'equivalence' at pre-test. Their statistical analysis suggested that the outcomes in the two groups were similar enough to be considered equivalent as p>0.5 (which actually means that this difference between groups is not something very unlikely to occur by chance).

Read: Quasi-experiment or crazy experiment? Trustworthy research findings are conditional on getting a lot of things right

Yet visual inspection makes it clear this is a very obscure use of the term 'equivalent'. I wonder how many teachers who obtained these outcomes on the same test given to two of their classes would consider the two classes had 'equivalent' levels of outcome?

Is it enough to look for statistically significance differences? (No!)

If we are seeking to see if two groups are sufficiently equivalent, we might use a statistical text to check they are not statistically significantly different on some measure, but this is a very weak test of equivalence:

This is a bit like

  • commissioning a committee to test evidence of sainthood to justify beatification, and receiving a report which states that the individual concerned has never been convicted of any serious violent crime, or
  • seeking a criterion for desperate poverty, and being told we can exclude anyone with a net worth of over a billion dollars, or
  • considering an appeal against an decision to refuse an academic tenure (a permanent appointment) and the only evidence being available is that they have not yet been awarded a Nobel prize

It is a start, but only a very limited start,


Some situations where stronger evidence would be useful

Although statistical tests can offer some guidance on what counts as equivalent, they need to be interpreted differently than when looking for a statistically significance difference in the outcomes of the experiment (see Figure 2). An initial difference which is substantial, but statistically non-significant, may be sufficient to explain outcome differences that do reach statistical significance…. If statistical tests are applied to the starting conditions using the usual p < 0.05 criterion then they will only flag up differences between the two groups which are very unlikely to be due to chance. However, what should be looked for is evidence of close similarity, rather than the absence of evidence of improbable differences. (One might say that testing for equivalence pre-intervention, and for experimental effects post-intervention, involve looking at different tails of a distribution.) Two classes with differences between them that are at a level quite unlikely to occur by chance are certainly not equivalent (at least in the sense that the word is generally employed).

Taber, 2019, pp.86-7

Figure 2 from Taber, 2019: Evaluations of equivalence between different groups should be more rigorous than simply excluding differences reaching statistical significance.
A better criterion?

There does not seem to be an agreed criterion of what alternative p value might best be used, so I have suggested as an aunt sally figure that rather than admitting anything where p≥0.05, it would make more sense to look for p>0.5; that is, where any differences found are more likely to be chance effects than not.


Example from the literature

"All participants were given a pre-screening test…all 120 participants scored 10 and below (i.e., with an average of 3.66 out of 30 marks)"

Shamsulbahri & Zulkiply, 2021

In this study, students in all three conditions took a pre-test. But the authors do not report any comparson of performance between groups (and do not use the pre-test scores as a covariate in their analysis of group differences at post-test). Read: 'Shock result: more study time leads to higher test scores'


Sources cited:

My introduction to educational research:

Taber, K. S. (2013). Classroom-based Research and Evidence-based Practice: An introduction (2nd ed.). London: Sage.