Experimental pot calls the research kettle black

Do not enquire as I do, enquire as I tell you

Keith S. Taber

Sotakova, Ganajova and Babincakova (2020) rightly criticised experiments into enquiry-based science teaching on the grounds that such studies often used control groups where the teaching methods had "not been clearly defined".

So, how did they respond to this challenge?

Consider a school science experiment where students report comparing the rates of reaction of 1 cm strips of magnesium ribbon dropped into:
(a) 100 ml of hydrochloric acid of 0.2 mol/dm³ concentration at a temperature of 28 ˚C; and
(b) some unspecified liquid.

This is a bit like someone who wants to check they are not diabetic, but – being worried they are – dips the test strip in a glass of tap water rather than their urine sample.

Basic premises of scientific enquiry and reporting are that

when carrying out an experiment one should carefully manage the conditions (which is easier in laboratory research than in educational enquiry) and
one should offer detailed reports of the work carried out.

In science there is an ideal that a research report should be detailed enough to allow other competent researchers to repeat the original study and verify the results reported. That repeating and checking of existing work is referred to as replication.

Replication in science

In practice, replication is more problematic for both principled and pragmatic reasons.

It is difficult to communicate tacit knowledge

It has been found that when a researcher develops some new technique, the official report in the literature is often inadequate to allow researchers elsewhere to repeat the work based only on the published account. The sociologist of science, Harry Collins (1992) has explored how there may be minor (but critical) details about the setting-up of apparatus or laboratory procedures that the original researchers did not feel were significant enough to report – or even that the researchers had not been explicitly aware of. Replication may require scientists to physically visit each others' laboratories to learn new techniques.

This should not be surprising, as the chemist and philosopher Michael Polanyi (1962/1969) long ago argued that science relied on tacit knowledge (sometimes known as implicit knowledge) – a kind of green fingers of the laboratory where people learn ways of doing things more as a kind of 'muscle memory' than formal procedural rules.

Novel knowledge claims are valued

The other problem with replication is that there is little to be gained for scientists by repeating other people's work if they believe it is sound, as journals put a premium on research papers that claim to report original work. Even if it proves possible to publish a true replication (at best, in a less prestigious journal), the replication study will just be an 'also ran' in the scientific race.

Copies need not apply!

Scientific kudos and rewards go to those who produce novel work: originality is a common criterion used when evaluating reports submitted to research journals

(Image by Tom from Pixabay)

Historical studies (Shapin & Schaffer, 2011) show that what actually tends to happen is that scientists – deliberately – do not exactly replicate published studies, but rather make adjustments to produce a modified version of the reported experiment. A scientist's mind set is not to confirm, but to seek a new, publishable, result,

they say it works for tin, so let's try manganese?
they did it in frogs, let's see if it works in toads?
will we still get that effect closer to the boiling point?
the outcome in broad spectrum light has been reported, but might monochromatic light of some particular frequency be more efficient?
they used glucose, we can try fructose

This extends (or finds the limits of) the range of application of scientific ideas, and allows the researchers to seek publication of new claims.

I have argued that the same logic is needed in experimental studies of teaching approaches, but this requires researchers detailing the context of their studies rather better than many do (e.g., not just 'twelve year olds in a public school in country X'),

"When there is a series of studies testing the same innovation, it is most useful if collectively they sample in a way that offers maximum information about the potential range of effectiveness of the innovation. There are clearly many factors that may be relevant. It may be useful for replication studies of effective innovations to take place with groups of different socio-economic status, or in different countries with different curriculum contexts, or indeed in countries with different cultural norms (and perhaps very different class sizes; different access to laboratory facilities) and languages of instruction …It may be useful to test the range of effectiveness of some innovations in terms of the ages of students, or across a range of quite different science topics. Such decisions should be based on theoretical considerations.

…If all existing studies report positive outcomes, then it is most useful to select new samples that are as different as possible from those already tested…When existing studies suggest the innovation is effective in some contexts but not others, then the characteristics of samples/context of published studies can be used to guide the selection of new samples/contexts (perhaps those judged as offering intermediate cases) that can help illuminate the boundaries of the range of effectiveness of the innovation."
Taber, 2019, pp.104-105

When scientists do relish replication

The exception, that tests the 'scientists do not simply replicate' rule, is when it is suspected that a research finding is wrong. Then, an attempt at replication might be used to show a published account is flawed.

For example, when 'cold fusion' was announced with much fanfare (ahead of the peer reviewed publications reporting the research) many scientists simply thought it was highly unlikely that atomic energy generation was going to be possible in fairly standard glassware (not that unlike the beakers and flasks used in school science) at room temperature, and so that there was a challenge to find out what the original researchers had got wrong.

"When it was claimed that power could be generated by 'cold fusion', scientists did not simply accept this, but went about trying it for themselves…Over a period of time, a (near) consensus developed that, when sufficient precautions were made to measure energy inputs and outputs accurately, there was no basis for considering a new revolutionary means of power generation had been discovered.
Taber, 2020, p.18

Of course, one failed replication might just mean the second team did not quite do the experiment correctly, so it may take a series of failed replications to make the point. In this situation, being the first failed replication of many (so being first to correct the record in the literature) may bring prestige – but this also invites the risk of being the only failed replication (so, perhaps, being judged a poorly executed replication) if subsequently other researchers confirm the fidnings of the original study!

So, a single attempt at replication is nether enough to definitely verify nor reject a published result. What all this does show is that the simple notion that there are crucial or critical experiments in science which once reported immediately 'prove' something for all time is a naïve oversimplification of how science works.

Experiments in education

Experiments are often the best way to test ideas about natural phenomena. They tend to be much less useful in education as there are often many potentially relevant variables that usually cannot be measured, let alone controlled, even if they can be identified.

Without proper control, you do not have a meaningful experiment.
Without a detailed account of the different treatments, and so how the comparison condition is different from the experimental condition, you do not have a useful scientific report, but little more than an anecdote.

Challenges of experimental work in classrooms

Despite this, the research literature includes a vast number of educational studies claiming to be experiments to test this innovation or that (Taber, 2019). Some are very informative. But many are so flawed in design or execution that their conclusions rely more on the researchers' expectations than a logical chain of argument from robust evidence. They often use poorly managed experimental conditions to find differences in learning outcomes between groups of students that are initially not equivalent. ¹ (Poorly managed?: because there are severe – practical and ethical – limits on the variables you can control in a school or college classroom.)

Read about expectancy effects in research

Statistical tests are then used which would be informative had there been a genuinely controlled experiment with identical starting points and only the variable of interest being different in the two conditions. Results are claimed by ignoring the inconvenient fact that studies use statistical tests that, strictly, do not apply in the actual conditions studied! Worse than this, occasionally the researchers think they should have got a positive result and so claim one even when the statistical tests suggests otherwise (e.g., read 'Falsifying research conclusions')! In order to try and force a result, a supposed innovation may be compared with control conditions that have been deliberately framed to ensure the learners in that condition are not taught well!

Read about unethical control conditions

A common problem is that it is not possible to randomise students to conditions, so only classes are assigned to treatments randomly. As there are usually only a few classes in each condition (indeed, often only one class in each condition) there are not enough 'units of analysis' to validly use statistical tests. A common solution to this common problem, is…to do the tests anyway, as if there had been randomisation of learners. ² The computer that crunches the numbers follows a programme that has been written on the assumption researchers will not cheat, so it churns out statistical results and (often) reports significant outcomes due to a misuse of the tests. ³

This is a bit like someone who wants to check they are not diabetic, but being worried they are, dips the test strip in a glass of tap water rather than their urine sample. They cannot blame the technology for getting it wrong if they do not follow the proper procedures.

I have been trying to make a fuss about these issues for some time, because a lot of the results presented in the educational literature are based upon experimental studies that, at best, do not report the research in enough detail, and often, when there is enough detail to be scrutinised, fall well short of valid experiments.

I have a hunch that many people with scientific training are so convinced of the superiority of the experimental method, that they tacitly assume it is better to do invalid experiments into teaching, than adopt other approaches which (whilst not as inherently convincing as a well-designed and executed experiment) can actually offer useful insights in the complex and messy context of classrooms. ⁴

Read: why do natural scientists tend to make poor social scientists?

So, it is uplifting when I read work which seems to reflect my concerns about the reliance on experiments in those situations where good experiments are not feasible. In that regard, I was reading a paper reporting a study into enquiry-based teaching (Sotakova, Ganajova & Babincakova, 2020) where the authors made the very valid criticism:

"The ambiguous results of research comparing IBSE [enquiry-based science education] with other teaching methods may result from the fact that often, [sic] teaching methods used in the control groups have not been clearly defined, merely referred to as "traditional teaching methods" with no further specification, or there has been no control group at all."
Sotakova, Ganajova & Babincakova, 2020, p.500

Quite right!

The pot calling the kettle black

idiom "that means people should not criticise someone else for a fault that they have themselves" ⁵ (https://dictionary.cambridge.org/dictionary/english/pot-calling-the-kettle-black)

(Images by OpenClipart-Vectors from Pixabay)

Now, I do not want to appear to be the pot calling the kettle black myself, so before proceeding I should acknowledge that I was part of a major funded research project exploring a teaching innovation in lower secondary science and maths teaching. Despite a large grant, the need to enrol a sufficient number of classes to randomise to treatments to allow statistical testing meant that we had very limited opportunities to observe, and so detail, the teaching in the control condition, which was basically the teachers doing their normal teaching, whilst the teachers of the experimental classes were asked to follow a particular scheme of work.

Results from a randomised trial showing the range of within-condition outcomes (After Figure 5, Taber, 2019)

In the event, the electricity module I was working on produced almost identical mean outcomes as the control condition (see the figure). The spread of outcomes was large, in both sets of conditions – so, clearly, there were significant differences between individual classes that influenced learning: but these differences were even more extreme in the condition where the teachers were supposed to be teaching the same content, in the same order, with the same materials and activities, than in the control condition where teachers were free to do whatever they thought best!

The main thing I learned from this experience is that experiments into teaching are highly problematic.

Anyway, Sotakova, Ganajova and Babincakova were quite right to point out that experiments with poorly defined control conditions are inadequate. Consider a school science experiment designed by students who report comparing the rates of reaction of 1 cm strips of magnesium ribbon dropped into

(a) 100 ml of hydrochloric acid of 0.2 mol/dm³ concentration at a temperature of 28 ˚C; and
(b) some unspecified liquid.

A science teacher might be disappointed with the students concerned, given the limited informativeness of such an experiment – yet highly qualified science education researchers often report analogous experiments where some highly specified teaching is compared with instruction that is not detailed at all.

The pot decides to follow the example of the kettle

So, what did Sotakova and colleagues do?

"Pre-test and post-test two-group design was employed in the research…Within a specified period of time, an experimental intervention was performed within the experimental group while the control group remained unaffected. The teaching method as an independent variable was manipulated to identify its effect on the dependent variable (in this case, knowledge and skills). Both groups were tested using the same methods before and after the experiment…both groups proceeded to revise the 'Changes in chemical reactions' thematic unit in the course of 10 lessons"
Sotakova, Ganajova & Babincakova, 2020, pp.501, 505.

In the experimental condition, enquiry-based methods were used in five distinct activities as a revision approach (an example activity is detailed in the paper). What about the control conditions?

"…in the control group IBSE was not used at all…In the control group, teachers revised the topic using methods of their choice, e.g. questions & answers, oral and written revision, textbook studying, demonstration experiments, laboratory work."
Sotakova, Ganajova & Babincakova, 2020, pp.502, 505

So, the 'control' condition involved the particular teachers in that condition doing as they wished. The only control seems to be that they were asked not to use enquiry. Otherwise, anything went – and that anything was not necessarily typical of what other teachers might have done. ⁶

This might have involved any of a number of different activities, such as

questions and answers
oral and written revision
textbook studying
demonstration experiments
laboratory work

or combinations of them. Call me picky (or a blackened pot), but did these authors not complain that

"The ambiguous results of research comparing IBSE [enquiry-based science education] with other teaching methods may result from the fact that often…teaching methods used in the control groups have not been clearly defined…"
Sotakova, Ganajova & Babincakova, 2020, p.500

Hm.

Work cited

Collins, H. (1992). Changing order: Replication and induction in scientific practice. University of Chicago Press.
Polanyi, M. (1962/1969). The unaccountable element in science. In M. Greene (Ed.), Knowing and Being: Essays by Michael Polanyi (pp. 105-120). University of Chicago.
Shapin, S., & Schaffer, S. (2011). Leviathan and the Air-Pump: Hobbes, Boyle, and the Experimental Life. Princeton University Press.
Sotakova, I., Ganajova, M., & Babincakova, M. (2020). Inquiry-based science education as a revision strategy. Journal of Baltic Science Education, 19(3), 499-513. https://doi.org/https://doi.org/10.33225/jbse/20.19.499
Taber, K. S. (2019). Experimental research into teaching innovations: responding to methodological and ethical challenges. Studies in Science Education, 55(1), 69-119. https://doi.org/10.1080/03057267.2019.1658058 [Download manuscript version]
Taber, K. S. (2020). Is reproducibility a realistic norm for scientific research into teaching? HPS&ST Newsletter (April), 13-23. https://www.hpsst.com/uploads/6/2/9/3/62931075/2020april.pdf

Notes:

¹ A very common approach is to use a pre-test to check for significant differences between classes before the intervention. Where differences between groups do not reach the usual criterion for being statistically significant (probability, p<0.05) the groups are declared 'equivalent'. That is, a negative result in a test for unlikely differences is treated inappropriately as an indicator of equivalence (Taber, 2019).

Read about testing for initial equivalence

² So, for example, a valid procedure may be to enter the mean class scores on some instrument as data, but what are actually entered are the individual students scores as though the students can be treated as independent units rather than members of a treatment class.

Some statistical tests lead to a number (the statistic) which is then compared with the critical value that reaches statistical significance as listed in a table. The number in the table selected depends on the number of 'degrees of freedom' in the experimental design. Often that should be the determined by the number of classes involved in the experiment – but if instead the number of learners is used, a much smaller value of the calculated statistic will seem to reach significance.

³ Some of these studies would surely have given positive outcomes even if they had been able to randomise students to conditions or used a robust test for initial equivalence – but we cannot use that as a justification for ignoring the flaws in the experiment. That would be like claiming a laboratory result was obtained with dilute acid when actually concentrated acid was used – and then justifying the claim by arguing that the same result might have occurred with dilute acid.

⁴ Consider, for example, a case study that involves researchers in observing teaching, interviewing students and teachers, documenting classroom activities, recording classroom dialogue, collecting samples of student work, etc. This type of enquiry can offer a good deal of insight into the quality of teaching and learning in the class and the processes at work during instruction (and so whether specific outcomes seem to be causally linked to features of the innovation being tested).

Critics of so-called qualitative methods quite rightly point out that such approaches cannot actually show any one approach is better than others – only experiments can do that. Ideally, we need both types of study as they complement each other offering different kinds of information.

The problem with many experiments reported in the education literature is that because of the inherent challenges of setting up genuinely fair testing in educational contexts they are not comparing like with like, and often it is not even clear what the comparison is with! Probably this can only be avoided in very large scale (and so expensive) studies where enough different classrooms can be randomly assigned to each condition to allow statistics to be used.

Why do researchers keep undertaking small scale experimental studies that often lack proper initial equivalence between conditions, and that often have inadequate control of variables? I suggest they will continue to do so as long as research journals continue to publish the studies (and allow them to claim definitive conclusions) regardless of their problems.

⁵ At a time when cooking was done on open fires, using wood that produced much smoke, the idiom was likely easily understood. In an age of ceramic hobs and electric kettles the saying has become anachronistic.

From the perspective of thermal physics, black cooking pots (rather than shiny reflective surfaces) may be a sensible choice.

⁶ So, the experimental treatment was being compared with the current standard practice of the teachers assigned to the control condition. It would not matter so much that this varies between teachers, nor that we do not know what that practice is, if we could be confident that the teachers in the control condition were (or were very probably) a representative sample of the wider population of teachers – such as a sufficiently large number of teachers randomly chosen from the wider population (Taber, 2019). Then we would at least know whether the enquiry based approach was an improvement on current common practice.

All we actually know is how the experimental condition fared in comparison with the unknown practices of a small number of teachers who may or may not have been representative of the wider population.

Share on Facebook

Author: Keith

Former school and college science teacher, teacher educator, research supervisor, and research methods lecturer. Emeritus Professor of Science Education at the University of Cambridge. View all posts by Keith