Science, superstition, or confidence trick – Science-Education-Research

Page contents

Do science educators have too much faith in the experiment?

On this page you can view, read, or download, a seminar paper:

Taber, Keith S. (2023) Science, superstition, or confidence trick. Do science educators have too much faith in the experiment? Seminar paper presented to the Science Education Group at the University College London's Institute of Education. 3rd May 2023. DOI: 10.13140/RG.2.2.17510.29768

(A list of my publications can be found here.)

Download: Science, superstition, or confidence trick: Do science educators have too much faith in the experiment?

View the presentation:

Abstract

A rigorous experiment is rightly considered an especially informative research tool. But doing rigorous experiments in education is very challenging. A poorly designed experiment may tell us very little. Yet the literature includes a vast number of experimental studies in science education. Here I make an argument that:
Often these are very small scale studies with unrepresentative populations;
Often the extent of control of variables would not pass for 'fair testing' in a school laboratory exercise;
There is a sleight of hand commonly used to (mis)apply statistical tests to treat samples as much larger than they are;
Sometimes authors draw conclusions contrary to their results; and
Sometimes inappropriate (unethical) control conditions are imposed on learners for the sake of research.
I will pose, and reflect on, the question of why so many of these flawed studies get undertaken in science education and published in peer-reviewed journals

The text of the talk:

The thesis

The thesis I am advancing is that there is a phenomenon to be explained. That phenomenon is that the science education research literature includes a very large number of studies that are experimental in nature, but which do not meet the criteria for being valid experiments. Many of these are published in regional or national journals, but there are plenty in more prestigious journals. Perhaps everybody involved – authors, peer reviewers, editors – recognise the problem with these studies, and see their findings as purely indicative – but often that is not the impression given by their framing and phrasing.

I argue that

Experimental method is often the appropriate approach in seeking new knowledge in the natural sciences;
Experimental method is much more challenging in the social sciences;
Valid experiments are (at the least) very rare in small-scale studies of teaching and learning;
Yet, the science education literature includes a vast number of such studies, which do not support robust conclusions.

This needs explaining! My focus here is on experiments that involve a small number of teachers and classes…

Indeed, the prototype of this type of study involves

one class in the experimental treatment condition
another class being the comparison class

Sometimes both classes are taught by the same teacher, sometime not.

Sometimes the classes are not even from the same school.

Other studies may have just two or three classes in each condition. A good many published studies are of this kind.

Key issues

There is a host of difficulties in doing experimental studies into classroom teaching, and I have published an account of a number of these in a review for Studies in Science Education.

Here I wish to focus on, and illustrate, a few themes

Control of variables
Representativeness and generalisability
Identifying independent units of analysis
Inadequate tests of equivalence
Rhetorical experiments and unethical controls
Falsifying conclusions

Control of variables

Now a good experiment involves three classes of variable:

the thing we are deliberately changing,
the thing we allow to vary to see if, and if so how, it changes, and
everything else which could have an effect, and, so, is not allowed to change.

To do an experiment could be seen as in large part controlling all the other things that could effect our results and confuse the possible dependency of the dependent variable upon our planned intervention.

A poor experiment has a fourth class of variable. These are all the things that should be in the controlled category, but which we do not control. Confounding variables confound our study because we logically need to caveat the conclusions with an 'unless this was due to something else'.

Now, even in the natural sciences, there can be confounding variables, because it is never possible to control everything that we might imagine could be considered a variable, so, on theoretical grounds, we dismiss such possibilities as the relevance of a researcher's hair colour or whether they use their left or right eye to look into the microscope. We use existing theoretical knowledge to judge what we can ignore. Even here, it sometimes transpires there was a pertinent variable that influences results, but which had been assumed to be irrelevant, or was even unknown to the researchers.

In the social sciences, control of variables is very much more challenging.

For one thing, we usually work in naturally occurring social contexts, rather than isolate systems for laboratory manipulation.

Another major difference is that physicists and chemists do not have to consider what their research materials think about them, or expect to happen in their experiments. A piece of alloy, or a solution of an oxidising agent, has no preconceptions about the outcome of the experiment, and does not have an attitude to the researcher. You do not need to develop rapport with your test tubes.

Indeed, if you have an active imagination, you can likely think of feasible scenarios when any number of things could effect the outcomes of an educational experiment such as student learning.

I do not think it is an exaggeration to argue that in the case of an educational experiment:

we often cannot identify all the variables which might have an effect;
even when we can identify them, we often do not know how to meaningfully measure them;
even if we can measure them, we may not have a way of holding them at a constant value.

That need not prevent experimental studies where you have large enough samples that are representative of a population to be able to assume that statistics can tell you whether any outcomes are unlikely to to be due to such chance factors. This is a point I will return to.

It is a problem, though, in any studies with small, unrepresentative, samples. And most of the published experiments in the educational literature have small, unrepresentative, samples.

As one example of the kinds of issues that can arise, I was impressed to read this comment in a research paper.

"The ambiguous results of research comparing IBSE [enquiry-based science education] with other teaching methods may result from the fact that often teaching methods used in the control groups have not been clearly defined, merely referred to as 'traditional teaching methods' with no further specification, or there has been no control group at all."

The authors had noticed that in many experimental studies the experimental treatment is well defined, but the control condition is anything but controlled. It is actually laissez-faire, anything goes, as long as the teacher avoids the approach being taken in the experimental condition.

This seemed a fair point, so I read on to see how they managed this issue in their own study:

"The teaching method as an independent variable was manipulated to identify its effect on the dependent variable (in this case, knowledge and skills)…

In the control group, teachers revised the topic using methods of their choice, e.g. questions & answers, oral and written revision, textbook studying, demonstration experiments, laboratory work."

It seems that raising an issue which undermines the ability to draw clear conclusions from a study does not impose a requirement to address the issue in your own study!

Now here I present the results I found in a published research study. As you see, three classes were taught the topic of chemical qualitative analysis for different lengths of time. Afterwards, the average performance of the students in these classes was found to vary.

I wonder what you think we might be able to conclude from this study?

Now, as you may have guessed, I was not giving you all the details. Here I reveal more.

One class had three lessons in the lab.
But that class was outperformed by a class that had five lessons of paper-based learning activities.
The third class, spent three lessons in the lab. and also had the five lessons of text-based activities.

It perhaps seems reasonable to conclude that the class that had both kinds of activity learnt the most. But what about comparing the first two classes?

I would mischievously suggest five lessons can lead to more learning than three, but the authors thought this was evidence of the superiority of the text-based learning approach. Presumably, the peer-reviewers and editor were sufficiently convinced.

But I was not convinced.

The two classes were taught by different teachers. The two classes were drawn from different schools. They were considered to be similar schools, but even so. For that matter, the assessment tool was a paper-based test, so how do we know that if a laboratory-based assessment had been used, the results would not have been very different? After all, to actually do qualitative analysis, you need to work with real samples and reagents in a laboratory.

You might feel that I am only stating the obvious. But if it is so obvious, how does such work get published without strong caveats, and sometimes even in the more prestigious journals. After all, we expect 14 years olds to do better than this. Arguably, science educators are teaching skills they then fail to display in their own work.

One of the variables not controlled in that study, was the teacher. Perhaps the two different teachers were very similar in all relevant characteristics, but that is not very likely.

Sometimes the 'teacher variable ' is 'controlled' by asking the same teacher to teach differently in two different conditions. That is, the teacher is asked to use two different teaching approaches in different classes. Jig-saw learning here, but computer simulations there. Or, sadly, more likely, enquiry-based teaching here, and dictation of notes there. More on that choice later. Or, a teacher is asked to trial a new curriculum module whilst teaching a parallel class according to the established scheme of work.

This assumes, at least implicitly, the teacher will have the same competence and confidence when switching to do something different, even when it is novel to them. The same teacher, teaching different classes, is assumed to have controlled that variable, but I do not find that very convincing. One of the issues is teacher beliefs, which much research shows often have an effect on outcomes.

If the teacher is persuaded the experimental treatment is an improvement, or is entirely unconvinced by it, then that may be enough to make a difference. Even if the teacher simply lacks confidence in their competence to teach in a different way, then this may make a difference.

It is issues such as this that have led to medical studies adopting double blind conditions in drug trials, so the neither the patients nor the clinicians administering treatment know whether a tablet or injection actually contains the substance being tested or not. Of course if one takes double blind protocols too far, one might run into ethical issues.

I was astonished to read this description of studies where a novel angina treatment was tested by doing sham surgeries alongside genuine interventions.

"In the late 1950s and early 60s two different surgical teams…did double-blind trials of a ligation procedure – the closing of a duct or tube using a clip – for very ill patients suffering from severe angina, a condition in which pain radiates from the chest to the outer extremities as a result of poor blood supply to the heart. The surgeons were not told until they arrived in the operating theatre which patients were to receive a real ligation and which were not. All the patients, whether or not they were getting the procedure, had their chest cracked open and their heart lifted out. But only half the patients actually had their arteries rerouted so that their blood could more efficiently bathe its pump …"

Reports based on the accounts of patients and their doctors had previously claimed that angina symptoms were relieved somewhat by doing some re-routing of blood in the chest though closing off some vessels. However, the experimental studies found that, actually, those given this procedure showed no more improvement than patients wheeled into theatre for a sham procedure.

I was less astonished when I sourced the original research papers, as it seems the sham surgery only involved some superficial incisions made under local anaesthetic. In any case, it is much harder to blind learners and, especially, teachers to the educational treatment they are assigned to.

Generalisation

Social kinds are also different from natural kinds, in that all pure samples of copper or all E. coli specimens have much in common – but not all schools, or all classes, or all teachers, or all lessons, have so much in common. This creates the very big issue in educational research, that what works in one classroom in one school, does not always work in another school, or even when adopted by another teacher in the same school.

Even experiments that are designed to allow us to generalise to populations only tell us that what was found to be most effective in the research is more likely than not to be effective in the wider population. But there is diversity in those populations. We know from some of the few very large studies undertaking in schooling, that what is most effective overall, is not most effective in every context; what is found to be least effective overall, can have been the most effective approach with some classes.

Knowing what most often works best is still useful. But to do this kind of work well we need not only to work at scale, but to randomise our sample to conditions, and to either use a random sample of the population or be confident that our sample is representative of the diversity of the population.

But what are the populations?

A paper title in the natural sciences might refer to a class of star; a superconductor with a specific composition; a variant of SARS-CoV-2, or some such natural kind. If we look at the titles of papers in science education, we find these papers seem to be about broad groups – sometimes National groups, but sometimes they are apparently about 'children' or 'adolescents' or 'primary school teachers' quite generally!

Of course, the participants in such studies can seldom be considered statistically representative of such broad groups.

My point is that we – as authors as well as readers – fall into the trap of generalising, at least implicitly,

'We studied what (some) 14 year old Australian students knew about natural selection, so now we know what 14 year old Australian students know about natural selection.'

Again, we would not accept this kind of sloppy thinking from school children.

Units of analysis

A major issue with many small scale studies is the identification of the unit of analysis to use in statistical testing.

In a true experiment we randomise the so-called units of analysis to the treatments, the conditions. Sometimes this is possible in education. Perhaps we enrol fifty schools and assign each randomly to an experimental or control condition. If we can consider the schools to be experiencing the assigned treatment independently of each other, then this seems fair enough.

But many experiments in education are undertaken with learners as the unit of analysis, and often these are not individually assigned to treatments, but are members of pre-established classes. Of course, there are very good reasons why schools would not be happy with researchers coming in and breaking up established classes to randomise students.

Perhaps the logic here should be that as we cannot meet the requirements for an experiment, we should do a different kind of study. Often, instead, the logic adopted is that as we cannot meet the requirements for a valid experiment, we are justified in carrying on regardless and just ignoring that requirement.

If a manufacturer of pickled onions was selling jars of vinegar as pickled onions, it is likely their customers would not be prepared to accept this just because the company was having difficulty sourcing onions. Yet, readers of research papers are assumed to be less demanding of rigour. Science education researchers often sell jars of vinegar labelled as pickled onions.

Now, I do not want to give the impression that I do not think there might be circumstances when treating students within a class as independent units would be appropriate. One has to consider the overall research design and purpose.

In this hypothetical case, a teacher has read a lot of material about mindfulness, student anxiety, relaxation techniques, and the like; and suspects that asking students to do two half-hours of meditation each week would be just as beneficial for their science learning as asking them to do subject-based homework. Being a science teacher, she tests this idea by randomly assigning students to either a homework condition or a meditation condition, and at the end of term compares test scores.

Randomisation does not ensure matched groups, rather just avoids any systematic bias. So, how does the teacher decide if the difference in profiles of scores in the two conditions is just down to chance effects of who ended-up in each condition? Inferential statistics will allow her to see if any difference in performance is likely to just be down to chance.

Of course, even if there is a very low probability value, and it is concluded there is a significant difference, strictly this only applies to this class with this teacher, and perhaps even this topic […when taught at this time of year, in this classroom…]? It may be more generally applicable, but we should not assume that.

Here it is reasonable to assume we can treat the learners as independent units of analysis even though they are from the same class. Here, in effect, the class is the population of interest. Yes, the students will influence each other in class, but they are all in the same class so those in both conditions are exposed to the same influences. If the students are off doing homework or meditating individually, and do not collude on the end of term test, it seems reasonable to assume we have fifteen units of analysis in each condition. That is still a small sample size, but at least it is more than unity.

In that circumstance, it is not a problem that all the students are taught together in class and no doubt, at least one might hope, interact during lessons.

But, if we were comparing between two classes, and one class was assigned to the homework condition and another to the meditation condition, then it surely does matter. In this situation, we would need to consider the class as the unit of analysis with one mean outcome score. But, of course, if there are only two classes in the experiment we are not going to find any difference in outcomes that can be statistically significant. The only way we can get positive results here, is by pretending that we have a lot of independent outcomes scores in each condition.

But, that seems like cheating. "Can't you see the onions in the jar?"

Perhaps you feel I am wrong.

I suspect that, like me, you may have taught a range of different classes over the years.

My own experience is that a class has its own character that is emergent, and is not just an aggregate of the characters in the class.
My own experience is that parallel classes, or successive cohorts, that are nominally equivalent can actually be very different.
My own experience is that the same class can be experienced by different teachers as quite different.
When working in school I sometimes found that one or two students in a class could have a disproportional influence on the class environment and progress – and that could be either for better or worse.

But, perhaps your experiences have been different. If you can honestly say that you feel that the attitudes; progress; learning, of students in classes is not influenced by the rest of the class, and occurs completely independently of the others in the lessons, then, yes, it is fair to treat the learner as the unit of analysis.

I thought an analogy might be testing a drug that helped blood supply to the extremities where blood circulation was measured for each digit. That seems sensible if the scores are to be aggregated to give an overall score for the patient. But It would not make sense to consider blood supply to different digits to be independent when it is all part of the same circulatory system, being pumped around by the same heart!

Initial equivalence

A very common procedure used in experimental studies is testing for initial equivalence between groups. This is especially important when there is no random assignment as if there are systematic differences between groups, then any difference in final outcomes may just reflect differences at the start.

Even if researchers showed there was no difference between groups at the outset, this does not negate concerns about students in different classes being influenced by the class context and not learning independently. But leaving that aside, there is a conceptual problem with testing for initial equivalence, because if researchers were really checking for equivalence between groups at the start of start of an experiment then they would very rarely find complete equivalence.

So, imagine two classes we have given what we consider the most relevant pre-test in relation to the study outcomes to be measured at post-test. Actually these are randomly generated numbers, so there is no systematic bias in the scores assigned to the students in the conditions.

The average score in one class is about 48, but in the other it is about 54.

48 is not strictly equivalent to 54. If you were appointed to a job on an annual salary of £54 000, but were only paid £48 000, you would probably not accept the argument that these two figures are equivalent.

But one is seldom going to get precisely the same scores on a pre-test (even with random numbers, as we see here!) So, the question becomes how close is close enough to be seen as as good as equivalent. And this is where I think the most common approach is seriously flawed.

He is a real example. In this study the experimental group out-performed the control group at the end of the experiment. And pre-tests were used, so we can compare between pre- and post- intervention, as well as between the two conditions.

Statistical tests tell us that the experimental group did significantly better on the post-test than it did on the pre-test. But by itself that's not very informative, as even fairly ineffective teaching is likely to produce some learning. And indeed the control group also showed significant increases between the two tests, so both conditions seem to bring about learning.

However, the statistics also tell us that the experimental group out-performed the control group at post-test by a difference that was statistically significant.

That means this difference was unlikely to be due to chance effects. But what if the two groups were starting from different bases? This is discounted because there is a significant difference between groups at the end of the experiments, when there was no such difference at the outset.

This seems logical, and perhaps the numbers here look quite convincing.

But I think, in general, there is a logical problem here. Let's consider a hypothetical marginal case. Here there is a small measured difference before the experiment, and also a small measured difference after the experiment.

The difference at pre-test just failed to reach significance.
The difference at post test just reached significance.
Conclusion: the experimental intervention made a difference.

But, surely, initial differences can be magnified in subsequent teaching. It is a common phenomena that differences between learners tend to increase over time. In this hypothetical case, can we really be confident that initial differences were not a factor?

Of course, we might argue that there are more sophisticated statistical approaches which look at how factors co-vary in a study. Indeed, there are, but my target here is the simple test for equivalence that is commonly used in published studies to supposedly establish a level playing field.

This common approach is to test to see if there is a very unlikely difference between the pre-test measures in the different conditions.

This means that differences which are unlikely to be due to chance effects, but which are not so unlikely to get a p value below 0.05 are found 'equivalent'.

I do not think this is a sensible test for equivalence. It is a weak test, and, indeed, inadequate.

Imagine you were looking for a test to decide if someone should be considered a Saint. How high would you set the bar?
What if you wanted to identify those who should be considered in poverty. Certainly, not having a great deal of money would be a relevant criterion, but perhaps not quite exclusive enough.
Maybe you were on a committee that was considering rejecting the permanent re-appointment of a colleague on probation as their research record was not strong enough. But you should not have unreasonable expectations.

I think these are similar situations, in the sense that the criteria being adopted are certainly relevant and perhaps necessarily apply, but are by no means sufficient.

In a sense, we are looking at the wrong end of the distribution. We should be asking how probable we need measured differences to be, not simply excluding the most improbable.

Here is another real example.

The study looked to compare between two teaching conditions. I struggle to understand why researchers think it is acceptable to require teachers to lecture school children, but I'll come back to that later.

One thing we might notice is the very different gender composition of classes. Is that relevant? Perhaps it should not be. In some cultural contexts though, it might be. The two groups are in different schools, which I think it really troubling in these kinds of studies, as schools vary so much, and in so many ways.

But a pre-test was used, and the researchers claimed that on none of the items did differences between groups reach significance. This was seen to assure equivalence. From the data given, I prepared this chart showing the performance on one of the items at pre-test.

We are told that there was 'no significant difference'. Certainly, in both groups most students got the question wrong. But if this is an equivalent performance in the two classes, then the word 'equivalent' means something very different to its normal sense. Surely, despite the lack of statistical significance, one of these classes is better placed to build on their level of prior learning than the other?

This does not persuade me of equivalence. This is not an isolated case. This technique is very widely used.

Ethics of control conditions

Another major concern I have with some published studies, is that they seem to be examples of rhetorical research. That is the study is done to demonstrate what the researchers already expect, indeed believe, and not in a spirit of open-ended enquiry.

I have noticed that many school science practicals undertaken to demonstrate well-established scientific findings are often incorrectly referred to as 'experiments', but surely professional science educators know that genuine experiments have uncertain outcomes?

Of course, if one includes more dubious journals in one's purview, one can find extreme examples where no doubt the researcher was entirely convinced by their research, but it seems unlikely there was ever any serious peer review or editorial evaluation beyond checking the publication fee had been submitted. Sadly, there are now a large number of predatory journals out there.

I've detailed a number of examples of both honest and dishonest nonsense published in such journals on my website, by which I mean both work which has been submitted in good faith, but which should never have been published; as well as things that presumably even the authors knew were complete nonsense.

An example of the former is an alternative version of periodic table including a whole raft of new elements that had previously been missed.
An example of the latter was a paper which suggested that deforestation of the Amazon was a good idea. After some digging around I found the entire paper was a collage of translations of paragraphs form other published works. Poor translations. 'Deforestation' had been 'poisoning' in the source.
One paper I found basically plagiarised a published study, but simply replaced key words to make the paper about something else entirely. [A strong clue to the quality of the work was how the topic of the paper seemed unrelated to its title, and neither seemed related to the remit of the publishing journal. Actually, it seems the author first stole someone else's work; and then plagiarised his own plagiarism by making a modified copy of the same paper with a different focus].

I can only assume that the European Journal of Education and Pedagogy does not use serious peer review as the paper I have summarised here does not logically show that school grades can be improved by exorcism, though I strongly suspect the author was genuine enough about the work. That is one of the key points for rigorous peer review – we can have blind spots in our own work.

It is very easy for qualitative [i.e., interpretivist] studies to become rhetorical. We look for something. We find evidence. If we believe that enquiry and argumentation are essential to good science teaching; and so devise an observation schedule based on indicators of enquiry and argumentation; and then observe the lessons of good science teachers, we are likely to find indicators of enquiry and argumentation. The question is, so what?

If instead, we believed that hesitation and contradiction were essential to good science teaching; and so devised an observation schedule to spot incidents of hesitation and contradiction; and then observed the lessons of good science teachers, I suspect we could also find evidence of some level of hesitation and contradiction. So, what?

That's one reason why quantitative, experimental studies have more kudos. Simple forms of qualitative studies are good at identifying potentially interesting points, but not suitable for testing substantive hypotheses.

But there is a genre of rhetorical experiments as well.

These papers usually start by making very strong claims about the merits of constructivism and some particular pedagogy associated with it. These papers will offer strong, convincing, theoretical arguments why such teaching is so much more effective than what may be labelled 'traditional' teaching (though by now I would have hoped constructivist-minded teaching would be traditional, but anyway.)

They then report a whole raft of prior studies which have shown the superiority of the focal pedagogy in a wide range of contexts – across geographical locations, age ranges, topics, languages of instruction, school cultures, etc.

But, they tell us, no one has yet tested the effectiveness of this pedagogy with, say, 13-14 year-olds studying the specific topic of the periodic table, in rural schools in South Cambridgeshire.

Of course, from everything that has been reported, there is every reason to think this pedagogy will be effective in such a context. No sense of Karl Popper's notion of scientists testing bold conjectures, here: we only test dead-cert hypotheses.

But to make absolutely sure that the experiment works, we compare our preferred pedagogy with a teacher lecturing a class. There may be some 'discussion' in the sense of the teacher answering questions, but we do not allow group-work or any real dialogue, practical work, or access to digital technologies; and we restrict written work to exercises at the lower end of Bloom's taxonomy: recall, comprehension, and some application. The teacher teaches as if any educational thinking post about 1870 never happened. That is the comparison condition for our theoretically-sound, widely-proven, pedagogy.

I find this utterly unethical.

It is unethical in at least two senses.

Any research which inconveniences people for no good reason, is pointless and so a waste of resource. A demonstration posing as an experiment falls into this category.
Asking control teachers to deliberately avoid good practice in their classroom, so to ensure a positive study outcome, is also clearly wrong.

And here I have seen quite a few of these studies, including some published in good journals. How come reviewers do not spot that this is not genuine enquiry, and it is not an acceptable way to treat study participants? I can only assume we are all blinded by the assumed superiority of the experiment in science.

Falsifying conclusions

My final issue also reflects this idea, in that sometimes authors are so convinced about their expectations that they fail to observe the logic of their experiment.

The confidence level chosen to determine whether differences between outcomes are statistically significant is arbitrary. Nearly always, we use a cut-off of 0.05: but there is nothing magical about that value. I suspect aliens with different numbers of fingers to us may have chosen a different critical value.

Whatever critical value we choose, the outcome is only an indicator. We are always subject to false positives where results are found significant, but unbeknown to us are actually due to chance effects; and false negatives, where results are found to be non-significant despite there being some kind of causal effect too small to reach significance in small samples.

But, if we are going to use inferential stats, then the conclusions of our experiments, certainly subject to appropriate caveats, should be determined by the outcomes of the analysis undertaken.

I am going to briefly refer to two papers, one published in each of the two most important chemistry education journals

A paper in Chemistry Education Research and Practice, reported positive outcomes from a study into a learning approach, and made a point of suggesting this refuted a previous study which had not found a significant effect. The whole question of replication is worthy of a talk of its own, but I am merely going to comment here that it was questionable whether the two studies were similar enough for one to be able to refute the other.

The study presented a range of results, comparing pre-test to post-test, and between the experimental and comparison groups. The two groups were considered equivalent at pretest, for the familiar reason that the differences were not so different as to reach statistical significance. My interpretation of a p value of 0.384 is that the differences between the two groups probably were NOT due to chance effects, but I think I have already said enough about that.

But surely that is not really important in this study, as there was not a significant difference between the two groups after the intervention. Any differences in scores between the two groups were not sufficiently unlikely so as be considered statistically significant.

So, in terms of the experimental design set out before collecting data, this is a negative result.

There is certainly nothing wrong with reporting a negative result, and indeed there is strong belief that research literature is distorted by a bias towards towards authors submitting, and journals preferentially publishing, positive outcomes. But, these authors are claiming to refute a study that did not find significant differences by having carried out another study that ALSO did not find significant differences. Here the negative result is imply ignored when presenting the study conclusions as being positive outcomes.

Sadly, having noticed this, I felt it necessary to write up a comment – which to be fair to the journal, was published. Again, we would not accept this kind of sloppy work from school children.

For my final example I turn to the other top chemistry education specific journal, the Journal of Chemical Education published by the American Chemical Society, whose journals are, according to the American Chemical Society itself, at least, 'most trusted'.

This is based on a figure showing results reported in a study in the Journal of Chemical Education. After 2017, the researchers modified their medicinal chemistry course, by implementing what they called 'Student-Centred Team-Based Learning Teaching Method', and as you can see student results improved thereafter.

"…our results suggest that the SCTBL method is an effective way to improve teaching quality and student achievement."

You will notice I have failed to include any numbers on this figure, which is disingenuous of me, because the original did include the numbers. And with the numbers we see that we are viewing a graph with a truncated axis. That is a perfectly valid technique used to emphasise differences, but I wondered if here they might have over-emphasised the differences? They are just focusing on a small range of values. And if I present the whole graph as it might have been drawn, the change seems less impressive.

Perhaps some of you are used to marking student work, and perhaps you feel your marking is entirely objective and very precise. But, given how cohorts shift from year to year, I really do not think average course scores to two places of decimals are meaningful.

We would not accept this from school children. So, I have reworked their results [2017: 80%; 2018: 80%; 2019: 80%; 2020: 80%] . But even if I am being too fussy, and you think this level of precision can be justified, we might wonder how such small differences led to a statistically significant outcome.

Of course, they did not. The authors did the analysis, and reported there was no significant difference between the scores before and after the new approach was implemented. Yet, these authors felt they could ignore this analysis and reach a positive conclusion.

Presumably the peer reviewers, and the editor, thought this was fine. I think this goes completely against proper scientific practice.

This is leaving aside such a comparison only makes sense if we think subsequent cohorts can be considered equivalent. The authors did not even use a weak test of equivalence of the kind I have discussed earlier, because they did not use any test of equivalence. That just assumed that the 40 students admitted to the course each year could be treated as equivalent.

It was also clear from details in the paper that there had been some necessary modification in the assessment process in moving to the new teaching approach. This may have only effected a minor component of the final score, but that seems relevant when they think they are measuring to a hundredth of a percentage point. I just cannot see how peer reviewers thought this was okay.

So, I wrote to the editor about it. I was told my comments could be considered for publication if I submitted them as a formal submission. So, I did.

Apparently, the editor initially asked three reviewers to read my submission. Two thought it should be published. One thought some changes were needed before it could be published. Now, I have been a journal editor, so I am pretty clear how I would respond to that review profile as an editor. But this editor was unsure.

The editor then asked a fourth person, who thought the comment should be rejected. And so it was.

I was told I could prepare a resubmission, but that if I did so it would be better to focus on general issues and not the specific paper I wanted to critique. That, of course, would have been a completely different article. I must admit to having been shocked by this outcome.

In conclusion

To conclude: We are scientists, or we like to think we are, and scientists do experiments.

I know from my time working in science teacher preparation that, generally, science graduates tend to think experiment is the method of choice when we ask them to undertake small-scale enquiry into their teaching.

Perhaps our scientific training so promotes the merits of experimental procedures that science educators have an implicit bias that is strong enough to overcome any concerns about features that invalidate so many educational experiments of this kind.
Perhaps our scientific education leads us to think that the world can be organised into natural kinds such that one copper wire of a certain gauge is assumed to be able to stand for any other copper wire of the same dimensions; and so one teacher, or one class of fifteen-year-olds, can stand for any other?
Perhaps the scientific mindset that objectifies the natural world is so strong that we see people as experimental subjects that respond to our treatments without regard to the inter-subjective nature of our interactions with them and what they might think of us and our experiments?
Perhaps this is also why we tend to see classes as arrays of individuals and forget that people in groups interact and influence each other.

I would suggest that when a valid experiment is possible, it is usually to be preferred. But if we cannot do a valid experiment, we should not do an experiment at all.

An invalid experiment is not scientific, and is a waste of valuable resource – including researcher time and participant goodwill.

An invalid experiment carried out by a researcher undermines their claim to be a competent scientist.

So, what does the science education community's propensity for publishing invalid experiments in its journals say about us collectively?

Thank you.

Download: Science, superstition, or confidence trick: Do science educators have too much faith in the experiment?

Share on Facebook