Misconceptions of change

It may be difficult to know what counts as an alternative conception in some topics – and sometimes research does not make it any clearer


Keith S. Taber


If a reader actually thought the researchers themselves held these alternative conceptions then one could have little confidence in their ability to distinguish between the scientific and alternative conceptions of others

I recently published an article here where I talked in some detail about some aspects of a study (Tarhan, Ayyıldız, Ogunc & Sesen, 2013) published in the journal Research in Science and Technological Education. Despite having a somewhat dodgy title 1, this is a well respected journal published by a serious publisher (Routledge/Taylor & Francis). I read the paper because I was interested in the pedagogy being discussed (jigsaw learning), but what promoted me to then write about it was the experimental design: setting up a comparison between a well-tested active learning approach and lecture-based teaching. A teacher experienced in active learning techniques taught a control group of twelve year old pupils through a 'traditional' teaching approach (giving the children notes, setting them questions…) as a comparison condition for a teaching approach based on engaging group-work.

The topic being studied by the sixth grade, elementary school, students was physical and chemical changes.

I did not discuss the outcomes of the study in that post as my focus there was on the study as possibly being an example of rhetorical research (i.e., a demonstration set up to produce a particular outcome, rather than an open-ended experiment to genuinely test a hypothesis), and I was concerned that the control conditions involved deliberately providing sub-optimal, indeed sub-standard, teaching to the learners assigned to the comparison condition.

Read 'Didactic control conditions. Another ethically questionable science education experiment?'

Identifying alternative conceptions

The researchers actually tested the outcome of their experiment in two ways (as well as asking students in the experimental condition about their perceptions of the lessons), a post-test taken by all students, and "ten-minute semi-structured individual interviews" with a sample of students from each condition.

Analysis of the post-test allowed the researchers to identify the presence of students' alternative conceptions ('misconceptions'2) related to chemical and physical change, and the identified conceptions are reported in the study. Interviewees were purposively selected,

"Ten-minute semi-structured individual interviews were carried out with seven students from the experimental group and 10 students from the control group to identify students' understanding of physical and chemical changes by acquiring more information about students' unclear responses to [the post-test]. Students were selected from those who gave incorrect, partially correct and no answers to the items in the test. During the interviews, researchers asked the students to explain the reasons for their answers to the items."

Tarhan et al., 2013, p.188

I was interested to read about the alternative conceptions they had found for several reasons:

  1. I have done research into student thinking, and have written a lot about alternative conceptions, so the general topic interests me;
  2. More specifically, it is interesting to compare what researchers find in different educational contexts, as this gives some insight into the origins and developments of such conceptions;
  3. Also, I think the 'chemical and physical changes' distinction is actually a very problematic topic to teach. (Read about a free classroom resource to explore learners' ideas about physical and chemical changes.)

In this post I am going to question whether the author's claims in their research report about some of the alternative conceptions they reported finding are convincing. First, however, I should explain the second point here.

Cultural variations in alternative conceptions

Some alternative conceptions seem fairly universal, being identified in populations all around the world. These may primarily be responses to common experiences of the natural world. An obvious example relates to Newton's first law (the law of inertia): we learn from very early experience, before we even have language to talk about our experiences, that objects that we push, throw, kick, toss, pull… soon come to a stop. They do not move off in a straight line and continue indefinitely at a constant speed.

Of course, that experience is not actually contrary to Newton's first law (as various forces are acting on the objects concerned), but it presents a consistent pattern (objects initially move off, but soon slow and stop) that becomes part of out intuitions about the world and so makes learning the scientific law seem counter-intuitive, and so more difficult to accept and apply when taught in school.

Read about the challenge of learning Newton's first law

By contrast, no one has ever tested Newton's first law directly by seeing what happens under the ideal conditions under which it would apply (see 'Poincaré, inertia, and a common misconception').

Other alternative conceptions may be less universal: some may be, partially at least, due to an aspect of local cultural context (e.g. folk knowledge, local traditions), the language of instruction, the curriculum or teaching scheme, or even a particular teacher's personal way of presenting material.

So, to the extent that there are some experiences that are universal for all humans, due to commonalities in the environment (e.g., to date at least, all members of the species have been born into an environment with a virtually constant gravitational field and a nitrogen-rich atmosphere of about 1 atmosphere pressure {i.e., c.105 Pa} and about 21% oxygen content), there is a tendency for people everywhere (on earth) to develop the same alternative conceptions.

And, conversely, to the extent that people in different institutional, social, and cultural contexts have contrasting experiences, we would expect some variations in the levels of incidence of some alternative conceptions across populations.

"Some common ideas elicited from children are spread, at least in part, through informal learning in everyday "life-world" contexts. Through such processes youngsters are inducted into the beliefs of their culture. Ideas that are common in a culture will not usually contradict everyday experience, but clearly beliefs may develop and be disseminated without matching formal scientific knowledge. …

Where life-world beliefs are relevant to school science – perhaps contradicting scientific principles, perhaps apparently offering an explanation of some science taught in school; perhaps appearing to provide familiar examples of taught principles – then it is quite possible, indeed likely, that such prior beliefs will interfere with the learning of school science. …

Different common beliefs will be found among different cultural groups, and therefore it is likely that the same scientific concepts will be interpreted differently among different cultural groups as they will be interpreted through different existing conceptual frameworks."

Taber, 2012a, pp.5-6

As a trivial example, in England the National Curriculum for primary age children in England erroneously describes some materials that are mixtures as being substances. These errors have persisted for some years as the government department does not think they are important enough to make the effort to correct the error. Assuming many primary school teachers (who are usually not science specialists, though some are of course) trust the flawed information in the official curriculum, we might expect more secondary school students in England, than in other comparable populations, to later demonstrate alternative conceptions in relation to the critical concept of a chemical substance.

"This suggests that studies from different contexts (e.g., different countries, different cultures, different languages of instruction, and different curriculum organisations) should be encouraged for what they can tell us about the relative importance of educational variables in encouraging, avoiding, overcoming, or redirecting various types of ideas students are known to develop."

Taber, 2012a, p.9
The centrality of language

Language of instruction may sometimes be important. Words that supposedly are translated from one language to another may actually have different nuances and associations. (In English, it is clearly an alternative conception to think the chemical elements still exist in a compound, but the meaning of the French élément chemie seems to include the 'essence' of an element that does continue into compound.)

Research in different educational contexts can in principle help unravel some of this: in principle as it does need the various researchers to detail aspects of the teaching contexts and cultural contexts from which they report as well as the student's ideas (Taber, 2012a).

Chemical and physical change

Teaching about chemical and physical change is a traditional topic in school science and chemistry courses. It is one of those dichotomies that is understandably introduced in simple terms, and so, offers a simplification that may need to be 'unlearnt' later:

[a change is] chemical change or physical change

[an element is] metal or non-metal

[a chemical bond is] ionic bonding or covalent bonding

There are some common distinctions often made to support this discrimination into two types of change:


Table 1.2 from Teaching Secondary Chemistry (2nd ed) (Taber, 2012b)

However, a little thought suggests that such criteria are not especially useful in supporting the school student making observations, and indeed some of these criteria simply do not stand up to close examination. 2

"the distinction between chemical and physical changes is a rather messy one, with no clear criteria to help students understand the difference"

Taber, 2012b, p.33


So, I was especially interested to know what Tarhan and colleagues had found.

Methodological 'small print'

In reading any study, a consideration of the findings has to be tempered by an understanding of how the data were collected and analysed. Writing-up research reports for journals can be especially challenging as referees and editors may well criticise missing details they feel should be reported, yet often journals impose word-limits on articles.

Currently (2023) this particular journal tells potential authors that "A typical paper for this journal should be between 7000 and 8000 words" which is a little more generous than some other journals. However, Tarhan and colleagues do not fully report all aspects of their study. This may in part be because they need quite a lot of space to describe the experimental teaching scheme (six different jigsaw learning activities).

Whatever the reason:

  • the authors do not provide a copy of the post-test which elicited the responses that were the basis of the identified alternative conceptions; and
  • nor do they explain how the analysis to identify conceptions was undertaken – to show how student responses were classified;
  • similarly, there are no quotations from the interview dialogue to illustrate how the researchers interpreted student comments .

Data analysis is the process of researchers interpreting data so they become evidence for their findings, and generally research journals expect the process to be detailed – but here the reader is simply told,

"Students' understanding of physical and chemical changes was identified according to the post-test and the individual interviews after the process."

Tarhan et al., 2013, p.189

'Misconceptions'

In their paper, Tarhan and colleagues use the term 'misconception' which is often considered a synonym for 'alternative conception'. Commonly, conceptions are referred to as alternative if they are judged to be inconsistent with canonical concepts.

Read about alternative conceptions

Although the term 'misconception' is used 32 times in the paper (not counting instances in the reference list), the term is not explained in the text, presumably because it is assumed that all those working in science education know (and agree) what it means. This is not at all unusual. I once wrote about another study

"[The] qualities of misconceptions are largely assumed by the author and are implicit in what is written…It could be argued that research reports of this type suggest the reported studies may themselves be under-theorised, as rather well-defined technical procedures are used to investigate foci that are themselves only vaguely characterised, and so the technical procedures are themselves largely operationalised without explicit rationale."

Taber, 2013, p.22

Unfortunately, in Tarhan and colleagues' study there are less well-defied technical procedures in relation to how data was analysed to identify 'misconceptions', so leaving the reader with limited grounds for confidence that what are reported are worthy of being described as student conceptions – and are not just errors or guesses made on the test. Our thinking is private, and never available directly to others, and, so, can only be interpreted from the presentations we make to represent our conceptions in a public (shared) space. Sometimes we mis-speak, or we mis-write (so that then our words do not accurately represent our thoughts). Sometimes our intended meanings may be misinterpreted (Taber, 2013).

Perhaps the researchers felt that this process of identifying conceptions from students' texts and utterances was unproblematic – perhaps the assignments seemed so obvious to the researchers that they did not need to exemplify and justify their analytical method. This is unfortunate. There might also be another factor here.

Lost and found in translation?

The study was carried out in Turkey. The paper is in English, and this includes the reported alternative conceptions. The study was carried out "in a public elementary school" (not an international school, for example). Although English is often taught as a foreign language in Turkish schools, the language of instruction, not unreasonably, is Turkish.

So, it seems either

  • the data was collected in (what, for the children, would have been) 'L2' – a second language, or
  • a study carried out (questions asked; answers given) in Turkish has been reported in English, translating where necessary from one language to another.

This issue is not discussed at all in the paper – there is no mention of either the Turkish or English language, nor of anything being translated.

Yet the authors are not oblivious to the significance of language issues in learning. They report how one variant of Jigsaw teaching had "been designed specifically to increase interaction among students of differing language proficiencies in bilingual classrooms" (p.186) and how the research literature reports that sometimes children's ideas reflect "the incorrect use of terms in everyday language" (p.198). However, they did not feel it was necessary to report either that

  1. data had been collected from elementary school children in a second language, or
  2. data had been translated for the purposes of reporting in an English language journal

It seems reasonable to assume they would have appreciated the importance of mentioning option 1, and so it seems much more likely (although readers of the study should not have to guess) the reporting in English involved translation. Yet translation is never a simple algorithmic process, but rather always a matter of interpretation (another stage in analysis), so it would be better if authors always acknowledged this – and offered some basis for readers to consider the translations made were of high quality (Taber, 2018).

Read about guidelines for detailing translation in research reports

It is a general principle that the research community should adopt, surely, that whenever material reported in a research paper has been translated from another language (a) this is reported and (b) evidence of the accuracy and reliability of the translation is offered (Taber, 2018).

I make this point here, as some of the alternative conceptions reported by the authors are a little mystifying, and this may(?) be because their wording has been 'degraded' (and obscured) by imperfect translation.

An alternative conception of combustion?

For example, here are two of the learning objectives from one of the learning activities:

"The students were expected to be able to:

…comment on whether the wood has similar intensive properties before and after combustion

…indicate the combustion reactions in examples of several physical and chemical changes"

Tarhan et al., 2013, p.193

The wording of the first of these examples seems to imply that when wood is burnt, the product is still…wood. That is nonsense, but possibly this is simply a mistranslation of something that made perfect sense in Turkish. (The problem is that a reader can only speculate on whether this is the case, and research reports should be precise and explicit.)

The second learning objective quoted here implies that some combustion reactions are physical changes (or, at least, combustion reactions are components of some physical changes).

Combustion reactions are a class of chemical reactions. 'Chemical reaction' is synonymous with 'chemical change'. So, there are (if you will excuse the double negative) no examples of combustion reactions that are not chemical reactions and which would be said to occur in physical changes. So, this is mystifying, as it is not at all clear what the children were actually being taught unless one assumes the researchers themselves have very serious misconceptions about the chemistry they are teaching.

If a reader actually thought that the researchers themselves held these alternative conceptions

  • the product of combustion of wood is still wood
  • some combustion reactions are (or occur as part of) physical changes

then one could have little confidence in their ability to distinguish between the scientific and alternative conceptions of others. (A reader might also ask how come the journal referees and editor did not ask for corrections here before publication – I certainly wondered about this).

There are other statements the authors make in describing the teaching which are not entirely clear (e.g., "give the order of the changes in matter during combustion reactions", p.194), and this suggests a degree of scepticism is needed in not simply accepting the reported alternative conceptions at face value. This does not negate their interest, but does undermine the paper's authority somewhat.

One of the misconceptions reported in the study is that some students thought that "there is a flame in all combustion reaction". This led me to reflect on whether I could think of any combustion reactions that did not involve a flame – and I must confess none readily came to mind. Perhaps I also have this alternative conception – but it seems a harsh judgement on elementary school learners unless they had actually been taught about combustion reactions without flames (if, indeed, there are such things).


The study reported that some 12 year olds held the 'misconception' that "there is a flame in all combustion reaction[s]".

[Image by Susanne Jutzeler, Schweiz, from Pixabay]


Failing to control variables?

Another objective was for students to "comprehend that temperature has an effect on chemical reaction rate by considering the decay of fruit at room temperature, and the change in color [colour] from green to yellow of fallen leaves in autumn" (p.193). As presented, this is somewhat obscure.

Presumably it is not meant to be a comparison between:

the rate of
decay of fruit at room temperature
andthe rate of
change in colour of fallen leaves in autumn
Explaining that temperature has an effect on chemical reaction rate?

Clearly, even if the change of colour of leaves takes place at a different temperature to room temperature, one cannot compare between totally different processes at different temperatures and draw any conclusions about how "temperature has an effect on chemical reaction rate" . (Presumably, 'control of variables' is taught in the Turkish science curriculum.)

So, one assumes these are two different examples…

But that does not help matters too much. The "decay of fruit at room temperature" (nor, indeed, any other process studied at a single temperature) cannot offer any indication of how "temperature has an effect on chemical reaction rate". The change of colours in leaves of deciduous trees (that usually begins before they fall) is triggered by environmental conditions such as change in day length and temperature. This is part of a very complex system involving a range of pigments, whilst water content of the leaf decreases (once the supply of water through the tree's vascular system is cut off), and it is not clear how much detail these twelve year olds were taught…but it is certainly not a simple matter of a reaction changing rate according to temperature.

Evaluating conceptions

Tarhan and colleagues report their identified alternative conceptions ('misconceptions') under a series of headings. These are reported in their table 4 (p.195). A reader certainly finds some of the entries in this table easy to interpret: they clearly seem to reflect ideas contrary to the canonical science one would expect to be reflected in the curriculum and teaching. Other statements are less obviously evidence of alternative conceptions as they do not immediately seem necessarily at odds with scientific accounts (e.g., associating combustion reactions with flames).

Other reported misconceptions are harder to evaluate. School science is in effect a set of models and representations of scientific accounts that often simplify the actual current state of scientific knowledge. Unless we know exactly what has been taught it is not entirely clear if students' ideas are credit-worthy or erroneous in the specific context of their curriculum.

Moreover, as the paper does not report the data and its analysis, but simply the outcome of the analysis, readers do not know on what basis judgements have been made to assign learners as having one of the listed misconceptions.


Changes of state are chemical changes

A few students from the lecture-based teaching condition were identified as 'having' the misconception that 'changes of state are chemical changes'. This seems a pretty serious error at the end of a teaching sequence on chemical and physical changes.

However, this raises a common issue in terms of reports of alternative conceptions – what exactly does it mean to say that a student has a conception that 'changes of state are chemical changes'? A conception is a feature of someone's thinking – but that encompasses a vast range of potential possibilities from a fleeting notion that is soon forgotten ('I wonder if s orbitals are so-called because they are spherical?') to an on-going commitment to an extensive framework of ideas that a life is lived by (Buddhism, Roman Catholicism, Liberalism, Hedonism, Marxism…).


A person's conceptions can vary along a range of characteristics (Figure from Taber, 2014)


The statement that 'Changes of state are chemical changes' is unlikely to be the basis of anyone's personal creed. It could simply be a confusion of terms. Perhaps a student had a decent understanding of the essential distinction between chemical and physical changes but got the terms mixed up (or was thinking that 'changes of state' meant 'chemical reaction'). That is certainty a serious error that needs correcting, but in terms of understanding of the science, would seem to be less worrying than a deeper conceptual problem.

In their commentary, the authors note of these children:

"They thought that if ice was heated up water formed, and if water was heated steam formed, so new matter was formed and chemical changes occurred".

Tarhan et al., 2013, p.197

It is not clear if this was an explanation the learners gave for thinking "changes of state are chemical changes", or whether "changes of state are chemical changes" was the researchers' gloss on children commenting that "if ice was heated up water formed, and if water was heated steam formed, so new matter was formed and chemical changes occurred".

That a range of students are said to have precisely the same train of thought leads a reader (or, at least, certainly one with experience of undertaking research of this kind) to ask if these are open-ended responses produced by the children, or the selection by the children of one of a number of options offered by the researchers (as pointed out above, the data analysis is not discussed in detail in the paper). That makes a difference in how much weight we might give to the prevalence of the response (putting a tick by the most likely looking option requires less commitment to, and appreciation of, an idea than setting it out yourself in your own personally composed text), illustrating why it is important that research journals should require researchers to give full accounts of their instrumentation and analysis.

Because density of matter changes during changes of state, its identity also changes, and so it is a chemical change

Thirteen of the children (all in the lecture-based teaching condition) were considered to have the conception "Because density of matter changes during changes of state, its identity also changes, and so it is a chemical change". This is clearly a much more specific conception (than 'changes of state are chemical changes') which can be analysed into three components:

  • a change of state is a chemical change, AND
  • we know this because such changes involve a change in identity, AND
  • we know that because a change of state leads to a change in density

Terhan and colleagues claim this conception was "first determined in this study" (p.195).

The specificity is intriguing here – if so many students explicitly and individually built this argument for themselves then this is an especially interesting finding. Unfortunately, the paper does not give enough detail of the methodology for a reader to know if this was the case. Again, if students were just agreeing with an argument offered as an option on the assessment instrument then it is of note, but less significant (as in such cases students might agree with the statement simply because one component resonated – or they may even be guessing rather than leaving an item unanswered). Again this does not completely negate the finding, but it leaves its status very unclear.

Taken together these first two claimed results seem inconsistent – as at least 13 students seem to think "Changes of state are chemical changes". That is, all those who thought that "Because density of matter changes during changes of state, its identity also changes, and so it is a chemical change" would seem to have thought that "Changes of state are chemical changes" (see the Venn diagram below). Yet, we are also told that only five students held the less specific and seemingly subsuming conception "changes of state are chemical changes".


If 13 students think that changes of state are chemical changes because a change of density implies a change of identity; what does it mean that only 5 students think that changes of state are chemical changes?

This looks like an error, but perhaps is just a lack of sufficient detail to make the findings clear. Alternatively, perhaps this indicates some failure in translating material accurately into English.

The changes in the pure matters are physical changes

Six children in the lecture-based teaching condition and one in the jigsaw learning condition were reported as holding the conception that "The changes in the pure matters are physical changes". The authors do not explain what they mean here by "pure matters" (sic, presumably 'matter'?). The only place this term is used in the paper is in relation to this conception (p.195, p.197).

The only other reference to 'pure' was in one of the learning objectives for the teaching:

  • explain the changes of state of water depending on temperature and pressure; give various examples for other pure substances (p.191)

If "pure matter" means a pure sample of a substance, then changes in pure substances are all physical – by definition a chemical changes leads to a different substance/different substances. That would explain why this conception was "first determined [as a misconception] in this study", p.195, as it is not actually a misconception)". So, it does not seem clear precisely why the researchers feel these children have got something wrong here. Again, perhaps this is a failure of translation rather than a failure in the original study?

Changes in shape?

Tarhan and colleagues report two conceptions under the subheading of 'changes in shape'. They seem to be thinking here more of grain size than shape as such. (Another translation issue?) One reported misconception is that if cube sugar is granulated, sugar particles become small [smaller?].


Is it really a misconception to think that "If cube sugar is granulated, sugar particles become small"?

(Image by Bruno /Germany from Pixabay)


Tarhan and colleagues reported that two children in the experimental condition, and 13 in the control condition thought that "If cube sugar is granulated, sugar particles become small". Sugar cubes are made of granules of sugar weakly joined together – they can easily be crumbled into the separate grains. The grains are clearly smaller than the cubes. So, what is important here is what is meant/understood* by the children by the term 'particles'.

(* If this phrasing was produced by the children, then we want to know what they meant by it. If, however, the children were agreeing with a phrase presented to them by researchers, then we wish to know how they understood it.)

If this means quanticle level particles, molecules, then it is clearly an alternative conception – each grain contain vast numbers of molecules, and the molecules are unchanged by the breaking up the cubes. If, however, particles here refers to the cube and grains**, then it is a fair reflection of what happens: one quite large particle of sugar is broken up into many much smaller particles. The ambiguity of the (English) word 'particles' in such contexts is well recognised.

(** That is, if the children used the word 'particles' – did they mean the cubes/grains as particles of sugar? If however the phrasing was produced by the researchers and presented to the children, and if the researchers meant 'particles' to mean 'molecules'; did the children appreciate that intention, or did they understand 'particles' to refer to the cubes and grains?)

However, as no detail is given on the actual data collected (e.g., is this the children's own words; was this based on an open response?), and how it was analysed (and, as I suspect this all occurred in Turkish) the reader has no way to check on this interpretation of the data.

What kind of change is dissolving?

Tarhan and colleagues report a number of 'misconceptions' under the heading of 'molecular solubility'. Two of these are:

  • "The solvation processes are always chemical changes"
  • "The solvation processes are always physical changes"

This reflects a problem of teaching about physical and chemical changes. Dissolving is normally seen as a physical change: there is no new chemical substance formed and dissolving is usually fairly readily reversed. However, as bonds are broken and formed it also has some resemblance to chemical change.2

In dissolving common salt in water, strong ionic bonds are disrupted and the ions are strongly solvated. Yet the usual convention is still to consider this a physical change – the original substance, the salt, can be readily recovered by evaporation of the solvent. A solution is considered a kind of mixture. In any case, as Tarhan and colleagues refer to 'molecular' solubility (strictly solubility refers to substances, not molecules, but still) they were, presumably, only dealing with examples of the dissolving of substances with discrete molecules.

Taking together these two conceptions, it seems that Tarhan and colleagues think that dissolving is sometimes a physical change, and sometimes a chemical change. Presumably they have some criterion or criteria to distinguish those examples of dissolving they consider physical changes from those they consider chemical changes. A reader can only speculate how a learner observing some solute dissolve in a solvent is expected to distinguish these cases. The researchers do not explain what was taught to the students, so it is difficult to appreciate quite what the students supposedly got wrong here.

Sugar is invisible in the water, because new matter is formed

The idea that learners think that new matter is formed on dissolving would indeed be an alternative conception. The canonical view is that new matter is only formed in very high energy processes – such as in the big bang. In both chemical and physical processes studied in the school laboratory there may be transformations of matter, but no new matter.

This seems a rather extreme 'misconception' for the learners to hold. However, a reader might wonder if the students actually suggested that a new substance was formed, and this has been mistranslated. (The Turkish word 'madde' seems to mean either matter or substance.) If these students thought that a new type of substance was formed then this would be an alternative conception (and it would be interesting to know why this led to sugar being invisible – unless they were simply arguing that different appearance implied different substance).

While sugar is dissolving in the water, water damages the structure of sugar and sugar splits off

Whether this is a genuine alternative conception or just imprecise use of language is not clear. It seems reasonable to suggest that while sugar is dissolving in the water, the process breaks up the structure of solid sugar and sugar molecules split off – so some more detail would be useful here. Again, if there has been translation from Turkish this may have lost some of the nuance of the original phrasing through translation into English.

The phrasing reflects an alternative conception that in chemical reactions one reactant is an active agent (here the water doing the damaging) and the other the patient, that is passive and acted upon (here the sugar being damaged) – rather than seeing the reaction as an interaction between two species (Taber & García Franco, 2010) – but there is no suggestion in their paper that this is the issue Tarhan and colleagues are highlighting here.

When sugar dissolves in water, it reacts with water and disappears from sight

If the children thought that dissolving was a chemical reaction then this is an alternative conception – the sugar does indeed disappear from sight, but there has been no reaction.

Again, we might ask if this was actually a misunderstanding (misconception), or imprecise use of language. The sugar does 'react' with the water in the everyday sense of 'reaction'. But this is not a chemical reaction, so this terminology should be avoided in this context.

Even in science, 'reaction' means something different in chemistry and physics: in the sense of Newtonian physics, during dissolving, when a water molecule attracts a sugar molecule {'action')'} there will be an equal and oppositely directed reaction as the sugar molecule attracts the water molecule. This is Newton's third law, which applies to quanticles as much as to planets. If a water molecule and a sugar molecule collide, the force applied by the sugar molecule on the water molecule is equal to the force applied by the water molecule on the sugar molecule.

Read about learning difficulties with Newton's third law

So, 'sugar reacts with water' could be

  • a misunderstanding of dissolving (a genuine alternative conception);
  • a misuse of the chemical term 'reaction'; or
  • a use of the everyday term 'reaction' in a context where this should be avoided as it can be misunderstood

These are somewhat different problems for a teacher to address.

Molecules split off in physical changes and atoms split off in chemical changes

Ten of the children are said to have demonstrated the 'misconception' that molecules split off in physical changes and atoms split off in chemical changes. The authors claim that this misconception has not been reported in previous studies. But is this really a misconception? It may be a simplistic, and imprecise, statement – but I think when I was teaching youngsters of this age I would have been happy to find they have this notion – which at least seems to reflect an ability to imagine and visualise processes at the molecular level.

In dissolving or melting/boiling of simple molecular substances, molecules do indeed 'split off' in a sense, and in at least some chemical changes we can posit mechanisms that, in simple terms at least, involve atoms 'splitting off' from molecules.

So, again, this is another example of how this study is tantalising, without being very informative. The reader is not clear in what sense this is viewed as wrong, or how the conception was detected. (Again, for ten different students to specifically think that 'molecules split off in physical changes and atoms split off in chemical changes' makes one wonder if they volunteered this, or have simply agreed with the statement when having it presented to them).

In conclusion

The main thrust of Tarhan and colleagues' study was to report on an innovation using jig-saw learning (which unfortunately compared this with a form of pedagogy widely considered unsuitable for young children, so offering a limited basis for judging effectiveness of the innovation). As part of the study they collected data to evaluate learning in the two conditions, and used this to identify misconceptions students demonstrated after being taught about physical and chemical changes. The researchers provide a long list of identified misconceptions – but it is not always obvious why these are considered misconceptions, and what the desired responses matching teaching models were.

The researchers do not detail their data collection and analysis instruments and protocols in sufficient detail for a readers to appreciate what they mean by their results. In particular, what it means to have a misconception – e.g., to give a definitive statement in an interview, or just to select some response on a test as the answer that looked most promising at the time. Clearly we give much more weight to a notion that a learner presents in their own words as an explanation for some phenomenon, than the selection of one option from a menu of statements presented to them that comes with no indication of their confidence in the selection made.

Of particular concern: either the children were asked questions in a second language that they may not have been sufficiently fluent in to fully understand questions or compose clear responses; or none of the misconceptions reported are presented in their original form and they have all been translated by someone (unspecified) of uncertain ability as a translator. (A suitably qualified translator would need to have high competence in both languages and a strong familiarity with the subject matter being translated.)

In the circumstances, Tarhan and colleagues' reported misconceptions are little more than intriguing. In science, the outcome of a study is only informative in the context of understanding exactly how the data were obtained, and how they have been processed. Without that, readers are asked to take a researcher's conclusions on faith, rather than be persuaded of them by a logical chain of argument.


p.s. For anyone who did not know, but wondered: s orbitals are not so-called because they are spherical: the designation derives from a label ('sharp') that was applied to some lines in atomic spectra.


Work cited

Notes


1 To my reading, the publication title 'Research in Science and Technological Education' seems to suggest the journal has two distinct and somewhat disconnected foci, that is:

Research in ( Science ) and ( Technological Education )

And it would be better (that is, most consistently) titled as

Research in Science and Technology Education

{Research in ( Science and Technology ) Education}

or

Research in Scientific and Technological Education

{Research in ( Scientific and Technological ) Education}

but, hey, I know I am pedantic.


2 The table (Table 1.2 in the source) was followed by the following text:

"The first criterion listed is the most fundamental and is generally clear cut as long as the substances present before and after the change are known. If a new substance has been produced, it will almost certainly have different melting and boiling temperatures than the original substance.

The other [criteria] are much more dubious. Some chemical changes involve a great deal of energy being released, such as the example of burning magnesium in air, or even require a considerable energy input, such as the example of the electrolysis of water. However, other reactions may not obviously involve large energy transfers, for example when the enthalpy and entropy changes more or less cancel each other out…. The rusting of iron is a chemical reaction, but usually occurs so slowly that it is not apparent whether the process involves much energy transfer ….

Generally speaking, physical changes are more readily reversible than chemical changes. However, again this is not a very definitive criterion. The idea that chemical reactions tend to either 'go' or not is a useful approximation, but there are many examples of reactions that can be readily reversed…. In principle, all reactions involve equilibria of forward and reverse reactions, and can be reversed by changing the conditions sufficiently. When hydrogen and oxygen are exploded, it takes a pedant to claim that there is also a process of water molecules being converted into oxygen and hydrogen molecules as the reaction proceeds, which means the reaction will continue for ever. Technically such a claim may be true, but for all practical purposes the explosion reflects a reaction that very quickly goes to completion.

One technique that can be used to separate iodine from sand is to warm the mixture gently in an evaporating basin, over which is placed an upturned beaker or funnel. The iodine will sublime – turn to vapour – before recondensing on the cold glass, separated from the sand. The same technique may be used if ammonium chloride is mixed with the sand. In both cases the separation is achieved because sand (which has a high melting temperature) is mixed with another substance in the solid state that is readily changed into a vapour by warming, and then readily recovered as a solid sample when the vapour is in contact with a colder surface. There are then reversible changes involved in both cases:

solid iodine ➝ iodine vapour

ammonium chloride ➝ ammonia + hydrogen chloride

In the first case, the process involves only changes of state: evaporation and condensation – collectively called sublimation. However the second case involves one substance (a salt) changing to two other substances. To a student seeing these changes demonstrated, there would be little basis to infer one is (usually considered as) a chemical change, but not the other. …

The final criterion in Table 1.2 concerns whether bonds are broken and made during a change, and this can only be meaningful for students once they have learnt about particle models of the submicroscopic structure of matter… In a chemical change, there will be the breaking of bonds that hold together the reactants and the formation of new bonds in the products. However, we have to be careful here what we mean by 'bond' …

When ice melts and water boils, 'intermolecular' forces between molecules are disrupted and this includes the breaking of hydrogen 'bonds'. However, when people talk about bond breaking in the context of chemical and physical changes, they tend to mean strong chemical bonds such as covalent, ionic and metallic bonds…

Yet even this is not clear cut. When metals evaporate or are boiled, metallic bonds are broken, although the vapour is not normally considered a different substance. When elements such as carbon and phosphorus undergo phase changes relating to allotropy, there is breaking, and forming, of bonds, which might suggest these changes are chemical and that the different forms of the same elements should be considered different substances. …

A particularly tricky case occurs when we dissolve materials to form solutions, especially materials with ionic bonding…. Dissolving tends to involve small energy changes, and to be readily reversible, and is generally considered a physical change. However, to dissolve an ionic compound such as sodium chloride (table salt), the strong ionic bonds between the sodium and chloride ions have to be overcome (and new bonds must form between the ions and solvent molecules). This would seem to suggest that dissolving can be a chemical change according to the criterion of bond breaking and formation (Table 1.2)."

(Taber, 2012b, pp.31-33)

How to avoid birds of prey

…by taking refuge in the neutral zone


Keith S. Taber


Fact is said to be stranger than (science) fiction

Regular viewers of Star Trek may be under the impression that it is dangerous to enter the neutral zone between the territories claimed by the United Federation of Planets and that of the Romulan Empire in case any incursion results in an attack by a Romulan Bird of Prey.


A bird of prey (with its prey?)
(Image by Thomas Marrone, used by permission – full-size version at the source site here)


However, back here on earth, it may be that entering the neutral zone is actually a way of avoiding an attack by a bird of prey


A bird of prey (with its prey). Run rabbit, run rabbit…into the neutral zone
(Image by Ralph from Pixabay)

At least, according to the biologist Jakob von Uexküll

"All the more remarkable is the observation that a neutral zone insinuates itself between the nest and the hunting ground of many raptors, a zone in which they seize no prey at all. Ornithologists must be correct in their assumption that this organisation of the environment was made by Nature in order to keep the raptors from seizing their own young. If, as they say, the nestling becomes a branchling and spends its days hopping from branch to branch near the parental nest, it would easily be in danger of being seized by mistake by its own parents. In this way, it can spend its days free of danger in the neutral zone of the protected area. The protected area is sought out by many songbirds as a nesting and incubation site where they can raise their young free of danger under the protection of the big predator."

Uexküll, 1934/2010

This is a very vivid presentation, but is phrased in a manner I thought deserved a little interrogation. It should, however, be pointed out that this extract is from the English edition of a book translated from the original German text (which itself was originally published almost a century ago).

A text with two authors?

Translation is a process of converting a text from one natural language to another, but every language is somewhat unique regarding its range of words and word meanings. That is, words that are often considered equivalent in different language may have somewhat different ranges of application in those languages, and different nuances. Sometimes there is no precise translation for a word, and a single word in one language may have several near-equivalents in another (Taber, 2018). Translation therefore involves interpretation and creative choices.

So, translation is a skilled art form, and not simply something that can be done well by algorithmically applying suggestions in a bilingual dictionary. A good translation of an academic text not only requires someone fluent in both languages, but also someone having a sufficient understanding of the topic to translate in the best way to convey the intended meaning rather than simply using the most directly equivalent words. A sequence of the most equivalent individual words may not give the best translation of a sentence, and indeed when translating idioms may lead to a translation with no obvious meaning in the target language. It is worth bearing in mind that any translated text has (in effect) two authors, and reflects choices made by the translator as well as the original author.

Read about the challenges of translation in research writing

I am certainly not suggesting there is anything wrong with the translation of Uexküll's text, but it should be born in mind I am commenting on the English language version of the text.

A neutral zone insinuates itself

No it does not.

The language here is surely metaphorical, as it implies a deliberate action by the neutral zone. This seems to anthropomorphise the zone as if it is a human-like actor.

Read about anthropomorphism

The zone is a space. Moreover, it is not a space that is in any way discontinuous with the other space surrounding it – it is a human conception of a region of space with imagined boundaries. The zone is not a sentient agent, so it can not insinuate itself.

Ornithologists must be correct

Science develops theoretical knowledge which is tested against empirical evidence, but is always (strictly) provisional in that it should be open to revisiting in the light of further evidence. Claims made in scientific discourse should therefore be suitable tentative. Perhaps

  • ornithologists seem to be correct in suggesting…, or
  • it seems likely that ornithologists were correct when they suggested…or even
  • at present our best understanding reflects the suggestions made by ornithologists that...

Yet a statement that ornithologists must be correct implies a level of certainty and absoluteness that seems inconsistent with a scientific claim.

Read about certainty in accounts of science

The environment was made by Nature in order to…

This phrasing seems to personify Nature as if 'she' is a person. Moreover, this (…in order to…) suggests a purpose in nature. This kind of teleological claim is often considered inappropriate in science as it suggests natural events occur according to some pre-existing plan rather than unfolding according to natural laws. 1 If we consider something happens to achieve a purpose we seem to not need to look for a mechanism in terms of (for example) forces (or entropy or natural selection or…).

Read about personification of nature

Read about teleology in science

Being seized by mistake

We can understand that it would decrease the biological fitness of a raptor to indiscriminately treat its own offspring as potential food. There are situations when animals do eat their young, but clearly any species that's members committed considerable resources to raising a small number of young (e.g., nest building, egg incubation) but were also regular consumers of those young would be at a disadvantage when it came to its long-term survival.

So, in terms of what increases a species' fitness, avoiding eating your own children would help. If seeking a good 'strategy' to have descendants, then, eating offspring would be a 'mistake'. But the scientific account is not that species, or individual members of a species, seek to deliberately adopt a strategy to have generations of descendants: rather behaviour that tends to lead to descendants is self-selecting.

Just because humans can reflect upon 'our children's children's, children', we cannot assume that other species even have the vaguest notions of descendants. (And the state of the world – pollution, deforestation, habitat destruction, nuclear arsenals, soil degradation, unsustainable use of resources, etceterastrongly suggests that even humans who can conceptualise and potentially care about their descendants have real trouble making that the basis for rational action.)


Even members of the very rare species capable of conceptualising a future for their offspring struggle to develop strategies taking the well-being of future generations into account.
(Image: cover art for 'To our children's children's children' {The Moody Blues}).


Natural selection is sometimes seen as merely a tautology as it seems to be a theory that explains the flourishing of some species (and not others) in terms that they have the qualities to flourish! But this is to examine the wrong level of explanation. Natural selection explains in general terms why it is that in a particular environment competing species will tend to survive and leave offspring to different extents. (Then within that general framework, specific arguments have to be made about why particular features or behaviours contribute to differential fitness in that ecological context.)

Particular evolved behaviours may be labelled as 'strategies' by analogy with human strategies, but this is purely a metaphor: the animal is following instincts, or sometimes learned behaviours, but is not generally following a consciously considered plan intended to lead to some desired outcome in the longer term.

But a reader is likely to read about a nestling being "in danger of being seized by mistake by its own parents" as the birds themselves making a mistake – which implies they have a deliberate plan to catch food, while excluding their own offspring from the food category, and so intended to avoid treating their offspring as prey. That is, it is implied that birds of prey are looking to avoid eating their own, but get it wrong.

Yet, surely, birds are behaving instinctively, and not conceptualising their hunting as a means of acquiring nutrition, where they should discriminate between admissible prey and young relatives. Again this seems to be anthropomorphism as it treats non-human animals as if their have mental experiences and thought processes akin to humans: "I did not mean to eat my child, I just failed to recognise her, and so made a mistake".

The protected area is sought out

Similarly, the songbirds also behave instinctively. They surely do not 'seek out' the 'protected' area around the nest of a bird of prey. There must be a sense in which they 'learn' (over many generations, perhaps) that they need not fear the raptors when they are near their own nests but it seems unlikely a songbird conceptualises any of this in a way that allows them to deliberately (that is, with deliberation) seek out the neutral zone.

In terms of natural selection, a songbird that has no fear of raptors and so does not seek to avoid or hide or flee from them would likely be at a disadvantage, and so tend to leave less offspring. Similarly, a songbird that usually avoided birds of prey, but nested in the neutral zone, would have a fitness advantage if other predators (small cats say) kept clear of the area. The bird would not have to think "hey, I know raptors are generally a hazard, but I'll be okay here as I'm close enough to be in the zone where they do not hunt", as long as the behaviour was heritable (and there was initially variation in the extent to which individuals behaved that way) – as natural selection would automatically lead to it becoming common behaviour.

(In principle, the bird could be responding to some cue in the environment that was a reliable but indirect indicator they were near a raptor nesting site. For example, perhaps building a nest very close to a location where there is a regular depositing of small bones on the ground gives an advantage, so this behaviour increases fitness and so is 'selected'.)

Under the protection of the big predator

Why are the songbirds under the protection of the raptors? Perhaps because other potential predators do not come into the neutral zone as they are vulnerable when approaching this area, even if they would be safe once inside. Again, if this is so, it surely does not reflect a conscious conceptualisation of the neutral zone.

For example, a cat that preys on small birds would experience a different 'unwelt' from the bird. A small songbird with a nest where it has young experiences the surrounding space differently to a cat (already a larger animal so experiencing the world at a different scale) that ranges over a substantial territory. Perhaps the songbird perceives the neutral zone as a distinct space, whereas to the cat it is simply an undistinguished part of a wider area where the raptors are regularly seen.

Or, perhaps, for the smaller predator, the area around the neutral zone offers too little cover to risk venturing into the zone. (Again, this does not mean a conscious thinking process along the lines "I'd be safe once I was over there, but I'm not sure I'd make it there as I could easily be seen moving between here and there", but could just be an inherited tendency to keep under cover.)

The birds of prey themselves will not take the songbirds, so the smaller birds are protected from them in the zone, but if this is simply an evolved mechanism that prevents accidental 'infanticide' this can hardly be considered as other birds being under the protection of the birds of prey. Perhaps the birds of prey do scare away other predators – but, if so, this is in no sense a desired outcome of a deliberate policy adopted by the birds of prey because they want to protect their more vulnerable neighbours.

One could understand how the birds of prey might hypothetically have evolved behaviour of not preying on smaller birds (which might include their own offspring) near their nest, but would still attack smaller predators that might threaten their own chicks. In that scenario 2, the birds of prey might have indeed protected nearby songbirds from potential predators (even if only incidentally), but this does not apply if, as Uexküll suggests, "they seize no prey at all" in the neutral zone.

Again the, 'under the protection of the big predator' seems to anthropomorphise the situation and treat the birds of prey as if they are acting deliberately to protect songbirds, and so this phrasing needs to be understood metaphorically.

Does language matter?

Uexküll's phrasing offers an engaging narrative which aids in the communication of the idea of the neutral zone to his readers. (He is skilled in making the unfamiliar familiar.) It is easier to understand an abstract idea if it seems to reflect a clear purpose or it can be understood in terms of human ways of thinking and acting, for example:

  • it is important to keep your children safe
  • it is good to look out for your neighbours

But we know that science learners readily tend to accept explanations that are teleological and/or anthropomorphic, and that sometimes (at least) this acts as an impediment to learning the scientific accounts based on natural principles and mechanisms.

Therefore it is useful for science teachers in particular to be alert to such language so they can at least check that learners are seeing beyond the metaphor and not mistaking a good story for a scientific account.


Work cited:

Notes:

1 Many people, including some scientists, do believe the world is unfolding according to a pre-ordained plan or scheme. This would normally be considered a matter of religious faith or at least a metaphysical commitment.

The usual stance taken in science ('methodological naturalism'), however, is that scientific explanations must be based on scientific principles, concepts, laws, theories, etcetera, and must not call upon any supernatural causes or explanations. This need not exclude a religious faith in some creator with a plan for the world, as long as the creator is seen to have set up the world to unfold through natural laws and mechanisms. That is, faith-based and scientific accounts and explanations may be considered to work at different levels and to be complementary.

Read more about the relationship between science and religion


2 That this does not seem to be the case might reflect how a flying bird perceives prey – if it has simply evolved to swoop upon and take any object in a certain size range {that we might explain as small enough to be taken, but not so small as not to repay the effort} that matches a certain class of movement pattern {that we might interpret as moving under its own direction and so being animate} then the option of avoiding smaller birds but taking other prey would not be available.

After all, studies show parent birds will try and feed the most simple representations of a hatchling's open beak – suggesting they do not perceive the difference between their own children and crude models of an open bird mouth.


The general form of a chick's open mouth (as shown by these hatchlings) is enough to trigger feeding behaviour in adult birds.
(Image by Tania Van den Berghen from Pixabay )

Uexküll himself reported that,

"…a very young wild duck was brought to me; it followed me every step. I had the impression that it was my boots that attracted it so, since it also ran occasionally after a black dachshund. I concluded from this that a black moving object was sufficient to replace the image of its mother…"

Uexküll, 1934/2010

(A year later, Lorentz would publish his classic work on imprinting which reported detailed studies of the same phenomenon.)


A drafted man is like a draft horse because…

A case of analogy in scientific discovery


Keith S. Taber


How is a drafted man like a draft horse (beyond them both having been required to give service?)

"The phthisical soldier is to his messmates
what
the glandered horse is to its yoke fellow"

Jean-Antoine Villemin quoted by Goetz, 2013

Analogy in science

I have discussed many examples of analogies in these pages. Often, these are analogies intended to help communicate scientific ideas – to introduce some scientific concept by suggesting it is similar to something already familiar. However, analogy is important in the practice of science itself – not just when teaching about or communicating science to the general public. Scientific discoveries are often made by analogical thinking – perhaps this as-yet-unexplained phenomenon is a bit like that other well-conceptualised phenomenon?

Analogies are more than just similes (simply suggesting that X is like Y; say that the brain is like a telephone exchange 1) because they are based on an explicit structural mapping. That is, there are parallels between relationships within a concept.

So,

  • to say that the atom is a tiny solar system would just be a metaphor, and
  • to simply state that the atom is like a tiny solar system would be a simile;
  • but to say that the atom is like a tiny solar system because both consist of a more massive central body orbited by much less massive bodies would be an analogy. 2

Read about analogies in science

A medical science analogy

Thomas Goetz describes how, in the nineteenth century, Jean-Antoine Villemin suspected that the disease known as phthisis (tuberculosis, 'T.B.') was passed between people, and that this tended to occur when people were living in crowded conditions. Villemin was an army surgeon and the disease was very common among soldiers, even though they tended to be drawn from younger, healthier members of the population. (This phenomenon continued into the twentieth century long after the cause of the infection was understood. 3)


Heavy horses: it is not just the workload of draught horses that risks their health 4
(Image by Daniel Borker from Pixabay)


Villemin knew that a horse disease, glanders, was often found to spread among horses that were yoked closely together to work in teams, and he suspected something similar was occurring among the enlisted men due to their living and working in close quarters.

"…Jean-Antoine Villemin, a French army surgeon…in the 1860s conducted a series of experiments testing whether tuberculosis could be transmitted form one animal to another. Villemin's interest began when he observed how tuberculosis seemed to affect young men who moved to the city, even though they were previously healthy in their rural homes. He compared the effect to how glanders, a horse disease, seemed to spread when a team [of horses] was yoked together. "The phthisical soldier is to his messmates what the glandered horse is to its yoke fellow", Villemin conjectured."

Goetz, 2013, p.104

To a modern reader this seems an unremarkable suggestion, but that would be an ahistorical evaluation. Glanders is an infectious disease, and so is tuberculosis, so being in close contact with an infected cospecific is clearly a risk factor for being infected. Yet, when Villemin was practising medicine it was not accepted that tuberculosis was infectious, and infectious agents such as bacteria and viruses had not been identified.

Before the identification of the bacterium Mycobacterium tuberculosis as the infectious agent, there was no specific test to demarcate tuberculosis from other diseases. This mattered as although T.B. tends to especially affect the pulmonary system, it can cause a wide range of problems for an infected person. Scrofula, causing swollen lymph nodes, was historically seen as quite distinct from consumption, recognised by bloody coughing, but these are now both recognised as the results of Mycobacterium tuberculosis infection (when the bacterium moves from the lungs into the lymphatic system it leads to the symptoms of scrofula). The bacterium can spread through the bloodstream to cause systemic disease. However, a person may be infected with the bacterium for years before becoming ill. Before the advent of 'germ theory', and the ability to identify specific 'germs', the modern account of tuberculosis as a complex condition with diverse symptoms caused by a single infectious agent was not at all obvious.

The contexts of discovery and justification

Although the analogy with glanders was suggestive to Villemin, this was just the formation of a hypothesis: that T.B. could be passed from one person to another via some form of material transfer during close contact. The context of discovery was the recognition of an analogy, but the context of justification needed to be the laboratory.

Sacrifices for medical science

The basic method for testing the hypothesis consisted of taking diseased animals (today we would say infected, but that was not yet accepted), excising diseased material from their bodies, or taking samples of tissue from diseased people, and introducing it into the bodies of healthy animals. If the healthy animals quickly showed signs of disease, when similar controls remained healthy, it seemed likely that the transfer of material from the diseased animal was the cause.

Although the microbes responsible for T.B. and similar diseases had not been found, autopsy showed irregularities in diseased bodies. The immune system acts to localise the infection and contain it within tissue nobules or granuloma known as 'tubercles'. These tubercles are large enough to be detected and recognised post-mortem.

It was therefore possible to harvest diseased material and introduce it into healthy animals:

"If one shaves a narrow area on the ear of a rabbit or at the groin or on the chest under the elbow of a dog, and then creates a subcutaneous wound so small and so shallow that it does not yield the slightest drop of blood, and then one introduces into this wound, such that it cannot escape, a pinhead-sized packet of tuberculous material obtained from a man, a cow or a rabbit that has already been rendered tuberculous; or if, alternatively, one uses a Pravaz [hypodermic] syringe to instil, under the skin of the animal, a few droplets of sputum from a patient with phthisis…"

Villemin, 1868/2015, p.256

Villemin reports that the tiny wound quickly heals, and then the introduced material cannot be felt beneath the site of introduction. However after a few days:

"a slight swelling is observed, accompanied in some cases by redness and warmth, and one observes the progressive development of a local tubercle of a size between that of a hemp seed and that of a cobnut. When they reach a certain volume, these tubercles generally ulcerate. In some cases, there is an inflammatory reaction…"

Villemin, 1868/2015, p.256

Despite these signs, the animals remain in reasonable health – for a while,

"Only after 15, 20 or 30 days does it become evident that they are losing weight, and have lost their appetite, gaiety and vivacity of movement. Some, after going into decline for a certain period, regain some weight. Others gradually weaken, falling into the doldrums, often suffering from debilitating diarrhoea, finally succumbing to their illness in a state of emaciation."

Villemin, 1868/2015, p.256
In the doldrums

The doldrums refers to oceanic waters within about five degrees of the equator where there are often 'lulls' or calms with no substantial winds. Sailing ships relied on winds to make progress, and ships that were in the doldrums might be becalmed for extended periods, and so unable to make progress, leaving crews listless and frustrated – and possibly running out of essential supplies.

"Down dropt the breeze, the sails dropt down, 'Twas sad as sad could be; And we did speak only to break The silence of the sea! 

All in a hot and copper sky, The bloody Sun, at noon, Right up above the mast did stand, No bigger than the Moon. 

Day after day, day after day, We stuck, nor breath nor motion; As idle as a painted ship Upon a painted ocean. 

Water, water, every where, And all the boards did shrink; Water, water, every where, Nor any drop to drink." 

Extract from The Rime of the Ancient Mariner, 1834, Samuel Taylor Coleridge

So, the inoculated animals 'fell into the duldrums', metaphorically speaking.

Read about metaphors in science


Under a hot and copper sky
(Image by Youssef Jheir from Pixabay)

The needs of the many are outweighed by the needs of humans

It was widely considered entirely acceptable to sacrifice the lives and well-being of animals in this way, to generate knowledge that is was hoped might help reduce human suffering. 'Animal rights' had not become a mainstream cause (even if animals had occasionally been subject to legal prosecution and sometimes found guilty in European courts – suggesting they had responsibilities if not rights).

Similar experiments were later carried out by Robert Koch in his own investigations of T.B. and other diseases soon after. Indeed, Goetz notes that when working on anthrax in 1875,

"As Koch's experiments went on, his backyard menagerie began to thin out; his daughter, Getrud, grew concerned that she was losing all her pets."
p.27

Goetz, 2013, p.27

"Let us hope that daddy can draw conclusions from his experiments soon…"
(Image by Adina Voicu from Pixabay )

Although animals are still used in medical research today, there is much more concern about their welfare and researchers are expected to avoid the suffering and death of more animals than considered strictly necessary. 5 Wherever possible, alternatives to animal experimentation are preferred.

Inadmissible analogies?

One of the arguments made against animal studies is that as different species are by definition different in their anatomy and physiology, non-human animals are imperfect models for human disease processes. One argument that Villemin faced was that his inoculations between animals was most successful in rabbits, when, it was claimed, rabbits were widely tubercular in the normal population. In other words, it was suggested that Villemin only found evidence of disease in his inoculated test animals because they probably already had the disease anyway.

That suggests the need for some sort of experimental control, and Villemin reported that

"…despite routine sequestration and the tortures that the vivisectionists force them to endure, rabbits are almost never tuberculous. I have explored more than a hundred lungs from these rodents from markets and I found none to be tuberculous."

Villemin, 1868/2015, p.257
Indirect evidence

Villemin had made an analogy between disease transfer between horses to disease transfer between humans. His experiments did not directly test disease transfer between humans – as that would not have been considered unethical (and so "absolutely forbidden") even at a time when animal (i.e., non-human animal) research was not widely questioned:

I believe that I have experimentally demonstrated that phthisis, like syphilis and glanders, is communicable by inoculation. It can be inoculated from humans to certain animals, and from these animals to others of the same species. Can it be inoculated between humans? It is absolutely forbidden for us to provide experimental proof of this, but all the evidence is in favour of an affirmative response.

Villemin, 1868/2015, p.265

So, Villemin did not demonstrate that T.B. could be transferred between people, but only that analogous transfers occurred. So, in a sense, the context of justification, as well as the context of discovery, relied on analogies. Despite this, the indirect evince was strong and Villemin's failure to persuade most of the wider scientific community of his arguments likely reflected the general paradigmatic beliefs at the time that disease was caused by hereditary weakness, or through broad environmental conditions, rather than minute amounts of material being transferred between bodies.


Mycobacterium tuberculosis – the infectious agent in tuberculosis – could only be detected once suitable microscopes were available – Koch published his discovery of the bacterium in 1882.

(source: Wikipedia Commons)


Koch was able to be more persuasive because he was also able to actually identify a microbe present in diseased bodies, as well as show inoculation led to the microbe being found in the inoculated animal. That shift in thinking required the acceptance of a different kind of analogy: that the presence, or absence, of a bacterium in the tissues mapped onto being infected with, or free from, a disease.

present in tissuesMycobacterium tuberculosis
a microscopic 'germ' – only visible under the microscope
absent in tissues
     ↕︎↕︎
infectedtuberculosis
a widespread and often fatal disease of people and other mammals
not infected
In a sense, diagnosis through microbiological methods relies on a kind of analogy

Sources cited:
  • Daniel, T. M. (2015). Jean-Antoine Villemin and the infectious nature of tuberculosis. The International Journal of Tuberculosis and Lung Disease, 19(3), 267-268. https://doi.org/10.5588/ijtld.06.0636
  • Frith, J. (2014). History of Tuberculosis. Part 1 – Phthisis, consumption and the White Plague. Journal of Military and Veterans' Health, 22(2), 29-35.
  • Goetz, T. (2013). The Remedy. Robert Koch, Arthur Conan Doyle, and the quest to cure tuberculosis. Gotham Books.
  • Surget, A. (2022). Being between Scylla and Charybdis: designing animal studies in neurosciences and psychiatry – too ethical to be ethical? In Seminar series: Berlin-Bordeaux Working Group Translating Validity in Psychiatric Research.
  • Taber, K. S. (2013). Upper Secondary Students' Understanding of the Basic Physical Interactions in Analogous Atomic and Solar Systems. Research in Science Education, 43(4), 1377-1406. doi:10.1007/s11165-012-9312-3
  • Villemin, J. A. (1868/2015). On the virulence and specificity of tuberculosis [De la virulence et de la spécificité de la tuberculose]. The International Journal of Tuberculosis and Lung Disease, 19(3), 256-266. https://doi.org/https://doi.org/10.5588/ijtld.06.0636-v

Notes

1 As analogies link to what is familiar, they tend to reflect cultural contexts. At one time the mind was referred to as being like a slate. The once-common comparison of the brain to a telephone exchange has tended to have been largely displaced now by the commparison to a computer.


2 Whilst this is a common teaching analogy, it is also problematic if it taught without considering the negative aspects of the analogy (e.g. electrons repel each other, unlike planets; planets vary in mass etc.), and if the target concept is not clearly presented as one (simplified) model of atomic structure. See Taber, 2013.


3 "During both World War I and World War II in the US Army, tuberculosis was the leading cause of discharge [i.e., from the service]. Annual incidence of tuberculosis in the military of Western countries is very low, however in the last several decades microepidemics have occurred in small close knit units on US and British Naval warships and land based units deployed overseas. Living and working in close quarters and overseas deployment to tuberculosis-endemic areas of the world such as Afghanistan, Iraq and South-East Asia remain significant risk factors for tuberculosis infection in military personnel, particularly multidrug resistant tuberculosis."

Frith, 2014, p.29

4 Some horses have been bred to be fast runners, and others to be capable of pulling heavy loads. (That is some have been artificially selected to be like sprinters or cyclists, and others to be like weightlifters or shot-putters). The latter are variously called draft (U.S. spelling) / draught (British spelling) horses (US), dray horses, carthorses, work horses or heavy horses. When a load was too heavy to be moved by a single horse, several would be harnessed together into a team – providing more power. Ironically the term 'horsepower' was popularised by James Watts – whose name has since been given to the modern international (S.I.) unit of power – in marketing his steam engines. According to the Institute of Physics,

Whilst the peak mechanical power of a single horse can reach up to 15 horsepower, it is estimated that a typical horse can only sustain an output of 1 horsepower (746 W) for three hours and, if working for an eight-hour day, a horse might output only three quarters of one horsepower. 

https://spark.iop.org/why-one-horsepower-more-power-one-horse

5 Alexandre Surget (Associate Professor at University of Tours, France) has even argued that the guidelines adopted in animal experiments are sometimes counter-productive as they encourage experiments with too few animals, and consequently too little statistical power, to support robust conclusions – in effect sacrificing animals without reasonable expectations of securing sound knowledge (Surget, 2022).

Any research that makes demands of resources and the input of others, but which is designed in such a way that it is unlikely to produce reliable new knowledge, can be considered unethical.

Read about research ethics


Out of the womb of darkness

Medical ethics in 20th Century movies


Keith S. Taber


The hero of the film, Dr Holden, is presented as a scientist. Here he is trying to collect some data.
(still from 'The Night of the Demon')

"The Night of the Demon" is a 1957 British film about an American professor who visits England to investigate a supposed satanic cult. It was just shown on English television. It was considered as a horror film at the time of its release, although the short scenes that actually feature a (supposedly real? merely imagined? *) monster are laughable today (think Star Trek's Gorn in the original series, and consider if it is believable as anything other than an actor wearing a lizard suit – and you get the level of horror involved). [*Apparently the director, Jacques Tourneur, never intended a demon to be shown, but the film's producer decided to add footage showing the monster in the opening scenes, potentially undermining the whole point of the film: but giving the publicity department something they could work with. 6]


A real scary demon (in 1959) and a convincing alien (in 1967)?
(stills from 'The Night of the Demon' and ' Star Trek' episode 'Arena')
[Move the slider to see more of each image.]

The film's protagonist is a psychologist, Dr. John Holden, who dismisses stories of demons and witchcraft and the like, and has made a career studying people's beliefs about such superstitions. Dr Holden's visit to Britain deliberately coincided with a conference at which he was to present, as well as coincidentally with the death of one of his colleagues (who had been subject to a hex for investigating the cult).


'Night of the Demon' (Dir.  Jacques Tourneur) movie poster: Sabre Film Production.
[As was common at the time, although the film was in monochrome, the publicity was coloured. Whether the colour painting of the monster looks even less scary than the version in the film itself is a moot point.]

The film works much better as a kind of psychological thriller examining the power of beliefs, than as horror. (Director: 1 – Producer, 0.) That, if we believe something enough, it can have real effects is well acknowledged – but this does not need a supernatural explanation. People can be 'scared' to death by what they imagine, and how they respond to their fears. Researchers expecting a positive outcome from their research are likely to inadvertently behave in ways that leads to this very result: thus the use of double blind studies in medical trials, so that the researchers do not know which patients are receiving which treatment.

Read about expectancy effects in research

While the modern viewer will find little of suspense in the film, I did metaphorically at least 'recoil with shock' from one moment of 'horror'. At the conference a patient (Rand Hobart) is wheeled in on a trolley – someone suspected of having committed a murder associated with the cult, whom the authorities had allowed to be questioned by the researchers…at the conference.


"The authorities have lent me this suspected murderer for the benefit of dramatic effect and for plot development purposes"
(still from 'The Night of the Demon').

A variety of movie posters were produced for the film 6 – arguably this one reflects the genuinely horrific aspect of ther story. To a modern viewer this might also appear the most honest representation of the film as the demon given prominence in some versions of the poster barely features in the film.

Holden's British colleague, Professor O'Brien, explains to the delegates,

"For a period of time this man has been as you see him here. He fails to respond to any normal stimulation. His experience, whatever it was, which we hope here to discover, has left him in a state of absolute catatonic immobility. When I first investigated this case, the problem of how to hypnotise an unresponsive person was the major one. Now the proceedings may be somewhat dramatic, but they are necessary. The only way of bringing his mind out of the womb of darkness into which it has retreated to protect itself, is by therapeutic shock, electrical or chemical. For our purposes we are today using pentothal [? 1] and later methylamphetamine."

Introducing a demonstration of non-consensual use of drugs on a prisoner/patient

"Okay, we'll give him a barbiturate, then we'll hypnotise him, then a stimulant, and if that does not kill him, surely he will simply, calmly and rationally, tell us what so traumatised him that he has completely withdrawn into his subconscious."
(Still from 'The Night of the Demon')


After an injection, Hobart comes out of his catatonic state, becomes aware of his surroundings, and panics.

The dignity of the accused: Hobart is forced out of his 'state of absolute catatonic immobility' to discover he is an exhibit at a scientific conference.
(Still from 'The Night of the Demon'.)

He is physically restrained, and examined by Holden (supposedly the 'hero' of the piece), who then hypnotises him.



He is then given an injection of methylamphetamine before being questioned by O'Brien and Holden. He becomes agitated (what, after being forcibly given 'speed'?), breaks free, and leaps, out of a conveniently placed window, to his death.

Now, of course, this is all just fiction – a story. No one is really drugged, and Hobart is played by an' actor who is unharmed. (I can be fairly sure of that as the part was played by Brian Wilde who much later turned up alive and well as prison officer 'Mr Barrowclough' in BBC's Ronnie Barker vehicle 'Porridge'.)


The magic of the movies – people do not stay dead, and there are no professional misconduct charges brought against our hero.
(Stills from 'The Night of the Demon' and from BBC series 'Porridge'.3 )
[Move the slider to see more of each image.]

Yet this is not some fantastical film (the Gorn's distant cousin aside) but played for realism. Would a psychiatric patient and murder suspect have been released to be paraded and demonstrated at a conference on the paranormal in 1957? I expect not. Would the presenters have been allowed to drug Hobart without his consent?

Read about voluntary, informed, consent

An adult cannot normally be medicated without their consent unless they are considered to lack the ability to make responsible decisions for themselves. Today, it might be possible to give a patient drugs without consent if they have been sectioned under the Mental Health Act (1983) and it was considered the action was necessary for their safety or for the safety of others. Hobart was certainly not an immediate threat to anyone before he was brought out of his trance.

However, even if this enforced use of drugs was sanctioned, this would not be done in a public place with dozens of onlookers. 4 And it would not be done (in the U.K. at least!) simply to question someone about a crime.5 Presumably, the makers of the film either thought that this scene reflected something quite reasonable, or, at least, that the cinema-going public would find this sufficiently feasible to suspend disbelief. If this fictitious episode did not reflect acceptable ethical standards at the time, it would seem to tell us something about public perceptions of the attitude of those in authority (whether the actual authorities who were meant to have a duty of care to a person under arrest, or those designated with professional roles and academic titles) to human rights.

Today, however, professionals such as researchers, doctors, and even teachers, are prepared for their work with a strong emphasis on professional ethics. In medical care, the interest of the patient themselves comes first. In research, informants are voluntary participants in our studies, who offer us the gift of data, and are not subjects of our enquiries to be treated simply as available material for our work.

Yet, actually, this is largely a modern perspective that has developed in recent decades, and sadly there are many real stories, even in living memory, of professionals deciding that people (and this usually meant people with less standing or power in their society) should be drugged, or shocked, or operated on, without their consent and even against their explicit wishes; for what is seen as their own, or even what is judged as some greater, good; in circumstances where it would be totally unacceptable in most countries these days.

So, although this is not really a horror film by today's measures, I hope any other researchers (or medical practitioners) who were watching the film shared my own reaction to this scene: 'no, they cannot do that!'

At least, they could not do that today.

Read about research ethics


Notes

1 This sounds to me like 'pentatyl', but I could not find any reference to a therapeutic drug of that name. Fentanyl is a powerful anti-pain drug, which like amphetamines is abused for recreational use – but was only introduced into practice the year after the film was made. It was most likely referring to sodium thiopental, known as pentothal, and much used (in movies and television, at least) as a truth serum. 5 As it is a barbiturate, and so is used in anaesthesia, it does not seem an obvious drug of choice to wake someone from a catatonic state.


2 The script is based loosely on a 1911 M. R. James short story, 'Casting the Runes' that does not include the episode discussed.


3 I have flipped this image (as can be seen form the newspaper) to put Wilde (playing alongside Ronnie Barker, standing, and Richard Beckinsale), on the right hand side of picture.


4 Which is not to claim that such a public demonstration would have been unlikely at another time and place. Execution was still used in the U.K. until 1964 (during my lifetime), although by that time being found guilty of vagrancy (being unemployed and hanging around {unfortunate pun unintended}) for the second time was no longer a capital offence. However, after 1868 executions were no longer carried out in public.

It was not unknown for the corpses of executed criminals to be subject to public dissection in Renaissance [sic, ironically] Europe.


5 Fiction, of course, has myriad scenes where 'truth drugs' are used to obtain secrets from prisoners – but usually those carrying out the torture are the 'bad guys', either criminals or agents of what is represented in the story as an enemy or dystopian state.


6 Some variations on a theme. (For some reason, for its slightly cut U.S. release 'The Night of the Demon' was called 'The Curse of the Demon'.) The various representations of the demon and the prominence given to it seem odd to a modern viewer given how little the demon actually features in the film.

The references to actually seeing demons and monsters from hell on the screen, "the most terrifying story ever told", and "scenes of terror never before imagined" raises the question of whether the copywriters were expected to watch a film before producing their copy.

Passive learners in unethical control conditions

When 'direct instruction' just becomes poor instruction


Keith S. Taber


An experiment that has been set up to ensure the control condition fails, and so compares an innovation with a substandard teaching condition, can – at best – only show the innovation is not as bad as the substandard teaching

One of the things which angers me when I read research papers is examples of what I think of as 'rhetorical research' that use unethical control conditions (Taber, 2019). That is, educational research which sets up one group of students to be taught in a way that is clearly disadvantages them to ensure the success of an experimental teaching approach,

"I am suggesting that some of the experimental studies reported in the literature are rhetorical in the … sense that the researchers clearly expect to demonstrate a well- established effect, albeit in a specific context where it has not previously been demonstrated. The general form of the question 'will this much-tested teaching approach also work here' is clearly set up expecting the answer 'yes'. Indeed, control conditions may be chosen to give the experiment the best possible chance of producing a positive outcome for the experimental treatment."

Taber, 2019, p.108

This irks me for two reasons. The first, obviously, is that researchers have been prepared to (ab)use learners as 'data fodder' and subject them to poor learning contexts in order to have the best chance of getting positive results for the innovation supposedly being 'tested'. However, it also annoys me as this is inherently a poor research design (and so a poor use of resources) as it severely limits what can be found out. An experiment that compares an innovation with a substandard teaching condition can, at best, show the innovation is not as ineffecitive as the substandard teaching in the control condition; but it cannot tell us if the innovation is at least as effective as existing good practice.

This irritation is compounded when the work I am reading is not some amateur report thrown together for a predatory journal, but an otherwise serious study published in a good research outlet. That was certainly the case for a paper I read today in Research in Science Education (the journal of the Australasian Science Education Research Association) on problem-based learning (Tarhan, Ayar-Kayali, Urek & Acar, 2008).

Rhetorical studies?

Genuine research is undertaken to find something out. The researchers in this enquiry claim:

"This research study aims to examine the effectiveness of a [sic] problem-based learning [PbBL] on 9th grade students' understanding of intermolecular forces (dipole- dipole forces, London dispersion forces and hydrogen bonding)."

Tarhan, et al., 2008, p.285

But they choose to compare PbBL with a teaching approach that they expect to be ineffective. Here the researchers might have asked "how does teaching year 9 students about intermolecular forces though problem-based learning compared with current good practice?" After all, even if PbBL worked quite well, if it is not quite as effective as the way teachers are currently teaching the topic then, all other things being equal, there is no reason to shift to it; whereas if it outperforms even our best current approaches, then there is a reason to recommend it to teachers and roll out associated professional development opportunities.


Problem-based learning (third column) uses a problem (i.e., a task which cannot be solved simply by recalling prior learning or employing an algorithmic routine) as the focus and motivation for learning about a topic

Of course, that over-simplifies the situation, as in education, 'all other things' never are equal (every school, class, teacher…is unique). An approach that works best on average will not work best everywhere. But knowing what works best on average (that is, taken across the diverse range of teaching and learning contexts) is certainly a very useful starting point when teachers want to consider what might work best in their own classrooms.

Rhetorical research is poor research, as it is set up (deliberately or inadvertently) to demonstrate a particular outcome, and, so, has built-in bias. In the case of experimental studies, this often means choosing an ineffective instructional approach for the comparison class. Why else would researchers select a control condition they know is not suitable for bringing about the educational outcomes they are testing for?

Problem-Based Learning in a 9th Grade Chemistry Class

Tarhan and colleagues' study was undertaken in one school with 78 students divided into two groups. One group was taught through a sequence based on problem-based learning that involved students undertaking research in groups, gently supported and steered by the teacher. The approach allowed student dialogue, which is believed to be valuable in learning, and motivated students to be active engaged in enquiry. When such an approach is well judged it has potential to count as 'scaffolding' of learning. This seems a very worthwhile innovation – well worth developing and evaluating.

Of course, work in one school cannot be assumed to generalise elsewhere, and small-scale experimental work of this kind is open to major threats to validity, such as expectancy effects and researcher bias – but this is unfortunately always true of these kinds of studies (which are often all educational researchers are resourced to carry out). Finding out what works best in some educational context at least potentially contributes to building up an overall picture (Taber, 2019). 1

Why is this rhetorical research?

I consider this rhetoric research because of the claims the authors make at the start of the study:

"Research in science education therefore has focused on applying active learning techniques, which ensure the affective construction of knowledge, prevent the formation of alternate conceptions, and remedy existing alternate conceptions…Other studies suggest that active learning methods increase learning achievement by requiring students to play a more active role in the learning process…According to active learning principles, which emphasise constructivism, students must engage in researching, reasoning, critical thinking, decision making, analysis and synthesis during construction of their knowledge."

Tarhan, et al., 2008, pp.285-286

If they genuinely believed that, then to test the effectiveness of their PbBL activity, Tarhan and colleagues needed to compare it with some other teaching condition that they are confident can "ensure the affective construction of knowledge, prevent the formation of alternate conceptions, and remedy existing alternate conceptions… requir[e] students to play a more active role in the learning process…[and] engage in researching, reasoning, critical thinking, decision making, analysis and synthesis during construction of their knowledge." A failure to do that means that the 'experiment' has been biased – it has been set up to ensure the control condition fails.

Unethical research?

"In most educational research experiments of [this] type…potential harm is likely to be limited to subjecting students (and teachers) to conditions where teaching may be less effective, and perhaps demotivating. This may happen in experimental treatments with genuine innovations (given the nature of research). It can also potentially occur in control conditions if students are subjected to teaching inputs of low effectiveness when better alternatives were available. This may be judged only a modest level of harm, but – given that the whole purpose of experiments to test teaching innovations is to facilitate improvements in teaching effectiveness – this possibility should be taken seriously."

Taber, 2019, p.94

The same teacher taught both classes: "Both of the groups were taught by the same chemistry teacher, who was experienced in active learning and PbBL" (p.288). This would seem to reduce the 'teacher effect' – outcomes being effected because the teacher of one one class being more effective than that of another. (Reduce, rather than eliminate, as different teachers have different styles, skills, and varied expertise: so, most teachers are more suited to, and competent in, some teaching approaches than others.)

So, this teacher was certainly capable of teaching in the ways that Tarhan and colleagues claim as necessary for effective learning ("active learning techniques"). However, the control condition sets up the opposite of active learning, so-called passive learning:

"In this study, the control group was taught the same topics as the experimental group using a teacher-centred traditional didactic lecture format. Teaching strategies were dependent on teacher expression and question-answer format. However, students were passive participants during the lessons and they only listened and took notes as the teacher lectured on the content.

The lesson was begun with teacher explanation about polar and nonpolar covalent bonding. She defined formation of dipole-dipole forces between polar molecules. She explained that because of the difference in electronegativities between the H and Cl atoms for HCl molecule is 0.9, they are polar molecules and there are dipole-dipole forces between HCl molecules. She also stated that the intermolecular dipole-dipole forces are weaker than intramolecular bonds such as covalent and ionic bonding. She gave the example of vaporisation and decomposition of HCl. She explained that while 16 kJ/mol of energy is needed to overcome the intermolecular attraction between HCl molecules in liquid HCl during vaporisation process of HCl, 431 kJ/mol of energy is required to break the covalent bond between the H and Cl atoms in the HCl molecule. In the other lesson, the teacher reminded the students of dipole-dipole forces and then considered London dispersion forces as weak intermolecular forces that arise from the attractive force between instantaneous dipole in nonpolar molecules. She gave the examples of F2, Cl2, Br2, I2 and said that because the differences in electronegativity for these examples are zero, these molecules are non-polar and had intermolecular London dispersion forces. The effects of molecular size and mass on the strengths of London dispersion forces were discussed on the same examples. She compared the strengths of dipole-dipole forces and London dispersion forces by explaining the differences in melting and boiling points for polar (MgO, HCl and NO) and non-polar molecules (F2, Cl2, Br2, and I2). The teacher classified London dispersion forces and dipole- dipole as van der Waals forces, and indicated that there are both London dispersion forces and dipole-dipole forces between polar molecules and only London dispersion forces between nonpolar molecules. In the last lesson, teacher called attention to the differences in boiling points of H2O and H2S and defined hydrogen bonds as the other intermolecular forces besides dipole-dipole and London dispersion forces. Strengths of hydrogen bonds depending on molecular properties were explained and compared in HF, NH3 and H2O. She gave some examples of intermolecular forces in daily life. The lesson was concluded with a comparison of intermolecular forces with each other and intramolecular forces."

Tarhan, et al., 2008, p.293

Lecturing is not ideal for teaching university students. It is generally not suitable for teaching school children (and it is not consistent with what is expected in Turkish schools).

This was a lost opportunity to seriously evaluate the teaching through PbBL by comparing with teaching that followed the national policy recommendations. Moreover, it was a dereliction of the duty that educators should never deliberately disadvantage learners. It is reasonable to experiment with children's learning when you feel there is a good chance of positive outcomes: it is not acceptable to deliberately set up learners to fail (e.g., by organising 'passive' learning when you claim to believe effective learning activities are necessarily 'active').

Isn't this 'direct instruction'?

Now, perhaps the account of the teaching given by Tarhan and colleagues might seem to fit the label of 'direct teaching'. Whilst Tarhan et al. claim constructivist teaching is clearly necessary for effective learning, there are some educators who claim that constructivist approaches are inferior, and a more direct approach, 'direct instruction', is more likely to lead to learning gains.

This has been a lively debate, but often the various commentators use terminology differently and argue across each other (Taber, 2010). The proponents of direct instruction often criticise teaching that expects learners to take nearly all the responsibility for learning, with minimal teacher support. I would also criticise that (except perhaps in the case of graduate research students once they have demonstrated their competence, including knowing when to seek supervisory guidance). That is quite unlike genuine constructivist teaching which is optimally guided (Taber, 2011): where the teacher manages activities, constantly monitors learner progress, and intervenes with various forms of direction and support as needed. Tarhan and colleagues' description of their problem-based learning experimental condition appears to have had this kind of guidance:

"The teacher visited each group briefly, and steered students appropriately by using some guiding questions and encouraging them to generate their hypothesis. The teacher also stimulated the students to gain more information on topics such as the polar structure of molecules, differences in electronegativity, electron number, atom size and the relationship between these parameters and melting-boiling points…The teacher encouraged students to discuss the differences in melting and boiling points for polar and non-polar molecules. The students came up with [their] research questions under the guidance of the teacher…"

Tarhan, et al., 2008, pp.290-291

By contrast, descriptions of effective direct instruction do involve tightly planned teaching with carefully scripted teacher moves of the kind quoted in the account, above, of the control condition. (But any wise teacher knows that lessons can only be scripted as a provisional plan: the teacher has to constantly check the learners are making sense of teaching as intended, and must be prepared to change pace, repeat sections, re-order or substitute activities, invent new analogies and examples, and so forth.)

However, this instruction is not simply a one-way transfer of information, but rather a teacher-led process that engages students in active learning to process the material being introduced by the teacher. If this is done by breaking the material into manageable learning quanta, each of which students engage with in dialogic learning activities before preceding to the next, then this is constructivist teaching (even if it may also be considered by some as 'direct instruction'!)


Effective teaching moves between teacher input and student activities and is not just the teacher communicating information to the learners.

By contrast, the lecture format adopted by Tarhan's team was based on the teacher offering a multi-step argument (delivered over several lessons) and asking the learners to follow and retain an extensive presentation.

"The lesson was begun with teacher explanation …

She defined …

She explained…

She also stated…

She gave the example …

She explained that …

the teacher reminded the students …

She gave the examples of …

She compared…

The teacher classified …

and indicated that …

[the] teacher called attention to …

She gave some examples of …"

Tarhan, et al., 2008, p.293

This is a description of the transmission of information through a communication channel: not an account of teaching which engages with students' thinking and guides them to new understandings.

Ethical review

Despite the paper having been published in a major journal, Research in Science Education, there seems to be no mention that the study design has been through any kind of institutional ethical review before the research began. Moreover, there is no reference to the learners or their parents/guardians having been asked for, or having given, voluntary, informed, consent, as is usually required in research with human participants. Indeed Tarhen and colleagues refer to the children as the 'subjects' of their research, not participants in their study.

Perhaps ethical review was not expected in the national context (at least, in 2008). Certainly, it is difficult to imagine how voluntary, informed, consent would be obtained if parents were to be informed that half of the learners would be deliberately subject to a teaching approach the researchers claim lacks any of the features "students must engage in…during construction of their knowledge".

PbBL is better than…deliberately teaching in a way designed to limit learning

Tarhan and colleagues, unsurprisingly, report that on a post-test the students who were taught through PbBL out-performed these students who were lectured at. It would have been very surprising (and so potentially more interesting, and, perhaps, even useful, research!) had they found anything else, given the way the research was biased.

So, to summarise:

  1. At the outset of the paper it is reported that it is already established that effective learning requires students to engage in active learning tasks.
  2. Students in the experimental conditions undertook learning through a PbBL sequence designed to engage them in active learning.
  3. Students in the control condition were subject to a sequence of lecturing inputs designed to ensure they were passive.
  4. Students in the active learning condition outperformed the students in the passive learning condition

Which I suggest can be considered both rhetorical research, and unethical.


The study can be considered both rhetorical and unfair to the learners assigned to be in the control group

Read about rhetorical experiments

Read about unethical control conditions


Work cited:

Note:

1 There is a major issue which is often ignored in studies of his type (where a pedagogical innovation is trialled in a single school area, school or classroom). Finding that problem-based learning (or whatever) is effective in one school when teaching one topic to one year group does not allow us to generalise to other classrooms, schools, country, educational level, topics and disciplines.

Indeed, as every school, every teacher, every class, etc., is unique in some ways, it might be argued that one only really finds out if an approach will work well 'here' by trying it out 'here' – and whether it is universally applicable by trying it everywhere. Clearly academic researchers cannot carry out such a programme, but individual teachers and departments can try out promising approaches for themselves (i.e., context-directed research, such as 'action research').

We might ask if there is any point in researchers carrying out studies of the type discussed in this article, there they start by saying an approach has been widely demonstrated, and then test it in what seems an arbitrarily chosen (or, more likely, convenient) curriculum and classroom context, given that we cannot generalise from individual studies, and it is not viable to test every possible context.

However, there are some sensible guidelines for how series of such studies into the same type of pedagogic innovation in different contexts can be more useful in (a) helping determine the range of contexts where an approach is effective (through what we might call 'incremental generalisation'), and (b) document the research contexts is sufficient detail to support readers in making judgements about the degree of similarity with their own teaching context (Taber, 2019).

Read about replication studies

Read about incremental generalisation

Falsifying research conclusions

You do not need to falsify your results if you are happy to draw conclusions contrary to the outcome of your data analysis.


Keith S. Taber


Li and colleagues claim that their innovation is successful in improving teaching quality and student learning: but their own data analaysis does not support this.

I recently read a research study to evaluate a teaching innovation where the authors

  • presented their results,
  • reported the statistical test they had used to analyse their results,
  • acknowledged that the outcome of their experiment was negative (not statistically significant), then
  • stated their findings as having obtained a positive outcome, and
  • concluded their paper by arguing they had demonstrated their teaching innovation was effective.

Li, Ouyang, Xu and Zhang's (2022) paper in the Journal of Chemical Education contravenes the scientific norm that your conclusions should be consistent with the outcome of your data analysis.
(Magnified portions of this scheme are presented below)

And this was not in a paper in one of those predatory journals that I have criticised so often here – this was a study in a well regarded journal published by a learned scientific society!

The legal analogy

I have suggested (Taber, 2013) that writing up research can be understood in terms of a number of metaphoric roles: researchers need to

  • tell the story of their research;
  • teach readers about the unfamiliar aspects of their work;
  • make a case for the knowledge claims they make.

Three metaphors for writing-up research

All three aspects are important in making a paper accessible and useful to readers, but arguably the most important aspect is the 'legal' analogy: a research paper is an argument to make a claim for new public knowledge. A paper that does not make its case does not add anything of substance to the literature.

Imagine a criminal case where the prosecution seeks to make its argument at a pre-trial hearing:

"The police found fingerprints and D.N.A. evidence at the scene, which they believe were from the accused."

"Were these traces sent for forensic analysis?"

"Of course. The laboratory undertook the standard tests to identify who left these traces."

"And what did these analyses reveal?"

"Well according to the current standards that are widely accepted in the field, the laboratory was unable to find a definite match between the material collected at the scene, and fingerprints and a D.N.A. sample provided by the defendant."

"And what did the police conclude from these findings?"

"The police concluded that the fingerprints and D.N.A. evidence show that the accused was at the scene of the crime."

It seems unlikely that such a scenario has ever played out, at least in any democratic country where there is an independent judiciary, as the prosecution would be open to ridicule and it is quite likely the judge would have some comments about wasting court time. What would seem even more remarkable, however, would be if the judge decided on the basis of this presentation that there was a prima facie case to answer that should proceed to a full jury trial.

Yet in educational research, it seems parallel logic can be persuasive enough to get a paper published in a good peer-reviewed journal.

Testing an educational innovation

The paper was entitled 'Implementation of the Student-Centered Team-Based Learning Teaching Method in a Medicinal Chemistry Curriculum' (Li, Ouyang, Xu & Zhang, 2022), and it was published in the Journal of Chemical Education. 'J.Chem.Ed.' is a well-established, highly respected periodical that takes peer review seriously. It is published by a learned scientific society – the American Chemical Society.

That a study published in such a prestige outlet should have such a serious and obvious flaw is worrying. Of course, no matter how good editorial and peer review standards are, it is inevitable that sometimes work with serious flaws will get published, and it is easy to pick out the odd problematic paper and ignore the vast majority of quality work being published. But, I did think this was a blatant problem that should have been spotted.

Indeed, because I have a lot of respect for the Journal of Chemical Education I decided not to blog about it ("but that is what you are doing…?"; yes, but stick with me) and to take time to write a detailed letter to the journal setting out the problem in the hope this would be acknowledged and the published paper would not stand unchallenged in the literature. The journal declined to publish my letter although the referees seemed to generally accept the critique. This suggests to me that this was not just an isolated case of something slipping through – but a failure to appreciate the need for robust scientific standards in publishing educational research.

Read the letter submitted to the Journal of Chemical Education

A flawed paper does not imply worthless research

I am certainly not suggesting that there is no merit in Li, Ouyang, Xu and Zhang's work. Nor am I arguing that their work was not worth publishing in the journal. My argument is that Li and colleague's paper draws an invalid conclusion, and makes misleading statements inconsistent with the research data presented, and that it should not have been published in this form. These problems are pretty obvious, and should (I felt) have been spotted in peer review. The authors should have been asked to address these issues, and follow normal scientific standards and norms such that their conclusions follow from, rather than contradict, their results.

That is my take. Please read my reasoning below (and the original study if you have access to J.Chem.Ed.) and make up your own mind.

Li, Ouyang, Xu and Zhang report an innovation in a university course. They consider this to have been a successful innovation, and it may well have great merits. The core problem is that Li and colleagues claim that their innovation is successful in improving teaching quality and student learning: when their own data analysis does not support this.

The evidence for a successful innovation

There is much material in the paper on the nature of the innovation, and there is evidence about student responses to it. Here, I am only concerned with the failure of the paper to offer a logical chain of argument to support their knowledge claim that the teaching innovation improved student achievement.

There are (to my reading – please judge for yourself if you can access the paper) some slight ambiguities in some parts of the description of the collection and analysis of achievement data (see note 5 below), but the key indicator relied on by Li, Ouyang, Xu and Zhang is the average score achieved by students in four teaching groups, three of which experienced the teaching innovation (these are denoted collectively as the 'the experimental group') and one group which did not (denoted as 'the control group', although there is no control of variables in the study 1). Each class comprised of 40 students.

The study is not published open access, so I cannot reproduce the copyright figures from the paper here, but below I have drawn a graph of these key data:


Key results from Li et al, 2022: this data was the basis for claiming an effective teaching innovation.

Loading poll ...

It is on the basis of this set of results that Li and colleagues claim that "the average score showed a constant upward trend, and a steady increase was found". Surely, anyone interrogating these data might have pause to wonder if that is the most authentic description of the pattern of scores year on year.

Does anyone teaching in a university really think that assessment methods are good enough to produce average class scores that are meaningful to 3 or 4 significant figures. To a more reasonable level of precision, nearest %age point (which is presumably what these numbers are – that is not made explicit), the results were:


CohortAverage class score
201780
201880
201980
202080
Average class scores (2 s.f.) year on year

When presented to a realistic level of precision, the obvious pattern is…no substantive change year on year!

A truncated graph

In their paper, Li and colleagues do present a graph to compare the average results in 2017 with (not 2018, but) 2019 and 2020, somewhat similar to the one I have reproduced here which should have made it very clear how little the scores varied between cohorts. However, Li and colleagues did not include on their axis the full range of possible scores, but rather only included a small portion of the full range – from 79.4 to 80.4.

This is a perfectly valid procedure often used in science, and it is quite explicitly done (the x-axis is clearly marked), but it does give a visual impression of a large spread of scores which could be quite misleading. In effect, their Figure 4b includes just a slither of my graph above, as shown below. If one takes the portion of the image below that is not greyed out, and stretches it to cover the full extent of the x axis of a graph, that is what is presented in the published account.


In the paper in J.Chem.Ed., Li and colleagues (2022) truncate the scale on their average score axis to expand 1% of the full range (approximated above in the area not shaded over) into a whole graph as their Figure 4b. This gives a visual impression of widely varying scores (to anyone who does not read the axis labels).

Compare images: you can use the 'slider' to change how much of each of the two images is shown.

What might have caused those small variations?

If anyone does think that differences of a few tenths of a percent in average class scores are notable, and that this demonstrates increasing student achievement, then we might ask what causes this?

Li and colleagues seem to be convinced that the change in teaching approach caused the (very modest) increase in scores year on year. That would be possible. (Indeed, Li et al seem to be arguing that the very, very modest shift from 2017 to subsequent years was due to the change of teaching approach; but the not-quite-so-modest shifts from 2018 to 2019 to 2020 are due to developing teacher competence!) However, drawing that conclusion requires making a ceteris paribus assumption: that all other things are equal. That is, that any other relevant variables have been controlled.

Read about confounding variables

Another possibility however is simply that each year the teaching team are more familiar with the science, and have had more experience teaching it to groups at this level. That is quite reasonable and could explain why there might be a modest increase in student outcomes on a course year on year.

Non-equivalent groups of students?

However, a big assumption here is that each of the year groups can be considered to be intrinsically the same at the start of the course (and to have equivalent relevant experiences outside the focal course during the programme). Often in quasi-experimental studies (where randomisation to conditions is not possible 1) a pre-test is used to check for equivalence prior to the innovation: after all, if students are starting from different levels of background knowledge and understanding then they are likely to score differently at the end of a course – and no further explanation of any measured differences in course achievement need be sought.

Read about testing for initial equivalence

In experiments, you randomly assign the units of analysis (e.g., students) to the conditions, which gives some basis for at least comparing any differences in outcomes with the variations likely by chance. But this was not a true experiment as there was no randomisation – the comparisons are between successive year groups.

In Li and colleagues' study, the 40 students taking the class in 2017 are implicitly assumed equivalent to the 40 students taking the class in each of the years 20818-2020: but no evidence is presented to support this assumption. 3

Yet anyone who has taught the same course over a period of time knows that even when a course is unchanged and the entrance requirements stable, there are naturally variations from one year to the next. That is one of the challenges of educational research (Taber, 2019): you never can "take two identical students…two identical classes…two identical teachers…two identical institutions".

Novelty or expectation effects?

We would also have to ignore any difference introduced by the general effect of there being an innovation beyond the nature of the specific innovation (Taber, 2019). That is, students might be more attentive and motivated simply because this course does things differently to their other current courses and past courses. (Perhaps not, but it cannot be ruled out.)

The researchers are likely enthusiastic for, and had high expectations for, the innovation (so high that it seems to have biased their interpretation of the data and blinded them to the obvious problems with their argument) and much research shows that high expectation, in its own right, often influences outcomes.

Read about expectancy effects in studies

Equivalent examination questions and marking?

We also have to assume the assessment was entirely equivalent across the four years. 4 The scores were based on aggregating a number of components:

"The course score was calculated on a percentage basis: attendance (5%), preclass preview (10%), in-class group presentation (10%), postclass mind map (5%), unit tests (10%), midterm examination (20%), and final examination (40%)."

Li, et al, 2022, p.1858

This raises questions about the marking and the examinations:

  • Are the same test and examination questions used each year (that is not usually the case as students can acquire copies of past papers)?
  • If not, how were these instruments standardised to ensure they were not more difficult in some years than others?
  • How reliable is the marking? (Reliable meaning the same scores/mark would be assigned to the same work on a different occasion.)

These various issues do not appear to have been considered.

Change of assessment methodology?

The description above of how the students' course scores were calculated raises another problem. The 2017 cohort were taught by "direct instruction". This is not explained as the authors presumably think we all know exactly what that is : I imagine lectures. By comparison, in the innovation (2018-2020 cohorts):

"The preclass stage of the SCTBL strategy is the distribution of the group preview task; each student in the group is responsible for a task point. The completion of the preview task stimulates students' learning motivation. The in-class stage is a team presentation (typically PowerPoint (PPT)), which promotes students' understanding of knowledge points. The postclass stage is the assignment of team homework and consolidation of knowledge points using a mind map. Mind maps allow an orderly sorting and summarization of the knowledge gathered in the class; they are conducive to connecting knowledge systems and play an important role in consolidating class knowledge."

Li, et al, 2022, p.1856, emphasis added.

Now the assessment of the preview tasks, the in-class group presentations, and the mind maps all contributed to the overall student scores (10%, 10%, 5% respectively). But these are parts of the innovative teaching strategy – they are (presumably) not part of 'direct instruction'. So, the description of how the student class scores were derived only applies to 2018-2020, and the methodology used in 2017 must have been different. (This is not discussed in the paper.) 5

A quarter of the score for the 'experimental' groups came from assessment components that could not have been part of the assessment regime applied to the 2017 cohort. At the very least, the tests and examinations must have been more heavily weighed into the 'control' group students' overall scores. This makes it very unlikely the scores can be meaningfully directly compared from 2017 to subsequent years: if the authors think otherwise they should have presented persuasive evidence of equivalence.


Li and colleagues want to convince us that variations in average course scores can be assumed to be due to a change in teaching approach – even though there are other conflating variables.

So, groups that we cannot assume are equivalent are assessed in ways that we cannot assume to be equivalent and obtain nearly identical average levels of achievement. Despite that, Li and colleagues want to persuade us that the very modest differences in average scores between the 'control' and 'experimental' groups (which is actually larger between different 'experimental group' cohorts than between the 'control' group and the successive 'experimental' cohort) are large enough to be significant and demonstrate their teaching innovation improves student achievement.

Statistical inference

So, even if we thought shifts of less than a 1% average in class achievement were telling, there are no good reasons to assume they are down to the innovation rather than some other factor. But Li and colleagues use statistical tests to tell them whether differences between the 'control' and 'experimental' conditions are significant. They find – just what anyone looking at the graph above would expect – "there is no significant difference in average score" (p.1860).

The scientific convention in using such tests is that the choice of test, and confidence level (e.g., a probability of p<0.05 to be taken as significant) is determined in advance, and the researchers accept the outcomes of the analysis. There is a kind of contract involved – a decision to use a statistical test (chosen in advance as being a valid way of deciding the outcome of an experiment) is seen as a commitment to accept its outcomes. 2 This is a form of honesty in scientific work. Just as it is not acceptable to fabricate data, nor is is acceptable to ignore experimental outcomes when drawing conclusions from research.

Special pleading is allowed in mitigation (e.g., "although our results were non-significant, we think this was due to the small samples sizes, and suggest that further research should be undertaken with large groups {and we are happy to do this if someone gives us a grant}"), but the scientist is not allowed to simply set aside the results of the analysis.


Li and colleagues found no significant difference between the two conditions, yet that did not stop them claiming, and the Journal of Chemical Education publishing, a conclusion that the new teaching approach improved student achievement!

Yet setting aside the results of their analysis is what Li and colleagues do. They carry out an analysis, then simply ignore the findings, and conclude the opposite:

"To conclude, our results suggest that the SCTBL method is an effective way to improve teaching quality and student achievement."

Li, et al, 2022, p.1861

It was this complete disregard of scientific values, rather than the more common failure to appreciate that they were not comparing like with like, that I found really shocking – and led to me writing a formal letter to the journal. Not so much surprise that researchers might do this (I know how intoxicating research can be, and how easy it is to become convinced in one's ideas) but that the peer reviewers for the Journal of Chemical Education did not make the firmest recommendation to the editor that this manuscript could NOT be published until it was corrected so that the conclusion was consistent with the findings.

This seems a very stark failure of peer review, and allows a paper to appear in the literature that presents a conclusion totally unsupported by the evidence available and the analysis undertaken. This also means that Li, Ouyang, Xu and Zhang now have a publication on their academic records that any careful reader can see is critically flawed – something that could have been avoided had peer reviewers:

  • used their common sense to appreciate that variations in class average scores from year to year between 79.8 and 80.3 could not possibly be seen as sufficient to indicate a difference in the effectiveness of teaching approaches;
  • recommended that the authors follow the usual scientific norms and adopt the reasonable scholarly value position that the conclusion of your research should follow from, and not contradict, the results of your data analysis.


Work cited:

Notes

1 Strictly the 2017 cohort has the role of a comparison group, but NOT a control group as there was no randomisation or control of variables, so this was not a true experiment (but a 'quasi-experiment'). However, for clarity, I am here using the original authors' term 'control group'.

Read about experimental research design


2 Some journals are now asking researchers to submit their research designs and protocols to peer review BEFORE starting the research. This prevents wasted effort on work that is flawed in design. Journals will publish a report of the research carried out according to an accepted design – as long as the researchers have kept to their research plans (or only made changes deemed necessary and acceptable by the journal). This prevents researchers seeking to change features of the research because it is not giving the expected findings and means that negative results as well as positive results do get published.


3 'Implicitly' assumed as nowhere do the authors state that they think the classes all start as equivalent – but if they do not assume this then their argument has no logic.

Without this assumption, their argument is like claiming that growing conditions for tree development are better at the front of a house than at the back because on average the trees at the front are taller – even though fast-growing mature trees were planted at the front and slow-growing saplings at the back.


4 From my days working with new teachers, a common rookie mistake was assuming that one could tell a teaching innovation was successful because students achieved an average score of 63% on the (say, acids) module taught by the new method when the same class only averaged 46% on the previous (say, electromagnetism) module. Graduate scientists would look at me with genuine surprise when I asked how they knew the two tests were of comparable difficulty!

Read about why natural scientists tend to make poor social scientists


5 In my (rejected) letter to the Journal of Chemical Education I acknowledged some ambiguity in the paper's discussion of the results. Li and colleagues write:

"The average scores of undergraduates majoring in pharmaceutical engineering in the control group and the experimental group were calculated, and the results are shown in Figure 4b. Statistical significance testing was conducted on the exam scores year to year. The average score for the pharmaceutical engineering class was 79.8 points in 2017 (control group). When SCTBL was implemented for the first time in 2018, there was a slight improvement in the average score (i.e., an increase of 0.11 points, not shown in Figure 4b). However, by 2019 and 2020, the average score increased by 0.32 points and 0.54 points, respectively, with an obvious improvement trend. We used a t test to test whether the SCTBL method can create any significant difference in grades among control groups and the experimental group. The calculation results are shown as follows: t1 = 0.0663, t2 = 0.1930, t3 =0.3279 (t1 <t2 <t3 <t𝛼, t𝛼 =2.024, p>0.05), indicating that there is no significant difference in average score. After three years of continuous implementation of SCTBL, the average score showed a constant upward trend, and a steady increase was found. The SCTBL method brought about improvement in the class average, which provides evidence for its effectiveness in medicinal chemistry."

Li, et al, 2022, p.1858-1860, emphasis added

This appears to refer to three distinct measures:

  • average scores (produced by weighed summations of various assessment components as discussed above)
  • exam scores (perhaps just the "midterm examination…and final examination", or perhaps just the final examination?)
  • grades

Formal grades are not discussed in the paper (the word is only used in this one place), although the authors do refer to categorising students into descriptive classes ('levels') according to scores on 'assessments', and may see these as grades:

"Assessments have been divided into five levels: disqualified (below 60), qualified (60-69), medium (70-79), good (80-89), and excellent (90 and above)."

Li, et al, 2022, p.1856, emphasis added

In the longer extract above, the reference to testing difference in "grades" is followed by reporting the outcome of the test for "average score":

"We used a t test to test …grades …The calculation results … there is no significant difference in average score"

As Student's t-test was used, it seems unlikely that the assignment of students to grades could have been tested. That would surely have needed something like the Chi-squared statistic to test categorical data – looking for an association between (i) the distributions of the number of students in the different cells 'disqualified', 'qualified', 'medium', 'good' and 'excellent'; and (ii) treatment group.

Presumably, then, the statistical testing was applied to the average course scores shown in the graph above. This also makes sense because the classification into descriptive classes loses some of the detail in the data and there is no obvious reason why the researchers would deliberately chose to test 'reduced' data rather than the full data set with the greatest resolution.


Reflecting the population

Sampling an "exceedingly large number of students"


Keith S. Taber


the key to sampling a population is identifying a representative sample

Obtaining a representative sample of a population can be challenging
(Image by Gerd Altmann from Pixabay)


Many studies in education are 'about' an identified population (students taking A level Physics examinations; chemistry teachers in German secondary schools; children transferring from primary to secondary school in Scotland; undergraduates majoring in STEM subjects in Australia…).

Read about populations of interest in research

But, in practice, most studies only collect data from a sample of the population of interest.

Sampling the population

One of the key challenges in social research is sampling. Obtaining a sample is usually not that difficult. However, often the logic of research is something along the lines:

  • 1. Aim – to find out about a population.
  • 2. As it is impractical to collect data from the whole population, collect data from a sample.
  • 3. Analyse data collected from the sample.
  • 4. Draw inferences about the population from the analysis of data collected form the sample.

For example, if one wished to do research into the views of school teachers in England and there are, say, 600 000 of them, it is, unlikely anyone could undertake research that collected and analysed data from all of them and produce results in a short enough period for the findings to still be valid (unless they were prepared to employ a research team of thousands!) But perhaps one could collect data from a sample that would be informative about the population.

This can be a reasonable approach (and, indeed, is a very common approach in research in areas like education) but relies on the assumption that what is true of the sample, can be generalised to the population.

That clearly depends on the sample being representatives of the larger population (at least in those ways which are pertinent to the the research).


When a study (as here in the figure an experiment) collects data from a sample drawn at random from a wider population, then the findings of the experiment can be assumed to apply (on average) to the population. (Figure from Taber, 2019.) In practice, unless a population of interest is quite modest in size (e.g., teachers in one school; post-graduate students in one university department; registered members of a society) it is usually simply not feasible to obtain a random sample.

For example, if we were interested in secondary school students in England, and we had a sample of secondary students from England that (a) reflected the age profile of the population; (b) reflected the gender profile of the population; but (c) were all drawn from one secondary school, this is unlikely to be a representative sample.

  • If we do have a representative sample, then the likely error in generalising from sample to population can be calculated (and can be reduced by having a larger sample);
  • If we do not have a representative sample, then there is no way of knowing how well the findings from the sample reflect the wider population and increasing sample size does not really help; and, for that matter,
  • If we do not know whether we have a representative sample, then, again, there is no way of knowing how well the findings from the sample reflect the wider population and increasing sample size does not really help.

So, the key to sampling a population is identifying a representative sample.

Read about sampling a population

If we know that only a small number of factors are relevant to the research then we may (if we are able to characterise members of the population on these criteria) be able to design a sample which is representative based on those features which are important.

If the relevant factors for a study were teaching subject; years of teaching experience; teacher gender, then we would want to build a sample that fitted the population profile accordingly, so, maybe, 3% female maths teachers with 10+ years of teaching experience, et cetera. We would need suitable demographic information about the population to inform the building of the sample.

We can then randomly select from those members of the the population with the right characteristics within the different 'cells'.

However, if we do not know exactly what specific features might be relevant to characterise a population in a particular research project, the best we might be able to do is to to employ a randomly chosen sample which at least allows the measurement error to be estimated.

Labs for exceedingly large numbers of students

Leopold and Smith (2020) were interested in the use of collaborative group work in a "general chemistry, problem-based lab course" at a United States university, where students worked in fixed groups of three or four throughout the course. As well as using group work for more principled reasons, "group work is also utilized as a way to manage exceedingly large numbers of students and efficiently allocate limited time, space, and equipment" (p.1). They tell readers that

"the case we examine here is a general chemistry, problem-based lab course that enrols approximately 3500 students each academic year"

Leopold & Smith, 2020, p.5

Although they recognised a wide range of potential benefits of collaborative work, these depend upon students being able to work effectively in groups, which requires skills that cannot be take for granted. Leopold and Smith report how structured support was put in place that help students diagnose impediments to the effective work of their groups – and they investigated this in their study.

The data collected was of two types. There was a course evaluation at the end of the year taken by all the students in the cohort, "795 students enrolled [in] the general chemistry I lab course during the spring 2019 semester" (p.7). However, they also collected data from a sample of student groups during the course, in terms of responses to group tasks designed to help them think about and develop their group work.

Population and sample

As the focus of their research was a specific course, the population of interest was the cohort of undergraduates taking the course. Given the large number of students involved, they collected qualitative data from a sample of the groups.

Units of analysis

The course evaluation questions sought individual learners' views so for that data the unit of analysis was the individual student. However, the groups were tasked with working as a group to improve their effectiveness in collaborative learning. So, in Leopold and Smith's sample of groups, the unit of analysis was the group. Some data was received from individual groups members, and other data were submitted as group responses: but the analysis was on the basis of responses from within the specific groups in the sample.

A stratified sample

Leopold and Smith explained that

"We applied a stratified random sampling scheme in order to account for variations across lab sections such as implementation fidelity and instructor approach so as to gain as representative a sample as possible. We stratified by individual instructors teaching the course which included undergraduate teaching assistants (TAs), graduate TAs, and teaching specialists. One student group from each instructor's lab sections was randomly selected. During spring 2019, we had 19 unique instructors teaching the course therefore we selected 19 groups, for a total of 76 students."

Leopold & Smith, 2020, p.7

The paper does not report how the random assignment was made – how it was decided which group would be selected for each instructor. As any competent scientist ought to be able to make a random selection quite easily in this situation, this is perhaps not a serious omission. I mention this because sadly not all authors who report having used randomisation can support this when asked how (Taber, 2013).

Was the sample representative?

Leopold and Smith found that, based on their sample, student groups could diagnose impediments to effective group working, and could often put in place effective strategies to increase their effectiveness.

We might wonder if the sample was representative of the wider population. If the groups were randomly selected in the way claimed then one would expect this would probably be the case – only 'probably', as that is the best randomisation and statistics can do – we can never know for certain that a random sample is representative, only that it is unlikely to be especially unrepresentative!

The only way to know for sure that a sample is genuinely representative of the population of interest in relation to the specific focus of a study, would be to collect data from the whole population and check the sample data matches the population data.* But, of course, if it was feasible to collect data from everyone in the population, there would be no need to sample in the first place.

However, because the end of course evaluation was taken by all students in the cohort (the study population) Leopold and Smith were able to see if those students in the sample responded in ways that were generally in line with the population as a whole. The two figures reproduced here seem to suggest they did!


Figure 1 from Leopold & Smith, 2020, p.10, which is published with a Creative Commons Attribution (CC BY) license allowing reproduction.

Figure 2 from Leopold & Smith, 2020, p.10, which is published with a Creative Commons Attribution (CC BY) license allowing reproduction.

There is clearly a pretty good match here. However, it is important to not over-interpret this data. The questions in the evaluation related to the overall experience of group working, whereas the qualitative data analysed from the sample related to the more specific issues of diagnosing and addressing issues in the working of groups. These are related matters but not identical, and we cannot assume that the very strong similarity between sample and population outcomes in the survey demonstrates (or proves!) that the analysis of data from the sample is also so closely representative of what would have been obtained if all the groups had been included in the data collection.


Experiences of learning through group-workLearning to work more effectively in groups
Samplepatterns in data closely reflected population responsesdata only collected from a sample of groups
Populationall invited to provide feedback[it seems reasonable to assume results from sample are likely to apply to the cohort as a whole]
The similarly of the feedback viewing by students in the sample of groups to the overall cohort responses suggests that the sample was broadly representative of the overall population in terms of developing group-work skills and practices

It might well have been, but we cannot know for sure. (* The only way to know for sure that a sample is genuinely representative of the population of interest in relation to the specific focus of a study, would be …)

However, the way the sample so strongly reflected the population in relation to the evaluation data, shows that in that (related if not identical) respect at least the sample is strongly representative, and that is very likely to give readers confidence in the sampling procedure used. If this had been my study I would have been pretty pleased with this, at least strongly suggestive, circumstantial evidence of the representativeness of the sampling of the student groups.


Work cited:

Didactic control conditions

Another ethically questionable science education experiment?


Keith S. Taber


This seems to be a rhetorical experiment where an educational treatment that is already known to be effective is 'tested' to demonstrate that it is more effective than suboptimal teaching – by asking a teacher to constrain her teaching to students assigned to be an unethical comparison condition

one group of students were deliberately disadvantaged by asking an experienced and skilled teacher to teach in a way all concerned knew was sub-optimal so as to provide a low base line that would be outperformed by the intervention, simply to replicate a much demonstrated finding

In a scientific experiment, an intervention is made into the natural state of affairs to see if it produces a hypothesised change. A key idea in experimental research is control of variables: in the ideal experiment only one thing is changed. In the control condition all relevant variables are fixed so that there is a fair test between the experimental treatment and the control.

Although there are many published experimental studies in education, such research can rarely claim to have fully controlled all potentially relevant variables: there are (nearly always, always?) confounding factors that simply can not be controlled.

Read about confounding variables

Experimental research in education, then, (nearly always, always?) requires some compromising of the pure experimental method.

Where those compromises are substantial, we might ask if experiment was the wrong choice of methodology: even if a good experiment is often the best way to test an idea, a bad experiment may be less informative than, for example, a good case study.

That is primarily a methodological matter, but testing educational innovations and using control conditions in educational studies also raises ethical issues. After all, an experiment means experimenting with real learners' educational experiences. This can certainly be sometimes justified – but there is (or should be) an ethical imperative:

  • researchers should never ask learners to participate in a study condition they have good reason to expect will damage their opportunities to learn.

If researchers want to test a genuinely innovative teaching approach or learning resource, then they have to be confident it has a reasonable chance of being effective before asking learners to participate in a study where they will be subjected to an untested teaching input.

It is equally the case that students assigned to a control condition should never be deliberately subjected to inferior teaching simply in order to help make a strong contrast with an experimental approach being tested. Yet, reading some studies leads to a strong impression that some researchers do seek to constrain teaching to a control group to help bias studies towards the innovation being tested (Taber, 2019). That is, such studies are not genuinely objective, open-minded investigations to test a hypothesis, but 'rhetorical' studies set up to confirm and demonstrate the researchers' prior assumptions. We might say these studies do not reflect true scientific values.


A general scheme for a 'rhetorical experiment'

Read about rhetorical experiments


I have raised this issue in the research literature (Taber, 2019), so when I read experimental studies in education I am minded to check see that any control condition has been set up with a concern to ensure that the interests of all study participants (in both experimental and control conditions) have been properly considered.

Jigsaw cooperative learning in elementary science: physical and chemical changes

I was reading a study called "A jigsaw cooperative learning application in elementary science and technology lessons: physical and chemical changes" (Tarhan, Ayyıldız, Ogunc & Sesen, 2013) published in a respectable research journal (Research in Science & Technological Education).

Tarhan and colleagues adopted a common type of research design, and the journal referees and editor presumably were happy with the design of their study. However, I think the science education community should collectively be more critical about the setting up of control conditions which require students to be deliberately taught in ways that are considered to be less effective (Taber, 2019).


Jigsaw learning involves students working in co-operative groups, and in undertaking peer-teaching

Jigsaw learning is a pedagogic technique which can be seen as a constructivist, student-centred, dialogic, form of 'active learning'. It is based on collaborative groupwork and includes an element of peer-tutoring. In this paper the technique is described as "jigsaw cooperative learning", and the article authors explain that "cooperative learning is an active learning approach in which students work together in small groups to complete an assigned task" (p.185).

Read about jigsaw learning

Random assignment

The study used an experimental design, to compare between learning outcomes in two classes taught the same topic in two different ways. Many studies that compare between two classes are problematic because whole extant classes are assigned to conditions which means that the unit of analysis should be the class (experimental condition, n=1; control condition, n=1). Yet, despite this, such studies commonly analyse results as if each learner was an independent unit of analysis (e.g., experimental condition, n=c.30; control condition, n=c.30) which is necessary to obtain statistical results, but unfortunately means that inferences drawn from those statistics are invalid (Taber, 2019). Such studies offer examples of where there seems little point doing an experiment badly as the very design makes it intrinsically impossible to obtain a (i.e., a valid) statistically significant outcome.


Experimental designs may be categorised as true experiments, quasi-experiments and natural experiments (Taber, 2019).

Tarhan and colleagues, however, randomly assign the learners to the two conditions so can genuinely claim that in their study they have a true experiment: for their study, experimental condition, n=30; control condition, n=31.

Initial equivalence between groups

Assigning students in this way also helped ensure the two groups started from a similar base. Often such experimental studies use a pre-test to compare the groups before teaching. However, often the researchers look for a statistical difference between the groups which does not reach statistical significance (Taber, 2019). That is, if a statistical test shows p≥0.05 (in effect, the initial difference between the groups is not very unlikely to occur by chance) this is taken as evidence of equivalence. That is like saying we will consider two teachers to be of 'equivalent' height as long as there is no more than 30 cm difference in their height!

In effect

'not very different'

is being seen as a synonym for

'near enough the same'


Some analogies for how equivalence is determined in some studies: read about testing for initial equivalence

However, the pretest in Tarhan and colleagues' study found that the difference between two groups in performances on the pretest was at a level likely to occur by chance (not simply something more than 5%, but) 87% of the time. This is a much more convincing basis for seeing the two groups as initially similar.

So, there are two ways in which the Tarhan et al. study seemed better thought-through than many small scale experiments in teaching I have read.

Comparing two conditions

The research was carried out with "sixth grade students in a public elementary school in Izmir, Turkey" (p.184). The focus was learning about physical and chemical changes.

The experimental condition

At the outset of the study, the authors suggest it is already known that

  • "Jigsaw enhances cooperative learning" (p.185)"
  • "Jigsaw promotes positive attitudes and interests, develops communication skills between students, and increases learning achievement in chemistry" (p.186)
  • "the jigsaw technique has the potential to improve students' attitude towards science"
  • development of "students' understanding of chemical equilibrium in a first year general chemistry course [was more successful] in the jigsaw class…than …in the individual learning class"

It seems the approach being tested was already demonstrated to be effective in a range of contexts. Based on the existing research, then, we could already expect well-implemented jigsaw learning to be effective in facilitating student learning.

Similarly, the authors tell the readers that the broader category of cooperative learning has been well established as successful,

"The benefits of cooperative learning have been well documented as being

higher academic achievement,

higher level of reasoning and critical thinking skills,

deeper understanding of learned material,

better attention and less disruptive behavior in class,

more motivation to learn and achieve,

positive attitudes to subject matter,

higher self-esteem and

higher social skills."

Tarhan et al., 2013, p.185

What is there not to like here? So, what was this highly effective teaching approach compared with?

What is being compared?

Tarhan and colleagues tell readers that:

"The experimental group was taught via jigsaw cooperative learning activities developed by the researchers and the control group was taught using the traditional science and technology curriculum."

Tarhan et al., 2013, p.189
A different curriculum?

This seems an unhelpful statement as it does not seem to compare like with like:


conditioncurriculumpedagogy
experimental?jigsaw cooperative learning activities developed by the researchers
control traditional science and technology curriculum?
A genuine experiment would look to control variables, so would not simultaneously vary both curriculum and pedagogy

The study uses a common test to compare learning in the two conditions, so the study only makes sense as an experimental test of jigsaw learning if the same curriculum is being followed in both conditions. Otherwise, there is no prima facie reason to think that the post-test is equally fair in testing what has been taught in the two conditions. 1

The control condition

The paper includes an account of the control condition which seems to make it clear that both groups were taught "the same content", which is helpful as to have done otherwise would have seriously undermined the study.

The control group was instructed via a teacher-centered didactic lecture format. Throughout the lesson, the same science and technology teacher presented the same content as for the experimental group to achieve the same learning objectives, which were taught via detailed instruction in the experimental group. This instruction included lectures, discussions and problem solving. During this process, the teacher used the blackboard and asked some questions related to the subject. Students also used a regular textbook. While the instructor explained the subject, the students listened to her and took notes. The instruction was accomplished in the same amount of time as for the experimental group.

Tarhan et al., 2013, p.194

So, it seems:


conditioncurriculumpedagogy
experimental[by inference: "traditional science and technology curriculum"]jigsaw cooperative learning activities developed by the researchers
control traditional science and technology curriculum
[the same content as for the experimental group to achieve the same learning objectives]
teacher-centred didactic lecture format:
instructor explained the subject and asked questions
controlled variableindependent variable
An experiment relies on control of variables and would not simultaneously vary both curriculum and pedagogy

The statement is helpful, but might be considered ambiguous as "this instruction which included lectures, discussions and problem solving" seems to relate to what had been "taught via detailed instruction in the experimental group".

But this seems incongruent with the wider textual context. The experimental group were taught by a jigsaw learning technique – not lectures, discussions and problem solving. Yet, for that matter, the experimental group were not taught via 'detailed instruction' if this means the teacher presenting the curriculum content. So, this phrasing seems unhelpfully confusing (to me, at least – presumably, the journal referees and editor thought this was clear enough.)

So, this probably means the "lectures, discussions and problem solving" were part of the control condition where "the teacher used the blackboard and asked some questions related to the subject. Students also used a regular textbook. While the instructor explained the subject, the students listened to her and took notes".

'Lectures' certainly fit with that description.

However, genuine 'discussion' work is a dialogic teaching method and would not seem to fit within a "teacher-centered didactic lecture format". But perhaps 'discussion' simply refers to how the "teacher used the blackboard and asked some questions" that members of the class were invited to answer?

Read about dialogic teaching

Writing-up research is a bit like teaching in that in presenting to a particular audience, one works with a mental model of what that audience already knowns and understands, and how they use specific terms, and this model is never likely to be perfectly accurate:

  • when teaching, the learners tend to let you know this, whereas,
  • when writing, this kind of immediate feedback is lacking.

Similarly, problem-solving would not seem to fit within a "teacher-centered didactic lecture format". 'Problem-solving' engages high level cognitive and metacognitive skills because a 'problem' is a task that students are not able to respond to simply by recalling what they have been told and applying learnt algorithms. Problem-solving requires planning and applying strategies to test out ideas and synthesise knowledge. Yet teachers and textbooks commonly refer to simple questions that simply test recall and comprehension, or direct application of learnt techniques, as 'problems' when they are better understood as 'exercises' as they do not pose authentic problems.

The imprecise use of terms that may be understood differently across diverse contexts is characteristic of educational discourse, so Tarhan and colleagues may have simply used the labels that are normally applied in the context where they are working. It should also be noted that as the researchers are based in Turkey they are presumably finding the best English translations they can for the terms used locally.

Read about the challenges of translation in research writing

So, it seems we have:


Experimental conditionin one of the conditions?Control condition
Jigsaw learning (set out in some detail in the paper) – an example of
cooperative learning – an active learning approach in which students work together in small groups
detailed instruction?
discussions (=teacher questioning?)
problem solving? (=practice exercises?)
teacher-centred didactic lecture format…the teacher used the blackboard and asked some questions…a regular textbook….the instructor explained the subject, the students listened and took notes
The independent variable – teaching methodology

The teacher variable

One of the major problems with some educational experiments comparing different teaching approaches is the confound of the teacher. If

  • class A is taught through approach 'a' by teacher 1, and
  • class B is taught through approach 'b' by teacher 2

then even if there is a good case that class A and class B start off as 'equivalent' in terms of readiness to learn about the focal topic then any differences in study outcomes could be as much down to different teachers (and we all know that different teachers are not equivalent!) as different teaching methodology.

At first sight this is easily solved by having the same teacher teach both classes (as in the study discussed here). That certainly seems to help. But, a little thought suggests it is not a foolproof approach (Taber, 2019).

Teachers inevitably have better rapport with some classes than others (even when those classes are shown to be technically 'equivalent') simply because that is the nature of how diverse personalities interact. 3 Even the most professional teachers find they prefer to teach some classes than others, enjoy the teaching more, and seem to get better results (even when the classes are supposed to be equivalent).

In an experiment, there is no reason why the teacher would work better with a class assigned the experimental condition; it might just as well be the control condition. However, this is still a confound and there is no obvious solution to this, except having multiple classes and teachers in each condition such that the statistics can offer guide on whether outcomes are sufficiently unlikely to be able to reasonable discount these types of effect.

Different teachers also have different styles and approaches and skills sets – so the same teacher will not be equally suited to every teaching approach and pedagogy. Again, this does not necessarily advantage the experimental condition, but, again, is something that can only be addressed by having a diverse range of teachers in each condition (Taber, 2019).

So, although we might expect having the same teacher teach both classes is the preferred approach, the same teacher is not exactly the same teacher in different classes or teaching in different ways.

And what do participants expect will happen?

Moreover, expectancy effects can be very influential in education. Expecting something to work, or not work, has been shown to have real effects on outcomes. It may not be true, as some motivational gurus like to pretend, that we can all of us achieve anything if only we believe: but we are more likely to be successful when we believe we can succeed. When confident, we tend to be more motivated, less easily deterred, and (given the human capacity for perceiving with confirmation bias) more likely to judge we are making good progress. So, any research design which communicates to teachers and students (directly, or through the teacher's or researcher's enthusiasm) an expectation of success in some innovation is more likely to lead to success. This is a potential confound that is not even readily addressed by having large numbers of classes and teachers (Taber, 2019)!

Read about expectancy effects

The authors report that

Before implementation of the study, all students and their families were informed about the aims of the study and the privacy of their personal information. Permission for their children attend the study was obtained from all families.

Tarhan et al., 2013, p.194

This is as it should be. School children are not data-fodder for researchers, and they should always be asked for, and give, voluntary informed consent when recruited to join a research project. However, researchers need to open and honest about their work, whilst also being careful about how they present their research aims. We can imagine a possible form of invitation,

We would like you to invite you to be part of a study where some of you will be subject to traditional learning through a teacher-centred didactic lecture format where the teacher will give you notes and ask you questions, and some of you will learn by a different approach that has been shown to enhance learning, promote positive attitudes and interests, develop communication skills, increase achievement, support higher level of reasoning and critical thinking skills, lead to deeper understanding of learned material…

An honest, but unhelpful, briefing for students and parents

If this was how the researchers understood the background to their study, then this would be a fair and honest briefing. Yet, this would clearly set up strong expectations in the student groups!

A suitable teacher

Tarhan and colleagues report that

"A teacher experienced in active learning was trained in how to implement the instruction based on jigsaw cooperative learning. The teacher and researchers discussed the instructional plans before implementing the activities."

Tarhan et al., 2013, p.189

So, the teacher who taught both classes, using an jigsaw cooperative learning in one class and a teacher-centred didactic lecture approach in the other was "experienced in active learning". So, it seems that

  • the researchers were already convinced that active learning approaches were far superior to teaching via a lecture approach
  • the teacher had experience in teaching though more engaging, effective student-centred active learning approaches

despite this, a control condition was set-up that required the teacher to, in effect, de-skill, and teach in a way the researchers were well aware research suggested was inferior, for the sake of carrying out an experiment to demonstrate in a specific context what had already been well demonstrated elsewhere.

In other words, it seems that one group of students were deliberately disadvantaged by asking an experienced and skilled teacher to teach in a way all concerned knew was sub-optimal, so as to provide a low base line that would be outperformed by the intervention, simply to replicate a much demonstrated finding. When seen in that way, this is surely unethical research.

The researchers may not have been consciously conceptualising their design in those terms, but it is hard to see this as a fair test of the jigsaw learning approach – it can show it is better than suboptimal teaching, but does not offer a comparison with an example of the kind of teaching that is recommended in the national context where the research took place.

Unethical, but not unusual

I am not seeking to pick out Tarhan and colleagues in particular for designing an unethical study, because they are not unique in adopting this approach (Taber, 2019): indeed, they are following a common formula (an experimental 'paradigm' in the sense the term is used in psychology).

Tarhan and colleagues have produced a study that is interesting and informative, and which seems well planned, and strongly-motivated when considered as part of tradition of such studies. Clearly, the referees and journal editor were not minded to question the procedure. The problem is that as a science education community we have allowed this tradition to continue such that a form of study that was originally genuinely open-ended (in that it examined under-researched teaching approaches of untested efficacy) has not been modified as published study after published study has slowly turned those untested teaching approaches into well-researched and repeatedly demonstrated approaches.

So much so, that such studies are now in danger of simply being rhetorical research – where (as in this case) the authors tell readers at the outset that it is already known that what they are going to test is widely shown to be effective good practice. Rhetorical research is set up to produce an expected result, and so is not authentic research. A real experiment tests a genuine hypothesis rather than demonstrates a commonplace. A question researchers might ask themselves could be

'how surprised would I be if this leads to a negative outcome'?

If the answer is

'that would be very surprising'

then they should consider modifying their research so it is likely to be more than minimally informative.

Finding out that jigsaw learning achieved learning objectives better/as well as/not so well as, say, P-O-E (predict-observe-explain) activities might be worth knowing: that it is better than deliberately constrained teaching does not tell us very much that is not obvious.

I do think this type of research design is highly questionable and takes unfair advantage of students. It fails to meet my suggested guideline that

  • researchers should never ask learners to participate in a study condition they have good reason to expect will damage their opportunities to learn

The problem of generalisation

Of course, one fair response is that despite all the claims of the superiority of constructivist, active, cooperatative (etc.) learning approaches, the diversity of educational contexts means we can not simply generalise from an experiment in one context and assume the results apply elsewhere.

Read about generalising from research

That is, the research literature shows us that jigsaw learning is an effective teaching approach, but we cannot be certain it will be effective in the particular context of teaching about chemical and physical changes to sixth grade students in a public elementary school in Izmir, Turkey.

Strictly that is true! But we should ask:

do we not know this because

  1. research shows a great variation in whether jigsaw learning is effective or not as it differs according to contexts and conditions
  2. although jigsaw learning has consistently been shown to be effective in many different contexts, no one has yet tested it in the specific case of teaching about chemical and physical changes to sixth grade students in a public elementary school in Izmir, Turkey

It seems clear from the paper that the researchers are presenting the second case (in which case the study would actually be of more interest and importance if had been found that in this context jigsaw learning was not effective).

Given there are very good reasons to expect a positive outcome, there seems no need to 'stack the odds' by using deliberately detrimental control conditions.

Even had situation 1 applied, it seems of limited value to know that jigsaw learning is more effective (in teaching about chemical and physical changes to sixth grade students in a public elementary school in Izmir, Turkey) than an approach we already recognise is suboptimal.

An ethical alternative

This does not mean that there is no value in research that explores well-established teaching approaches in new contexts. However, unless the context is very different from where the approach has already been widely demonstrated, there is little value in comparing it with approaches that are known to be sub-optimal (which in Turkey, a country where constructivist 'reform' teaching approaches are supposed to be the expected standard, seem to often be labelled as 'traditional').

Detailed case studies of the implementation of a reform pedagogy in new contexts that collect rich 'process' data to explore challenges to implementation and to identify especially effective specific practices would surely be more informative? 4

If researchers do feel the need to do experiments, then rather than comparing known-to-be-effective approaches with suboptimal approaches hoping to demonstrate what everyone already knows, why not use comparison conditions that really test the innovation. Of course jigsaw learning out performed lecturing in an elementary school – but how might it have compared with another constructivist approach?

I have described the constructivist science teacher as a kind of learning doctor. Like medical doctors, our first tenet should be to do no harm. So, if researchers want to set up experimental comparisons, they have a duty to try to set up two different approaches that they believe are likely to benefit the learners (whichever condition they are assigned to):

  • not one condition that advantages one group of students
  • and another which deliberately disadvantages another group of students for the benefit of a 'positive' research outcome.

If you already know the outcome then it is not genuine research – and you need a better research question.


Work cited:

Note:

1 Imagine teaching one class about acids by jigsaw learning, and teaching another about the nervous system by some other pedagogy – and then comparing the pedagogies by administering a test – about acids! The class in the jigsaw condition might well do better, without it being reasonable to assume this reflects more effective pedagogy.

So, I am tempted to read this as simply a drafting/typographical error that has been missed, and suspect the authors intended to refer to something like the traditional approach to teaching the science and technology curriculum. Otherwise the experiment is fatally flawed.

Yet, one purpose of the study was to find out

"Does jigsaw cooperative learning instruction contribute to a better conceptual understanding of 'physical and chemical changes' in sixth grade students compared to the traditional science and technology curriculum?"

Tarhan et al., 2013, p.187

This reads as if the researchers felt the curriculum was not sufficiently matched to what they felt were the most important learning objectives in the topic of physical and chemical changes, so they have undertaken some curriculum development, as well as designed a teaching unit accordingly, to be taught by jigsaw learning pedagogy. If so the experiment is testing

traditional curriculum x traditional pedagogy

vs.

reformed curriculum x innovative pedagogy

making it impossible to disentangle the two components.

This suggests the researchers are testing the combination of curriculum and pedagogy, and doing so with a test biased towards the experimental condition. This seems illogical, but I have actually worked in a project where we faced a similar dilemma. In the epiSTEMe project we designed innovative teaching units for lower secondary science and maths. In both physics units we incorporated innovative aspects to the curriculum.

  • In the forces unit material on proportionality was introduced, with examples (car stopping distance) normally not taught at that grade level (Y7);
  • In the electricity unit the normal physics content was embedded in an approach designed to teach aspects of the nature of science.

In the forces unit, the end-of-topic test included material that was included in the project-designed units, but unlikely to be taught in the control classes. There was evidence that on average students in the project classes did better on the test.

In the electricity unit, the nature of science objectives were not tested as these would not necessarily have been included in teaching control classes. On average, there was very little difference in learning about electrical circuits in the two conditions. There was however a very wide range of class performances – oddly just as wide in the experimental condition (where all classes had a common scheme of work, common activities, and common learning materials) as in the control condition where teachers taught the topic in their customary ways.


2 It could be read either as


1

ControlExperimental
The control group was instructed via a teacher-centered didactic lecture format. Throughout the lesson, the same science and technology teacher presented the same content as for the experimental group to achieve the same learning objectives, which were taught via detailed instruction in the experimental group.
…detailed instruction in the experimental group. This instruction included lectures, discussions and problem solving.
During this process, the teacher used the blackboard and asked some questions related to the subject. Students also used a regular textbook. While the instructor explained the subject, the students listened to her and took notes. The instruction was accomplished in the same amount of time as for the experimental group.
What was 'this instruction' which included lectures, discussions and problem solving?

or


2

ControlExperimental
The control group was instructed via a teacher-centered didactic lecture format. Throughout the lesson, the same science and technology teacher presented the same content as for the experimental group to achieve the same learning objectives, which were taught via detailed instruction in the experimental group.
…detailed instruction in the experimental group.
This [sic] instruction included lectures, discussions and problem solving. During this process, the teacher used the blackboard and asked some questions related to the subject. Students also used a regular textbook. While the instructor explained the subject, the students listened to her and took notes. The instruction was accomplished in the same amount of time as for the experimental group.
What was 'this instruction' which included lectures, discussions and problem solving?

3 A class, of course, is not a person, but a collection of people, so perhaps does not have a 'personality' as such. However, for teachers, classes do take on something akin to a personality.

This is not just an impression. It was pointed out above that if a researcher wants to treat each learner as a unit of analysis (necessary to use inferential statistics when only working with a small number of classes) then learners, not intact classes, should be assigned to conditions. However, even a newly formed class will soon develop something akin to a personality. This will certainly be influenced by individual learners present but develop through the history of their evolving mutual interactions and is not just a function of the sum of their individual characteristics.

So, even when a class is formed by random assignment of learners at the start of a study, it is still strictly questionable whether these students should be seen as independent units for analysis (Taber, 2019).


4 I suspect that science educators have a justified high regard for experimental method in the natural sciences, which sometimes blinkers us to its limitations in social contexts where there are myriad interacting variables and limited controls.

Read: Why do natural scientists tend to make poor social scientists?


Delusions of educational impact

A 'peer-reviewed' study claims to improve academic performance by purifying the souls of students suffering from hallucinations


Keith S. Taber


The research design is completely inadequate…the whole paper is confused…the methodology seems incongruous…there is an inconsistency…nowhere is the population of interest actually identified…No explanation of the discrepancy is provided…results of this analysis are not reported…the 'interview' technique used in the study is highly inadequate…There is a conceptual problem here…neither the validity nor reliability can be judged…the statistic could not apply…the result is not reported…approach is completely inappropriate…these tables are not consistent…the evidence is inconclusive…no evidence to demonstrate the assumed mechanism…totally unsupported claims…confusion of recommendations with findings…unwarranted generalisation…the analysis that is provided is useless…the research design is simply inadequate…no control condition…such a conclusion is irresponsible

Some issues missed in peer review for a paper in the European Journal of Education and Pedagogy

An invitation to publish without regard to quality?

I received an email from an open-access journal called the European Journal of Education and Pedagogy, with the subject heading 'Publish Fast and Pay Less' which immediately triggered the thought "another predatory journal?" Predatory journals publish submissions for a fee, but do not offer the editorial and production standards expected of serious research journals. In particular, they publish material which clearly falls short of rigorous research despite usually claiming to engage in peer review.

A peer reviewed journal?

Checking out the website I found the usual assurances that the journal used rigorous peer review as:

"The process of reviewing is considered critical to establishing a reliable body of research and knowledge. The review process aims to make authors meet the standards of their discipline, and of science in general.

We use a double-blind system for peer-reviewing; both reviewers and authors' identities remain anonymous to each other. The paper will be peer-reviewed by two or three experts; one is an editorial staff and the other two are external reviewers."

https://www.ej-edu.org/index.php/ejedu/about

Peer review is critical to the scientific process. Work is only published in (serious) research journals when it has been scrutinised by experts in the relevant field, and any issues raised responded to in terms of revisions sufficient to satisfy the editor.

I could not find who the editor(-in-chief) was, but the 'editorial team' of European Journal of Education and Pedagogy were listed as

  • Bea Tomsic Amon, University of Ljubljana, Slovenia
  • Chunfang Zhou, University of Southern Denmark, Denmark
  • Gabriel Julien, University of Sheffield, UK
  • Intakhab Khan, King Abdulaziz University, Saudi Arabia
  • Mustafa Kayıhan Erbaş, Aksaray University, Turkey
  • Panagiotis J. Stamatis, University of the Aegean, Greece

I decided to look up the editor based in England where I am also based but could not find a web presence for him at the University of Sheffield. Using the ORCID (Open Researcher and Contributor ID) provided on the journal website I found his ORCID biography places him at the University of the West Indies and makes no mention of Sheffield.

If the European Journal of Education and Pedagogy is organised like a serious research journal, then each submission is handled by one of this editorial team. However the reference to "editorial staff" might well imply that, like some other predatory journals I have been approached by (e.g., Are you still with us, Doctor Wu?), the editorial work is actually carried out by office staff, not qualified experts in the field.

That would certainly help explain the publication, in this 'peer-reviewed research journal', of the first paper that piqued my interest enough to motivate me to access and read the text.


The Effects of Using the Tazkiyatun Nafs Module on the Academic Achievement of Students with Hallucinations

The abstract of the paper published in what claims to be a peer-reviewed research journal

The paper initially attracted my attention because it seemed to about treatment of a medical condition, so I wondered was doing in an education journal. Yet, the paper seemed to also be about an intervention to improve academic performance. As I read the paper, I found a number of flaws and issues (some very obvious, some quite serious) that should have been spotted by any qualified reviewer or editor, and which should have indicated that possible publication should have been be deferred until these matters were satisfactorily addressed.

This is especially worrying as this paper makes claims relating to the effective treatment of a symptom of potentially serious, even critical, medical conditions through religious education ("a  spiritual  approach", p.50): claims that might encourage sufferers to defer seeking medical diagnosis and treatment. Moreover, these are claims that are not supported by any evidence presented in this paper that the editor of the European Journal of Education and Pedagogy decided was suitable for publication.


An overview of what is demonstrated, and what is claimed, in the study.

Limitations of peer review

Peer review is not a perfect process: it relies on busy human beings spending time on additional (unpaid) work, and it is only effective if suitable experts can be found that fit with, and are prepared to review, a submission. It is also generally more challenging in the social sciences than in the natural sciences. 1

That said, one sometimes finds papers published in predatory journals where one would expect any intelligent person with a basic education to notice problems without needing any specialist knowledge at all. The study I discuss here is a case in point.

Purpose of the study

Under the heading 'research objectives', the reader is told,

"In general, this journal [article?] attempts to review the construction and testing of Tazkiyatun Nafs [a Soul Purification intervention] to overcome the problem of hallucinatory disorders in student learning in secondary schools. The general objective of this study is to identify the symptoms of hallucinations caused by subtle beings such as jinn and devils among students who are the cause of disruption in learning as well as find solutions to these problems.

Meanwhile, the specific objective of this study is to determine the effect of the use of Tazkiyatun Nafs module on the academic achievement of students with hallucinations.

To achieve the aims and objectives of the study, the researcher will get answers to the following research questions [sic]:

Is it possible to determine the effect of the use of the Tazkiyatun Nafs module on the academic achievement of students with hallucinations?"

Awang, 2022, p.42

I think I can save readers a lot of time regarding the research question by suggesting that, in this study, at least, the answer is no – if only because the research design is completely inadequate to answer the research question. (I should point that the author comes to the opposite conclusion: e.g., "the approach taken in this study using the Tazkiyatun Nafs module is very suitable for overcoming the problem of this hallucinatory disorder", p.49.)

Indeed, the whole paper is confused in terms of what it is setting out to do, what it actually reports, and what might be concluded. As one example, the general objective of identifying "the symptoms of hallucinations caused by subtle beings such as jinn and devils" (but surely, the hallucinations are the symptoms here?) seems to have been forgotten, or, at least, does not seem to be addressed in the paper. 2


The study assumes that hallucinations are caused by subtle beings such as jinn and devils possessing the students.
(Image by Tünde from Pixabay)

Methodology

So, this seems to be an intervention study.

  • Some students suffer from hallucinations.
  • This is detrimental to their education.
  • It is hypothesised that the hallucinations are caused by supernatural spirits ("subtle beings that lead to hallucinations"), so, a soul purification module might counter this detriment;
  • if so, sufferers engaging with the soul purification module should improve their academic performance;
  • and so the effect of the module is being tested in the study.

Thus we have a kind of experimental study?

No, not according to the author. Indeed, the study only reports data from a small number of unrepresentative individuals with no controls,

"The study design is a case study design that is a qualitative study in nature. This study uses a case study design that is a study that will apply treatment to the study subject to determine the effectiveness of the use of the planned modules and study variables measured many times to obtain accurate and original study results. This study was conducted on hallucination disorders [students suffering from hallucination disorders?] to determine the effectiveness of the Tazkiyatun Nafs module in terms of aspects of student academic achievement."

Awang, 2022, p.42

Case study?

So, the author sees this as a case study. Research methodologies are better understood as clusters of similar approaches rather than unitary categories – but case study is generally seen as naturalistic, rather than involving an intervention by an external researcher. So, case study seems incongruous here. Case study involves the detailed exploration of an instance (of something of interest – a lesson, a school, a course of tudy, a textbook, …) reported with 'thick description'.

Read about the characteristics of case study research

The case is usually a complex phenomena which is embedded within a context from which is cannot readily be untangled (for example, a lesson always takes place within a wider context of a teacher working over time with a class on a course of study, within a curricular, and institutional, and wider cultural, context, all of which influence the nature of the specific lesson). So, due to the complex and embedded nature of cases, they are all unique.

"a case study is a study that is full of thoroughness and complex to know and understand an issue or case studied…this case study is used to gain a deep understanding of an issue or situation in depth and to understand the situation of the people who experience it"

Awang, 2022, p.42

A case is usually selected either because that case is of special importance to the researcher (an intrinsic case study – e.g., I studied this school because it is the one I was working in) or because we hope this (unique) case can tell us something about similar (but certainly not identical) other (also unique) cases. In the latter case [sic], an instrumental case study, we are always limited by the extent we might expect to be able to generalise beyond the case.

This limited generalisation might suggest we should not work with a single case, but rather look for a suitably representative sample of all cases: but we sometimes choose case study because the complexity of the phenomena suggests we need to use extensive, detailed data collection and analyses to understand the complexity and subtlety of any case. That is (i.e., the compromise we choose is), we decide we will look at one case in depth because that will at least give us insight into the case, whereas a survey of many cases will inevitably be too superficial to offer any useful insights.

So how does Awang select the case for this case study?

"This study is a case study of hallucinatory disorders. Therefore, the technique of purposive sampling (purposive sampling [sic]) is chosen so that the selection of the sample can really give a true picture of the information to be explored ….

Among the important steps in a research study is the identification of populations and samples. The large group in which the sample is selected is termed the population. A sample is a small number of the population identified and made the respondents of the study. A case or sample of n = 1 was once used to define a patient with a disease, an object or concept, a jury decision, a community, or a country, a case study involves the collection of data from only one research participant…

Awang, 2022, p.42

Of course, a case study of "a community, or a country" – or of a school, or a lesson, or a professional development programme, or a school leadership team, or a homework policy, or an enrichnment activity, or … – would almost certainly be inadequate if it was limited to "the collection of data from only one research participant"!

I do not think this study actually is "a case study of hallucinatory disorders [sic]". Leading aside the shift from singular ("a case study") to plural ("disorders"), the research does not investigate a/some hallucinatory disorders, but the effect of a soul purification module on academic performance. (Actually, spoiler alert  😉, it does not actually investigate the effect of a soul purification module on academic performance either, but the author seems to think it does.)

If this is a case study, there should be the selection of a case, not a sample. Sometimes we do sample within a case in case study, but only from those identified as part of the case. (For example, if the case was a year group in a school, we may not have resources to interact in depth with several hundred different students). Perhaps this is pedantry as the reader likely knows what Awang meant by 'sample' in the paper – but semantics is important in research writing: a sample is chosen to represent a population, whereas the choice of case study is an acknowledgement that generalisation back to a population is not being claimed).

However, if "among the important steps in a research study is the identification of populations" then it is odd that nowhere in the paper is the population of interest actually specified!

Things slip our minds. Perhaps Awang intended to define the population, forgot, and then missed this when checking the text – buy, hey, that is just the kind of thing the reviewers and editor are meant to notice! Otherwise this looks very like including material from standard research texts to play lip-service to the idea that research-design needs to be principled, but without really appreciating what the phrases used actually mean. This impression is also given by the descriptions of how data (for example, from interviews) were analysed – but which are not reflected at all in the results section of the paper. (I am not accusing Awang of this, but because of the poor standard of peer review not raising the question, the author is left vulnerable to such an evaluation.)

The only one research participant?

So, what do we know about the "case or sample of n = 1 ", the "only one research participant" in this study?

The actual respondents in this case study related to hallucinatory disorders were five high school students. The supportive respondents in the case study related to hallucination disorders were five counseling teachers and five parents or guardians of students who were the actual respondents."

Awang, 2022, p.42

It is certainly not impossible that a case could comprise a group of five people – as long as those five make up a naturally bounded group – that is a group that a reasonable person would recognise as existing as a coherent entiy as they clearly had something in common (they were in the same school class, for example; they were attending the same group therapy session, perhaps; they were a friendship group; they were members of the same extended family diagnosed with hallucinatory disorders…something!) There is no indication here of how these five make up a case.

The identification of the participants as a case might have made sense had the participants collectively undertaken the module as a group, but the reader is told: "This study is in the form of a case study. Each practice and activity in the module are done individually" (p.50). Another justification could have been if the module had been offered in one school, and these five participants were the students enrolled in the programme at that time but as "analysis of  the  respondents'  academic  performance  was conducted  after  the  academic  data  of  all  respondents  were obtained  from  the  respective  respondent's  school" (p.45) it seems they did not attend a single school.

The results tables and reports in the text refer to "respondent 1" to "respondent 4". In case study, an approach which recognises the individuality and inherent value of the particular case, we would usually assign assumed names to research participants, not numbers. But if we are going to use numbers, should there not be a respondent 5?

The other one research participant?

It seems that these is something odd here.

Both the passage above, and the abstract refer to five respondents. The results report on four. So what is going on? No explanation of the discrepancy is provided. Perhaps:

  • There only ever were four participants, and the author made a mistake in counting.
  • There only ever were four participants, and the author made a typographical mistake (well, strictly, six typographical mistakes) in drafting the paper, and then missed this in checking the manuscript.
  • There were five respondents and the author forgot to include data on respondent 5 purely by accident.
  • There were five respondents, but the author decided not to report on the fifth deliberately for a reason that is not revealed (perhaps the results did not fit with the desired outcome?)

The significant point is not that there is an inconsistency but that this error was missed by peer reviewers and the editor – if there ever was any genuine peer review. This is the kind of mistake that a school child could spot – so, how is it possible that 'expert reviewers' and 'editorial staff' either did not notice it, or did not think it important enough to query?

Research instruments

Another section of the paper reports the instrumentation used in the paper.

"The research instruments for this study were Takziyatun Nafs modules, interview questions, and academic document analysis. All these instruments were prepared by the researcher and tested for validity and reliability before being administered to the selected study sample [sic, case?]."

Awang, 2022, p.42

Of course, it is important to test instruments for validity and reliability (or perhaps authenticity and trustworthiness when collecting qualitative data). But it is also important

  • to tell the reader how you did this
  • to report the outcomes

which seems to be missing (apart from in regard to part of the implemented module – see below). That is, the reader of a research study wants evidence not simply promises. Simply telling readers you did this is a bit like meeting a stranger who tells you that you can trust them because they (i.e., say that they) are honest.

Later the reader is told that

"Semi- structured interview questions will be [sic, not 'were'?] developed and validated for the purpose of identifying the causes and effects of hallucinations among these secondary school students…

…this interview process will be [sic, not 'was'] conducted continuously [sic!] with respondents to get a clear and specific picture of the problem of hallucinations and to find the best solution to overcome this disorder using Islamic medical approaches that have been planned in this study

Awang, 2022, pp.43-44

At the very least, this seems to confuse the plan for the research with a report of what was done. (But again, apparently, the reviewers and editorial staff did not think this needed addressing.) This is also confusing as it is not clear how this aspect of the study relates to the intervention. Were the interviews carried out before the intervention to help inform the design of the modules (presumably not as they had already been "tested for validity and reliability before being administered to the selected study sample"). Perhaps there are clear and simple answers to such questions – but the reader will not know because the reviewers and editor did not seem to feel they needed to be posed.

If "Interviews are the main research instrument in this study" (p.43), then one would expect to see examples of the interview schedules – but these are not presented. The paper reports a complex process for analysing interview data, but this is not reflected in the findings reported. The readers is told that the six stage process leads to the identifications and refinement of main and sub-categories. Yet, these categories are not reported in the paper. (But, again, peer reviewers and the editor did not apparently raise this as something to be corrected.) More generally "data  analysis  used  thematic  analysis  methods" (p.44), so why is there no analysis presented in terms of themes? The results of this analysis are simply not reported.

The reader is told that

"This  interview  method…aims to determine the respondents' perspectives, as well as look  at  the  respondents'  thoughts  on  their  views  on  the issues studied in this study."

Awang, 2022, p.44

But there is no discussion of participants perspectives and views in the findings of the study. 2 Did the peer reviewers and editor not think this needed addressing before publication?

Even more significantly, in a qualitative study where interviews are supposedly the main research instrument, one would expect to see extracts from the interviews presented as part of the findings to support and exemplify claims being made: yet, there are none. (Did this not strike the peer reviewers and editor as odd: presumably they are familiar with the norms of qualitative research?)

The only quotation from the qualitative data (in this 'qualitative' study) I can find appears in the implications section of the paper:

"Are you aware of the importance of education to you? Realize. Is that lesson really important? Important. The success of the student depends on the lessons in school right or not? That's right"

Respondent 3: Awang, 2022, p.49

This seems a little bizarre, if we accept this is, as reported, an utterance from one of the students, Respondent 3. It becomes more sensible if this is actually condensed dialogue:

"Are you aware of the importance of education to you?"

"Realize."

"Is that lesson really important?"

"Important."

"The success of the student depends on the lessons in school right or not?"

"That's right"

It seems the peer review process did not lead to suggesting that the material should be formatted according to the norms for presenting dialogue in scholarly texts by indicating turns. In any case, if that is typical of the 'interview' technique used in the study then it is highly inadequate, as clearly the interviewer is leading the respondent, and this is more an example of indoctrination than open-ended enquiry.

Random sampling of data

Completely incongruous with the description of the purposeful selection of the participants for a case study is the account of how the assessment data was selected for analysis:

"The  process  of  analysis  of  student  achievement documents is carried out randomly by taking the results of current  examinations  that  have  passed  such  as the  initial examination of the current year or the year before which is closest  to  the  time  of  the  study."

Awang, 2022, p.44

Did the peer reviewers or editor not question the use of the term random here? It is unclear what is meant to by 'random' here, but clearly if the analysis was based on randomly selected data that would undermine the results.

Validating the soul purification module

There is also a conceptual problem here. The Takziyatun Nafs modules are the intervention materials (part of what is being studied) – so they cannot also be research instruments (used to study them). Surely, if the Takziyatun Nafs modules had been shown to be valid and reliable before carrying out the reported study, as suggested here, then the study would not be needed to evaluate their effectiveness. But, presumably, expert peer reviewers (if there really were any) did not see an issue here.

The reliability of the intervention module

The Takziyatun Nafs modules had three components, and the author reports the second of the three was subjected to tests of validity and reliability. It seems that Awang thinks that this demonstrates the validity and reliability of the complete intervention,

"The second part of this module will go through [sic] the process of obtaining the validity and reliability of the module. Proses [sic] to obtain this validity, a questionnaire was constructed to test the validity of this module. The appointed specialists are psychologists, modern physicians (psychiatrists), religious specialists, and alternative medicine specialists. The validity of the module is identified from the aspects of content, sessions, and activities of the Tazkiyatun Nafs module. While to obtain the value of the reliability coefficient, Cronbach's alpha coefficient method was used. To obtain this Cronbach's alpha coefficient, a pilot test was conducted on 50 students who were randomly selected to test the reliability of this module to be conducted."

Awang, 2022, pp.43-44

Now to unpack this, it may be helpful to briefly outline what the intervention involved (as as the paper is open access anyone can access and read the full details in the report).


From the MGM film 'A Night at the Opera' (1935): "The introduction of the module will elaborate on the introduction, rationale, and objectives of this module introduced"

The description does not start off very helpfully ("The introduction of the module will elaborate on the introduction, rationale, and objectives of this module introduced" (p.43) put me in mind of the Marx brothers: "The party of the first part shall be known in this contract as the party of the first part"), but some key points are,

"the Tazkiyatun Nafs module was constructed to purify the heart of each respondent leading to the healing of hallucinatory disorders. This liver purification process is done in stages…

"the process of cleansing the patient's soul will be done …all the subtle beings in the patient will be expelled and cleaned and the remnants of the subtle beings in the patient will be removed and washed…

The second process is the process of strengthening and the process of purification of the soul or heart of the patient …All the mazmumah (evil qualities) that are in the heart must be discarded…

The third process is the process of enrichment and the process of distillation of the heart and the practices performed. In this process, there will be an evaluation of the practices performed by the patient as well as the process to ensure that the patient is always clean from all the disturbances and disturbances [sic] of subtle beings to ensure that students will always be healthy and clean from such disturbances…

Awang, 2022, p.45, p.43

Quite how this process of exorcising and distilling and cleansing will occur is not entirely clear (and if the soul is equated with the heart, how is the liver involved?), but it seems to involve reflection and prayer and contemplation of scripture – certainly a very personal and therapeutic process.

And yet its validity and reliability was tested by giving a questionnaire to 50 students randomly selected (from the unspecified population, presumably)? No information is given on how a random section was made (Taber, 2013) – which allows a reader to be very sceptical that this actually was a random sample from the (un?)identified population, and not just an arbitrary sample of 50 students. (So, that is twice the word 'random' is used in the paper when it seems inappropriate.)

It hardly matters here, as clearly neither the validity nor the reliability of a spiritual therapy can be judged from a questionnaire (especially when administered to people who have never undertaken the therapy). In any case, the "reliability coefficient" obtained from an administration of a questionnaire ONLY applies to that sample on that occasion. So, the statistic could not apply to the four participants in the study. And, in any case, the result is not reported, so the reader has no idea what the value of Cronbach's alpha was (but then, this was described as a qualitative study!)

Moreover, Cronbach's alpha only indicates the internal coherence of the items on a scale (Taber, 2019): so, it only indicates whether the set of questions included in the questionnaire seem to be accessing the same underlying construct in motivating the responses of those surveyed across the set of items. It gives no information about the reliability of the instrument (i.e., whether it would give the same results on another occasion).

This approach to testing validity and reliability is then completely inappropriate and unhelpful. So, even if the outcomes of the testing had been reported (and they are not) they would not offer any relevant evidence. Yet it seems that peer reviewers and editor did not think to question why this section was included in the paper.

Ethical issues

A study of this kind raises ethical issues. It may well be that the research was carried out in an entirely proper and ethical manner, but it is usual in studies with human participants ('human subjects') to make this clear in the published report (Taber, 2014b). A standard issue is whether the participants gave voluntary, informed, consent. This would mean that they were given sufficient information about the study at the outset to be able to decide if they wished to participate, and were under no undue pressure to do so. The 'respondents' were school students: if they were considered minors in the research context (and oddly for a 'case study' such basic details as age and gender are not reported) then parental permission would also be needed, again subject to sufficient briefing and no duress.

However, in this specific research there are also further issues due to the nature of the study. The participants were subject to medical disorders, so how did the researcher obtain information about, and access to, the students without medical confidentiality being broken? Who were the 'gatekeepers' who provided access to the children and their personal data? The researcher also obtained assessment data "from  the  class  teacher  or  from  the  Student Affairs section of the student's school" (p.44), so it is important to know that students (and parents/guardians) consented to this. Again, peer review does not seem to have identified this as an issue to address before publication.

There is also the major underlying question about the ethics of a study when recognising that these students were (or could be, as details are not provided) suffering from serious medical conditions, but employing religious education as a treatment ("This method of treatment is to help respondents who suffer from hallucinations caused by demons or subtle beings", p.44). Part of the theoretical framework underpinning the study is the assumption that what is being addressed is"the problem of hallucinations caused by the presence of ethereal beings…" (p.43) yet it is also acknowledged that,

"Hallucinatory disorders in learning that will be emphasized in this study are due to several problems that have been identified in several schools in Malaysia. Such disorders are psychological, environmental, cultural, and sociological disorders. Psychological disorders such as hallucinatory disorders can lead to a more critical effect of bringing a person prone to Schizophrenia. Psychological disorders such as emotional disorders and psychiatric disorders. …Among the causes of emotional disorders among students are the school environment, events in the family, family influence, peer influence, teacher actions, and others."

Awang, 2022, p.41

There seem to be three ways of understanding this apparent discrepancy, which I might gloss:

  1. there are many causes of conditions that involve hallucinations, including, but not only, possession by evil or mischievousness spirits;
  2. the conditions that lead to young people having hallucinations may be understood at two complementary levels, at a spiritual level in terms of a need for inner cleansing and exorcising of subtle beings, and in terms of organic disease or conditions triggered by, for example, social and psychological factors;
  3. in the introduction the author has relied on various academic sources to discuss the nature of the phenomenon of students having hallucinations, but he actually has a working assumption that is completely different: hallucinations are due to the presence of jinn or other spirits.

I do not think it is clear which of these positions is being taken by the study's author.

  1. In the first case it would be necessary to identify which causes are present in potential respondents and only recruit those suffering possession for this study (which does not seem to have been done);
  2. In the second case, spiritual treatment would need to complement medical intervention (which would completely undermine the validity of the study as medical treatments for the underlying causes of hallucinations are likely to be the cause of hallucinations ceasing, not the tested intervention);
  3. The third position is clearly problematic in terms of academic scholarship as it is either completely incompetent or deliberately disregards academic norms that require the design of a study to reflect the conceptual framework set out to motivate it.

So, was this tested intervention implemented instead of or alongside formal medical intervention?

  • If it was alongside medical treatment, then that raises a major confound for the study.
  • Yet it would clearly be unacceptable to deny sufferers indicated medical treatment in order to test an educational intervention that is in effect a form of exorcism.

Again, it may be there are simple and adequate responses to these questions (although here I really cannot see what they might be), but unfortunately it seems the journal referees and editor did not think to ask for them.  

Findings


Results tables presented in Awang, 2022 (p.45) [Published with a creative commons licence allowing reproduction]: "Based on the findings stated in Table I show that serial respondents experienced a decline in academic achievement while they face the problem of hallucinations. In contrast to Table II which shows an improvement in students' academic achievement  after  hallucinatory  disorders  can  be  resolved." If we assume that columns in the second table have been mislabelled, then it seems the school performance of these four students suffered while they were suffering hallucinations, but improved once they recovered. From this, we can infer…?

The key findings presented concern academic performance at school. Core results are presented in tables I and II. Unfortunately these tables are not consistent as they report contradictory results for the academic performance of students before and during periods when they had hallucinations.

They can be made consistent if the reader assumes that two of the columns in table II are mislabelled. If the reader assumes that the column labelled 'before disruption' actually reports the performance 'during disruption' and that the column actually labelled 'during disruption' is something else, then they become consistent. For the results to tell a coherent story and agree with the author's interpretation this 'something else' presumably should be 'after disruption'.

This is a very unfortunate error – and moreover one that is obvious to any careful reader. (So, why was it not obvious to the referees and editor?)

As well as looking at these overall scores, other assessment data is presented separately for each of respondent 1 – respondent 4. Theses sections comprise presentations of information about grades and class positions, mixed with claims about the effects of the intervention. These claims are not based on any evidence and in many cases are conclusions about 'respondents' in general although they are placed in sections considering the academic assessment data of individual respondents. So,there are a number of problems with these claims:

  • they are of the nature of conclusions, but appear in the section presenting the findings;
  • they are about the specific effects of the intervention that the author assumes has influenced academic performance, not the data analysed in these sections;
  • they are completely unsubstantiated as no data or analysis is offered to support them;
  • often they make claims about 'respondents' in general, although as part of the consideration of data from individual learners.

Despite this, the paper passed peer-review and editorial scrutiny.

Rhetorical research?

This paper seems to be an example of a kind of 'rhetorical research' where a researcher is so convinced about their pre-existant theoretical commitments that they simply assume they have demonstrated them. Here the assumption seem to be:

  1. Recovering from suffering hallucinations will increase student performance
  2. Hallucinations are caused by jinn and devils
  3. A spiritual intervention will expel jinn and devils
  4. So, a spiritual intervention will cure hallucinations
  5. So, a spiritual intervention will increase student performance

The researcher provided a spiritual intervention, and the student performance increased, so it is assumed that the scheme is demonstrated. The data presented is certainly consistent with the assumption, but does not in itself support this scheme without evidence. Awang provides evidence that student performance improved in four individuals after they had received the intervention – but there is no evidence offered to demonstrate the assumed mechanism.

A gardener might think that complimenting seedlings will cause them to grow. Perhaps she praises her seedlings every day, and they do indeed grow. Are we persuaded about the efficacy of her method, or might we suspect another cause at work? Would the peer-reveiewers and editor of the European Journal of Education and Pedagogy be persuaded this demonstrated that compliments cause plant growth? On the evidence of this paper, perhaps they would.

This is what Awang tells readers about the analysis undertaken:

Each student  respondent  involved  in  this  study  [sic, presumably not, rather the researcher] will  use  the analysis  of  the  respondent's  performance  to  determine the effect of hallucination disorders on student achievement in secondary school is accurate.

The elements compared in this analysis are as follows: a) difference in mean percentage of achievement by subject, b) difference in grade achievement by subject and c) difference in the grade of overall student achievement. All academic results of the respondents will be analyzed as well as get the mean of the difference between the  performance  before, during, and after the  respondents experience  hallucinations. 

These  results  will  be  used  as research material to determine the accuracy of the use of the Tazkiyatun  Nafs  Module  in  solving  the  problem  of hallucinations   in   school   and   can   improve   student achievement in academic school."

Awang, 2022, p.45

There is clearly a large jump between the analysis outlined in the second paragraph here, and testing the study hypotheses as set out in the final paragraph. But the author does not seem to notice this (and more worryingly, nor do the journal's reviewers and editor).

So interleaved into the account of findings discussing "mean percentage of achievement by subject…difference in grade achievement by subject…difference in the grade of overall student achievement" are totally unsupported claims. Here is an example for Respondent 1:

"Based on the findings of the respondent's achievement in the  grade  for  Respondent  1  while  facing  the  problem  of hallucinations  shows  that  there  is  not  much  decrease  or deterioration  of  the  respondent's  grade.  There  were  only  4 subjects who experienced a decline in grade between before and  during  hallucination  disorder.  The  subjects  that experienced  decline  were  English,  Geography,  CBC, and Civics.  Yet  there  is  one  subject  that  shows  a  very  critical grade change the Civics subject. The decline occurred from grade A to grade E. This shows that Civics education needs to be given serious attention in overcoming this problem of decline. Subjects experiencing this grade drop were subjects involving  emotion,  language,  as  well  as  psychomotor fitness.  In  the  context  of  psychology,  unstable  emotional development  leads  to  a  decline  in the psychomotor  and emotional development of respondents.

After  the  use  of  the  Tazkiyatun  Nafs  module  in overcoming  this  problem,  hallucinatory  disorders  can  be overcome.  This  situation  indicates  the  development  of  the respondents  during  and  after  experiencing  hallucinations after  practicing  the  Tazkiyatun  Nafs  module.  The  process that takes place in the Tzkiyatun Nafs module can help the respondent  to  stabilize  his  emotions  and  psyche  for  the better. From the above findings there were 5 subjects who experienced excellent improvement in grades. The increase occurred in English, Malay, Geography, and Civics subjects. The best improvement is in the subject of Civic education from grade E to grade B. The improvement in this language subject  shows  that  the  respondents'  emotions  have stabilized.  This  situation  is  very  positive  and  needs  to  be continued for other subjects so that respondents continue to excel in academic achievement in school.""

Awang, 2022, p.45 (emphasis added)

The material which I show here as underlined is interjected completely gratuitously. It does not logically fit in the sequence. It is not part of the analysis of school performance. It is not based on any evidence presented in this section. Indeed, nor is it based on any evidence presented anywhere else in the paper!

This pattern is repeated in discussing other aspects of respondents' school performance. Although there is mention of other factors which seem especially pertinent to the dip in school grades ("this was due to the absence of the  respondents  to  school  during  the  day  the  test  was conducted", p.46; "it was an increase from before with no marks due to non-attendance at school", p.46) the discussion of grades is interspersed with (repetitive) claims about the effects of the intervention for which no evidence is offered.


Respondent 1Respondent 2Respondent 3Respondent 4
§: Differences in Respondents' Grade Achievement by Subject"After the use of the Tazkiyatun Nafs module in overcoming this problem, hallucinatory disorders can be overcome. This situation indicates the development of the respondents during and after experiencing hallucinations after practicing the Tazkiyatun Nafs module. The process that takes place in the Tzkiyatun Nafs module can help the respondent to stabilize his emotions and psyche for the better." (p.45)"After the use of the Tazkiyatun Nafs module as a soul purification module, showing the development of the respondents during and after experiencing hallucination disorders is very good. The process that takes place in the Tzkiyatun Nafs module can help the respondent to stabilize his emotions and psyche for the better." (p.46)"The process that takes place in the Tazkiyatun Nafs module can help the respondent to stabilize his emotions and psyche for the better" (p.46)"The process that takes place in the Tazkiyatun Nafs module can help the respondent to stabilize his emotions and psyche for the better." (p.46)
§:Differences in Respondent Grades according to Overall Academic Achievement"Based on the findings of the study after the hallucination
disorder was overcome showed that the development of the respondents was very positive after going through the treatment process using the Tazkiyatun Nafs module…In general, the use of Tazkiyatun Nafs module successfully changed the learning lifestyle and achievement of the respondents from poor condition to good and excellent achievement.
" (pp.46-7)
"Based on the findings of the study after the hallucination disorder was overcome showed that the development of the respondents was very positive after going through the treatment process using the Tazkiyatun Nafs module. … This excellence also shows that the respondents have recovered from hallucinations after practicing the methods found in the Tazkiayatun Nafs module that has been introduced.
In general, the use of the Tazkiyatun Nafs module successfully changed the learning lifestyle and achievement of the respondents from poor condition to good and excellent achievement
." (p.47)
"Based on the findings of the study after the hallucination disorder was overcome showed that the development of the respondents was very positive after going through the treatment process using the Tazkiyatun Nafs module…In general, the use of the Tazkiyatun Nafs module successfully changed the learning lifestyle and achievement of the respondents from poor condition to good and excellent achievement." (p.47)"Based on the findings of the study after the hallucination disorder was overcome showed that the development of the respondents was very positive after going through the treatment process using the Tazkiyatun Nafs module…In general, the use of the Tazkiyatun Nafs module has successfully changed the learning lifestyle and achievement of the respondents from poor condition to good and excellent achievement." (p.47)
Unsupported claims made within findings sections reporting analyses of individual student academic grades: note (a) how these statements included in the analysis of individual school performance data from four separate participants (in a case study – a methodology that recognises and values diversity and individuality) are very similar across the participants; (b) claims about 'respondents' (plural) are included in the reports of findings from individual students.

Awang summarises what he claims the analysis of 'differences in respondents' grade achievement by subject' shows:

"The use of the Tazkiyatun Nafs module in this study helped the students improve their respective achievement grades. Therefore, this soul purification module should be practiced by every student to help them in stabilizing their soul and emotions and stay away from all the disturbances of the subtle beings that lead to hallucinations"

Awang, 2022, p.46

And, on the next page, Awang summarises what he claims the analysis of 'differences in respondent grades according to overall academic achievement' shows:

"The use of the Tazkiyatun Nafs module in this study helped the students improve their respective overall academic achievement. Therefore, this soul purification module should be practiced by every student to help them in stabilizing the soul and emotions as well as to stay away from all the disturbances of the subtle beings that lead to hallucination disorder."

Awang, 2022, p.47

So, the analysis of grades is said to demonstrate the value of the intervention, and indeed Awang considers this is reason to extend the intervention beyond the four participants, not just to others suffering hallucinations, but to "every student". The peer review process seems not to have raised queries about

  • the unsupported claims,
  • the confusion of recommendations with findings (it is normal to keep to results in a findings section), nor
  • the unwarranted generalisation from four hallucination suffers to all students whether healthy or not.

Interpreting the results

There seem to be two stories that can be told about the results:

When the four students suffered hallucinations, this led to a deterioration in their school performance. Later, once they had recovered from the episodes of hallucinations, their school performance improved.  

Narrative 1

Now narrative 1 relies on a very substantial implied assumption – which is that the numbers presented as school performance are comparable over time. So, a control would be useful: such as what happened to the performance scores of other students in the same classes over the same time period. It seems likely they would not have shown the same dip – unless the dip was related to something other than hallucinations – such as the well-recognised dip after long school holidays, or some cultural distraction (a major sports tournament; fasting during Ramadan; political unrest; a pandemic…). Without such a control the evidence is suggestive (after all, being ill, and missing school as a result, is likely to lead to a dip in school performance, so the findings are not surprising), but inconclusive.

Intriguingly, the author tells readers that "student  achievement  statistics  from  the  beginning  of  the year to the middle of the current [sic, published in 2022] year in secondary schools in Northern Peninsular Malaysia that have been surveyed by researchers show a decline (Sabri, 2015 [sic])" (p.42), but this is not considered in relation to the findings of the study.

When the four students suffered hallucinations, this led to a deterioration in their school performance. Later, as a result of undergoing the soul purification module, their school performance improved.  

Narrative 2

Clearly narrative 2 suffers from the same limitation as narrative 1. However, it also demands an extra step in making an inference. I could re-write this narrative:

When the four students suffered hallucinations, this led to a deterioration in their school performance. Later, once they had recovered from the episodes of hallucinations, their school performance improved. 
AND
the recovery was due to engagement with the soul purification module.

Narrative 2'.

That is, even if we accept narrative 1 as likely, to accept narrative 2 we would also need to be convinced that:

  • a) sufferers from medical conditions leading to hallucinations do not suffer periodic attacks with periods of remission in between; or
  • b) episodes of hallucinations cannot be due to one-off events (emotional trauma, T.I.A. {transient ischaemic attack or mini-strokes},…) that resolve naturally in time; or
  • c) sufferers from medical conditions leading to hallucinations do not find they resolve due to maturation; or
  • d) the four participants in this study did not undertaken any change in life-style (getting more sleep, ceasing eating strange fungi found in the woods) unrelated to the intervention that might have influenced the onset of hallucinations; or
  • e) the four participants in this study did not receive any medical treatment independent of the intervention (e.g., prescribed medication to treat migraine episodes) that might have influenced the onset of hallucinations

Despite this study being supposedly a case study (where the expectation is there should be 'thick description' of the case and its context), there is no information to help us exclude such options. We do not know the medical diagnoses of the conditions causing the participants' hallucinations, or anything about their lives or any medical treatment that may have been administered. Without such information, the analysis that is provided is useless for answering the research question.

In effect, regardless of all the other issues raised, the key problem is that the research design is simply inadequate to test the research question. But it seems the referees and editor did not notice this shortcoming.

Alleged implications of the research

After presenting his results Awang draws various implications, and makes a number of claims about what had been found in the study:

  • "After the students went through the treatment session by using the Tazkiayatun Nafsmodule to treat hallucinations, it showed a positive effect on the student respondents. All this was certified by the expert, the student's parents as well as the  counselor's  teacher." (p.48)
  • "Based on these findings, shows that hallucinations are very disturbing to humans and the appropriate method for now to solve this problem is to use the Tazkiyatun Nafs Module." (p.48)
  • "…the use of the Tazkiyatun Nafs module while the  respondent  is  suffering  from  hallucination  disorder  is very  appropriate…is very helpful to the respondents in restoring their minds and psyche to be calmer and healthier. These changes allow  students  to  focus  on  their  studies  as  well  as  allow them to improve their academic performance better." (p.48)
  • "The use of the Tazkiyatun Nafs Module in this study has led to very positive changes there are attitudes and traits of students  who  face  hallucinations  before.  All  the  negative traits  like  irritability, loneliness,  depression,etc.  can  be overcome  completely." (p.49)
  • "The personality development of students is getting better and perfect with the implementation of the Tazkiaytun Nafs module in their lives." (p.49)
  • "Results  indicate that  students  who  suffer  from  this hallucination  disorder are in  a  state  of  high  depression, inactivity, fatigue, weakness and pain,and insufficient sleep." (p.49)
  • "According  to  the  findings  of  this study,  the  history  of  this  hallucination  disorder  started in primary  school  and  when  a  person  is  in  adolescence,  then this  disorder  becomes  stronger  and  can  cause  various diseases  and  have  various  effects  on  a  person who  is disturbed." (p.50)

Given the range of interview data that Awang claims to have collected and analysed, at least some of the claims here are possibly supported by the data. However, none of this data and analysis is available to the reader. 2 These claims are not supported by any evidence presented in the paper. Yet peer reviewers and the editor who read the manuscript seem to feel it is entirely acceptable to publish such claims in a research paper, and not present any evidence whatsoever.

Summing up

In summary: as far as these four students were concerned (but not perhaps the fifth participant?), there did seem to be a relationship between periods of experiencing hallucinations and lower school performance (perhaps explained by such factors as "absenteeism to school during the day the test was conducted" p.46) ,

"the performance shown by students who face chronic hallucinations is also declining and  declining.  This  is  all  due  to  the  actions  of  students leaving the teacher's learning and teaching sessions as well as  not  attending  school  when  this  hallucinatory  disorder strikes.  This  illness or  disorder  comes  to  the  student suddenly  and  periodically.  Each  time  this  hallucination  disease strikes the student causes the student to have to take school  holidays  for  a  few  days  due  to  pain  or  depression"

Awang, 2022, p.42

However,

  • these four students do not represent any wider population;
  • there is no information about the specific nature, frequency, intensity, etcetera, of the hallucinations or diagnoses in these individuals;
  • there was no statistical test of significance of changes; and
  • there was no control condition to see if performance dips were experienced by others not experiencing hallucinations at the same time.

Once they had recovered from the hallucinations (and it is not clear on what basis that judgement was made) their scores improved.

The author would like us to believe that the relief from the hallucinations was due to the intervention, but this seems to be (quite literally) an act of faith 3 as no actual research evidence is offered to show that the soul purification module actually had any effect. It is of course possible the module did have an effect (whether for the conjectured or other reasons – such as simply offering troubled children some extra study time in a calm and safe environment and special attention – or because of an expectancy effect if the students were told by trusted authority figures that the intervention would lead to the purification of their hearts and the healing of their hallucinatory disorder) but the study, as reported, offers no strong grounds to assume it did have such an effect.

An irresponsible journal

As hallucinations are often symptoms of organic disease affecting blood supply to the brain, there is a major question of whether treating the condition by religious instruction is ethically sound. For example, hallucinations may indicate a tumour growing in the brain. Yet, if the module was only a complement to proper medical attention, a reader may prefer to suspect that any improvement in the condition (and consequent increased engagement in academic work) may have been entirely unrelated to the module being evaluated.

Indeed, a published research study that claims that soul purification is a suitable treatment for medical conditions presenting with hallucinations is potentially dangerous as it could lead to serious organic disease going untreated. If Awang's recommendations were widely taken up in Malaysia such that students with serious organic conditions were only treated for their hallucinations by soul purification rather than with medication or by surgery it would likely lead to preventable deaths. For a research journal to publish a paper with such a conclusion, where any qualified reviewer or editor could easily see the conclusion is not warranted, is irresponsible.

As the journal website points out,

"The process of reviewing is considered critical to establishing a reliable body of research and knowledge. The review process aims to make authors meet the standards of their discipline, and of science in general."

https://www.ej-edu.org/index.php/ejedu/about

So, why did the European Journal of Education and Pedagogy not subject this submission to meaningful review to help the author of this study meet the standards of the discipline, and of science in general?


Work cited:

Notes:

1 In mature fields in the natural sciences there are recognised traditions ('paradigms', 'disciplinary matrices') in any active field at any time. In general (and of course, there will be exceptions):

  • at any historical time, there is a common theoretical perspective underpinning work in a research programme, aligned with specific ontological and epistemological commitments;
  • at any historical time, there is a strong alignment between the active theories in a research programme and the acceptable instrumentation, methodology and analytical conventions.

Put more succinctly, in a mature research field, there is generally broad agreement on how a phenomenon is to be understood; and how to go about investigating it, and how to interpret data as research evidence.

This is generally not the case in educational research – which is in part at least due to the complexity and, so, multi-layered nature, of the phenomena studied (Taber, 2014a): phenomena such as classroom teaching. So, in reviewing educational papers, it is sometimes necessary to find different experts to look at the theoretical and the methodological aspects of the same submission.


2 The paper is very strange in that the introductory sections and the conclusions and implications sections have a very broad scope, but the actual research results are restricted to a very limited focus: analysis of school test scores and grades.

It is as if as (and could well be that) a dissertation with a number of evidential strands has been reduced to a paper drawing upon only one aspect of the research evidence, but with material from other sections of the dissertation being unchanged from the original broader study.


3 Readers are told that

"All  these  acts depend on the sincerity of the medical researcher or fortune-teller seeking the help of Allah S.W.T to ensure that these methods and means are successful. All success is obtained by the permission of Allah alone"

Awang, 2022, p.43


A case study of educational innovation?

Design and Assessment of an Online Prelab Model in General Chemistry


Keith S. Taber


Case study is meant to be naturalistic – whereas innovation sounds like an intervention. But interventions can be the focus of naturalistic enquiry.

One of the downsides of having spent years teaching research methods is that one cannot help but notice how so much published research departs from the ideal models one offers to students. (Which might be seen as a polite way of saying authors often seem to get key things wrong.) I used to teach that how one labelled one's research was less important than how well one explained it. That is, different people would have somewhat different takes on what is, or is not, grounded theory, case study or action research, but as long as an author explained what they had done, and could adequately justify why, the choice of label for the methodology was of secondary importance.

A science teacher can appreciate this: a student who tells the teacher they are doing a distillation when they are actually carrying out reflux – but clearly explains what they are doing and why, will still be understood (even if the error should be pointed out). On the other hand if a student has the right label but an alternative conception this is likely to be a more problematic 'bug' in the teaching-learning system. 1

That said, each type of research strategy has its own particular weaknesses and strengths so describing something as an experiment, or a case study, if it did not actually share the essential characteristics of that strategy, can mislead the reader – and sometimes even mislead the authors such that invalid conclusions are drawn.

A 'case study', that really is a case study

I made reference above to action research, grounded theory, and case study – three methodologies which are commonly name-checked in education research. There are a vast number of papers in the literature with one of these terms in the title, and a good many of them do not report work that clearly fits the claimed approach! 2


The case study was published in the Journal for the Research Center for Educational Technology

So, I was pleased to read an interesting example of a 'case study' that I felt really was a case study (Llorens-Molina, 2009). 'Design and assessment of an online prelab model in general chemistry: A case study' offered a good example of a case study. Although, I suspect some other authors might have been tempted to describe this research differently.

Is it a bird, is it a plane; no it's…

Llorens-Molina's study included an experimental aspect. A cohort of learners was divided into two groups to allow the researcher to compare two different educational treatments; then, measurements were made to compare outcomes quantitatively. That might sound like an experiment. Moreover, this study reported an attempt to innovate in a teaching situation, which gives the work a flavour of action research. Despite this, I agree with Llorens-Molinathat that the work is best characterised as a case study.

Read about experiments

Read about action research


A case study focuses on 'one instance' from among many


What is a case study?

A case study is an in-depth examination of one instance: one example – of something for which there are many examples. The focus of a case study might be one learner, one teacher, one group of students working together on a task, one class, one school, one course, one examination paper, one text book, one laboratory session, one lesson, one enrichment programme… So, there is great variety in what kind of entity a case study is a study of, but what case studies have in common is they each focus in detail on that one instance.

Read about case study methodology


Characteristics of case study

Characteristics of case study

Case studies are naturalistic studies, which means they are studies of things as they are, not attempts to change things. The case has to be bounded (a reader of a case study learns what is in the case and what is not) but tends to be embedded in a wider context that impacts upon it. That is, the case is entangled in a context from which it could not easily be extracted and still be the same case. (Imagine moving a teacher with her class from their school to have their lesson in a university where it could be observed by researchers – it would not be 'the same lesson' as would have occurred in situ).

The case study is reported in detail, often in a narrative form (not just statistical summaries) – what is sometimes called 'thick description'. Usually several 'slices' of data are collected – often different kinds of data – and often there is a process of 'triangulation' to check the consistency of the account presented in relation to the different slices of data available. Although case studies can include analysis of quantitative data, they are usually seen as interpretive as the richness of data available usually reflects complexity and invites nuance.



Design and Assessment of an Online Prelab Model in General Chemistry

Llorens-Molina's study explored the use of prelabs that are "used to introduce and contextualize laboratory work in learning chemistry" (p.15), and in particular "an alternative prelab model, which consists of an audiovisual tutorial associated with an online test" (p.15).

An innovation

The research investigated an innovation in teaching practice,

"In our habitual practice, a previous lecture at the beginning of each laboratory session, focused almost exclusively on the operational issues, was used. From our teaching experience, we can state that this sort of introductory activity contributes to a "cookbook" way to carry out the laboratory tasks. Furthermore, the lecture takes up valuable time (about half an hour) of each ordinary two-hour session. Given this set-up, the main goal of this research was to design and assess an alternative prelab model, which was designed to enhance the abilities and skills related to an inquiry-type learning environment. Likewise, it would have to allow us to save a significant amount of time in laboratory sessions due to its online nature….

a prelab activity developed …consists of two parts…a digital video recording about a brief tutorial lecture, supported by a slide presentation…[followed by ] an online multiple choice test"

Llorens-Molina, 2009, p.16-17
Not action research?

The reference to shifting "our habitual practice" indicates this study reports practitioner research. Practitioner studies, such as this, that test a new innovation are often labelled by authors as 'action research'. (Indeed, sometimes, the fact that research is carried out by practitioners looking to improve their own practice is seen as sufficient for action research: when actually this is a necessary, but not a sufficient condition.)

Genuine action research aims at improving practice, not simply seeing if a specific innovation is working. This means action research has an open-ended design, and is cyclical – with iterations of an innovation tested and the outcomes used as feedback to inform changes in the innovation. (Despite this, a surprising number of published studies labelled as action research lack any cyclic element, simply reporting one iteration of a innovation.) Llorens-Molina's study does not have a cyclic design, so would not be well-characterised as action research.

An experimental design?

Llorens-Molina reports that the study was motivated by three hypotheses (p.16):

  • "Substituting an initial lecture by an online prelab to save time during laboratory sessions will not have negative repercussions in final examination marks.
  • The suggested online prelab model will improve student autonomy and prerequisite knowledge levels during laboratory work. This can be checked by analyzing the types and quantity of SGQ [student generated questions].
  • Student self-perceptions about prelab activities will be more favourable than those of usual lecture methods."

To test these hypotheses the student cohort was divided into two groups, to be split between the customary and innovative approach. This seems very much like an experiment.

It may be useful here to make a discrimination between two levels of research design – methodology (akin to strategy) and techniques (akin to tactics). In research design, a methodology is chosen to meet the overall aims of the study, and then one or more research techniques are selected consistent with that methodology (Taber, 2013). Experimental techniques may be included in a range of methodologies, but experiment as an overall methodology has some specific features.

Read about Research design

In a true experiment there is random assignment to conditions, and often there is an intention to generalise results to a wider population considered to be sampled in the study. Llorens-Molina reports that although inferential statistics were used to test the hypotheses, there was no intention to offer statistical generalisation beyond the case. The cohort of students was not assumed to be a sample representing some wider population (such as, say, undergraduates on chemistry courses in Spain) – and, indeed, clearly such an assumption would not have been justified.

Case study is naturalistic – but an innovation is an intervention in practice…

Case study is said to be naturalistic research – it is a method used to understand and explore things as they are, not to bring about change. Yet, here the focus is an innovation. That seems a contradiction. It would be a contradiction if the study was being carried out by external researchers who had asked the teaching team to change practice for the benefits of their study. However, here it is useful to separate out the two roles of teacher and researcher.

This is a situation that I commonly faced when advising graduates preparing for school teaching who were required to carry out a classroom based study into an aspect of their school placement practice context as part of their university qualification (the Post-Graduate Certificate in Education, P.G.C.E.). Many of these graduates were unfamiliar with research into social phenomena. Science graduates often brought a model of what worked in the laboratory to their thinking about their projects – and had a tendency to think that transferring the experimental approach to classrooms (where there are usually a large number of potentially relevant variables, many of which can not be controlled) would be straightforward.

Read 'Why do natural scientists tend to make poor social scientists?'

The Cambridge P.G.C.E. teaching team put into place a range of supports to introduce graduate preparing for teaching to the kinds of education research useful for teachers who want to evaluate and improve their own teaching. This included a book written to introduce classroom-based research that drew heavily on analysis of published studies (Taber, 2007; 2013). Part of our advice was that those new to this kind of enquiry might want to consider action research and case study as suitable options for their small-scale projects.


Useful strategies for the novice practitioner-researcher (Figure: diagram used in working with graduates preparing for teaching, from Taber, 2010)

Simplistically, action research might be considered best suited to a project to test an innovation or address a problem (e.g., evaluating a new teaching resource; responding to behavioural issues), and case study best suited to an exploratory study (e.g., what do Y9 students understand about photosynthesis?; what is the nature of peer dialogue during laboratory working in this class?) However, it was often difficult for the graduates to carry out authentic action research as the constraints of the school-based placements seldom allowed them to test successive iterations of the same intervention until they found something like an optimal specification.

Yet, they often were in a good position to undertake a detailed study of one iteration, collecting a range of different data, and so producing a detailed evaluation. That sounds like a case study.

Case study is supposed to be naturalistic – whereas innovation sounds like an intervention. But some interventions in practice can be considered the focus of naturalistic enquiry. My argument was that when a teacher changes the way they do something to try and solve a problem, or simply to find a better way to work, that is a 'natural' part of professional practice. The teacher-researcher, as researcher, is exploring something the fully professional teacher does as matter of course – seek to develop practice. After all, our graduates were being asked to undertake research to give them the skills expected to meet professional teaching standards, which

"clearly requires the teacher to have both the procedural knowledge to undertake small-scale classroom enquiry, and 'conceptual frameworks' for thinking about teaching and learning that can provide the basis for evaluating their teaching. In other words, the professional teacher needs both the ability to do her own research and knowledge of what existing research suggests"

Taber, 2013, p.8

So, the research is on something that is naturally occurring in the classroom context, rather than an intervention imported into the context in order to answer an external researcher's questions. A case study of an intervention introduced by practitioners themselves can be naturalistic – even if the person implementing the change is the researcher as well as the teacher.


If a teacher-researcher (qua researcher) wishes to enquire into an innovation introduced by the teacher-researcher (qua teacher) then this can be considered as naturalistic enquiry


The case and the context

In Llorens-Molina's study, the case was a sequence of laboratory activities carried out by a cohort of undergraduates undertaking a course of General and Organic Chemistry as part of an Agricultural Engineering programme. So, the case was bounded (the laboratory part of one taught course) and embedded in a wider context – a degree programme in a specific institution in Spain: the Polytechnic University of Valencia.

The primary purpose of the study was to find out about the specific innovation in the particular course that provided the case. This was then what is known as an intrinsic case study. (When a case is studied primarily as an example of a class of cases, rather than primarily for its own interest, it is called an instrumental case study).

Llorens-Molina recognised that what was found in this specific case, in its particular context, could not be assumed to apply more widely. There can be no statistical generalisation to other courses elsewhere. In case study, the intention is to offer sufficient detail of the case for readers to make judgements of the likely relevance to other context of interest (so-called 'reader generalisation').

The published report gives a good deal of information about the course as well as much information about how data was collected, and equally important, analysed.

Different slices of data

Case study often uses a range of data sources to develop a rounded picture of the case. In this study the identification of three specific hypotheses (less usual in case studies, which often have more open-ended research questions) led to the collection of three different types of data.

  • Students were assessed on each of six laboratory activities. A comparison was made between the prelab condition and the existing approach.
  • Questions asked by students in the laboratories were recorded and analysed to see if the quality/nature of such questions was different in the two conditions. A sophisticated approach was developed to analyse the questions.
  • Students were asked to rate the prelabs through responding to items on a questionnaire.

This approach allowed the author to go beyond simply reporting whether hypotheses were supported by the analysis, to offer a more nuanced discussion around each feature. Such nuance is not only more informative to the reader of a case study, but reflects how the researcher, as practitioner, has an ongoing commitment to further develop practice and not see the study as an end in itself.

Avoiding the 'equivalence' and the 'misuse of control groups' problems

I particularly appreciate a feature of the research design that many educational studies that claim to be experiments could benefit from. To test his hypotheses Llorens-Molina employed two conditions or treatments, the innovation and a comparison condition, and divided the cohort: "A group with 21 students was split into two subgroups, with 10 and 11 in each one, respectively". Llorens-Molina does not suggest this was based on random assignment, which is necessary for a 'true' experiment.

In many such quasi-experiments (where randomisation to condition is not carried out, and is indeed often not possible) the researchers seek to offer evidence of equivalence before the treatments occur. After all, if the two subgroups are different in terms of past subject attainment or motivation or some other relevant factor (or, indeed, if there is no information to allow a judgement regarding whether this is the case or not), no inferences about an intervention can be drawn from any measured differences. (Although that does not always stop researchers from making such claims regardless: e.g., see Lack of control in educational research.)

Another problem is that if learners are participating in research but are assigned to a control or comparison condition then it could be asked if they are just being used as 'data fodder', and would that be fair to them? This is especially so in those cases (so, not this one) where researchers require that the comparison condition is educationally deficient – many published studies report a control condition where schools students have effectively been lectured to, and no discussion work, group work, practical work, digital resources, et cetera, have been allowed, in order to ensure a stark contrast with whatever supposedly innovative pedagogy or resource is being evaluated (Taber, 2019).

These issues are addressed in research designs which have a compensatory structure – in effect the groups switch between being the experimental and comparison condition – as here:

"Both groups carried out the alternative prelab and the previous lecture (traditional practice), alternately. In this way, each subgroup carried out the same number of laboratory activities with either a prelab and previous lecture"

Llorens-Molina, 2009, p.19

This is good practice both from methodological and ethical considerations.


The study used a compensatory design which avoids the need to ensure both groups are equivalent at the start, and does not disadvantage one group. (Figure from Llorens-Molina, 2009, p.22 – published under a creative commons Attribution-NonCommercial-NoDerivs 3.0 United States license allowing redistribution with attribution)

A case of case study

Do I think this is a model case study that perfectly exemplifies all the claimed characteristics of the methodology? No, and very few studies do. Real research projects, often undertaken in complex contexts with limited resources and intractable constraints, seldom fit such ideal models.

However, unlike some studies labelled as case studies, this study has an explicit bounded case and has been carried out in the spirit of case study that highlights and values the intrinsic worth of individual cases. There is a good deal of detail about aspects of the case. It is in essence a case study, and (unlike what sometimes seems to be the case [sic]) not just called a case study for want of a methodological label. Most educational research studies examine one particular case of something – but (and I do not think this is always appreciated) that does not automatically make them case studies. Because it has been both conceptualised and operationalised as a case study, Llorens-Molina's study is a coherent piece of research.

Given how, in these pages, I have often been motivated to call out studies I have read that I consider have major problems – major enough to be sufficient to undermine the argument for the claimed conclusions of the research – I wanted to recognise a piece of research that I felt offered much to admire.


Work cited:

Notes:

1 I am using language here reflecting a perspective on teaching as being based on a model (whether explicit or not) in the teacher's mind of the learners' current knowledge and understanding and how this will respond to teaching. That expects a great deal of the teacher, so there are often bugs in the system (e.g., the teacher over-estimates prior knowledge) that need to be addressed. This is why being a teacher involves being something of a 'learning doctor'.

Read about the learning doctor perspective on teaching


2 I used to teach sessions introducing each of these methodologies when I taught on an Educational Research course. One of the class activities was to examine published papers claiming the focal methodology, asking students to see if studies matched the supposed characteristics of the strategy. This was a course with students undertaking a very diverse range of research projects, and I encouraged them to apply the analysis to papers selected because they were of particular interest and relevance to to their own work. Many examples selected by students proved to offer poor match between claimed methodology and the actual research design of ther study!

Lack of control in educational research

Getting that sinking feeling on reading published studies


Keith S. Taber


this is like finding that, after a period of watering plant A, it is taller than plant B – when you did not think to check how tall the two plants were before you started watering plant A

Research on prelabs

I was looking for studies which explored the effectiveness of 'prelabs', activities which students are given before entering the laboratory to make sure they are prepared for practical work, and can therefore use their time effectively in the lab. There is much research suggesting that students often learn little from science practical work, in part because of cognitive overload – that is, learners can be so occupied with dealing with the apparatus and materials they have little capacity left to think about the purpose and significance of the work. 1


Okay, so is THIS the pipette?
(Image by PublicDomainPictures from Pixabay)

Approaching a practical work session having already spent time engaging with its purpose and associated theories/models, and already having become familiar with the processes to be followed, should mean students enter the laboratory much better prepared to use their time efficiently, and much better informed to reflect on the wider theoretical context of the work.

I found a Swedish paper (Winberg & Berg, 2007) reporting a pair of studies that tested this idea by using a simulation as a prelab activity for undergraduates about to engage with an acid-base titration. The researchers tested this innovation by comparisons between students who completed the prelab before the titration, and those who did not.

The work used two basic measures:

  • types (sophistication) of questions asked by students during the lab. session
  • elicitation of knowledge in interviews after the laboratory activity

The authors found some differences (between those who had completed the prelab and those that had not) in the sophistication of the questions students asked, and in the quality of the knowledge elicited. They used inferential statistics to suggest at least some of the differences found were statistically significant. From my reading of the paper, these claims were not justified.

A peer reviewed journal (no, really, this time)

This is a paper in a well respected journal (not one of the predatory journals I have often discussed on this site). The Journal of Research in Science Teaching is published by Wiley (a major respected publisher of academic material) and is the official journal of NARST (which used to stand for the National Association for Research in Science Teaching – where 'national' referred to the USA 2). This is a journal that does take peer review very seriously.

The paper is well-written and well-structured. Winberg and Berg set out a conceptual framework for the research that includes a discussion of previous relevant studies. They adopt a theoretical framework based on the Perry's model of intellectual development (Taber, 2020). There is considerable detail of how data was collected and analysed. This account is well-argued. (But, you, dear reader, can surely sense a 'but' coming.)

Experimental research into experimental work?

The authors do not seem to explicitly describe their research as an experiment as such (as opposed to adopting some other kind of research strategy such as survey or case study), but the word 'experiment' and variations of it appear in the paper.

For one thing, the authors refer to students' practical work as being experiments,

"Laboratory exercises, especially in higher education contexts, often involve training in several different manipulative skills as well as a high information flow, such as from manuals, instructors, output from the experimental equipment, and so forth. If students do not have prior experiences that help them to sort out significant information or reduce the cognitive effort required to understand what is happening in the experiment, they tend to rely on working strategies that help them simply to cope with the situation; for example, focusing only on issues that are of immediate importance to obtain data for later analysis and reflective thought…"

Winberg & Berg, 2007

Now, some student practical work is experimental, where a student is actively looking to see what happens when they manipulate some variable to test a hypothesis. This type of practical work is sometimes labelled enquiry (or inquiry in US spelling). But a lot of school and university laboratory work, however, is undertaken to learn techniques, or (probably more often) to support the learning of taught theory – where it is usually important the learners know what is meant to happen before they begin the laboratory activity.

Winberg and Berg refer to the 'laboratory exercise' as 'the experiment' as though any laboratory work counts as an experiment. In Winberg and Berg's research, students were asked about their "own [titration] experiment", despite the prelab material involving a simulation of the titration process, in advance of which "the theoretical concepts, ideas, and procedures addressed in the simulation exercise had been treated mainly quantitatively during the preceding 1-week instructional sequence". So, the laboratory titration exercise does not seem to be an experiment in the scientific sense of the term.

School children commonly describe all practical work in the lab as 'doing experiments'. It cannot help students learn what an experiment really is when the word 'experiment' has two quite distinct meanings in the science classroom:

  • experiment(technical) = an empirical test of a hypothesis involving the careful control of variables and observation of the effect on a specified (hypothetised as) dependent variable of changing the variable specified as the independent variable
  • experiment(casual) = absolutely any practical activity carried out with laboratory equipment

We might describe this second meaning as an alternative conception of 'experiment', a way of understanding that is inconsistent with the scientific meaning. (Just as there are common alternative conceptions of other 'nature of science' concepts such as 'theory').

I would imagine Winberg and Berg were well aware of what an experiment is, although their casual use of language might suggest a lack of rigour in thinking with the term. They refer to having "both control and experiment groups" in their studies, and refer to "the experimental chronology" of their research design. So, they certainly seem to think of their work as a kind of experiment.

Experimental design

In a true experiment, a sample is randomly drawn from a population of interest (say, first year undergraduate chemistry students; or, perhaps, first year undergraduate chemistry students attending Swedish Universities, or… 3) and assigned randomly to the conditions being compared. Providing a genuine form of random assignment is used, then inferential statistical tests can guide on whether any differences found between groups at the end of an experiment should be considered statistically significant. 4

"Statistics can only indicate how likely a measured result would occur by chance (as randomisation of units of analysis to different treatments can only make uneven group composition unlikely, not impossible)…Randomisation cannot ensure equivalence between groups (even if it makes any imbalance just as likely to advantage either condition)"

Taber, 2019, p.73

Inferential statistics can be used to test for statistical significance in experiments – as long as the 'units of analysis' (e.g., students) are randomly assigned to the experimental and control conditions.
(Figure from Taber, 2019)

That is, if the are difference that the stats. tests suggests are very unlikely to happen by chance, then they are very unlikely to be due to an initial difference between the groups in the two conditions as long as the groups were the result of random assignment. But that is a very important proviso.

There are two aspects to this need for randomisation:

  • to be able to suggest any differences found reflect the effects of the intervention, then there should be random assignment to the two (or more) conditions
  • to be able to suggest the results reflect what would probably would be found in a wider population, the sample should be randomly selected from the population of interest 3

Studies in education seldom meet the requirements for being true experiments
(Figure from Taber, 2019)

In education, it is not always possible to use random assignment, so true experiments are then not possible. However, so-called 'quasi-experiments' may be possible where differences between the outcomes in different conditions may be understood as informative, as long as there is good reason to believe that even without random assignment, the groups assigned to the different conditions are equivalent.

In this specific research, that would mean having good reason to believe that without the intervention (the prelab):

  • students in both groups would have asked overall equivalent (in terms of the analysis undertaken in this study) questions in the lab.;
  • students in both groups would have been judged as displaying overall equivalent subject knowledge.

Often in research where a true experiment is not possible some kind of pre-testing is used to make a case for equivalence between groups.

Two control groups that were out of control

In Winberg and Berg's research there were two studies where comparisons were made between 'experimental' and 'control' conditions

StudyExperimentalControl
Study 1n=78: first-year students, following completion of their first chemistry course in 2001n=97: students who had been interviewed by the researchers during the same course in the previous year
Study 2n=21 (of 58 in cohort)n=37 (of 58 in same cohort)

In the first study, a comparison was made between the cohort where the innovation was introduced and a cohort from the previous year. All other things being equal, it seems likely these two cohorts were fairly similar. But in education all thing are seldom equal, so there is no assurance they were similar enough to be considered equivalent.

In the second study

"Students were divided into treatment (n = 21) and control (n = 37) groups. Distribution of students between the treatment and control groups was not controlled by the researchers".

Winberg & Berg, 2007

So, some factor(s) external to the researchers divided the cohort into two groups – and the reader is told nothing about the basis for this, nor even if the two groups were assigned to the treatments randomly.5 The authors report that the cohort "comprised prospective molecular biologists (31%), biologists (51%), geologists (7%), and students who did not follow any specific program (11%)", and so it is possible the division into two uneven sized groups was based on timetabling constraints with students attending chemistry labs sessions according to their availability based on specialism. But that is just a guess. (It is usually better when the reader of a research report is not left to speculate about procedures and constraints.)

What is important for a reader to note is that in these studies:

  • the researchers were not able to assign learners to conditions randomly;
  • nor were the researchers able to offer any evidence of equivalence between groups (such as near identical pre-test scores);
  • so, the requirements for inferring significance from statistical tests were not met;
  • so, claims in the paper about finding statistically significant differences between conditions cannot therefore be justified given the research design;
  • and therefore the conclusions presented in the paper are strictly not valid.

If students are not randomly assigned to conditions, then any statistically unlikely difference found at the end of an experiment cannot be assumed to be likely to be due to intervention, rather than some systematic initial difference between the groups.
(Figure adapted from Taber, 2019)


This is a shame, because this is in many ways an interesting paper, and much thought and care seems to have been taken about the collection and analysis of meaningful data. Yet, drawing conclusions from statistical tests comparing groups that might never have been similar in the first case is like finding that careful use of a vernier scale shows that after a period of watering plant A, plant A is taller than plant B – having been very careful to make sure plant A was watered regularly with carefully controlled volumes, while plant B was not watered at all – when you did not think to check how tall the two plants were before you started watering plant A.

In such a scenario we might be tempted to assume plant A has actually become taller because it had been watered; but that is just applying what we had conjectured should be the case, and we would be mistaking our expectations for experimental evidence.

Work cited:

Notes:

1 The part of the brain where we can consciously mentipulate ideas is called the working memory (WM). Research suggests that WM has a very limited capacity in the sense that people can only hold in mind a very small number of different things at once. (These 'things' however are somewhat subjective – a complex idea that is treated as a single 'thing' in the WM of an expert can overload a novice.) This limit to ~WM is considered to be one of the most substantial constraints on effective classroom learning. This is also, then, one of the key research findings informing the design of effective teaching.

Read about working memory

Read about key ideas for teaching in accordance with learning theory

How fat is your memory? – read about a chemical analogy for working memory


2 The organisation has seemingly spotted that the USA is only one part of the world, and now describes itself as a global organisation for improving science education through research.


3 There is no reason why an experiment cannot be carried out on a very specific population, such as first year undergraduate chemistry students attending a specific Swedish University such a, say, Umea ̊ University. However, if researchers intend their study to have results generalisable beyond their specific research contexts (say, to first year undergraduate chemistry students attending any Swedish University) then it is important to have a representative sample of that population.

Read about populations of interest in research

Read about generalisation from research studies


4 It might be assumed that scientists, and researchers know what is meant by random, and how to undertake random assignment. Sadly, the literature suggests that in practice the term 'randomly' is sometimes used in research reports to mean something like 'arbitrarily' (Taber, 2013), which fills short of being random.

Read about randomisation in research


5 Arguably, even if the two groups were assigned randomly, there is only one 'unit of analysis' in each condition, as they were assigned as groups. That is, for statistical purposes, the two groups have size n=1 and n=1, which would not allow statistical significance to be found: e.g, see 'Quasi-experiment or crazy experiment?'

What causes the clouds in your coffee?

Of liars, paradoxes, and vanity


Keith S. Taber


the song works wonderfully as a kind of paradox as in a sense
the song can only be about someone whom it is not about

Are your dreams no more than clouds in your coffee?
(Image by kyuubicreeper from Pixabay)

In a popular song of the early 1970s, singer-songwriter Carly Simon reflected on having had some "clouds in my coffee", which is an intriguing reference. If this was meant as an objective observation, then it seems to invite some interpretation. What kinds of things are clouds and coffee such that clouds can be observed in coffee?

Solutions, suspensions and supersaturation

In everyday life clouds are usually observed in the sky, and are due to myriad tiny water droplets. The air always naturally contains some water vapour, and the amount depends on the conditions – air just over a hot sea is likely to have a high 'moisture' content due to the rate of evaporation. If very moist air cools then it may become supersaturated with water vapour, in which case any suitable 'nuclei' will facilitate condensation. (These nuclei may be dust particles for example – but ions can also act as condensation nuclei.)

Today everyone is taught at school about the water cycle which is so essential for life on this planet, by which water is recycled through repeated evaporation/transpiration and condensation and precipitation. (Sadly, in Isaac Newton's day the school curriculum was mostly limited to learning maths and Latin, which was unfortunate – as if he had been taught about the water cycle he might not have felt the need to posit an extraterrestrial explanation for how the seas do not dry up with all that evaporation.)


Newton had a suggestion for how the earth's seas did not dry up
(Images by 1980supra and Gordon Johnson from Pixabay)

Clouds may occur on a smaller scale, such as in cloud chambers used to detect the traces left by alpha or beta radiation. Here, material soaked with a suitable volatile liquid, such as ethanol, is placed in a chamber so that the air becomes saturated, and then, where it cools, supersaturated. An alpha or beta source will emit fast moving particles that transfer momentum by colliding with molecules in the air, often ionising them. As the alpha or beta particle moves through the chamber it leaves behind it a 'trace' in terms of a trail of ions – in a cloud chamber the alcohol or other other supersaturated vapour condenses around these ions giving a visible trail – somewhat like the vapour trails left by jets that are often still visible when the plane is too far away to be seen.


The atmosphere – nature's own cloud chamber


So, what is coffee? I think that depends on how you make it. Assuming you take your coffee black, then if you serve it in a glass, and hold the glass up to the light, it may seem to be transparent. That is, it has a brown colour, but you can see through it to what is behind. If so, that is a solution with various substances in the coffee dissolved in the solvent (hot water). Perhaps you cannot see through your coffee, and if you try shining a torch or laser pen at it you see the beam lighting up its route through the coffee? If so, as well as dissolved material, it also contains suspended particles that are too large to be in solution. You can test this – as long as you do not mind not drinking your coffee. Given enough time, if the glass is undisturbed, the suspended particles will form a sediment at the bottom, and you will be left with a clear solution above. (But your coffee will now be cold.)

Coffee is made in various ways, and whether your coffee is a solution or has both dissolved solute and suspended particles will depend on how finely the coffee solids are filtered in preparing the drink. If you take milk or something similar in your coffee, then you definitely have some suspended particles of fat or oil in there.

So, how are we to understand how clouds can form in coffee? If one had hot coffee which was purely a solution (finely filtered), and was very strong coffee, then perhaps some of the solute would be saturated – the most that could be dissolved at that temperature. If the coffee cooled, then perhaps it would become a supersaturated solution, and, if suitable nuclei were present (so perhaps not too fine a filter, so allowing a few suspended particles?), 'clouds' of precipitating coffee solids would be seen in the solution?

Song-writing as representing a poetic truth

Now, dear reader, you are probably suspecting that I am being an over-literal scientist here, as clearly Carly Simon was writing a song and not making laboratory observations. Surely, it is obvious, that the clouds in her coffee were metaphorical clouds? She is representing how she felt – as she mused over her coffee – she was sad or melancholy or at least reflective.

When released as a single, the record, 'You're so vain', was a big hit in many parts of the world, no doubt in part because it was a very catchy song, but perhaps also in part because of ongoing speculation about WHOM it was Ms. Simon was accusing of vanity. Over the years she has suggested the song is about a composite of three men, and she has acknowledged one of them (the actor Warren Beatty) but speculation has continued. Perhaps if it was released today, a song that includes the line "You had me several years ago when I was still quite naive" might be viewed as reporting something darker than just a failed love affair? But what especially appeals to me about the song is its sense of paradox.


The album including the hit song 'You're so vain' proclaimed 'No secrets' but the precise target(s) of the song have remained a matter of speculation


The liar paradox

There is a famous paradox which was said to have bemused and puzzled some ancients. Imagine meeting someone who tells you:

All Cretans are liars.
I am a Cretan.

I mean no disrespect to the people of Crete, but this is how I understand the paradox was originally framed. We could substitute Venusians or politicians or whatever. A modern version could be

All members of the Bullingdon Club are liars.
I am a member of the Bullingdon Club

This is supposed to present a paradox. Either the first statement is true, in which case the second is not. Or the second is true, in which case the first is not.

If (and see below) we accept this is a paradox then it has a simple solution. As well as saying things they think are true, and things they think are false, people are also capable of saying things that do not make sense – even to themselves! Not all texts can be considered to have truth value. There is then no paradox, just a lack of consistency!

After all, we can say all kinds of things that do not relate to possible situations

  • Gas sample A contains 2g of hydrogen at a lower temperature and higher pressure than gas sample B
  • Gas sample B contains 2g of hydrogen that occupies a smaller volume than gas sample A

Oh, how much easier (if less interesting) life would be if there was a law of nature that meant we could not say or write things that were not true or not physically possible! Scholars would simply need to sit down and start writing. Anything they were able to produce would be true and we would not need the expense of CERN and all those other laboratories!

Applying hermeneutics

Now, even though what people say or write need not make good sense, one should be careful dismissing an apparently non-sensical statement too easily. I know from working with science students who may have various alternative conceptions and alternative conceptual frameworks that often they say things that do not seem to make sense. Certainly, sometimes, this may be because they are confused or are guessing an answer to a teacher's question without fully thinking it through.

But sometimes what they say makes good sense from their perspective. We only find this out by engaging them in conversation when it may transpire from the wider context of their talk that they are using a term in a somewhat non-canonical way, or have a different way of dividing up the world, or they limit certain principles to a too restricted set of contexts (or apply principles beyond their valid range of application), et cetera. That is, we apply a hermeneutic approach to seeking sense by seeking to understand a statement in terms of the wider 'text'.

Whilst, from a canonical scientific perspective, the student has still got some of the science wrong, it is much more likely a teacher can shift their thinking towards the target knowledge in the curriculum if she recognises it has coherence for the student and understands and if she engages with the student's way of thinking (for example exploring limitations, pointing out it has absurd or clearly incorrect implications), than if she simply dismisses it as 'wrong'. This, of course, is the basis of the constructivist approach to science teaching.

Read about constructivist pedagogy

Liars, and effective lairs

However, even if we take the Cretan's couplet as a paradox, it is not very convincing. A liar is someone who tells lies – not someone who only ever tells lies. A 'good liar' (if that is not an oxymoron – I mean someone good at lying), that is someone able to use lying to their advantage, presumably does this by being truthful enough of the time that people do not suspect when they are lying. Someone who announced themselves on the telephone with…

"Hi, I'm John. I am a fish. I eat oak trees for breakfast. I am four thousands years old. I used to be Napoleon Bonaparte. I can hold my breath for months at a time. I levitate when I sleep. I am England's greatest goalscorer, even though, as a fish, I do not have any feet. I am phoning from your bank because we are concerned about some suspicious activity on your account, so would like to just check with you on some recent transactions to make sure you authorised them. First of all, because we take customer privacy and security very seriously, I need to be sure who I am talking to, so would you mind giving me your full name, postcode, account number and password."

A very unconvincing scammer

…would be unlikely to be believed. Much better to start with something that is clearly true if you want to sneak in a lie without it being noticed. (The recent demise of a UK Prime Minister perhaps offers an example of how, when you already have a reputation for not telling the truth, people are more likely to suspect, scrutinise and check your claims, and, so, detect dishonest statements.)

Reductio ad absurdum

So, an improvement on the Cretan liar paradox is the card which has a statement on each side:

  • the statement on the other side of this card is true
  • the statement on the other side of this card is a lie

This corrects for the need to understand 'liar' as someone who only tells lies.

If the statement on the first side is correct, then the statement on the other side is true, which means the statement on the first side was a lie, so not correct.

But if the statement on the first side is a indeed a lie (as we are informed by the statement on the other side) then the statement the second side is not true, which means the statement on the first side was not a lie, and is true

Either way, whichever statement we begin by accepting we find is contradicted later. This reflects the method of 'reductio ad absurdum' which is a technique used to demonstrate false arguments.

Imagine we wanted to demonstrate that atoms can be divided. Let us posit that atoms are indivisible. This would lead us to conclude there are not discrete subatomic particles. Yet electrons, alpha particles, neutron, protons have all been shown to be subatomic particles. Therefore our premise (atoms are indivisible) must be false.

An even simpler version of the liar paradox is the statement:

  • this statement is a lie

The statement claims to be a lie, but if it is a lie that means the truth is contrary to what it claims. So (as it claims to be a lie) it is true. But if it is true, then the statement must be correct. So if it is correct, as it claims to be a lie, it is a lie. So, if true it is a lie. But then if it is a lie…

Clearly we have self-contradiction. Again, there is no real mystery here – it is simply a clever statement that is neither true nor false but lacks coherent sense. What is a mystery, is who 'is so vain'?

Do the vain think themselves vain?

The hook of the song is the chorus

You're so vain
You probably think this song is about you
You're so vain (you're so vain)
I bet you think this song is about you
Don't you don't you?

This seems a nice reflection of the Cretan paradox. Carly's ex-lover would have to be very vain to think she would be so obsessed with him that she would write a song about him. So, if he thinks the song is about him, he is indeed 'so vain'.

Except of course, the song may actually be about him. If an ex-lover whom the song is about thinks it is about him then is that vanity? Surely, not. It is not vanity for someone to acknowledge, say, being a Nobel prize winner, if she is indeed a Nobel laureate. Vanity is thinking you should have won the Nobel that was given to someone else!

The song contains some specific biographic details, such as

Well I hear you went up to Saratoga
And your horse naturally won
Then you flew your Lear jet up to Nova Scotia
To see the total eclipse of the sun

So, someone hearing the song who had been a lover of Ms. Simon several years earlier, and had been up to Saratoga to watch a horse race where his own horse had won the race, and had flown himself to Nova Scotia in his own Leah jet to see the total eclipse, surely would have good grounds for feeling this could well be him.

In particular, we might think, if they recognised themselves as being vain! But this is what makes the song delicious lyrically, as surely a vain person does not recognise themselves as vain?

So, if someone thinks the song is about them, when it is not, they are vain enough to think an ex-lover would write a song about them. BUT that is not someone the song is actually about, so not whom is being accused of being 'so vain'.

If the person whom is being written about does not think it is them, then they are presumably not so vain. If they do recognise themselves, then they are justified in doing so, so that is not really evidence of vanity, either!

So, the song works wonderfully as a kind of paradox as in a sense the song can only be about someone whom it is not about! Did Carly Simon realise that when she wrote the song. I assume so. Does this contribute to its continuing popularity? Perhaps, if you, dear reader, know this song, do you too appreciate this aspect of it? Or, perhaps most people just sing along with the catchy tune and let the lyrics flow? They are poetry after all, not formal knowledge claims.

Explaining the clouds in the coffee

So, were the clouds in the coffee just meant as a metaphor for how Carly was feeling about the plans she had had during her time with her ex-lover?

Well you said that we made such a pretty pair and that you would never leave
But you gave away the things you loved
And one of them was me
I had some dreams they were clouds in my coffee clouds in my coffee and
You're so vain
You probably think this song is about you

On a number of websites Ms. Simon is quoted as explaining (in 2001) that

"'Clouds in my coffee' are the confusing aspects of life and love. That which you can't see through, and yet seems alluring…until. Like a mirage that turns into a dry patch. Perhaps there is something in the bottom of the coffee cup that you could read if you could (like tea leaves or coffee grinds)"

Carly Simon quoted on a range of websites

However, Carly has also explained she took the line from a comment her friend and pianist Billy Mernit made when they were served coffee on a plane – "As I got my coffee, there were clouds outside the window of the airplane and you could see the reflection in the cup of coffee. Billy said to me, 'Look at the clouds in your coffee.  That's like a Truffaut shot!'."

Mermet recalls on his blog that he had actually compared the image to a scene from a Godard film: "what I had talked about was a Godard shot, namely the overhead close-up of a coffee cup from [the film] 2 or 3 Things I Know About Her.


A still from the Jean-Luc Godard film '2 or 3 Things I Know About Her' – clouds? I see galaxies!

Clearly Carly [sic] may have been in a reflective mood, but the clouds that appeared to be in her coffee were due to a different kind of reflection. So, it seems there was a sound physical interpretation, after all.