Can we be sure that fun in the sun alters water chemistry?

Minimalist sampling and experimental variables


Keith S. Taber


Dirty water

I was reading the latest edition of Education in Chemistry and came across an article entitled "Fun in the sun alters water chemistry. How swimming and tubing are linked to concerning rises in water contaminants" (Notman, 2023). This was not an article about teaching, but a report of some recent chemistry research summarised for teachers. [Teaching materials relating to this article can be downloaded from the RSC website.]

I have to admit to not having understood what 'tubing' was (I plead 'age') apart from its everyday sense of referring collectively to tubes, such as those that connect Bunsen burners to gas supplies, and was intrigued by what kinds of tubes were contaminating the water.

The research basically reported on the presence of higher levels of contaminants in the same body of water at Clear Creak, Colorado on a public holiday when many people used the water for recreational pursuits (perhaps even for 'tubing'?) than on a more typical day.

This seems logical enough: more people in the water; more opportunities for various substances to enter the water from them. I have my own special chemical sensor which supports this finding. I go swimming in the local hotel pool, and even though people are supposed to shower before entering the pool: not everyone does (or at least, not effectively). Sometimes one can 'taste' 1 the change when someone gets in the water without washing off perfume or scented soap residue. Indeed, occasionally the water 'tastes' 1 differently after people enter the pool area wearing strong perfume, even if they do not use the pool and come into direct contact with the water!

The scientists reported finding various substances they assumed were being excreted 2 by the people using the water – substances such as antihistamines and cocaine – as well as indicators of various sunscreens and cosmetics. (They also found higher levels of "microbes associated with humans", although this was not reported in Education in Chemistry.)


I'm not sure why I bother having a shower BEFORE I go for a swim in there… (Image by sandid from Pixabay)


It makes sense – but is there a convincing case?

Now this all seems very reasonable, as the results fit into a narrative that seems theoretically feasible: a large number of people entering the fresh water of Clear Creek are likely to pollute it sufficiently (if not to rename it Turbid Creek) for detection with the advanced analytical tools available to the modern chemist (including "an inductively coupled plasma mass spectrometer and a liquid chromatography high resolution mass spectrometer").

However, reading on, I was surprised to learn that the sampling in this study was decidedly dodgy.

"The scientists collected water samples during a busy US public holiday in September 2022 and on a quiet weekday afterwards."

I am not sure how this (natural) experiment would rate as a design for a school science investigation. I would certainly have been very critical if any educational research study I had been asked to evaluate relied on sampling like this. Even if large numbers of samples were taken from various places in the water over an extended period during these two days this procedure has a major flaw. This is because the level of control of other possibly relevant factors is minimal.

Read about control in experimental research

The independent variable is whether the samples were collected on a public holiday when there was much use of the water for leisure, or on a day with much less leisure use. The dependent variables measured were levels of substances in the water that would not be considered part of the pristine natural composition of river water. A reasonable hypothesis is that there would be more contamination when more people were using the water, and that was exactly what was found. But is this enough to draw any strong conclusions?

Considering the counterfactual

A useful test is to ask whether we would have been convinced that people do not contaminate the water had the analysis shown there was no significant difference in water samples on the two days? That is to examine a 'counterfactual' situation (one that is not the case, but might have been).

In this counterfactual scenario, would similar levels of detected contaminants be enough to convince us the hypotheses was misguided – or might we look to see if there was some other factor which might explain this unexpected (given how reasonable the hypothesis seems) result and rescue our hypothesis?

Had pollutant levels been equally high on both days, might we have sought ('ad hoc') to explain that through other factors:

  • Maybe it was sunnier on the second day with high U.V. levels which led to more breakdown of organic debris in the river?
  • Perhaps there was a spill of material up-river 3 which masked any effect of the swimmers (and, er, tubers?)
  • Perhaps rainfall between the two sampling dates had increased the flow of the river and raised its level, washing more material into the water?
  • Perhaps the wind direction was different and material was being blown in from nearby agricultural land on the second day.
  • Perhaps the water temperature was different?
  • Perhaps a local industry owner tends to illegally discharge waste into the river when the plant is operating on normal working days?
  • Perhaps spawning season had just started for some species, or some species was emerging from a larval state on the river bed and disturbing the debris on the bottom?
  • Perhaps passing migratory birds were taking the opportunity to land in the water for some respite, and washing off parasites as well as dust.
  • Perhaps a beaver's dam had burst up stream 3 ?
  • Perhaps (for any panspermia fans among readers) an asteroid covered with organic residues had landed in the river?
  • Or…

But: if we might consider some of those factors to potentially explain a lack of effect we were expecting, then we should equally consider them as possible alternative causes for an effect we predicted.

  • Maybe it was sunnier on the first day with high U.V. levels which led to more breakdown of organic debris in the river?
  • Perhaps a local industry owner tends to illegally discharge waste into the river on public holidays because the work force are off site and there will be no one to report this?
  • … etc.

Lack of control of confounding variables

Now, in environmental research, as in research into teaching, we cannot control conditions in the way we can in a laboratory. We cannot ensure the temperature and wind direction and biota activity in a river is the same. Indeed, one thing about any natural environment that we can be fairly sure of is that biological activity (and so the substances released by such activity) varies seasonally, and according to changing weather conditions, and in different ways for different species.

So, as in educational research, there are often potentially confounding variables which can undermine our experiments:

In quasi-experiments or natural experiments, a more complex design than simply comparing outcome measures is needed. …this means identifying and measuring any relevant variables. …Often…there are other variables which it is recognised could have an effect, other than the dependent variable: 'confounding' variables.

Taber, 2019, p.85 [Download this article]

independent variableclass of day (busy holiday versus quiet working day)
dependent variablesconcentrations of substances and organisms considered to indicate contamination
confounding variablesanything that might feasibly influence the level of concentrations of substances and organisms considered to indicate contamination – other than the class of day
In a controlled experiment any potential confounding variables are held at fixed levels, but in 'natural experiments' this is not possible

Read about confounding variables in research

Sufficient sampling?

The best we can do to mitigate for the lack of control is rigorous sampling. If water samples from a range of days when there was high level of leisure activity, and a range of days when there was low level of leisure activity were compared, this would be more convincing that just one day from each category. Especially so if these were randomly selected days. It is still possible that factors such as wind direction and water temperature could bias findings, but it becomes less likely – and with random sampling of days it is possible to estimate how likely such chance factors are to have an effect. Then we can at least apply models that suggest whether observed differences in outcomes exceed the level likely due to chance effects.

Read about sampling in research

I would like to think that any educational study that had this limitation would be questioned in peer review. The Education in Chemistry article cited the original research, although I could not immediately find this. The work does not seem to have been published in a research journal (at least, not yet) but was presented at a conference, and is discussed in a video published by the American Chemical Society on YouTube.

"With Labor Day approaching, many people are preparing to go tubing and swimming at local streams and rivers. These delightful summertime activities seem innocuous, but do they have an impact on these waterways? Today, scientists report preliminary [sic] results from the first holistic study of this question 4, which shows that recreation can alter the chemical and microbial fingerprint of streams, but the environmental and health ramifications are not yet known."

American Chemical Society Meeting Newsroom, 2023

In the video, Noor Hamdan, of John Hopkins University, reports that "we are thinking of collecting more samples and doing some more statistical analysis to really, really make sure that humans are significantly impacting a stream".

This seems very wise, as it is only too easy to be satisfied with very limited data when it seems to fit with your expectations. Indeed that is one of the everyday ways of thinking that science challenges by requiring more rigorous levels of argument and evidence. In the meantime, Noor Hamdan suggests people using the water should use mineral-based rather than organic-based sunscreens, and she "recommend[s] not peeing in rivers". No, I am fairly sure 'tubing' is not meant as a euphemism for that. 5


Work cited:

Notes:


1 Perhaps more correctly, smell, though it is perceived as tasting – most of the flavour we taste in food is due to volatile substances evaporating in the mouth cavity and diffusing to be detected in the nose lining.


2 The largest organ of excretion for humans is the skin. The main mechanism for excreting the detected contaminating substances into the water (if perhaps not the only pertinent one, according to the researchers) was sweating. Physical exertion (such as swimming) tends to be associated with higher levels of sweating. We do not notice ourselves sweating when the sweat evaporates as fast as it is released – nor, of course, when we are immersed in water.


One of those irregular verbs?

I perspire.

You sweat.

She excretes through her skin

(Image by Sugar from Pixabay)


3 The video suggests that sampling took place both upriver and downriver of the Creek which would offer some level of control for the effect of completely independent influxes into the water – unless they occurred between the sampling points.


4 There seem to be plenty of studies of the effects of water quality on leisure use of waterways: but not on the effects of the recreational use of waterways on their quality.


5 Just in case any readers were also ignorant about this, it apparently refers to using tyre inner tubes (or similar) as floatation devices. This suggests a new line of research. People who float around in inner tubes will tend to sweat less than those actively swimming – but are potentially harmful substances leached from the inner tubes themselves?


Join an email discussion list for those teaching chemistry


Educational experiments – making the best of an unsuitable tool?

Can small-scale experimental investigations of teaching carried-out in a couple of arbitrary classrooms really tells us anything about how to teach well?


Keith S. Taber


Undertaking valid educational experiments involves (often, insurmountable) challenges, but perhaps this grid (shown larger below) might be useful for researchers who do want to do genuinely informative experimental studies into teaching?


Applying experimental method to educational questions is a bit like trying to use a precision jeweller's screwdriver to open a tin of paint: you may get the tin open eventually, but you will probably have deformed the tool in the process whilst making something of a mess of the job.


In recent years I seem to have developed something of a religious fervour about educational research studies of the kind that claim to be experimental evaluations of pedagogies, classroom practices, teaching resources, and the like. I think this all started when, having previously largely undertaken interpretive studies (for example, interviewing learners to find out what they knew and understood about science topics) I became part of a team looking to develop, and experimentally evaluate, classroom pedagogy (i.e., the epiSTEMe project).

As a former school science teacher, I had taught learners about the basis of experimental method (e.g., control of variables) and I had read quite a number of educational research studies based on 'experiments', so I was pretty familiar with the challenges of doing experiments in education. But being part of a project which looked to actually carry out such a study made a real impact on me in this regard. Well, that should not be surprising: there is a difference between watching the European Cup Final on the TV, and actually playing in the match, just as reading a review of a concert in the music press is not going to impact you as much as being on stage performing.

Let me be quite clear: the experimental method is of supreme value in the natural sciences; and, even if not all natural science proceeds that way, it deserves to be an important focus of the science curriculum. Even in science, the experimental strategy has its limitations. 1 But experiment is without doubt a precious and powerful tool in physics and chemistry that has helped us learn a great deal about the natural world. (In biology, too, but even here there are additional complications due to the variations within populations of individuals of a single 'kind'.)

But transferring experimental method from the laboratory to the classroom to test hypotheses about teaching is far from straightforward. Most of the published experimental studies drawing conclusions about matters such as effective pedagogy, need to be read with substantive and sometimes extensive provisos and caveats; and many of them are simply invalid – they are bad experiments (Taber, 2019). 2

The experiment is a tool that has been designed, and refined, to help us answer questions when:

  • we are dealing with non-sentient entities that are indifferent to outcomes;
  • we are investigating samples or specimens of natural kinds;
  • we can identify all the relevant variables;
  • we can measure the variables of interest;
  • we can control all other variables which could have an effect;

These points simply do not usually apply to classrooms and other learning contexts. 3 (This is clearly so, even if educational researchers often either do not appreciate these differences, or simply pretend they can ignore them.)

Applying experimental method to educational questions is a bit like trying to use a precision jeweller's screwdriver to open a tin of paint: you may get the tin open eventually, but you will probably have deformed the tool in the process whilst making something of a mess of the job.

The reason why experiments are to be preferred to interpretive ('qualitative') studies is that supposedly experiments can lead to definite conclusions (by testing hypotheses), whereas studies that rely on the interpretation of data (such as classroom observations, interviews, analysis of classroom talk, etc.) are at best suggestive. This would be a fair point when an experimental study genuinely met the control-of-variables requirements for being a true experiment – although often, even then, to draw generalisable conclusions that apply to a wide population one has to be confident one is working with a random or representatives sample, and use inferential statistics which can only offer a probabilistic conclusion.

My creed…researchers should prefer to undertake competent work

My proselytising about this issue, is based on having come to think that:

  • most educational experiments do not fully control relevant variables, so are invalid;
  • educational experiments are usually subject to expectancy effects that can influence outcomes;
  • many (perhaps most) educational experiments have too few independent units of analysis to allow the valid use of inferential statistics;
  • most large-scale educational experiments can not assure that samples are fully representative of populations, so strictly cannot be generalised;
  • many experiments are rhetorical studies that deliberately compare a condition (supposedly being tested but actually) assumed to be effective with a teaching condition known to fall short of good teaching practice;
  • an invalid experiment tells us nothing that we can rely upon;
  • a detailed case study of a learning context which offers rich description of teaching and learning potentially offers useful insights;
  • given a choice between undertaking a competent study of a kind that can offer useful insights, and undertaking a bad experiment which cannot provide valid conclusions, researchers should prefer to undertake competent work;
  • what makes work scientific is not the choice of methodology per se, but the adoption of a design that fits the research constraints and offers a genuine opportunity for useful learning.

However, experiments seem very popular in education, and often seem to be the methodology of choice for researchers into pedagogy in science education.

Read: Why do natural scientists tend to make poor social scientists?

This fondness of experiments will no doubt continue, so here are some thoughts on how to best draw useful implications from them.

A guide to using experiments to inform education

It seems there are two very important dimensions that can be used to characterise experimental research into teaching – relating to the scale and focus of the research.


Two dimensions used to characterise experimental studies of teaching


Scale of studies

A large-scale study has a large number 'units of analysis'. So, for example, if the research was testing out the value of using, say, augmented reality in teaching about predator-prey relationships, then in such a study there would need to be a large number of teaching-learning 'units' in the augmented learning condition and a similarly large number of teaching-learning 'units' in the comparison condition. What a unit actually is would vary from study to study. Here a unit might be a sequence of three lessons where a teacher teaches the topic to a class of 15-16 year-old learners (either with, or without, the use of augmented reality).

For units of analysis to be analysed statistically they need to be independent from each other – so different students learning together from the same teacher in the same classroom at the same time are clearly not learning independently of each other. (This seems obvious – but in many published studies this inconvenient fact is ignored as it is 'unhelpful' if researchers wish to use inferential statistics but are only working with a small number of classes. 4)

Read about units of analysis in research

So, a study which compared teaching and learning in two intact classes can usually only be considered to have one unit of analysis in each condition (making statistical tests completely irrelevant 5, thought this does not stop them often being applied anyway). There are a great many small scale studies in the literature where there are only one or a few units in each condition.

Focus of study

The other dimension shown in the figure concerns the focus of a study. By the focus, I mean whether the researchers are interested in teaching and learning in some specific local context, or want to find out about some general population.

Read about what is meant by population in research

Studies may be carried out in a very specific context (e.g., one school; one university programme) or across a wide range of contexts. That seems to simply relate to the scale of the study, just discussed. But by focus I mean whether the research question of interest concerns just a particular teaching and learning context (which may be quite appropriate when practitioner-researchers explore their own professional contexts, for exmample), or is meant to help us learn about a more general situation.


local focusgeneral focus
Why does school X get such outstanding science examination scores?Is there a relationship between teaching pedagogy employed and science examination results in English schools?
Will jig-saw learning be a productive way to teach my A level class about the properties of the transition elements?Is jig-saw learning an effective pedagogy for use in A level chemistry classes?
Some hypothetical research questions relating either to a specific teaching context, or a wider population. (n.b. The research literature includes a great many studies that claim to explore general research questions by collecting data in a single specific context.)

If that seems a subtle distinction between two quite similar dimensions then it is worth noting that the research literature contains a great many studies that take place in one context (small-scale studies) but which claim (implicitly or explicitly) to be of general relevance. So, many authors, peer reviewers, and editors clearly seem think one can generalise from such small scale studies.

Generalisation

Generalisation is the ability to draw general conclusions from specific instances. Natural science does this all the time. If this sample of table salt has the formula NaCl, then all samples of table salt do; if the resistance of this copper wire goes up when the wire is heated the same will be found with other specimens as well. This usually works well when dealing with things we think are 'natural kinds' – that is where all the examples (all samples of NaCl, all pure copper wires) have the same essence.

Read about generalisation in research

Education deals with teachers, classes, lessons, schools…social kinds that lack that kind of equivalence across examples. You can swap any two electrons in a structure and it will make absolutely no difference. Does any one think you can swap the teachers between two classes and safely assume it will not have an effect?

So, by focus I mean whether the point of the research is to find out about the research context in its own right (context-directed research) or to learn something that applies to a general category of phenomena (theory-directed research).

These two dimensions, then, lead to a model with four quadrants.

Large-scale research to learn about the general case

In the top-right quadrant is research which focuses on the general situation and is larger-scale. In principle 6 this type of research can address a question such as 'is this pedagogy (teaching resource, etc.) generally effective in this population', as long as

  • the samples are representative of the wider population of interest, and
  • those sampled are randomly assigned to conditions, and
  • the number of units supports statistical analysis.

The slight of hand employed in many studies is to select a convenience sample (two classes of thirteen years old students at my local school) yet to claim the research is about, and so offers conclusions about, a wider population (thirteen year learners).

Read about some examples of samples used to investigate populations


When an experiment tests a sample drawn at random from a wider population, then the findings of the experiment can be assumed to (probably) apply (on average) to the population. (Taber, 2019)

Even when a population is properly sampled, it is important not to assume that something which has been found to be generally effective in a population will be effective throughout the population. Schools, classes, courses, learners, topics, etc. vary. If it has been found that, say, teaching the reactivity series through enquiry generally works in the population of English classes of 13-14 year students, then a teacher of an English class of 13-14 year students might sensibly think this is an approach to adopt, but cannot assume it will be effective in her classroom, with a particular group of students.

To implement something that has been shown to generally work might be considered research-based teaching, as long as the approach is dropped or modified if indications are it is not proving effective in this particular context. That is, there is nothing (please note, UK Department for Education, and Ofsted) 'research-based' about continuing with a recommended approach in the face of direct empirical evidence that it is not working in your classroom.

Large-scale research to learn about the range of effectiveness

However, even large-scale studies where there are genuinely sufficient units of analysis for statistical analysis may not logically support the kinds of generalisation in the top-right quadrant. For that, researchers needs either a random sampling of the full population (seldom viable given people and institutions must have a choice to participate or not 7), or a sample which is known to be representative of the population in terms of the relevant characteristics – which means knowing a lot about

  • (i) the population,
  • (ii) the sample, and
  • (ii) which variables might be relevant!

Imagine you wanted to undertake a survey of physics teachers in some national context, and you knew you could not reach all that population so you needed to survey a sample. How could you possibly know that the teachers in your sample were representative of the wider population on whatever variables might potentially be pertinent to the survey (level of qualification?; years of experience?; degree subject?; type of school/college taught in?; gender?…)

But perhaps a large scale study that attracts a diverse enough sample may still be very useful if it collects sufficient data about the individual units of analysis, and so can begin to look at patterns in how specific local conditions relate to teaching effectiveness. That is, even if the sample cannot be considered representative enough for statistical generalisation to the population, such a study might be a be to offer some insights into whether an approach seems to work well in mixed-ability classes, or top sets, or girls' schools, or in areas of high social deprivation, or…

In practice, there are very few experimental research studies which are large-scale, in the sense of having enough different teachers/classes as units of analysis to sit in either of these quadrants of the chart. Educational research is rarely funded at a level that makes this possible. Most researchers are constrained by the available resources to only work with a small number of accessible classes or schools.

So, what use are such studies for producing generalisable results?

Small-scale research to incrementally extend the range of effectiveness

A single small-scale study can contribute to a research programme to explore the range of application of an innovation as if it was part of a large-scale study with a diverse sample. But this means such studies need to be explicitly conceptualised and planned as part of such a programme.

At the moment it is common for research papers to say something like

"…lots of research studies, from all over the place, report that asking students to

(i) first copy science texts omitting all the vowels, and then

(ii) re-constituting them in full by working from the reduced text, by writing it out adding vowels that produce viable words and sentences,

is an effective way of supporting the learning of science concepts; but no one has yet reported testing this pedagogic method when twelve year old students are studying the topic of acids in South Cambridgeshire in a teaching laboratory with moveable stools and West-facing windows.

In this ground-breaking study, we report an experiment to see if this constructivist, active-learning, teaching approach leads to greater science learning among twelve year old students studying the topic of acids in South Cambridgeshire in a teaching laboratory with moveable stools and West-facing windows…"

Over time, the research literature becomes populated with studies of enquiry-based science education, jig-saw learning, use of virtual reality, etc., etc., and these tend to refer to a range of national contexts, variously aged students, diverse science topics, etc., this all tends to be piecemeal. A coordinated programme of research could lead to researchers both (a) giving rich description of the context used, and (b) selecting contexts strategically to build up a picture across ranges of contexts,

"When there is a series of studies testing the same innovation, it is most useful if collectively they sample in a way that offers maximum information about the potential range of effectiveness of the innovation.There are clearly many factors that may be relevant. It may be useful for replication studies of effective innovations to take place with groups of different socio-economic status, or in different countries with different curriculum contexts, or indeed in countries with different cultural norms (and perhaps very different class sizes; different access to laboratory facilities) and languages of instruction …. It may be useful to test the range of effectiveness of some innovations in terms of the ages of students, or across a range of quite different science topics. Such decisions should be based on theoretical considerations.

Given the large number of potentially relevant variables, there will be a great many combinations of possible sets of replication conditions. A large number of replications giving similar results within a small region of this 'phase space' means each new study adds little to the field. If all existing studies report positive outcomes, then it is most useful to select new samples that are as different as possible from those already tested. …

When existing studies suggest the innovation is effective in some contexts but not others, then the characteristics of samples/context of published studies can be used to guide the selection of new samples/contexts (perhaps those judged as offering intermediate cases) that can help illuminate the boundaries of the range of effectiveness of the innovation."

Taber, 2019

Not that the research programme would be co-ordinated by a central agency or authority, but by each contributing researcher/research team (i) taking into account the 'state of play' at the start of their research; (ii) making strategic decisions accordingly when selecting contexts for their own work; (iii) reporting the context in enough detail to allow later researchers to see how that study fits into the ongoing programme.

This has to be a more scientific approach than simply picking a convenient context where researchers expect something to work well; undertake a small-scale local experiment (perhaps setting up a substandard control condition to be sure of a positive outcome); and then report along the lines "this widely demonstrated effective pedagogy works here too", or, if it does not, perhaps putting the study aside without publication. As the philosopher of science, Karl Popper, reminded us, science proceeds through the testing of bold conjectures: an 'experiment' where you already know the outcome is actually a demonstration. Demonstrations are useful in teaching, but do not contribute to research. What can contribute is an experiment in a context where there is reason to be unsure if an innovation will be an improvement or not, and where the comparison reflects good teaching practice to offer a meaningful test.

Small-scale research to inform local practice

Now, I would be the first to admit that I am not optimistic that such an approach will be developed by researchers; and even if it is, it will take time for useful patterns to arise that offer genuine insights into the range of convenience of different pedagogies.

Does this mean that small-scale studies in single context are really a waste of research resource and an unmerited inconvenient for those working in such contexts?

Well, I have time for studies in my final (bottom left) quadrant. Given that schools and classrooms and teachers and classes all vary considerably, and that what works well in a highly selective boys-only fee-paying school with a class size of 16 may not be as effective in a co-educational class of 32 mixed ability students in an under-resourced school in an area of social deprivation – and vice versa, of course!, there is often value in testing out ideas (even recommended 'research-based' ones) in specific contexts to inform practice in that context. These are likely to be genuine experiments, as the investigators are really motived to find out what can improve practice in that context.

Often such experiments will not get published,

  • perhaps because the researchers are teachers with higher priorities than writing for publication;
  • perhaps because it is assumed such local studies are not generalisable (but they could sometimes be moved into the previous category if suitably conceptualised and reported);
  • perhaps because the investigators have not sought permissions for publication (part of the ethics of research), usually not necessary for teachers seeking innovations to improve practice as part of their professional work;
  • perhaps because it has been decided inappropriate to set up control conditions which are not expected to be of benefit to those being asked to participate;
  • but also because when trying out something new in a classroom, one needs to be open to make ad hoc modifications to, or even abandon, an innovation if it seems to be having a deleterious effect.

Evaluation of effectiveness here usually comes down to professional judgement (rather than statistical testing – which assumes a large random sample of a population – being used to invalidly generalise small, non-random, local results to that population) which might, in part, rely on the researcher's close (and partially tacit) familiarity with the research context.

I am here describing 'action research', which is highly useful for informing local practice, but which is not ideally suited for formal reporting in academic journals.

Read about action research

So, I suspect there may be an irony here.

There may be a great many small-scale experiments undertaken in schools and colleges which inform good teaching practice in their contexts, without ever being widely reported; whilst there are a great many similar scale, often 'forced' experiments, carried out by visiting researchers with little personal stake in the research context, reporting the general effectiveness of teaching approaches, based on misuse of statistics. I wonder which approach best reflects the true spirit of science?

Source cited:


Notes:

1 For example:

Even in the natural sciences, we can never be absolutely sure that we have controlled all relevant variables (after all, if we already knew for sure which variables were relevant, we would not need to do the research). But usually existing theory gives us a pretty good idea what we need to control.

Experiments are never a simple test of the specified hypothesis, as the experiment is likely to depends upon the theory of instrumentation and the quality of instruments. Consider an extreme case such as the discovery of the Higgs boson at CERN: the conclusions relied on complex theory that informed the design of the apparatus, and very challenging precision engineering, as well as complex mathematical models for interpreting data, and corresponding computer software specifically programmed to carry out that analysis.

The experimental results are a test of a hypothesis (e.g., that a certain particle would be found at events below some calculated energy level) subject to the provisos that

  • the theory of the the instrument and its design is correct; and
  • the materials of the apparatus (an apparatus as complex and extensive as a small city) have no serious flaws; and
  • the construction of the instrumentation precisely matches the specifications;
  • and the modelling of how the detectors will function (including their decay in performance over time) is accurate; and
  • the analytical techniques designed to interpret the signals are valid;
  • the programming of the computers carries out the analysis as intended.

It almost requires an act of faith to have confidence in all this (and I am confident there is no one scientist anywhere in the world who has a good enough understanding and familiarity will all these aspects of the experiment to be able to give assurances on all these areas!)


CREST {Critical Reading of Empirical Studies} evaluation form: when you read a research study, do you consider the cumulative effects of doubts you may have about different aspects of the work?

I would hope at least that as professional scientists and engineers they might be a little more aware of this complex chain of argumentation needed to support robust conclusions than many students – for students often seem to be overconfident in the overall value of research conclusions given any doubts they may have about aspects of the work reported.

Read about the Critical Reading of Empirical Studies Tool


Galileo Galilei was one of the first people to apply the telescope to study the night sky

Galileo Galilei was one of the first people to apply the telescope to study the night sky (image by Dorothe from Pixabay)


A historical example is Galileo's observations of astronomical phenomena such as Jovian moons (he spotted the four largest: Io, Europa, Ganymede and Callisto) and the irregular surface of the moon. Some of his contemporaries rejected these findings on the basis that they were made using an apparatus, the newly fanged telescope, that they did not trust. Whilst this is now widely seen as being arrogant and/or ignorant, arguably if you did not understand how a telescope could magnify, and you did not trust the quality of the lenses not to produce distortions, then it was quite reasonable to be sceptical of findings which were counter to a theory of the 'heavens' that had been generally accepted for many centuries.


2 I have discussed a number of examples on this site. For example:

Falsifying research conclusions: You do not need to falsify your results if you are happy to draw conclusions contrary to the outcome of your data analysis.

Why ask teachers to 'transmit' knowledge…if you believe that "knowledge is constructed in the minds of students"?

Shock result: more study time leads to higher test scores (But 'all other things' are seldom equal)

Experimental pot calls the research kettle black: Do not enquire as I do, enquire as I tell you

Lack of control in educational research: Getting that sinking feeling on reading published studies


3 For a detailed discussion of these and other challenges of doing educational experiments, see Taber, 2019.


4 Consider these two situations.

A researcher wants to find out if a new textbook 'Science for the modern age' leads to more learning among the Grade 10 students she teaches than the traditional book 'Principles of the natural world'. Imagine there are fifty grade 10 students divided already into two classes. The teacher flips a coin and randomly assigns one of the classes to the innovative book, the other being assigned by default the traditional book. We will assume she has a suitable test to assess each students' learning at the end of the experiment.

The teacher teaches the two classes the same curriculum by the same scheme of work. She presents a mini-lecture to a class, then sets them some questions to discuss using the text book. At the end of the (three part!) lesson, she leads a class disucsison drawing on students' suggested answers.

Being a science teacher, who believes in replication, she decides to repeat the exercise the following year. Unfortunately there is a pandemic, and all the students are sent into lock-down at home. So, the teacher assigns the fifty students by lot into two groups, and emails one group the traditional book, and the other the innovative text. She teaches all the students on line as one cohort: each lesson giving them a mini-lecture, then setting them some reading from their (assigned) book, and a set of questions to work through using the text, asking them to upload their individual answers for her to see.

With regard to experimental method, in the first cohort she has only two independent units of analysis – so she may note that the average outcome scores are higher in one group, but cannot read too much into that. However, in the second year, the fifty students can be considered to be learning independently, and as they have been randomly assigned to conditions, she can treat the assessment scores as being from 25 units of analysis in each condition (and so may sensibly apply statistics to see if there is a statistically significant different in outcomes).


5 Inferential statistical tests are usually used to see if the difference in outcomes across conditions is 'significant'. Perhaps the average score in a class with an innovation is 5.6, compared with an average score in the control class of 5.1. The average score is higher in the experimental condition, but is the difference enough to matter?

Well, actually, if the question is whether the difference is big enough to likely to make a difference in practice then researchers should calculate the 'effect size' which will suggest whether the difference found should be considered small, moderate or large. This should ideally be calculated regardless of whether inferential statistics are being used or not.

Inferential statistical tests are often used to see if the result is generalisable to the wider population – but, as suggested above, this is strictly only valid if the population of interest have been randomly sampled – which virtually never happens in educational studies as it is usually not feasible.

Often researchers will still do the calculation, based on the sets of outcome scores in the two conditions, to see if they can claim a statistically significant difference – but the test will only suggest how likely or unlikely the difference between the outcomes is, if the units of analysis have been randomly assigned to the conditions. So, if there are 50 learners each randomly assigned to experimental or control condition this makes sense. That is sometimes the case, but nearly always the researchers work with existing classes and do not have the option of randomly mixing the students up. [See the example in the previous note 4.] In such a situation, the stats. are not informative. (That does not stop them often being reported in published accounts as if they are useful.)


6 That is, if it possible to address such complications as participant expectations, and equitable teacher-familiarity with the different conditions they are assigned to (Taber, 2019).

Read about expectancy effects


7 A usual ethical expectation is that participants voluntarily (without duress) offer informed consent to participate.

Read about voluntary informed consent


Is your heart in the research?

Someone else's research, that is


Keith S. Taber


Imagine you have a painful and debilitating illness. Your specialist tells you there is no conventional treatment known to help. However, there is a new – experimental – procedure: a surgery that may offer relief. But it has not yet been fully tested. If you are prepared to sign up for a study to evaluate this new procedure, then you can undergo surgery.

You are put under and wheeled into the operating theatre. Whilst you experience – rather, do not experience – the deep, sleepless rest of anaesthesia, the surgeon saws through your breastbone, prises open your ribcage with a retractor (hopefully avoiding breaking any ribs),
reaches in, and gently lifts up your heart.

The surgeon, pauses, perhaps counts to five, then carefully replaces your heart between the lungs. The ribcage is closed, and you are sown-up without any actual medical intervention. You had been randomly assigned to the control group.


How can we test whether surgical interventions are really effective without blind controls?

Is it right to carry out sham operations on sick people just for the sake of research?

Where is the balance of interests?

(Image from Pixabay)


Research ethics

A key aspect of planning, executing and reviewing research is ethical scrutiny. Planning, obviously, needs to take into account ethical considerations and guidelines. But even the best laid plans 'of mice and men' (or, of, say, people investigating mice) may not allow for all eventualities (after all, if we knew what was going to happen for sure in a study, it would not be research – and it would be unethical to spend precious public resources on the study), so the ethical imperative does not stop once we have got approval and permissions. And even then, we may find that we cannot fully mitigate for unexpected eventualities – which is something to be reported and discussed to help inform future research.

Read about research ethics

When preparing students setting out on research, instruction about research ethics is vital. It is possible to teach about rules, and policies, and guidelines and procedures – but real research contexts are often complex, and ethical thinking cannot be algorithmic or a matter of adopting slogans and following heuristics. In my teaching I would include discussion of past cases of research studies that raised ethical questions for students to discuss and consider.

One might think that as research ethics is so important, it would be difficult to find many published studies which were not exemplars of good practice – but attitudes to, and guidance on, ethics have developed over time, and there are many past studies which, if not clearly unethical in today's terms, at least present problematic cases. (That is without the 'doublethink' that allows some contemporary researchers to, in a single paper, both claim active learning methods should be studied because it is known that passive learning activities are not effective, yet then report how they required teachers to instruct classes through passive learning to act as control groups.)

Indeed, ethical decision-making may not always be straight-forward – as it often means balancing different considerations, and at a point where any hoped-for potential benefits of the research must remain uncertain.

Pretending to operate on ill patients

I recently came across an example of a medical study which I thought raised some serious questions, and which I might well have included in my teaching of research ethics as a case for discussion, had I known about before I retired.

The research apparently involved surgeons opening up a patient's ribcage (not a trivial procedure), and lifting out the person's heart in order to carry out a surgical intervention…or not,

"In the late 1950s and early 60s two different surgical teams, one in Kansas City and one in Seattle, did double-blind trials of a ligation procedure – the closing of a duct or tube using a clip – for very ill patients suffering from severe angina, a condition in which pain radiates from the chest to the outer extremities as a result of poor blood supply to the heart. The surgeons were not told until they arrived in the operating theatre which patients were to receive a real ligation and which were not. All the patients, whether or not they were getting the procedure, had their chest cracked open and their heart lifted out. But only half the patients actually had their arteries rerouted so that their blood could more efficiently bathe its pump …"

Slater, 2018

The quote is taken from a book by Lauren Slater which sets out a history of drug use in psychiatry. Slater is a psychotherapist who has written a number of books about aspects of mental health conditions and treatments.

Fair testing

In order to make a fair experiment, the double-blind procedure sought to treat the treatment and control group the same in all respects, apart from the actual procedure of ligation of selected blood vessels that comprised the mooted intervention. The patients did not know (at least, in one of the studies) they might not have the real operation. Their physicians were not told who was getting the treatment. Even the surgeons only found out who was in each group when the patient arrived in theatre.

It was necessary for those in the control group to think they were having an intervention, and to undergo the sham surgery, so that they formed a fair comparison with those who got the ligation.

Read about control of variables

It was necessary to have double-blind study (neither the patients themselves, nor the physicians looking after them, were told which patients were, and which were not, getting the treatment), because there is a great deal of research which shows that people's beliefs and expectations make substantial differences to outcomes. This is a real problem in educational research when researchers want to test classroom practices such as new teaching schemes or resources or innovative pedagogies (Taber, 2019). The teacher almost certainly knows whether she is teaching the experimental or control group, and usually the students have a pretty good idea. (If every previous lesson has been based on teacher presentations and note-taking, and suddenly they are doing group discussion work and making videos, they are likely to notice.)

Read about expectancy effects

It was important to undertake a study, because there was not clear objective evidence to show whether the new procedure actually improved patient outcomes (or possibly even made matters worst). Doctors reported seeing treated patients do better – but could only guess how they might have done without surgery. Without proper studies, many thousands or people might ultimately undergo an ineffective surgery, with all the associated risks and costs, without getting any benefit.

Simply comparing treated patients with matched untreated patients would not do the job, as there can be a strong placebo effect of believing one is getting a treatment. (It is likely that at least some alternative therapies largely work because a practitioner with good social skills spends time engaging with the patient and their concerns, and the client expects a positive outcome.)

If any positive effects of heart surgery were due to the placebo effect, then perhaps a highly coloured sugar pill prescribed with confidence by a physician could have the same effect without operating theatres, surgical teams, hospital stays… (For that matter, a faith healer who pretended to operate without actually breaking the skin, and revealed a piece of material {perhaps concealed in a pocket or sleeve} presented as an extracted mass of diseased tissue or a foreign body, would be just as effective if the patient believed in the procedure.)

So, I understood the logic here.

Do no harm

All the same – this seemed an extreme intervention. Even today, anaesthesia is not very well understood in detail: it involves giving a patient drugs that could kill them in carefully controlled sub-lethal doses – when how much would actually be lethal (and what would be insufficient to fully sedate) varies from person to person. There are always risks involved.


"All the patients, whether or not they were getting the procedure had their chest cracked open and their heart lifted out."

(Image by Starllyte from Pixabay)


Open heart surgery exposes someone to infection risks. Cracking open the chest is a big deal. It can take two months for the disrupted tissues to heal. Did the research really require opening up the chest and lifting the heart for the control group?

Could this really ever have been considered ethical?

I might have been much more cynical had I not known of other, hm, questionable medical studies. I recall hearing a BBC radio documentary in the 1990s about American physicians who deliberately gave patients radioactive materials without their knowledge, just to to explore the effects. Perhaps most infamously there was the Tuskegee Syphilis study where United States medical authorities followed the development of disease over decades without revealing the full nature of the study, or trying to treat any of those infected. Compared with these violations, the angina surgery research seemed tame.

But do not believe everything you read…

According to the notes at the back of Slater's book, her reference was another secondary source (Moerman, 2002) – that is someone writing about what the research reports said, not those actual 'primary' accounts in the research journals.

So, I looked on-line for the original accounts. I found a 1959 study, by a team from the University of Washington School of Medicine. They explained that:

"Considerable relief of symptoms has been reported for patient with angina pectoris subjected to bilateral ligation of the internal mammary arteries. The physiologic basis for the relief of angina afforded by this rather simple operation is not clear."

Cobb, Thomas, Dillard, Merendino & Bruce, 1959

It was not clear why clamping these blood vessels in the chest should make a substantial difference to blood flow to the heart muscles – despite various studies which had subjected a range of dogs (who were not complaining of the symptoms of angina, and did not need any surgery) to surgical interventions followed by invasive procedures in order to measure any modifications in blood flow (Blair, Roth & Zintel, 1960).

Would you like your aorta clamped, and the blood drained from the left side of your heart, for the sake of a research study?

That raises another ethical issue – the extent of pain and suffering and morbidity it is fair to inflect on non-human animals (which are never perfect models for human anatomy and physiology) to progress human medicine. Some studies explored the details of blood circulation in dogs. Would you like your aorta clamped, and the blood drained from the left side of your heart, for the sake of a research study? Moreover, in order to test the effectiveness of the ligation procedure, in some studies healthy dogs had to have the blood supply to the heart muscles disrupted to given them similar compromised heart function as the human angina sufferers. 1

But, hang on a moment. I think I passed over something rather important in that last quote: "this rather simple operation"?

"Considerable relief of symptoms has been reported for patient with angina pectoris subjected to bilateral ligation of the internal mammary arteries. The physiologic basis for the relief of angina afforded by this rather simple operation is not clear."

Cobb and colleagues' account of the procedure contradicted one of my assumptions,

 At the time of operation, which was performed under local anesthesia [anaesthesia], the surgeon was handed a randomly selected envelope, which contained a card instructing him whether or not to ligate the internal mammary arteries after they had been isolated.

Cobb et al, 1959

It seems my inference that the procedure was carried out under general anaesthetic was wrong. Never assume! Surgery under local anaesthetic is not a trivial enterprise, but carries much less risk than general anaesthetic.

Yet, surely, even back then, no surgeon was going to open up the chest and handle the heart under a local anaesthetic? Cobb and colleagues wrote:

"The surgical procedures commonly used in the therapy of coronary-artery disease have previously been "major" operations utilizing thoracotomy and accompanied by some morbidity and a definite mortality. … With the advent of internal-mammary-artery ligation and its alleged benefit, a unique opportunity for applying the principles of a double-blind evaluation to a surgical procedure has been afforded

Cobb, Thomas, Dillard, Merendino & Bruce, 1959

So, the researchers were arguing that, previously, surgical interventions for this condition were major operations that did involve opening up the chest (thorax) – thoracotomy – where sham surgery would not have been ethical; but the new procedure they were testing – "this rather simple operation" was different.

Effects of internal-mammary-artery ligation on 17 patients with angina pectoris were evaluated by a double-blind technic. Eight patients had their internal mammary arteries ligated; 9 had skin incisions only. 

Cobb et al, 1959

They describe "a 'placebo' procedure consisting of parasternal skin incisions"– that is some cuts were made into the skin next to the breast bone. Skin incisions are somewhat short of open heart surgery.

The description given by the Kansas team (from the Departments of Medicine and Surgery, University of Kansas Medical Center, Kansas City) also differs from Slater's third-hand account in this important way:

"The patients were operated on under local anesthesia. The surgeon, by random sampling, selected those in whom bilateral internal mammary artery and vein ligation (second interspace) was to be carried out and those in whom a sham procedure was to be performed. The sham procedure consisted of a similar skin incision with exposure of the internal mammary vessels, but without ligation."

Dimond, Kittle & Crocket, 1960

This description of the surgery seemed quite different from that offered by Slater.

These teams seemed to be reporting a procedure that could be carried out without exposing the lungs or the heart and opening their protective covers ("in this technique…the pericardium and pleura are not entered or disturbed", Glover, et al, 1957), and which could be superficially forged by making a few cuts into the skin.


"The performance of bilateral division of the internal mammary arteries as compared to other surgical procedures for cardiac disease is safe, simple and innocuous in capable hands."

Glover, Kitchell, Kyle, Davila & Trout, 1958

The surgery involved making cuts into the skin of the chest to access, and close off, arteries taking blood to (more superficial) chest areas in the hope it would allow more to flow to the heart muscles; the sham surgery, the placebo, involved making similar incisions, but without proceeding to change the pattern of arterial blood flow.

The sham surgery did not require general anaesthesia and involved relatively superficial wounds – and offered a research technique that did not need to cause suffering to, and the sacrifice of, perfectly healthy dogs. So, that's all ethical then?

The first hand research reports at least give a different impression of the balance of costs and potential benefits to stakeholders than I had originally drawn from Lauren Slater's account.

Getting consent for sham surgery

A key requirement for ethical research with human participants is being offered voluntary informed consent. Unlike dogs, humans can assent to research procedures, and it is generally considered that research should not be undertaken without such consent.

Read about voluntary informed consent

Of course, there is nuance and complication. The kind of research where investigators drop large denomination notes to test the honesty of passers by – where the 'participants' are in a public place and will not be identified or identifiable – is not usually seen as needing such consent (which would clearly undermine any possibility of getting authentic results). But is it acceptable to observe people using public toilets without their knowledge and consent (as was described in one published study I used as a teaching example)?

The extent to which a lay person can fully understand the logic and procedures explained to them when seeking consent can vary. The extent to which most participants would need, or even want to, know full details of the study can vary. When children of various ages are are involved, the extent to which consent can be given on their behalf by a parent or teachers raises interesting questions.


"I'm looking for volunteers to have a procedure designed to make it look like you've had surgery"

Image by mohamed_hassan from Pixabay


There is much nuance and many complications – and this is an area researchers needs to give very careful consideration.

  • How many ill patients would volunteer for sham surgery to help someone else's research?
  • Would that answer change, if the procedure being tested would later be offered to them?
  • What about volunteering for a study where you have a 50-50 chance of getting the real surgery or the placebo treatment?

In Cobb's study, the participants had all volunteered – but we might wonder if the extent of the information they were given amounted to what was required for informed consent,

The subjects were informed of the fact that this procedure had not been proved to be of value, and yet many were aware of the enthusiastic report published in the Reader's Digest. The patients were told only that they were participating in an evaluation of this operation; they were not informed of the double-blind nature of the study.

Cobb et al, 1959

So, it seems the patients thought they were having an operation that had been mooted to help angina sufferers – and indeed some of them were, but others just got taken into surgery to get a few wounds that suggested something more substantive had been done.

Was that ethical? (I doubt it would be allowed anywhere today?)

The outcome of these studies was that although the patients getting the ligation surgery did appear to get relief from their angina – so did those just getting the skin incisions. The placebo seemed just as good as the re-plumbing.

In hindsight, does this make the studies more worthwhile and seem more ethical? This research has probably prevented a great many people having an operation to have some of their vascular system blocked when that does not seem to make any difference to angina. Does that advance in medical knowledge justify the deceit involved in leading people to think they would get an experimental surgical treatment when they might just get an experimental control treatment?


Ethical principles and guidelines can helps us judge the merits of study

Coda – what did the middle man have to say?

I wondered how a relatively minor sham procedure under local anaesthetic became characterised as "the patients, whether or not they were getting the procedure had their chest cracked open and their heart lifted out" – a description which gave a vivid impression of a major intervention.


The heart is pretty well integrated into the body – how easy is it to life an intact, fully connected, working heart out of position?

Image by HANSUAN FABREGAS from Pixabay


I wondered to what extent it would even be possible to lift the heart out from the chest whilst it remained connected with the major vessels passing the blood it was pumping, and the nerves supplying it, and the vessels supplying blood to its own muscles (the ones that were considered compromised enough to make the treatment being tested worth considering). Some sources I found on-line referred to the heart being 'lifted' during open-heart procedures to give the surgeon access to specific sites: but that did not mean taking the heart out of the body. Having the heart 'lifted out' seemed more akin to Aztec sacrificial rites than medical treatment.

Although all surgery involves some risk, the actual procedure being investigated seemed of relatively routine nature. I actually attended a 'minor' operation which involved cutting into the chest when my late wife was prepared for kidney dialysis. Usually a site for venal access is prepared in the arm well in advance, but it was decided my wife needed to be put on dialysis urgently. A temporary hole was cut into her neck to allow the surgeon to connect a tube (a central venous catheter) to a vein, and another hole into her chest so that the catheter would exit in her chest, where the tap could be kept sterile, bandaged to the chest. This was clearly not considered a high risk operation (which is not to say I think I could have coped with having this done to me!) as I was asked by the doctors to stay in the room with my wife during the procedure, and I did not need to 'scrub' or 'gown up'.

Bilateral internal mammary artery ligation seemed a procedure on that kind of level, accessing blood vessels through incisions made in the skin. However, if Lauren Slater had read up some of the earlier procedures that did require opening the chest, or if she had read the papers describing how the dogs were investigated to trace blood flow through connected vessels, measure changes in flow, and prepare them for induced heart conditions, I could appreciate the potential for confusion. Yet she did not cite the primary research, but rather Daniel Moerman, an Emeritus Professor of Anthropology at University of Michigan-Dearborn, who has written a book about placebo treatments in medicine.

Moerman does write about the bilateral internal mammary artery ligation, and the two sham surgery studies I found in my search. Moerman describes the operation:

"It was quite simple, and since the arteries were not deep in the body, could be performed under local anaesthetic."

Moerman, 2002

He also refers to the subjective reports on one of the patients assigned to the placebo condition in one of the studies, who claimed to feel much better immediately after the procedure:

"This patient's arteries were not ligated…But he did have two scars on his chest…"

Moerman, 2002

But nobody cracked open his chest, and no one handled his heart.

There are still ethical issues here, but understanding the true (almost superficial) nature of the sham surgery clearly changes the balance of concerns. If there is a moral to this article, it is perhaps the importance of being fully informed before reaching judgement about the ethics of a research study.


Work cited:
  • Blair, C. R., Roth, R. F., & Zintel, H. A. (1960). Measurement of coronary artery blood-flow following experimental ligation of the internal mammary artery. Annals of Surgery, 152(2), 325.
  • Cobb, L. A., Thomas, G. I., Dillard, D. H., Merendino, K. A., & Bruce, R. A. (1959). An evaluation of internal-mammary-artery ligation by a double-blind technic. New England Journal of Medicine, 260(22), 1115-1118.
  • Dimond, E. G., Kittle, C. F., & Crockett, J. E. (1960). Comparison of internal mammary artery ligation and sham operation for angina pectoris. The American Journal of Cardiology, 5(4), 483-486.
  • Glover, R. P., Davila, J. C., Kyle, R. H., Beard, J. C., Trout, R. G., & Kitchell, J. R. (1957). Ligation of the internal mammary arteries as a means of increasing blood supply to the myocardium. Journal of Thoracic Surgery, 34(5), 661-678. https://doi.org/https://doi.org/10.1016/S0096-5588(20)30315-9
  • Glover, R. P., Kitchell, J. R., Kyle, R. H., Davila, J. C., & Trout, R. G. (1958). Experiences with Myocardial Revascularization By Division of the Internal Mammary Arteries. Diseases of the Chest, 33(6), 637-657. https://doi.org/https://doi.org/10.1378/chest.33.6.637
  • Moerman, D. E. (2002). Meaning, Medicine, and the "Placebo Effect". Cambridge University Press Cambridge.
  • Slater, Lauren (2018) The Drugs that Changed our Minds. The history of psychiatry in ten treatments. London. Simon & Schuster
  • Taber, K. S. (2019). Experimental research into teaching innovations: responding to methodological and ethical challengesStudies in Science Education, 55(1), 69-119. doi:10.1080/03057267.2019.1658058 [Download this paper.]


Note:

1 To find out if the ligation procedure protected a dog required stressing the blood supply to the heart itself,

"An attempt has been made to evaluate the degree of protection preliminary ligation of the internal mammary artery may afford the experimental animal when subjected to the production of sudden, acute myocardial infarction by ligation of the anterior descending coronary artery at its origin. …

It was hoped that survival in the control group would approximate 30 per cent so that infarct size could be compared with that of the "protected" group of animals. The "protected" group of dogs were treated in the same manner but in these the internal mammary arteries were ligated immediately before, at 24 hours, and at 48 hours before ligation of the anterior descending coronary.

In 14 control dogs, the anterior descending coronary artery with the aforementioned branch to the anterolateral aspect of the left ventricle was ligated. Nine of these animals went into ventricular fibrillation and died within 5 to 20 minutes. Attempts to resuscitate them by defibrillation and massage were to no avail. Four others died within 24 hours. One dog lived 2 weeks and died in pulmonary edema."

Glover, Davila, Kyle, Beard, Trout & Kitchell, 1957

Pulmonary oedema involves fluid build up in the lungs that restricts gaseous exchange and prevents effective breathing. The dog that survived longest (if it was kept conscious) will have experienced death as if by slow suffocation or drowning.

Why ask teachers to 'transmit' knowledge…

…if you believe that "knowledge is constructed in the minds of students"?


Keith S. Taber


While the students in the experimental treatment undertook open-ended enquiry, the learners in the control condition undertook practical work to demonstrate what they had already been told was the case – a rhetorical exercise that reflected the research study they were participating in


A team of researchers chose to compare a teaching approach they believed met the requirements for good science instruction, and which they knew had already been demonstrated effective pedagogy in other studies, with teaching they believed was not suitable for bringing about conceptual change.
(Ironically, they chose a research design more akin to the laboratory activities in the substandard control condition, than to the open-ended enquiry that was part of the pedagogy they considered effective!)

An imaginary conversation 1 with a team of science education researchers.

When we critically read a research paper, we interrogate the design of the study, and the argument for new knowledge claims that are being made. Authors of research papers need to anticipate the kinds of questions readers (editors, reviewers, and the wider readership on publication) will be asking as they try to decide if they find the study convincing.

Read about writing-up research

In effect, there is an asynchronous conversation.

Here I engage in 'an asynchronous conversation' with the authors of a research paper I was interrogating:

What was your study about?

"This study investigated the effect of the Science Writing Heuristic (SWH) approach on grade 9 students' understanding of chemical change and mixture concepts [in] a Turkish public high school."

Kingir, Geban & Gunel, 2013

I understand this research was set up as a quasi-experiment – what were the conditions being compared?

"Students in the treatment group were instructed by the SWH approach, while those in the comparison group were instructed with traditionally designed chemistry instruction."

Kingir, Geban & Gunel, 2013

Constructivism

Can you tell me about the theoretical perspective informing this study?

"Constructivism is increasingly influential in guiding student learning around the world. However, as knowledge is constructed in the minds of students, some of their commonsense ideas are personal, stable, and not congruent with the scientifically accepted conceptions… Students' misconceptions [a.k.a. alternative conceptions] and learning difficulties constitute a major barrier for their learning in various chemistry topics"

Kingir, Geban & Gunel, 2013

Read about constructivist pedagogy

Read about alternative conceptions

'Traditional' teaching versus 'constructivist' teaching

So, what does this suggest about so-called traditional teaching?

"Since prior learning is an active agent for student learning, science educators have been focused on changing these misconceptions with scientifically acceptable ideas. In traditional science teaching, it is difficult for the learners to change their misconceptions…According to the conceptual change approach, learning is the interaction between prior knowledge and new information. The process of learning depends on the degree of the integration of prior knowledge with the new information.2"

Kingir, Geban & Gunel, 2013

And does the Science Writing Heuristic Approach contrast to that?

"The Science Writing Heuristic (SWH) approach can be used to promote students' acquisition of scientific concepts. The SWH approach is grounded on the constructivist philosophy because it encourages students to use guided inquiry laboratory activities and collaborative group work to actively negotiate and construct knowledge. The SWH approach successfully integrates inquiry activities, collaborative group work, meaning making via argumentation, and writing-to-learn strategies…

The negotiation activities are the central part of the SWH because learning occurs through the negotiation of ideas. Students negotiate meaning from experimental data and observations through collaboration within and between groups. Moreover, the student template involves the structure of argumentation known as question, claim, and evidence. …Reflective writing scaffolds the integration of new ideas with prior learning. Students focus on how their ideas changed through negotiation and reflective writing, which helps them confront their misconceptions and construct scientifically accepted conceptions"

Kingir, Geban & Gunel, 2013

What is already known about SWH pedagogy?

It seems like the SWH approach should be effective at supporting student learning. So, has this not already been tested?

"There are many international studies investigating the effectiveness of the SWH approach over the traditional approach … [one team] found that student-written reports had evidence of their science learning, metacognitive thinking, and self-reflection. Students presented reasons and arguments in the meaning-making process, and students' self-reflections illustrated the presence of conceptual change about the science concepts.

[another team] asserted that using the SWH laboratory report format in lieu of a traditional laboratory report format was effective on acquisition of scientific conceptions, elimination of misconceptions, and learning difficulties in chemical equilibrium.

[Another team] found that SWH activities led to greater understanding of grade 6 science concepts when compared to traditional activities. The studies conducted at the postsecondary level showed similar results as studies conducted at the elementary level…

[In two studies] it was demonstrated that the SWH approach can be effective on students' acquisition of chemistry concepts. SWH facilitates conceptual change through a set of argument-based inquiry activities. Students negotiate meaning and construct knowledge, reflect on their own understandings through writing, and share and compare their personal meanings with others in a social context"

Kingir, Geban & Gunel, 2013

What was the point of another experimental test of SWH?

So, it seems that from a theoretical point of view, so-called traditional teaching is likely to be ineffective in bringing about conceptual learning in science, whilst a constructivist approach based on the Science Writing Heuristic is likely to support such learning. Moreover, you are aware of a range of existing studies which suggest that in practice the Science Writing Heuristic is indeed an effective basis for science teaching.

So, what was the point of your study?

"The present study aimed to investigate the effect of the SWH approach compared to traditional chemistry instruction on grade 9 students' understanding of chemical change and mixture concepts."

Kingir, Geban & Gunel, 2013

Okay, I would certainly accept that just because a teaching approach has been found effective with one age group, or in one topic, or in one cultural context, we cannot assume those findings can be generalised and will necessarily apply in other teaching contexts (Taber, 2019).

Read about generalisation from studies

What happened in the experimental condition?

So, what happened in the two classes taught in the experimental condition?

"The teacher asked students to form their own small groups (n=5) and introduced to them the SWH approach …they were asked to suggest a beginning question…, write a claim, and support that claim with evidence…

they shared their questions, claims, and evidence in order to construct a group question, claim, and evidence. …each group, in turn, explained their written arguments to the entire class. … the rest of the class asked them questions or refuted something they claimed or argued. …the teacher summarized [and then] engaged students in a discussion about questions, claims, and evidence in order to make students aware of the meaning of those words. The appropriateness of students' evidence for their claims, and the relations among questions, claims, and evidence were also discussed in the classroom…

The teacher then engaged students in a discussion about …chemical change. First, the teacher attempted to elicit students' prior understanding about chemical change through questioning…The teacher asked students to write down what they wanted to learn about chemical change, to share those items within their group, and to prepare an investigation question with a possible test and procedure for the next class. While students constructed their own questions and planned their testing procedure, the teacher circulated through the groups and facilitated students' thinking through questioning…

Each group presented their questions to the class. The teacher and the rest of the class evaluated the quality of the question in relation to the big idea …The groups' procedures were discussed and revised prior to the actual laboratory investigation…each group tested their own questions experimentally…The teacher asked each student to write a claim about what they thought happened, and support that claim with the evidence. The teacher circulated through the classroom, served as a resource person, and asked …questions

…students negotiated their individual claims and evidence within their groups, and constructed group claims and evidence… each group…presented … to the rest of the class."

Kingir, Geban & Gunel, 2013
What happened in the control condition?

Okay, I can see that the experimental groups experienced the kind of learning activities that both educational theory and previous research suggests are likely to engage them and develop their thinking.

So, what did you set up to compare with the Science Writing Heuristic Approach as a fair test of its effectiveness as a pedagogy?

"In the comparison group, the teacher mainly used lecture and discussion[3] methods while teaching chemical change and mixture concepts. The chemistry textbook was the primary source of knowledge in this group. Students were required to read the related topic from the textbook prior to each lesson….The teacher announced the goals of the lesson in advance, wrote the key concepts on the board, and explained each concept by giving examples. During the transmission of knowledge, the teacher and frequently used the board to write chemical formula[e] and equations and draw some figures. In order to ensure that all of the students understood the concepts in the same way, the teacher asked questions…[that] contributed to the creation of a discussion[3] between teacher and students. Then, the teacher summarized the concepts under consideration and prompted students to take notes. Toward the end of the class session, the teacher wrote some algorithmic problems [sic 4] on the board and asked students to solve those problems individually….the teacher asked a student to come to the board and solve a problem…

The …nature of their laboratory activities was traditional … to verify what students learned in the classroom. Prior to the laboratory session, students were asked to read the procedures of the laboratory experiment in their textbook. At the laboratory, the teacher explained the purpose and procedures of the experiment, and then requested the students to follow the step-by-step instructions for the experiment. Working in groups (n=5), all the students conducted the same experiment in their textbook under the direct control of the teacher. …

The students were asked to record their observations and data. They were not required to reason about the data in a deeper manner. In addition, the teacher asked each group to respond to the questions about the experiment included in their textbook. When students failed to answer those questions, the teacher answered them directly without giving any hint to the students. At the end of the laboratory activity, students were asked to write a laboratory report in traditional format, including purpose, procedure, observations and data, results, and discussion. The teacher asked questions and helped students during the activity to facilitate their connection of laboratory activity with what they learned in the classroom.

Kingir, Geban & Gunel, 2013

The teacher variable

Often in small scale research studies in education, a different teacher teaches each group and so the 'teacher variable' confounds the experiment (Taber, 2019). Here, however, you avoid that problem 5, as you had a sample of four classes, and two different teachers were involved, each teaching one class in each condition?

"In order to facilitate the proper instruction of the SWH approach in the treatment group, the teachers were given training sessions about its implementation prior to the study. The teachers were familiar with the traditional instruction. One of the teachers was teaching chemistry for 20 years, while the other was teaching chemistry for 22 years at a high school. The researcher also asked the teachers to teach the comparison group students in the same way they taught before and not to do things specified for the treatment group."

Kingir, Geban & Gunel, 2013

Was this research ethical?

As this is an imaginary conversation, not all of the questions I might like to ask are actually addressed in the paper. In particular, I would love to know how the authors would justify that their study was ethical, considering that the control condition they set up deliberately excluded features of pedagogy that they themselves claim are necessary to support effective science learning:

"In traditional science teaching, it is difficult for the learners to change their misconceptions"

The authors beleive that "learning occurs through the negotiation of ideas", and their experimental condition provides plenty of opportunity for that. The control condition is designed to avoid the explicit elicitation of learners' idea, dialogic talk, or peer interactions when reading, listening, writing notes or undertaking exercises. If the authors' beliefs are correct (and they are broadly consistent with a wide consensus across the global science education research community), then the teaching in the comparison condition is not suitable for facilitating conceptual learning.

Even if we think it is conceivable that highly experienced teachers, working in a national context where constructivist teaching has long been official education policy, had somehow previously managed to only teach in an ineffective way: was it ethical to ask these teachers to teach one of their classes poorly even after providing them with professional development enabling them to adopt a more engaging approach better aligned with our understanding of how science can be effectively taught?

Read about unethical control conditions

Given that the authors already believed that –

  • "Students' misconceptions and learning difficulties constitute a major barrier for their learning in various chemistry topics"
  • "knowledge is constructed in the minds of students"
  • "The process of learning depends on the degree of the integration of prior knowledge with the new information"
  • "learning occurs through the negotiation of ideas"
  • "The SWH approach successfully integrates inquiry activities, collaborative group work, meaning making" – A range of previous studies have shown that SWH effectively supports student learning

– why did they not test the SWH approach against existing good practice, rather than implement a control pedagogy they knew should not be effective, so setting up two classes of learners (who do not seem to have been asked to consent to being part of the research) to fail?

Read about the expectation for voluntary informed consent

Why not set up a genuinely informative test of the SWH pedagogy, rather than setting up conditions for manufacturing a forgone conclusion?


When it has already been widely established that a pedagogy is more effective than standard practice, there is little point further testing it against what is believed to be ineffective instruction.

Read about level of contol in experiments


How can it be ethical to ask teachers to teach in a way that is expected to be ineffective?

  • transmission of knowledge
  • follow the step-by-step instructions
  • not required to reason in a deeper manner
  • individual working

A rhetorical experiment?

Is this not just a 'rhetorical' experiment engineered to produce a desired outcome (a demonstration), rather than an open-ended enquiry (a genuine experiment)?

A rhetorical experiment is not designed to produce substantially new knowledge: but rather to create the conditions for a 'positive' result (Figure 8 from Taber, 2019).

Read about rhetorical experiments


A technical question

Any study of a teaching innovation requires the commitment of resources and some disruption of teaching. Therefore any research study which has inherent design faults that will prevent it producing informative outcomes can be seen as a misuse of resources, and an unproductive disruption of school activities, and so, if only in that sense, unethical.

As the research was undertaken with "four intact classes" is it possible to apply any statistical tests that can offer meaningful results, when there are only two units of analysis in each condition? [That is, I think not.]

The researchers claim to have 117 degrees of freedom when applying statistical tests to draw conclusions. They seem to assume that each of the 122 children can be considered to be a separate unit of analysis. But is it reasonable to assume that c.30 children taught together in the same intact class by the same teacher (and working in groups for at least part of the time) are independently experiencing the (experimental or control) treatment?

Surely, the students within a class influence each other's learning (especially during group-work), so the outcomes of statistical tests that rely on treating each learner as an independent unit of analysis are invalid (Taber, 2019). This is especially so in the experimental treatment where dialogue (and "the negotiation of ideas") through group-work, discussion, and argumentation were core parts of the instruction.

Read about units of analysis

Sources cited:

  • Ausubel, D. P. (1968). Educational Psychology: A cognitive view. Holt, Rinehart & Winston.
  • Kingir, S., Geban, O., & Gunel, M. (2013). Using the Science Writing Heuristic Approach to Enhance Student Understanding in Chemical Change and Mixture. Research in Science Education, 43(4), 1645-1663. https://doi.org/10.1007/s11165-012-9326-x
  • Taber, K. S. (2019). Experimental research into teaching innovations: responding to methodological and ethical challengesStudies in Science Education, 55(1), 69-119. doi:10.1080/03057267.2019.1658058 [Download]

Notes:

1 I have used direct quotes from the published report in Research in Science Education (but I have omitted citations to other papers), with some emphasis added. Please refer to the full report of the study for further details. I have attempted to extract relevant points from the paper to develop an argument here. I have not deliberately distorted the published account by selection and/or omission, but clearly am only reproducing small extracts. I would recommend readers might access the original study in order to make up their own minds.


2 The next statement is "If individuals know little about the subject matter, new information is easily embedded in their cognitive structure (assimilation)." This is counter to the common thinking that learning about an unfamiliar topic is more difficult, and learning is made meaningful when it can be related to prior knowledge (Ausubel, 1968).

Read about making the unfamiliar familiar


3 The term 'discussion' might suggest an open-ended exchange of ideas and views. This would be a dialogic technique typical of constructivist approaches. From the wider context its seems likely something more teacher-directed and closed than this was meant here – but this is an interpretation which goes beyond the description available in the original text.

Read about dialogic learning


4 Researchers into problem-solving consider that a problem has to require a learner to do more that simply recall and apply previously learned knowledge and techniques – so an 'algorithmic problem' might be considered an oxymoron. However, it is common for teachers to refer to algorithmic exercises as 'problems' even though they do not require going beyond application of existing learning.


5 This design does avoid the criticism that one of the teacher may have just been more effective at teaching the topic to this age group, as both teachers teach in both conditions.

This does not entirely remove potential confounds as teachers interact differently with different classes, and with only four teacher-class combinations it could well be that there is better rapport in the two classes in one or other condition. It is very hard to see how this can be addressed (except by having a large enough sample of classes to allow inferential statistics to be used rigorously – which is not feasible in small scale studies).

A potentially more serious issue is 'expectancy' effects. There is much research in education and other social contexts to show that people's beliefs and expectations influence outcomes of studies – and this can make a substantial difference. If the two teachers were unconvinced by the newfangled and progressive approach being tested, then this could undermine their ability to effectively teach that way.

On the other hand, although it is implied that these teachers normally teach in the 'traditional' way, actually constructivist approaches are recommended in Turkey, and are officially sanctioned, and widely taught in teacher education and development courses. If the teachers accepted the arguments for believing the SWH was likely to be more effective at bringing about conceptual learning than the methods they were asked to adopt in the comparison classes, that would further undermine that treatment as a fair control condition.

Read about expectancy effects in research

Again, there is very little researchers can do about this issue as they cannot ensure that teachers participating in research studies are equally confident in the effectivenes of different treatments (and why should they be – the researchers are obviously expecting a substantive difference*), and this is a major problem in studies into teaching innovations (Taber, 2019).

* This is clear from their paper. Is it likely that they would have communicated this to the teachers? "The teachers were given training sessions about [SWH's] implementation prior to the study." Presumably, even if somehow these experienced teachers had previously managed to completely avoid or ignore years of government policy and guidance intending to persuade them of the value of constructivist approaches, the researchers could not have offered effective "training sessions" without explaining the rationales of the overall approach, and for the specific features of the SWH that they wanted teachers to adopt.


Shock result: more study time leads to higher test scores

(But 'all other things' are seldom equal)


Keith S. Taber


I came across an interesting journal article that reported a quasi-experimental study where different groups of students studied the same topic for different periods of time. One group was given 3 half-hour lessons, another group 5 half-hour lessons, and the third group 8 half-hour lessons. Then they were tested on the topic they had been studying. The researchers found that the average group performance was substantially different across the different conditions. This was tested statistically, but the results were clear enough to be quite impressive when presented visually (as I have below).


Results from a quasi-experiment: its seems more study time can lead to higher achievement

These results seem pretty clear cut. If this research could be replicated in diverse contexts then the findings could have great significance.

  • Is your manager trying to cut course hours to save budget?
  • Does your school want you to teach 'triple science' in a curriculum slot intended for 'double science'?
  • Does your child say they have done enough homework?

Research evidence suggests that, ceteris paribus, learners achieve more by spending more time studying.

Ceteris paribus?

That is ceteris paribus (no, it is not a newly discovered species of whale): all other things being equal. But of course, in the real world they seldom – if ever – are.

If you wondered about the motivation for a study designed to see whether more teaching led to more learning (hardly what Karl Popper would have classed as a suitable 'bold conjecture' on which to base productive research), then I should confess I am being disingenuous. The information I give above is based on the published research, but offers a rather different take on the study from that offered by the authors themselves.

An 'alternative interpretation' one might say.

How useful are DARTs as learning activities?

I came across this study when looking to see if there was any research on the effectiveness of DARTs in chemistry teaching. DARTs are directed activities related to text – that is text-based exercises designed to require learners to engage with content rather than just copy or read it. They have long been recommended, but I was not sure I had seen any published research on their use in science classrooms.

Read about using DARTs in teaching

Shamsulbahri and Zulkiply (2021) undertook a study that "examined the effect of Directed Activity Related to Texts (DARTs) and gender on student achievement in qualitative analysis in chemistry" (p.157). They considered their study to be a quasi-experiment.

An experiment…

Experiment is the favoured methodology in many areas of natural science, and, indeed, the double blind experiment is sometimes seen as the gold standard methodology in medicine – and when possible in the social sciences. This includes education, and certainly in science education the literature reports many, many educational experiments. However, doing experiments well in education is very tricky and many published studies have major methodological problems (Taber, 2019).

Read about experiments in education

…requires control of variables

As we teach in school science, fair testing requires careful control of variables.

So, if I suggest there are some issues that prevent a reader from being entirely confident in the conclusions that Shamsulbahri and Zulkiply reach in their paper, it should be borne in mind that I think it is almost impossible to do a rigorously 'fair' small-scale experiment in education. By small-scale, I mean the kind of study that involves a few classes of learners as opposed to studies that can enrol a large number of classes and randomly assign them to conditions. Even large scale randomised studies are usually compromised by factors that simply cannot be controlled in educational contexts (Taber, 2019) , and small scale studies are subject to additional, often (I would argue) insurmountable, 'challenges'.

The study is available on the web, open access, and the paper goes into a good deal of detail about the background to, and aspects of, the study. Here, I am focusing on a few points that relate to my wider concerns about the merits of experimental research into teaching, and there is much of potential interest in the paper that I am ignoring as not directly relevant to my specific argument here. In particular, the authors describe the different forms of DART they used in the study. As, inevitably (considering my stance on the intrinsic problems of small-scale experiments in education), the tone of this piece is critical, I would recommend readers to access the full paper and make up your own minds.

Not a predatory journal

I was not familiar with the journal in which this paper was published – the Malaysian Journal of Learning and Instruction. It describes itself as "a peer reviewed interdisciplinary journal with an international advisory board". It is an open access journal that charges authors for publication. However, the publication fees are modest (US$25 if authors are from countries that are members of The Association of Southeast Asian Nations, and US$50 otherwise). This is an order of magnitude less than is typical for some of the open-access journals that I have criticised here as being predatory – those which do not engage in meaningful peer review, and will publish some very low quality material as long as a fee is paid. 25 dollars seems a reasonable charge for the costs involved in publishing work, unlike the hefty fees charged by many of the less scrupulous journals.

Shamsulbahri and Zulkiply seem, then, to have published in a well-motivated journal and their paper has passed peer review. But this peer thinks that, like most small scale experiments into teaching, it is very hard to draw any solid conclusions from this work.

What do the authors conclude?

Shamsulbahri and Zulkiply argue that their study shows the value of DARTs activities in learning. I approach this work with a bias, as I also think DARTs can be very useful. I used different kinds of DARTs extensively in my teaching with 14-16 years olds when I worked in schools.

The authors claim their study,

"provides experimental evidence in support of the claim that the DARTs method has been beneficial as a pedagogical approach as it helps to enhance qualitative analysis learning in chemistry…

The present study however, has shown that the DARTs method facilitated better learning of the qualitative analysis component of chemistry when it was combined with the experimental method. Using the DARTs method only results in better learning of qualitative analysis component in chemistry, as compared with using the Experimental method only."

Shamsulbahri & Zulkiply, 2021

Yet, despite my bias, which leads me to suspect they are right, I do not think we can infer this much from their quasi-experiment.

I am going to separate out three claims in the quote above:

  1. the DARTs method has been beneficial as a pedagogical approach as it helps to enhance qualitative analysis learning in chemistry
  2. the DARTs method facilitated better learning of the qualitative analysis component of chemistry when it was combined with the [laboratory1] method
  3. the DARTs method [by itself] results in better learning of qualitative analysis component in chemistry, as compared with using the [laboratory] method only.

I am going to suggest that there are two weak claims here and one strong claim. The weak claims are reasonably well supported (but only as long as they are read strictly as presented and not assumed to extend beyond the study) but the strong claim is not.

Limitations of the experiment

I suggest there are several major limiations of this research design.

What population is represented in the study?

In a true experiment researchers would nominate the population of interest (say, for example, 14-16 year old school learners in Malaysia), and then randomly select participants from this population who would be randomly assigned to the different conditions being compared. Random selection and assignment cannot ensure that the groupings of participants are equivalent, nor that the samples genuinely represent the population; as by chance it could happen that, say, the most studious students are assigned to one condition and all the lazy students to an other – but that is very unlikely. Random selection and assignment means that there is strong statistical case to think the outcomes of the experiment probably represent (more or less) what would have happened on a larger scale had it been possible to include the whole population in the experiment.

Read about sampling in research

Obviously, researchers in small-scale experiments are very unlikely to be able to access full populations to sample. Shamsulbahri and Zulkiply did not – and it would be unreasonable to criticise them for this. But this does raise the question of whether what happens in their samples will reflect what would happen with other groups of students. Shamsulbahri and Zulkiply acknowledge their sample cannot be considered typical,

"One limitation of the present study would be the sample used; the participants were all from two local fully residential schools, which were schools for students with high academic performance."

Shamsulbahri & Zulkiply, 2021

So, we have to be careful about generalising from what happened in this specific experiment to what we might expect with different groups of learners. In that regard, two of the claims from the paper that I have highlighted (i.e., the weaker claims) do not directly imply these results can be generalised:

  1. the DARTs method has been beneficial as a pedagogical approach…
  2. the DARTs method facilitated better learning of the qualitative analysis component of chemistry when it was combined with the [laboratory] method

These are claims about what was found in the study – not inferences about what would happen in other circumstances.

Read about randomisation in studies

Equivalence at pretest?

When it is not possible to randomly assign participants to the different conditions then there is always the possibility that whatever process has been used to assign conditions to groups produces a bias. (An extreme case would be in a school that used setting, that is assigning students to teaching groups according to achievement, if one set was assigned to one condition, and another set to a different condition.)

In quasi-experiments on teaching it is usual to pre-test students and to present analysis to show that at the start of the experiment the groups 'are equivalent'. Of course, it is very unlikely two different classes would prove to be entriely equivalent on a pre-test, so often there is a judgement made of the test results being sufficiently similar across the conditions. In practice, in many published studies, authors settle for the very weak (and inadequate) test of not finding differences so great that would be very unlikely to occur by chance (Taber, 2019)!

Read about testing for equivalence

Shamsulbahri and Zulkiply did pretest all participants as a screening process to exclude any students who already had good subject knowledge in the topic (qualitative chemical analysis),

"Before the experimental manipulation began, all participants were given a pre-screening test (i.e., the Cation assessment test) with the intention of selecting only the most qualified participants, that is, those who had a low-level of knowledge on the topic….The participants who scored ten or below (out of a total mark of 30) were selected for the actual experimental manipulation. As it turned out, all 120 participants scored 10 and below (i.e., with an average of 3.66 out of 30 marks), which was the requirement that had been set, and thus they were selected for the actual experimental manipulation."

Shamsulbahri & Zulkiply, 2021

But the researchers do not report the mean results for the groups in the three conditions (laboratory1; DARTs; {laboratory+DARTs}) or give any indication of how similar (or not) these were. Nor do these scores seem to have been included as a variable in the analysis of results. The authors seem to be assuming that as no students scored more than one-third marks in the pre-test, then any differences beteen groups at pre-test can be ignored. (This seems to suggest that scoring 30% or 0% can be considered the same level of prior knowledge in terms of the potential influence on further learning and subsequent post-test scores.) That does not seem a sound assumption.

"It is important to note that there was no issue of pre-test treatment interaction in the context of the present study. This has improved the external validity of the study, since all of the participants were given a pre-screening test before they got involved in the actual experimental manipulation, i.e., in one of the three instructional methods. Therefore, any differences observed in the participants' performance in the post-test later were due to the effect of the instructional method used in the experimental manipulation."

Shamsulbahri & Zulkiply, 2021 (emphasis added)

There seems to be a flaw in the logic here, as the authors seem to be equating demonstrating an absence of high scorers at pre-test with there being no differences between groups which might have influenced learning. 2

Units of analysis

In any research study, researchers need to be clear regarding what their 'unit of analysis' should be. In this case the extreme options seem to be:

  • 120 units of analysis: 40 students in each of three conditions
  • 3 units of analysis: one teaching group in each condition

The key question is whether individual learners can be considered as being subject to the treatment conditions independently of others assiged to the same condition.

"During the study phase, student participants from the three groups were instructed by their respective chemistry teachers to learn in pairs…"

Shamsulbahri & Zulkiply, 2021

There is a strong argument that when a group of students attend class together, and are taught together, and interact with each other during class, they strictly should not be considered as learning independently of each other. Anyone who has taught parallel classes that are supposedly equivalent will know that classes take on their own personalities as groups, and the behaviour and learning of individual students is influenced by the particular class ethos.

Read about units of analysis

So, rigorous research into class teaching pedagogy should not treat the individual learners as units of analysis – yet it often does. The reason is obvious – it is only possible to do statistical testing when the sample size is large enough, and in small scale educational experiments the sample size is never going to be large enough unless one…hm…pretends/imagines/considers/judges/assumes/hopes?, that each learner is independently subject to the assigned treatment without being substantially influenced by others in that condition.

So, Shamsulbahri and Zulkiply treated their participants as independent units of analysis and based on this find a statistically significant effect of treatment:

⎟laboratory⎢ vs. ⎟DARTs⎢ vs. ⎟laboratory+DARTs⎢.

That is questionable – but what if, for argument's sake, we accept this assumption that within a class of 40 students the learners can be considered not to influence each other (even their learning partner?) or the classroom more generally sufficiently to make a difference to others in the class?

A confounding variable?

Perhaps a more serious problem with the research design is that there is insufficient control of potentially relevant variables. In order to make a comparison of ⎟laboratory⎢ vs. ⎟DARTs⎢ vs. ⎟laboratory+DARTs⎢ then the only relevant difference between the three treatment conditions should be whether the students learn by laboratory activity, DARTs, or both. There should not be any other differences between the groups in the different treatments that might reasonably be expected to influence the outcomes.

Read about confounding variables

But the description of how groups were set up suggests this was not the case:

"….the researchers conducted a briefing session on the aims and experimental details of the study for the school's [schools'?] chemistry teachers…the researchers demonstrated and then guided the school's chemistry teachers in terms of the appropriate procedures to implement the DARTs instructional method (i.e., using the DARTs handout sheets)…The researcher also explained to the school's chemistry teachers the way to implement the combined method …

Participants were then classified into three groups: control group (experimental method), first treatment group (DARTs method) and second treatment group (Combination of experiment and DARTs method). There was an equal number of participants for each group (i.e., 40 participants) as well as gender distribution (i.e., 20 females and 20 males in each group). The control group consisted of the participants from School A, while both treatment groups consisted of participants from School B"


Shamsulbahri & Zulkiply, 2021

Several different teachers seems to have been involved in teaching the classes, and even if it is not entirely clear how the teaching was divided up, it is clear that the group that only undertook the laboratory activities were from a different school than those in the other two conditions.

If we think one teacher can be replaced by another without changing learning outcomes, and that schools are interchangeable such that we would expect exactly the same outcomes if we swapped a class of students from one school for a class from another school, then these variables are unimportant. If, however, we think the teacher doing the teaching and the school from which learners are sampled could reasonably make a difference to the learning achieved, then these are confounding variables which have not been properly controlled.

In my own experience, I do not think different teachers become equivalent even when their are briefed to teach in the same way, and I do not think we can assume schools are equivalent when providing students to participate in learning. These differences, then, undermine our ability to assign any differences in outcomes as due to the differences in pedagogy (that "any differences observed…were due to the effect of the instructional method used").

Another confounding variable

And then I come back to my starting point. Learners did not just experience different forms of pedagogy but also different amounts of teaching. The difference between 3 lessons and 5 lessons might in itself be a factor (that is, even if the pedagogy employed in those lessons had been the same), as might the difference between 5 lessons and 8 lessons. So, time spent studying must be seen as a likely confounding variable. Indeed, it is not just the amount of time, but also the number of lessons, as the brain processes learning between classes and what is learnt in one lesson can be reinforced when reviewed in the next. (So we could not just assume, for example, that students automatically learn the same amount from, say, two 60 min. classes and four 30 min. classes covering the same material.)

What can we conclude?

As with many experiments in science teaching, we can accept the results of Shamsulbahri and Zulkiply's study, in terms of what they found in the specific study context, but still not be able to draw strong conclusions of wider significance.

Is the DARTs method beneficial as a pedagogical approach?

I expect the answer to this question is yes, but we need to be careful in drawing this conclusion from the experiment. Certainly the two groups which undertook the DARTs activities outperformed the group which did not. Yet that group was drawn from a different school and taught by a different teacher or teachers. That could have explained why there was less learning. (I am not claiming this is so – the point is we have no way of knowing as different variables are conflated.) In any case, the two groups that did undertake the DARTs activity were both given more lessons and spent substantially longer studying the topic they were tested on, than the class that did not. We simply cannot make a fair comparison here with any confidence.

Did the DARTs method facilitate better learning when it was combined with laboratory work?

There is a stronger comparison here. We still do not know if the two groups were taught by the same teacher/teachers (which could make a difference) or indeed whether the two groups started from a very similar level of prior knowledge. But, at least the two groups were from the same school, and both experienced the same DARTs based instruction. Greater learning was achieved when students undertook laboratory work as well as undertaking DARTs activities compared with students who only undertook the DARTs activity.

The 'combined' group still had more teaching than the DARTs group, but that does not matter here in drawing a logical conclusion because the question being explored is of the form 'does additional teaching input provide additional value?' (Taber, 2019). The question here is not whether one type of pedagogy is better than the other, but simply whether also undertaking practical works adds something over just doing the paper based learning activities.

Read about levels of control in experimental design

As the sample of learners was not representative of any specfiic wider population, we cannot assume this result would generalise beyond the participants in the study, although we might reasonably expect this result would be found elsewhere. But that is because we might already assume that learning about a practical activity (qualitative chemical analysis) will be enhanced by adding some laboratory based study!

Does DARTs pedagogy produce more learning about qualitative analysis than laboratory activities?

Shamsulbahri and Zulkiply's third claim was bolder because it was framed as a generalisation: instruction through DARTs produces more learning about qualitative analysis than laboratory-based instruction. That seems quite a stretch from what the study clearly shows us.

What the research does show us with confidence is that a group of 40 students in one school taught by a particular teacher/teaching team with 5 lessons of a specific set of DARTs activities, performed better on a specific assessment instrument than a different group of 40 students in another school taught by a different teacher/teaching team through three lessons of laboratory work following a specific scheme of practical activities.


a group of 40 students
performed better on a specific assessment instrumentthan a different group of 40 students
in one schoolin another school
taught by a particular teacher/teaching team
taught by a different teacher/teaching team
with 5 lessonsthrough 3 lessons
of a specific set of DARTs activities, of laboratory work following a specific scheme of practical activities
Confounded variables

Test instrument bias?

Even if we thought the post-test used by Shamsulbahri and Zulkiply was perfectly valid as an assessment of topic knowledge, we might be concerned by knowing that learning is situated in a context – we better recall in a similar context to that in which we learned.


How can we best assess students' learning about qualitative analysis?


So:

  • should we be concerned that the form of assessment, a paper-based instrument, is closer in nature to the DARTs learning experience than the laboratory learning experience?

and, if so,

  • might this suggest a bias in the measurement instrument towards one treatment (i.e., DARTs)

and, if so,

  • might a laboratory-based assessment have favoured the group that did the laboratory based learning over the DARTs group, and led to different outcomes?

and, if so,

  • which approach to assessment has more ecological validity in this case: which type of assessment activity is a more authentic way of testing learning about a laboratory-based activity like qualitative chemical analysis?

A representation of my understanding of the experimental design

Can we generalise?

As always with small scale experiments into teaching, we have to judge the extent to which the specifics of the study might prevent us from generalising the findings – to be able to assume they would generally apply elsewhere.3 Here, we are left to ask to what extent we can

  • ignore any undisclosed difference between the groups in levels of prior learning;
  • ignore any difference between the schools and their populations;
  • ignore any differences in teacher(s) (competence, confidence, teaching style, rapport with classes, etc.);
  • ignore any idiosyncrasies in the DARTs scheme of instruction;
  • ignore any idiosyncrasies in the scheme of laboratory instruction;
  • ignore any idiosyncrasies (and potential biases) in the assessment instrument and its marking scheme and their application;

And, if we decide we can put aside any concerns about any of those matters, we can safely assume that (in learning this topic at this level)

  • 5 sessions of learning by DARTs is more effective than 3 sessions of laboratory learning.

Then we only have to decide if that is because

  • (i) DARTs activities teach more about this topic at this level than laboratory activities, or
  • (ii) whether some or all of the difference in learning outcomes is simply because 150 minutes of study (broken into five blocks) has more effect than 90 minutes of study (broken into three blocks).

What do you think?


Loading poll ...
Work cited:

Notes:

1 The authors refer to the conditions as

  • Experimental control group
  • DARTs
  • combination of Experiment + DARTs

I am referring to the first group as 'laboratory' both because it not clear the students were doing any experiments (that is, testing hypotheses) as the practical activity was learning to undertake standard analytical tests, and, secondly, to avoid confusion (between the educational experiment and the laboratory practicals).


2 I think the reference to "no issue of pre-test treatment interaction" is probably meant to suggest that as all students took the same pre-test it will have had the same effect on all participants. But this not only ignores the potential effect of any differences in prior knowledge reflected in the pre-test scores that might influence subsequent learning, but also the effect of taking the pre-test cannot be assumed to be neutral if for some learners it merely told them they knew nothing about the topic, whilst for others it activated and so reinforced some prior knowledge in the subject. In principle, the interaction between prior knowledge and taking the pretest could have influenced learning at both cognitive and affective levels: that is, both in terms of consolidation of prior learning and cuing for the new learning; and in terms of a learner's confidence in, and attitude towards, learning the topic.


3 Even when we do have a representative sample of a population to test, we can only infer that the outcomes of an experiment reflect what will be most likely for members (schools, learners, classes, teachers…) of the wider population. Individual differences are such that we can never say that what most probably is the case will always be the case.


When an experiment tests a sample drawn at random from a wider population, then the findings of the experiment can be assumed to apply (on average) to the population. (Source: after Taber, 2019).

Experimental pot calls the research kettle black

Do not enquire as I do, enquire as I tell you


Keith S. Taber


Sotakova, Ganajova and Babincakova (2020) rightly criticised experiments into enquiry-based science teaching on the grounds that such studies often used control groups where the teaching methods had "not been clearly defined".

So, how did they respond to this challenge?

Consider a school science experiment where students report comparing the rates of reaction of 1 cm strips of magnesium ribbon dropped into:
(a) 100 ml of hydrochloric acid of 0.2 mol/dm3 concentration at a temperature of 28 ˚C; and
(b) some unspecified liquid.


This is a bit like someone who wants to check they are not diabetic, but – being worried they are – dips the test strip in a glass of tap water rather than their urine sample.


Basic premises of scientific enquiry and reporting are that

  • when carrying out an experiment one should carefully manage the conditions (which is easier in laboratory research than in educational enquiry) and
  • one should offer detailed reports of the work carried out.

In science there is an ideal that a research report should be detailed enough to allow other competent researchers to repeat the original study and verify the results reported. That repeating and checking of existing work is referred to as replication.

Replication in science

In practice, replication is more problematic for both principled and pragmatic reasons.

It is difficult to communicate tacit knowledge

It has been found that when a researcher develops some new technique, the official report in the literature is often inadequate to allow researchers elsewhere to repeat the work based only on the published account. The sociologist of science, Harry Collins (1992) has explored how there may be minor (but critical) details about the setting-up of apparatus or laboratory procedures that the original researchers did not feel were significant enough to report – or even that the researchers had not been explicitly aware of. Replication may require scientists to physically visit each others' laboratories to learn new techniques.

This should not be surprising, as the chemist and philosopher Michael Polanyi (1962/1969) long ago argued that science relied on tacit knowledge (sometimes known as implicit knowledge) – a kind of green fingers of the laboratory where people learn ways of doing things more as a kind of 'muscle memory' than formal procedural rules.

Novel knowledge claims are valued

The other problem with replication is that there is little to be gained for scientists by repeating other people's work if they believe it is sound, as journals put a premium on research papers that claim to report original work. Even if it proves possible to publish a true replication (at best, in a less prestigious journal), the replication study will just be an 'also ran' in the scientific race.


Copies need not apply!

Scientific kudos and rewards go to those who produce novel work: originality is a common criterion used when evaluating reports submitted to research journals

(Image by Tom from Pixabay)


Historical studies (Shapin & Schaffer, 2011) show that what actually tends to happen is that scientists – deliberately – do not exactly replicate published studies, but rather make adjustments to produce a modified version of the reported experiment. A scientist's mind set is not to confirm, but to seek a new, publishable, result,

  • they say it works for tin, so let's try manganese?
  • they did it in frogs, let's see if it works in toads?
  • will we still get that effect closer to the boiling point?
  • the outcome in broad spectrum light has been reported, but might monochromatic light of some particular frequency be more efficient?
  • they used glucose, we can try fructose

This extends (or finds the limits of) the range of application of scientific ideas, and allows the researchers to seek publication of new claims.

I have argued that the same logic is needed in experimental studies of teaching approaches, but this requires researchers detailing the context of their studies rather better than many do (e.g., not just 'twelve year olds in a public school in country X'),

"When there is a series of studies testing the same innovation, it is most useful if collectively they sample in a way that offers maximum information about the potential range of effectiveness of the innovation. There are clearly many factors that may be relevant. It may be useful for replication studies of effective innovations to take place with groups of different socio-economic status, or in different countries with different curriculum contexts, or indeed in countries with different cultural norms (and perhaps very different class sizes; different access to laboratory facilities) and languages of instruction …It may be useful to test the range of effectiveness of some innovations in terms of the ages of students, or across a range of quite different science topics. Such decisions should be based on theoretical considerations.

…If all existing studies report positive outcomes, then it is most useful to select new samples that are as different as possible from those already tested…When existing studies suggest the innovation is effective in some contexts but not others, then the characteristics of samples/context of published studies can be used to guide the selection of new samples/contexts (perhaps those judged as offering intermediate cases) that can help illuminate the boundaries of the range of effectiveness of the innovation."

Taber, 2019, pp.104-105

When scientists do relish replication

The exception, that tests the 'scientists do not simply replicate' rule, is when it is suspected that a research finding is wrong. Then, an attempt at replication might be used to show a published account is flawed.

For example, when 'cold fusion' was announced with much fanfare (ahead of the peer reviewed publications reporting the research) many scientists simply thought it was highly unlikely that atomic energy generation was going to be possible in fairly standard glassware (not that unlike the beakers and flasks used in school science) at room temperature, and so that there was a challenge to find out what the original researchers had got wrong.

"When it was claimed that power could be generated by 'cold fusion', scientists did not simply accept this, but went about trying it for themselves…Over a period of time, a (near) consensus developed that, when sufficient precautions were made to measure energy inputs and outputs accurately, there was no basis for considering a new revolutionary means of power generation had been discovered.

Taber, 2020, p.18

Of course, one failed replication might just mean the second team did not quite do the experiment correctly, so it may take a series of failed replications to make the point. In this situation, being the first failed replication of many (so being first to correct the record in the literature) may bring prestige – but this also invites the risk of being the only failed replication (so, perhaps, being judged a poorly executed replication) if subsequently other researchers confirm the fidnings of the original study!

So, a single attempt at replication is nether enough to definitely verify nor reject a published result. What all this does show is that the simple notion that there are crucial or critical experiments in science which once reported immediately 'prove' something for all time is a naïve oversimplification of how science works.

Experiments in education

Experiments are often the best way to test ideas about natural phenomena. They tend to be much less useful in education as there are often many potentially relevant variables that usually cannot be measured, let alone controlled, even if they can be identified.

  • Without proper control, you do not have a meaningful experiment.
  • Without a detailed account of the different treatments, and so how the comparison condition is different from the experimental condition, you do not have a useful scientific report, but little more than an anecdote.
Challenges of experimental work in classrooms

Despite this, the research literature includes a vast number of educational studies claiming to be experiments to test this innovation or that (Taber, 2019). Some are very informative. But many are so flawed in design or execution that their conclusions rely more on the researchers' expectations than a logical chain of argument from robust evidence. They often use poorly managed experimental conditions to find differences in learning outcomes between groups of students that are initially not equivalent. 1 (Poorly managed?: because there are severe – practical and ethical – limits on the variables you can control in a school or college classroom.)

Read about expectancy effects in research

Statistical tests are then used which would be informative had there been a genuinely controlled experiment with identical starting points and only the variable of interest being different in the two conditions. Results are claimed by ignoring the inconvenient fact that studies use statistical tests that, strictly, do not apply in the actual conditions studied! Worse than this, occasionally the researchers think they should have got a positive result and so claim one even when the statistical tests suggests otherwise (e.g., read 'Falsifying research conclusions')! In order to try and force a result, a supposed innovation may be compared with control conditions that have been deliberately framed to ensure the learners in that condition are not taught well!

Read about unethical control conditions

A common problem is that it is not possible to randomise students to conditions, so only classes are assigned to treatments randomly. As there are usually only a few classes in each condition (indeed, often only one class in each condition) there are not enough 'units of analysis' to validly use statistical tests. A common solution to this common problem, is…to do the tests anyway, as if there had been randomisation of learners. 2 The computer that crunches the numbers follows a programme that has been written on the assumption researchers will not cheat, so it churns out statistical results and (often) reports significant outcomes due to a misuse of the tests. 3

This is a bit like someone who wants to check they are not diabetic, but being worried they are, dips the test strip in a glass of tap water rather than their urine sample. They cannot blame the technology for getting it wrong if they do not follow the proper procedures.

I have been trying to make a fuss about these issues for some time, because a lot of the results presented in the educational literature are based upon experimental studies that, at best, do not report the research in enough detail, and often, when there is enough detail to be scrutinised, fall well short of valid experiments.

I have a hunch that many people with scientific training are so convinced of the superiority of the experimental method, that they tacitly assume it is better to do invalid experiments into teaching, than adopt other approaches which (whilst not as inherently convincing as a well-designed and executed experiment) can actually offer useful insights in the complex and messy context of classrooms. 4

Read: why do natural scientists tend to make poor social scientists?

So, it is uplifting when I read work which seems to reflect my concerns about the reliance on experiments in those situations where good experiments are not feasible. In that regard, I was reading a paper reporting a study into enquiry-based teaching (Sotakova, Ganajova & Babincakova, 2020) where the authors made the very valid criticism:

"The ambiguous results of research comparing IBSE [enquiry-based science education] with other teaching methods may result from the fact that often, [sic] teaching methods used in the control groups have not been clearly defined, merely referred to as "traditional teaching methods" with no further specification, or there has been no control group at all."

Sotakova, Ganajova & Babincakova, 2020, p.500

Quite right!


The pot calling the kettle black

idiom "that means people should not criticise someone else for a fault that they have themselves" 5 (https://dictionary.cambridge.org/dictionary/english/pot-calling-the-kettle-black)

(Images by OpenClipart-Vectors from Pixabay)


Now, I do not want to appear to be the pot calling the kettle black myself, so before proceeding I should acknowledge that I was part of a major funded research project exploring a teaching innovation in lower secondary science and maths teaching. Despite a large grant, the need to enrol a sufficient number of classes to randomise to treatments to allow statistical testing meant that we had very limited opportunities to observe, and so detail, the teaching in the control condition, which was basically the teachers doing their normal teaching, whilst the teachers of the experimental classes were asked to follow a particular scheme of work.


Results from a randomised trial showing the range of within-condition outcomes (After Figure 5, Taber, 2019)

In the event, the electricity module I was working on produced almost identical mean outcomes as the control condition (see the figure). The spread of outcomes was large, in both sets of conditions – so, clearly, there were significant differences between individual classes that influenced learning: but these differences were even more extreme in the condition where the teachers were supposed to be teaching the same content, in the same order, with the same materials and activities, than in the control condition where teachers were free to do whatever they thought best!

The main thing I learned from this experience is that experiments into teaching are highly problematic.

Anyway, Sotakova, Ganajova and Babincakova were quite right to point out that experiments with poorly defined control conditions are inadequate. Consider a school science experiment designed by students who report comparing the rates of reaction of 1 cm strips of magnesium ribbon dropped into

  • (a) 100 ml of hydrochloric acid of 0.2 mol/dm3 concentration at a temperature of 28 ˚C; and
  • (b) some unspecified liquid.

A science teacher might be disappointed with the students concerned, given the limited informativeness of such an experiment – yet highly qualified science education researchers often report analogous experiments where some highly specified teaching is compared with instruction that is not detailed at all.

The pot decides to follow the example of the kettle

So, what did Sotakova and colleagues do?

"Pre-test and post-test two-group design was employed in the research…Within a specified period of time, an experimental intervention was performed within the experimental group while the control group remained unaffected. The teaching method as an independent variable was manipulated to identify its effect on the dependent variable (in this case, knowledge and skills). Both groups were tested using the same methods before and after the experiment…both groups proceeded to revise the 'Changes in chemical reactions' thematic unit in the course of 10 lessons"

Sotakova, Ganajova & Babincakova, 2020, pp.501, 505.

In the experimental condition, enquiry-based methods were used in five distinct activities as a revision approach (an example activity is detailed in the paper). What about the control conditions?

"…in the control group IBSE was not used at all…In the control group, teachers revised the topic using methods of their choice, e.g. questions & answers, oral and written revision, textbook studying, demonstration experiments, laboratory work."

Sotakova, Ganajova & Babincakova, 2020, pp.502, 505

So, the 'control' condition involved the particular teachers in that condition doing as they wished. The only control seems to be that they were asked not to use enquiry. Otherwise, anything went – and that anything was not necessarily typical of what other teachers might have done. 6

This might have involved any of a number of different activities, such as

  • questions and answers
  • oral and written revision
  • textbook studying
  • demonstration experiments
  • laboratory work

or combinations of them. Call me picky (or a blackened pot), but did these authors not complain that

"The ambiguous results of research comparing IBSE [enquiry-based science education] with other teaching methods may result from the fact that often…teaching methods used in the control groups have not been clearly defined…"

Sotakova, Ganajova & Babincakova, 2020, p.500

Hm.


Work cited

Notes:

1 A very common approach is to use a pre-test to check for significant differences between classes before the intervention. Where differences between groups do not reach the usual criterion for being statistically significant (probability, p<0.05) the groups are declared 'equivalent'. That is, a negative result in a test for unlikely differences is treated inappropriately as an indicator of equivalence (Taber, 2019).

Read about testing for initial equivalence


2 So, for example, a valid procedure may be to enter the mean class scores on some instrument as data, but what are actually entered are the individual students scores as though the students can be treated as independent units rather than members of a treatment class.

Some statistical tests lead to a number (the statistic) which is then compared with the critical value that reaches statistical significance as listed in a table. The number in the table selected depends on the number of 'degrees of freedom' in the experimental design. Often that should be the determined by the number of classes involved in the experiment – but if instead the number of learners is used, a much smaller value of the calculated statistic will seem to reach significance.


3 Some of these studies would surely have given positive outcomes even if they had been able to randomise students to conditions or used a robust test for initial equivalence – but we cannot use that as a justification for ignoring the flaws in the experiment. That would be like claiming a laboratory result was obtained with dilute acid when actually concentrated acid was used – and then justifying the claim by arguing that the same result might have occurred with dilute acid.


4 Consider, for example, a case study that involves researchers in observing teaching, interviewing students and teachers, documenting classroom activities, recording classroom dialogue, collecting samples of student work, etc. This type of enquiry can offer a good deal of insight into the quality of teaching and learning in the class and the processes at work during instruction (and so whether specific outcomes seem to be causally linked to features of the innovation being tested).

Critics of so-called qualitative methods quite rightly point out that such approaches cannot actually show any one approach is better than others – only experiments can do that. Ideally, we need both types of study as they complement each other offering different kinds of information.

The problem with many experiments reported in the education literature is that because of the inherent challenges of setting up genuinely fair testing in educational contexts they are not comparing like with like, and often it is not even clear what the comparison is with! Probably this can only be avoided in very large scale (and so expensive) studies where enough different classrooms can be randomly assigned to each condition to allow statistics to be used.

Why do researchers keep undertaking small scale experimental studies that often lack proper initial equivalence between conditions, and that often have inadequate control of variables? I suggest they will continue to do so as long as research journals continue to publish the studies (and allow them to claim definitive conclusions) regardless of their problems.


5 At a time when cooking was done on open fires, using wood that produced much smoke, the idiom was likely easily understood. In an age of ceramic hobs and electric kettles the saying has become anachronistic.

From the perspective of thermal physics, black cooking pots (rather than shiny reflective surfaces) may be a sensible choice.


6 So, the experimental treatment was being compared with the current standard practice of the teachers assigned to the control condition. It would not matter so much that this varies between teachers, nor that we do not know what that practice is, if we could be confident that the teachers in the control condition were (or were very probably) a representative sample of the wider population of teachers – such as a sufficiently large number of teachers randomly chosen from the wider population (Taber, 2019). Then we would at least know whether the enquiry based approach was an improvement on current common practice.

All we actually know is how the experimental condition fared in comparison with the unknown practices of a small number of teachers who may or may not have been representative of the wider population.

Out of the womb of darkness

Medical ethics in 20th Century movies


Keith S. Taber


The hero of the film, Dr Holden, is presented as a scientist. Here he is trying to collect some data.
(still from 'The Night of the Demon')

"The Night of the Demon" is a 1957 British film about an American professor who visits England to investigate a supposed satanic cult. It was just shown on English television. It was considered as a horror film at the time of its release, although the short scenes that actually feature a (supposedly real? merely imagined? *) monster are laughable today (think Star Trek's Gorn in the original series, and consider if it is believable as anything other than an actor wearing a lizard suit – and you get the level of horror involved). [*Apparently the director, Jacques Tourneur, never intended a demon to be shown, but the film's producer decided to add footage showing the monster in the opening scenes, potentially undermining the whole point of the film: but giving the publicity department something they could work with. 6]


A real scary demon (in 1959) and a convincing alien (in 1967)?
(stills from 'The Night of the Demon' and ' Star Trek' episode 'Arena')
[Move the slider to see more of each image.]

The film's protagonist is a psychologist, Dr. John Holden, who dismisses stories of demons and witchcraft and the like, and has made a career studying people's beliefs about such superstitions. Dr Holden's visit to Britain deliberately coincided with a conference at which he was to present, as well as coincidentally with the death of one of his colleagues (who had been subject to a hex for investigating the cult).


'Night of the Demon' (Dir.  Jacques Tourneur) movie poster: Sabre Film Production.
[As was common at the time, although the film was in monochrome, the publicity was coloured. Whether the colour painting of the monster looks even less scary than the version in the film itself is a moot point.]

The film works much better as a kind of psychological thriller examining the power of beliefs, than as horror. (Director: 1 – Producer, 0.) That, if we believe something enough, it can have real effects is well acknowledged – but this does not need a supernatural explanation. People can be 'scared' to death by what they imagine, and how they respond to their fears. Researchers expecting a positive outcome from their research are likely to inadvertently behave in ways that leads to this very result: thus the use of double blind studies in medical trials, so that the researchers do not know which patients are receiving which treatment.

Read about expectancy effects in research

While the modern viewer will find little of suspense in the film, I did metaphorically at least 'recoil with shock' from one moment of 'horror'. At the conference a patient (Rand Hobart) is wheeled in on a trolley – someone suspected of having committed a murder associated with the cult, whom the authorities had allowed to be questioned by the researchers…at the conference.


"The authorities have lent me this suspected murderer for the benefit of dramatic effect and for plot development purposes"
(still from 'The Night of the Demon').

A variety of movie posters were produced for the film 6 – arguably this one reflects the genuinely horrific aspect of ther story. To a modern viewer this might also appear the most honest representation of the film as the demon given prominence in some versions of the poster barely features in the film.

Holden's British colleague, Professor O'Brien, explains to the delegates,

"For a period of time this man has been as you see him here. He fails to respond to any normal stimulation. His experience, whatever it was, which we hope here to discover, has left him in a state of absolute catatonic immobility. When I first investigated this case, the problem of how to hypnotise an unresponsive person was the major one. Now the proceedings may be somewhat dramatic, but they are necessary. The only way of bringing his mind out of the womb of darkness into which it has retreated to protect itself, is by therapeutic shock, electrical or chemical. For our purposes we are today using pentothal [? 1] and later methylamphetamine."

Introducing a demonstration of non-consensual use of drugs on a prisoner/patient

"Okay, we'll give him a barbiturate, then we'll hypnotise him, then a stimulant, and if that does not kill him, surely he will simply, calmly and rationally, tell us what so traumatised him that he has completely withdrawn into his subconscious."
(Still from 'The Night of the Demon')


After an injection, Hobart comes out of his catatonic state, becomes aware of his surroundings, and panics.

The dignity of the accused: Hobart is forced out of his 'state of absolute catatonic immobility' to discover he is an exhibit at a scientific conference.
(Still from 'The Night of the Demon'.)

He is physically restrained, and examined by Holden (supposedly the 'hero' of the piece), who then hypnotises him.



He is then given an injection of methylamphetamine before being questioned by O'Brien and Holden. He becomes agitated (what, after being forcibly given 'speed'?), breaks free, and leaps, out of a conveniently placed window, to his death.

Now, of course, this is all just fiction – a story. No one is really drugged, and Hobart is played by an' actor who is unharmed. (I can be fairly sure of that as the part was played by Brian Wilde who much later turned up alive and well as prison officer 'Mr Barrowclough' in BBC's Ronnie Barker vehicle 'Porridge'.)


The magic of the movies – people do not stay dead, and there are no professional misconduct charges brought against our hero.
(Stills from 'The Night of the Demon' and from BBC series 'Porridge'.3 )
[Move the slider to see more of each image.]

Yet this is not some fantastical film (the Gorn's distant cousin aside) but played for realism. Would a psychiatric patient and murder suspect have been released to be paraded and demonstrated at a conference on the paranormal in 1957? I expect not. Would the presenters have been allowed to drug Hobart without his consent?

Read about voluntary, informed, consent

An adult cannot normally be medicated without their consent unless they are considered to lack the ability to make responsible decisions for themselves. Today, it might be possible to give a patient drugs without consent if they have been sectioned under the Mental Health Act (1983) and it was considered the action was necessary for their safety or for the safety of others. Hobart was certainly not an immediate threat to anyone before he was brought out of his trance.

However, even if this enforced use of drugs was sanctioned, this would not be done in a public place with dozens of onlookers. 4 And it would not be done (in the U.K. at least!) simply to question someone about a crime.5 Presumably, the makers of the film either thought that this scene reflected something quite reasonable, or, at least, that the cinema-going public would find this sufficiently feasible to suspend disbelief. If this fictitious episode did not reflect acceptable ethical standards at the time, it would seem to tell us something about public perceptions of the attitude of those in authority (whether the actual authorities who were meant to have a duty of care to a person under arrest, or those designated with professional roles and academic titles) to human rights.

Today, however, professionals such as researchers, doctors, and even teachers, are prepared for their work with a strong emphasis on professional ethics. In medical care, the interest of the patient themselves comes first. In research, informants are voluntary participants in our studies, who offer us the gift of data, and are not subjects of our enquiries to be treated simply as available material for our work.

Yet, actually, this is largely a modern perspective that has developed in recent decades, and sadly there are many real stories, even in living memory, of professionals deciding that people (and this usually meant people with less standing or power in their society) should be drugged, or shocked, or operated on, without their consent and even against their explicit wishes; for what is seen as their own, or even what is judged as some greater, good; in circumstances where it would be totally unacceptable in most countries these days.

So, although this is not really a horror film by today's measures, I hope any other researchers (or medical practitioners) who were watching the film shared my own reaction to this scene: 'no, they cannot do that!'

At least, they could not do that today.

Read about research ethics


Notes

1 This sounds to me like 'pentatyl', but I could not find any reference to a therapeutic drug of that name. Fentanyl is a powerful anti-pain drug, which like amphetamines is abused for recreational use – but was only introduced into practice the year after the film was made. It was most likely referring to sodium thiopental, known as pentothal, and much used (in movies and television, at least) as a truth serum. 5 As it is a barbiturate, and so is used in anaesthesia, it does not seem an obvious drug of choice to wake someone from a catatonic state.


2 The script is based loosely on a 1911 M. R. James short story, 'Casting the Runes' that does not include the episode discussed.


3 I have flipped this image (as can be seen form the newspaper) to put Wilde (playing alongside Ronnie Barker, standing, and Richard Beckinsale), on the right hand side of picture.


4 Which is not to claim that such a public demonstration would have been unlikely at another time and place. Execution was still used in the U.K. until 1964 (during my lifetime), although by that time being found guilty of vagrancy (being unemployed and hanging around {unfortunate pun unintended}) for the second time was no longer a capital offence. However, after 1868 executions were no longer carried out in public.

It was not unknown for the corpses of executed criminals to be subject to public dissection in Renaissance [sic, ironically] Europe.


5 Fiction, of course, has myriad scenes where 'truth drugs' are used to obtain secrets from prisoners – but usually those carrying out the torture are the 'bad guys', either criminals or agents of what is represented in the story as an enemy or dystopian state.


6 Some variations on a theme. (For some reason, for its slightly cut U.S. release 'The Night of the Demon' was called 'The Curse of the Demon'.) The various representations of the demon and the prominence given to it seem odd to a modern viewer given how little the demon actually features in the film.

The references to actually seeing demons and monsters from hell on the screen, "the most terrifying story ever told", and "scenes of terror never before imagined" raises the question of whether the copywriters were expected to watch a film before producing their copy.

Passive learners in unethical control conditions

When 'direct instruction' just becomes poor instruction


Keith S. Taber


An experiment that has been set up to ensure the control condition fails, and so compares an innovation with a substandard teaching condition, can – at best – only show the innovation is not as bad as the substandard teaching

One of the things which angers me when I read research papers is examples of what I think of as 'rhetorical research' that use unethical control conditions (Taber, 2019). That is, educational research which sets up one group of students to be taught in a way that is clearly disadvantages them to ensure the success of an experimental teaching approach,

"I am suggesting that some of the experimental studies reported in the literature are rhetorical in the … sense that the researchers clearly expect to demonstrate a well- established effect, albeit in a specific context where it has not previously been demonstrated. The general form of the question 'will this much-tested teaching approach also work here' is clearly set up expecting the answer 'yes'. Indeed, control conditions may be chosen to give the experiment the best possible chance of producing a positive outcome for the experimental treatment."

Taber, 2019, p.108

This irks me for two reasons. The first, obviously, is that researchers have been prepared to (ab)use learners as 'data fodder' and subject them to poor learning contexts in order to have the best chance of getting positive results for the innovation supposedly being 'tested'. However, it also annoys me as this is inherently a poor research design (and so a poor use of resources) as it severely limits what can be found out. An experiment that compares an innovation with a substandard teaching condition can, at best, show the innovation is not as ineffecitive as the substandard teaching in the control condition; but it cannot tell us if the innovation is at least as effective as existing good practice.

This irritation is compounded when the work I am reading is not some amateur report thrown together for a predatory journal, but an otherwise serious study published in a good research outlet. That was certainly the case for a paper I read today in Research in Science Education (the journal of the Australasian Science Education Research Association) on problem-based learning (Tarhan, Ayar-Kayali, Urek & Acar, 2008).

Rhetorical studies?

Genuine research is undertaken to find something out. The researchers in this enquiry claim:

"This research study aims to examine the effectiveness of a [sic] problem-based learning [PbBL] on 9th grade students' understanding of intermolecular forces (dipole- dipole forces, London dispersion forces and hydrogen bonding)."

Tarhan, et al., 2008, p.285

But they choose to compare PbBL with a teaching approach that they expect to be ineffective. Here the researchers might have asked "how does teaching year 9 students about intermolecular forces though problem-based learning compared with current good practice?" After all, even if PbBL worked quite well, if it is not quite as effective as the way teachers are currently teaching the topic then, all other things being equal, there is no reason to shift to it; whereas if it outperforms even our best current approaches, then there is a reason to recommend it to teachers and roll out associated professional development opportunities.


Problem-based learning (third column) uses a problem (i.e., a task which cannot be solved simply by recalling prior learning or employing an algorithmic routine) as the focus and motivation for learning about a topic

Of course, that over-simplifies the situation, as in education, 'all other things' never are equal (every school, class, teacher…is unique). An approach that works best on average will not work best everywhere. But knowing what works best on average (that is, taken across the diverse range of teaching and learning contexts) is certainly a very useful starting point when teachers want to consider what might work best in their own classrooms.

Rhetorical research is poor research, as it is set up (deliberately or inadvertently) to demonstrate a particular outcome, and, so, has built-in bias. In the case of experimental studies, this often means choosing an ineffective instructional approach for the comparison class. Why else would researchers select a control condition they know is not suitable for bringing about the educational outcomes they are testing for?

Problem-Based Learning in a 9th Grade Chemistry Class

Tarhan and colleagues' study was undertaken in one school with 78 students divided into two groups. One group was taught through a sequence based on problem-based learning that involved students undertaking research in groups, gently supported and steered by the teacher. The approach allowed student dialogue, which is believed to be valuable in learning, and motivated students to be active engaged in enquiry. When such an approach is well judged it has potential to count as 'scaffolding' of learning. This seems a very worthwhile innovation – well worth developing and evaluating.

Of course, work in one school cannot be assumed to generalise elsewhere, and small-scale experimental work of this kind is open to major threats to validity, such as expectancy effects and researcher bias – but this is unfortunately always true of these kinds of studies (which are often all educational researchers are resourced to carry out). Finding out what works best in some educational context at least potentially contributes to building up an overall picture (Taber, 2019). 1

Why is this rhetorical research?

I consider this rhetoric research because of the claims the authors make at the start of the study:

"Research in science education therefore has focused on applying active learning techniques, which ensure the affective construction of knowledge, prevent the formation of alternate conceptions, and remedy existing alternate conceptions…Other studies suggest that active learning methods increase learning achievement by requiring students to play a more active role in the learning process…According to active learning principles, which emphasise constructivism, students must engage in researching, reasoning, critical thinking, decision making, analysis and synthesis during construction of their knowledge."

Tarhan, et al., 2008, pp.285-286

If they genuinely believed that, then to test the effectiveness of their PbBL activity, Tarhan and colleagues needed to compare it with some other teaching condition that they are confident can "ensure the affective construction of knowledge, prevent the formation of alternate conceptions, and remedy existing alternate conceptions… requir[e] students to play a more active role in the learning process…[and] engage in researching, reasoning, critical thinking, decision making, analysis and synthesis during construction of their knowledge." A failure to do that means that the 'experiment' has been biased – it has been set up to ensure the control condition fails.

Unethical research?

"In most educational research experiments of [this] type…potential harm is likely to be limited to subjecting students (and teachers) to conditions where teaching may be less effective, and perhaps demotivating. This may happen in experimental treatments with genuine innovations (given the nature of research). It can also potentially occur in control conditions if students are subjected to teaching inputs of low effectiveness when better alternatives were available. This may be judged only a modest level of harm, but – given that the whole purpose of experiments to test teaching innovations is to facilitate improvements in teaching effectiveness – this possibility should be taken seriously."

Taber, 2019, p.94

The same teacher taught both classes: "Both of the groups were taught by the same chemistry teacher, who was experienced in active learning and PbBL" (p.288). This would seem to reduce the 'teacher effect' – outcomes being effected because the teacher of one one class being more effective than that of another. (Reduce, rather than eliminate, as different teachers have different styles, skills, and varied expertise: so, most teachers are more suited to, and competent in, some teaching approaches than others.)

So, this teacher was certainly capable of teaching in the ways that Tarhan and colleagues claim as necessary for effective learning ("active learning techniques"). However, the control condition sets up the opposite of active learning, so-called passive learning:

"In this study, the control group was taught the same topics as the experimental group using a teacher-centred traditional didactic lecture format. Teaching strategies were dependent on teacher expression and question-answer format. However, students were passive participants during the lessons and they only listened and took notes as the teacher lectured on the content.

The lesson was begun with teacher explanation about polar and nonpolar covalent bonding. She defined formation of dipole-dipole forces between polar molecules. She explained that because of the difference in electronegativities between the H and Cl atoms for HCl molecule is 0.9, they are polar molecules and there are dipole-dipole forces between HCl molecules. She also stated that the intermolecular dipole-dipole forces are weaker than intramolecular bonds such as covalent and ionic bonding. She gave the example of vaporisation and decomposition of HCl. She explained that while 16 kJ/mol of energy is needed to overcome the intermolecular attraction between HCl molecules in liquid HCl during vaporisation process of HCl, 431 kJ/mol of energy is required to break the covalent bond between the H and Cl atoms in the HCl molecule. In the other lesson, the teacher reminded the students of dipole-dipole forces and then considered London dispersion forces as weak intermolecular forces that arise from the attractive force between instantaneous dipole in nonpolar molecules. She gave the examples of F2, Cl2, Br2, I2 and said that because the differences in electronegativity for these examples are zero, these molecules are non-polar and had intermolecular London dispersion forces. The effects of molecular size and mass on the strengths of London dispersion forces were discussed on the same examples. She compared the strengths of dipole-dipole forces and London dispersion forces by explaining the differences in melting and boiling points for polar (MgO, HCl and NO) and non-polar molecules (F2, Cl2, Br2, and I2). The teacher classified London dispersion forces and dipole- dipole as van der Waals forces, and indicated that there are both London dispersion forces and dipole-dipole forces between polar molecules and only London dispersion forces between nonpolar molecules. In the last lesson, teacher called attention to the differences in boiling points of H2O and H2S and defined hydrogen bonds as the other intermolecular forces besides dipole-dipole and London dispersion forces. Strengths of hydrogen bonds depending on molecular properties were explained and compared in HF, NH3 and H2O. She gave some examples of intermolecular forces in daily life. The lesson was concluded with a comparison of intermolecular forces with each other and intramolecular forces."

Tarhan, et al., 2008, p.293

Lecturing is not ideal for teaching university students. It is generally not suitable for teaching school children (and it is not consistent with what is expected in Turkish schools).

This was a lost opportunity to seriously evaluate the teaching through PbBL by comparing with teaching that followed the national policy recommendations. Moreover, it was a dereliction of the duty that educators should never deliberately disadvantage learners. It is reasonable to experiment with children's learning when you feel there is a good chance of positive outcomes: it is not acceptable to deliberately set up learners to fail (e.g., by organising 'passive' learning when you claim to believe effective learning activities are necessarily 'active').

Isn't this 'direct instruction'?

Now, perhaps the account of the teaching given by Tarhan and colleagues might seem to fit the label of 'direct teaching'. Whilst Tarhan et al. claim constructivist teaching is clearly necessary for effective learning, there are some educators who claim that constructivist approaches are inferior, and a more direct approach, 'direct instruction', is more likely to lead to learning gains.

This has been a lively debate, but often the various commentators use terminology differently and argue across each other (Taber, 2010). The proponents of direct instruction often criticise teaching that expects learners to take nearly all the responsibility for learning, with minimal teacher support. I would also criticise that (except perhaps in the case of graduate research students once they have demonstrated their competence, including knowing when to seek supervisory guidance). That is quite unlike genuine constructivist teaching which is optimally guided (Taber, 2011): where the teacher manages activities, constantly monitors learner progress, and intervenes with various forms of direction and support as needed. Tarhan and colleagues' description of their problem-based learning experimental condition appears to have had this kind of guidance:

"The teacher visited each group briefly, and steered students appropriately by using some guiding questions and encouraging them to generate their hypothesis. The teacher also stimulated the students to gain more information on topics such as the polar structure of molecules, differences in electronegativity, electron number, atom size and the relationship between these parameters and melting-boiling points…The teacher encouraged students to discuss the differences in melting and boiling points for polar and non-polar molecules. The students came up with [their] research questions under the guidance of the teacher…"

Tarhan, et al., 2008, pp.290-291

By contrast, descriptions of effective direct instruction do involve tightly planned teaching with carefully scripted teacher moves of the kind quoted in the account, above, of the control condition. (But any wise teacher knows that lessons can only be scripted as a provisional plan: the teacher has to constantly check the learners are making sense of teaching as intended, and must be prepared to change pace, repeat sections, re-order or substitute activities, invent new analogies and examples, and so forth.)

However, this instruction is not simply a one-way transfer of information, but rather a teacher-led process that engages students in active learning to process the material being introduced by the teacher. If this is done by breaking the material into manageable learning quanta, each of which students engage with in dialogic learning activities before preceding to the next, then this is constructivist teaching (even if it may also be considered by some as 'direct instruction'!)


Effective teaching moves between teacher input and student activities and is not just the teacher communicating information to the learners.

By contrast, the lecture format adopted by Tarhan's team was based on the teacher offering a multi-step argument (delivered over several lessons) and asking the learners to follow and retain an extensive presentation.

"The lesson was begun with teacher explanation …

She defined …

She explained…

She also stated…

She gave the example …

She explained that …

the teacher reminded the students …

She gave the examples of …

She compared…

The teacher classified …

and indicated that …

[the] teacher called attention to …

She gave some examples of …"

Tarhan, et al., 2008, p.293

This is a description of the transmission of information through a communication channel: not an account of teaching which engages with students' thinking and guides them to new understandings.

Ethical review

Despite the paper having been published in a major journal, Research in Science Education, there seems to be no mention that the study design has been through any kind of institutional ethical review before the research began. Moreover, there is no reference to the learners or their parents/guardians having been asked for, or having given, voluntary, informed, consent, as is usually required in research with human participants. Indeed Tarhen and colleagues refer to the children as the 'subjects' of their research, not participants in their study.

Perhaps ethical review was not expected in the national context (at least, in 2008). Certainly, it is difficult to imagine how voluntary, informed, consent would be obtained if parents were to be informed that half of the learners would be deliberately subject to a teaching approach the researchers claim lacks any of the features "students must engage in…during construction of their knowledge".

PbBL is better than…deliberately teaching in a way designed to limit learning

Tarhan and colleagues, unsurprisingly, report that on a post-test the students who were taught through PbBL out-performed these students who were lectured at. It would have been very surprising (and so potentially more interesting, and, perhaps, even useful, research!) had they found anything else, given the way the research was biased.

So, to summarise:

  1. At the outset of the paper it is reported that it is already established that effective learning requires students to engage in active learning tasks.
  2. Students in the experimental conditions undertook learning through a PbBL sequence designed to engage them in active learning.
  3. Students in the control condition were subject to a sequence of lecturing inputs designed to ensure they were passive.
  4. Students in the active learning condition outperformed the students in the passive learning condition

Which I suggest can be considered both rhetorical research, and unethical.


The study can be considered both rhetorical and unfair to the learners assigned to be in the control group

Read about rhetorical experiments

Read about unethical control conditions


Work cited:

Note:

1 There is a major issue which is often ignored in studies of his type (where a pedagogical innovation is trialled in a single school area, school or classroom). Finding that problem-based learning (or whatever) is effective in one school when teaching one topic to one year group does not allow us to generalise to other classrooms, schools, country, educational level, topics and disciplines.

Indeed, as every school, every teacher, every class, etc., is unique in some ways, it might be argued that one only really finds out if an approach will work well 'here' by trying it out 'here' – and whether it is universally applicable by trying it everywhere. Clearly academic researchers cannot carry out such a programme, but individual teachers and departments can try out promising approaches for themselves (i.e., context-directed research, such as 'action research').

We might ask if there is any point in researchers carrying out studies of the type discussed in this article, there they start by saying an approach has been widely demonstrated, and then test it in what seems an arbitrarily chosen (or, more likely, convenient) curriculum and classroom context, given that we cannot generalise from individual studies, and it is not viable to test every possible context.

However, there are some sensible guidelines for how series of such studies into the same type of pedagogic innovation in different contexts can be more useful in (a) helping determine the range of contexts where an approach is effective (through what we might call 'incremental generalisation'), and (b) document the research contexts is sufficient detail to support readers in making judgements about the degree of similarity with their own teaching context (Taber, 2019).

Read about replication studies

Read about incremental generalisation

Falsifying research conclusions

You do not need to falsify your results if you are happy to draw conclusions contrary to the outcome of your data analysis.


Keith S. Taber


Li and colleagues claim that their innovation is successful in improving teaching quality and student learning: but their own data analaysis does not support this.

I recently read a research study to evaluate a teaching innovation where the authors

  • presented their results,
  • reported the statistical test they had used to analyse their results,
  • acknowledged that the outcome of their experiment was negative (not statistically significant), then
  • stated their findings as having obtained a positive outcome, and
  • concluded their paper by arguing they had demonstrated their teaching innovation was effective.

Li, Ouyang, Xu and Zhang's (2022) paper in the Journal of Chemical Education contravenes the scientific norm that your conclusions should be consistent with the outcome of your data analysis.
(Magnified portions of this scheme are presented below)

And this was not in a paper in one of those predatory journals that I have criticised so often here – this was a study in a well regarded journal published by a learned scientific society!

The legal analogy

I have suggested (Taber, 2013) that writing up research can be understood in terms of a number of metaphoric roles: researchers need to

  • tell the story of their research;
  • teach readers about the unfamiliar aspects of their work;
  • make a case for the knowledge claims they make.

Three metaphors for writing-up research

All three aspects are important in making a paper accessible and useful to readers, but arguably the most important aspect is the 'legal' analogy: a research paper is an argument to make a claim for new public knowledge. A paper that does not make its case does not add anything of substance to the literature.

Imagine a criminal case where the prosecution seeks to make its argument at a pre-trial hearing:

"The police found fingerprints and D.N.A. evidence at the scene, which they believe were from the accused."

"Were these traces sent for forensic analysis?"

"Of course. The laboratory undertook the standard tests to identify who left these traces."

"And what did these analyses reveal?"

"Well according to the current standards that are widely accepted in the field, the laboratory was unable to find a definite match between the material collected at the scene, and fingerprints and a D.N.A. sample provided by the defendant."

"And what did the police conclude from these findings?"

"The police concluded that the fingerprints and D.N.A. evidence show that the accused was at the scene of the crime."

It seems unlikely that such a scenario has ever played out, at least in any democratic country where there is an independent judiciary, as the prosecution would be open to ridicule and it is quite likely the judge would have some comments about wasting court time. What would seem even more remarkable, however, would be if the judge decided on the basis of this presentation that there was a prima facie case to answer that should proceed to a full jury trial.

Yet in educational research, it seems parallel logic can be persuasive enough to get a paper published in a good peer-reviewed journal.

Testing an educational innovation

The paper was entitled 'Implementation of the Student-Centered Team-Based Learning Teaching Method in a Medicinal Chemistry Curriculum' (Li, Ouyang, Xu & Zhang, 2022), and it was published in the Journal of Chemical Education. 'J.Chem.Ed.' is a well-established, highly respected periodical that takes peer review seriously. It is published by a learned scientific society – the American Chemical Society.

That a study published in such a prestige outlet should have such a serious and obvious flaw is worrying. Of course, no matter how good editorial and peer review standards are, it is inevitable that sometimes work with serious flaws will get published, and it is easy to pick out the odd problematic paper and ignore the vast majority of quality work being published. But, I did think this was a blatant problem that should have been spotted.

Indeed, because I have a lot of respect for the Journal of Chemical Education I decided not to blog about it ("but that is what you are doing…?"; yes, but stick with me) and to take time to write a detailed letter to the journal setting out the problem in the hope this would be acknowledged and the published paper would not stand unchallenged in the literature. The journal declined to publish my letter although the referees seemed to generally accept the critique. This suggests to me that this was not just an isolated case of something slipping through – but a failure to appreciate the need for robust scientific standards in publishing educational research.

Read the letter submitted to the Journal of Chemical Education

A flawed paper does not imply worthless research

I am certainly not suggesting that there is no merit in Li, Ouyang, Xu and Zhang's work. Nor am I arguing that their work was not worth publishing in the journal. My argument is that Li and colleague's paper draws an invalid conclusion, and makes misleading statements inconsistent with the research data presented, and that it should not have been published in this form. These problems are pretty obvious, and should (I felt) have been spotted in peer review. The authors should have been asked to address these issues, and follow normal scientific standards and norms such that their conclusions follow from, rather than contradict, their results.

That is my take. Please read my reasoning below (and the original study if you have access to J.Chem.Ed.) and make up your own mind.

Li, Ouyang, Xu and Zhang report an innovation in a university course. They consider this to have been a successful innovation, and it may well have great merits. The core problem is that Li and colleagues claim that their innovation is successful in improving teaching quality and student learning: when their own data analysis does not support this.

The evidence for a successful innovation

There is much material in the paper on the nature of the innovation, and there is evidence about student responses to it. Here, I am only concerned with the failure of the paper to offer a logical chain of argument to support their knowledge claim that the teaching innovation improved student achievement.

There are (to my reading – please judge for yourself if you can access the paper) some slight ambiguities in some parts of the description of the collection and analysis of achievement data (see note 5 below), but the key indicator relied on by Li, Ouyang, Xu and Zhang is the average score achieved by students in four teaching groups, three of which experienced the teaching innovation (these are denoted collectively as the 'the experimental group') and one group which did not (denoted as 'the control group', although there is no control of variables in the study 1). Each class comprised of 40 students.

The study is not published open access, so I cannot reproduce the copyright figures from the paper here, but below I have drawn a graph of these key data:


Key results from Li et al, 2022: this data was the basis for claiming an effective teaching innovation.

Loading poll ...

It is on the basis of this set of results that Li and colleagues claim that "the average score showed a constant upward trend, and a steady increase was found". Surely, anyone interrogating these data might have pause to wonder if that is the most authentic description of the pattern of scores year on year.

Does anyone teaching in a university really think that assessment methods are good enough to produce average class scores that are meaningful to 3 or 4 significant figures. To a more reasonable level of precision, nearest %age point (which is presumably what these numbers are – that is not made explicit), the results were:


CohortAverage class score
201780
201880
201980
202080
Average class scores (2 s.f.) year on year

When presented to a realistic level of precision, the obvious pattern is…no substantive change year on year!

A truncated graph

In their paper, Li and colleagues do present a graph to compare the average results in 2017 with (not 2018, but) 2019 and 2020, somewhat similar to the one I have reproduced here which should have made it very clear how little the scores varied between cohorts. However, Li and colleagues did not include on their axis the full range of possible scores, but rather only included a small portion of the full range – from 79.4 to 80.4.

This is a perfectly valid procedure often used in science, and it is quite explicitly done (the x-axis is clearly marked), but it does give a visual impression of a large spread of scores which could be quite misleading. In effect, their Figure 4b includes just a slither of my graph above, as shown below. If one takes the portion of the image below that is not greyed out, and stretches it to cover the full extent of the x axis of a graph, that is what is presented in the published account.


In the paper in J.Chem.Ed., Li and colleagues (2022) truncate the scale on their average score axis to expand 1% of the full range (approximated above in the area not shaded over) into a whole graph as their Figure 4b. This gives a visual impression of widely varying scores (to anyone who does not read the axis labels).

Compare images: you can use the 'slider' to change how much of each of the two images is shown.

What might have caused those small variations?

If anyone does think that differences of a few tenths of a percent in average class scores are notable, and that this demonstrates increasing student achievement, then we might ask what causes this?

Li and colleagues seem to be convinced that the change in teaching approach caused the (very modest) increase in scores year on year. That would be possible. (Indeed, Li et al seem to be arguing that the very, very modest shift from 2017 to subsequent years was due to the change of teaching approach; but the not-quite-so-modest shifts from 2018 to 2019 to 2020 are due to developing teacher competence!) However, drawing that conclusion requires making a ceteris paribus assumption: that all other things are equal. That is, that any other relevant variables have been controlled.

Read about confounding variables

Another possibility however is simply that each year the teaching team are more familiar with the science, and have had more experience teaching it to groups at this level. That is quite reasonable and could explain why there might be a modest increase in student outcomes on a course year on year.

Non-equivalent groups of students?

However, a big assumption here is that each of the year groups can be considered to be intrinsically the same at the start of the course (and to have equivalent relevant experiences outside the focal course during the programme). Often in quasi-experimental studies (where randomisation to conditions is not possible 1) a pre-test is used to check for equivalence prior to the innovation: after all, if students are starting from different levels of background knowledge and understanding then they are likely to score differently at the end of a course – and no further explanation of any measured differences in course achievement need be sought.

Read about testing for initial equivalence

In experiments, you randomly assign the units of analysis (e.g., students) to the conditions, which gives some basis for at least comparing any differences in outcomes with the variations likely by chance. But this was not a true experiment as there was no randomisation – the comparisons are between successive year groups.

In Li and colleagues' study, the 40 students taking the class in 2017 are implicitly assumed equivalent to the 40 students taking the class in each of the years 20818-2020: but no evidence is presented to support this assumption. 3

Yet anyone who has taught the same course over a period of time knows that even when a course is unchanged and the entrance requirements stable, there are naturally variations from one year to the next. That is one of the challenges of educational research (Taber, 2019): you never can "take two identical students…two identical classes…two identical teachers…two identical institutions".

Novelty or expectation effects?

We would also have to ignore any difference introduced by the general effect of there being an innovation beyond the nature of the specific innovation (Taber, 2019). That is, students might be more attentive and motivated simply because this course does things differently to their other current courses and past courses. (Perhaps not, but it cannot be ruled out.)

The researchers are likely enthusiastic for, and had high expectations for, the innovation (so high that it seems to have biased their interpretation of the data and blinded them to the obvious problems with their argument) and much research shows that high expectation, in its own right, often influences outcomes.

Read about expectancy effects in studies

Equivalent examination questions and marking?

We also have to assume the assessment was entirely equivalent across the four years. 4 The scores were based on aggregating a number of components:

"The course score was calculated on a percentage basis: attendance (5%), preclass preview (10%), in-class group presentation (10%), postclass mind map (5%), unit tests (10%), midterm examination (20%), and final examination (40%)."

Li, et al, 2022, p.1858

This raises questions about the marking and the examinations:

  • Are the same test and examination questions used each year (that is not usually the case as students can acquire copies of past papers)?
  • If not, how were these instruments standardised to ensure they were not more difficult in some years than others?
  • How reliable is the marking? (Reliable meaning the same scores/mark would be assigned to the same work on a different occasion.)

These various issues do not appear to have been considered.

Change of assessment methodology?

The description above of how the students' course scores were calculated raises another problem. The 2017 cohort were taught by "direct instruction". This is not explained as the authors presumably think we all know exactly what that is : I imagine lectures. By comparison, in the innovation (2018-2020 cohorts):

"The preclass stage of the SCTBL strategy is the distribution of the group preview task; each student in the group is responsible for a task point. The completion of the preview task stimulates students' learning motivation. The in-class stage is a team presentation (typically PowerPoint (PPT)), which promotes students' understanding of knowledge points. The postclass stage is the assignment of team homework and consolidation of knowledge points using a mind map. Mind maps allow an orderly sorting and summarization of the knowledge gathered in the class; they are conducive to connecting knowledge systems and play an important role in consolidating class knowledge."

Li, et al, 2022, p.1856, emphasis added.

Now the assessment of the preview tasks, the in-class group presentations, and the mind maps all contributed to the overall student scores (10%, 10%, 5% respectively). But these are parts of the innovative teaching strategy – they are (presumably) not part of 'direct instruction'. So, the description of how the student class scores were derived only applies to 2018-2020, and the methodology used in 2017 must have been different. (This is not discussed in the paper.) 5

A quarter of the score for the 'experimental' groups came from assessment components that could not have been part of the assessment regime applied to the 2017 cohort. At the very least, the tests and examinations must have been more heavily weighed into the 'control' group students' overall scores. This makes it very unlikely the scores can be meaningfully directly compared from 2017 to subsequent years: if the authors think otherwise they should have presented persuasive evidence of equivalence.


Li and colleagues want to convince us that variations in average course scores can be assumed to be due to a change in teaching approach – even though there are other conflating variables.

So, groups that we cannot assume are equivalent are assessed in ways that we cannot assume to be equivalent and obtain nearly identical average levels of achievement. Despite that, Li and colleagues want to persuade us that the very modest differences in average scores between the 'control' and 'experimental' groups (which is actually larger between different 'experimental group' cohorts than between the 'control' group and the successive 'experimental' cohort) are large enough to be significant and demonstrate their teaching innovation improves student achievement.

Statistical inference

So, even if we thought shifts of less than a 1% average in class achievement were telling, there are no good reasons to assume they are down to the innovation rather than some other factor. But Li and colleagues use statistical tests to tell them whether differences between the 'control' and 'experimental' conditions are significant. They find – just what anyone looking at the graph above would expect – "there is no significant difference in average score" (p.1860).

The scientific convention in using such tests is that the choice of test, and confidence level (e.g., a probability of p<0.05 to be taken as significant) is determined in advance, and the researchers accept the outcomes of the analysis. There is a kind of contract involved – a decision to use a statistical test (chosen in advance as being a valid way of deciding the outcome of an experiment) is seen as a commitment to accept its outcomes. 2 This is a form of honesty in scientific work. Just as it is not acceptable to fabricate data, nor is is acceptable to ignore experimental outcomes when drawing conclusions from research.

Special pleading is allowed in mitigation (e.g., "although our results were non-significant, we think this was due to the small samples sizes, and suggest that further research should be undertaken with large groups {and we are happy to do this if someone gives us a grant}"), but the scientist is not allowed to simply set aside the results of the analysis.


Li and colleagues found no significant difference between the two conditions, yet that did not stop them claiming, and the Journal of Chemical Education publishing, a conclusion that the new teaching approach improved student achievement!

Yet setting aside the results of their analysis is what Li and colleagues do. They carry out an analysis, then simply ignore the findings, and conclude the opposite:

"To conclude, our results suggest that the SCTBL method is an effective way to improve teaching quality and student achievement."

Li, et al, 2022, p.1861

It was this complete disregard of scientific values, rather than the more common failure to appreciate that they were not comparing like with like, that I found really shocking – and led to me writing a formal letter to the journal. Not so much surprise that researchers might do this (I know how intoxicating research can be, and how easy it is to become convinced in one's ideas) but that the peer reviewers for the Journal of Chemical Education did not make the firmest recommendation to the editor that this manuscript could NOT be published until it was corrected so that the conclusion was consistent with the findings.

This seems a very stark failure of peer review, and allows a paper to appear in the literature that presents a conclusion totally unsupported by the evidence available and the analysis undertaken. This also means that Li, Ouyang, Xu and Zhang now have a publication on their academic records that any careful reader can see is critically flawed – something that could have been avoided had peer reviewers:

  • used their common sense to appreciate that variations in class average scores from year to year between 79.8 and 80.3 could not possibly be seen as sufficient to indicate a difference in the effectiveness of teaching approaches;
  • recommended that the authors follow the usual scientific norms and adopt the reasonable scholarly value position that the conclusion of your research should follow from, and not contradict, the results of your data analysis.


Work cited:

Notes

1 Strictly the 2017 cohort has the role of a comparison group, but NOT a control group as there was no randomisation or control of variables, so this was not a true experiment (but a 'quasi-experiment'). However, for clarity, I am here using the original authors' term 'control group'.

Read about experimental research design


2 Some journals are now asking researchers to submit their research designs and protocols to peer review BEFORE starting the research. This prevents wasted effort on work that is flawed in design. Journals will publish a report of the research carried out according to an accepted design – as long as the researchers have kept to their research plans (or only made changes deemed necessary and acceptable by the journal). This prevents researchers seeking to change features of the research because it is not giving the expected findings and means that negative results as well as positive results do get published.


3 'Implicitly' assumed as nowhere do the authors state that they think the classes all start as equivalent – but if they do not assume this then their argument has no logic.

Without this assumption, their argument is like claiming that growing conditions for tree development are better at the front of a house than at the back because on average the trees at the front are taller – even though fast-growing mature trees were planted at the front and slow-growing saplings at the back.


4 From my days working with new teachers, a common rookie mistake was assuming that one could tell a teaching innovation was successful because students achieved an average score of 63% on the (say, acids) module taught by the new method when the same class only averaged 46% on the previous (say, electromagnetism) module. Graduate scientists would look at me with genuine surprise when I asked how they knew the two tests were of comparable difficulty!

Read about why natural scientists tend to make poor social scientists


5 In my (rejected) letter to the Journal of Chemical Education I acknowledged some ambiguity in the paper's discussion of the results. Li and colleagues write:

"The average scores of undergraduates majoring in pharmaceutical engineering in the control group and the experimental group were calculated, and the results are shown in Figure 4b. Statistical significance testing was conducted on the exam scores year to year. The average score for the pharmaceutical engineering class was 79.8 points in 2017 (control group). When SCTBL was implemented for the first time in 2018, there was a slight improvement in the average score (i.e., an increase of 0.11 points, not shown in Figure 4b). However, by 2019 and 2020, the average score increased by 0.32 points and 0.54 points, respectively, with an obvious improvement trend. We used a t test to test whether the SCTBL method can create any significant difference in grades among control groups and the experimental group. The calculation results are shown as follows: t1 = 0.0663, t2 = 0.1930, t3 =0.3279 (t1 <t2 <t3 <t𝛼, t𝛼 =2.024, p>0.05), indicating that there is no significant difference in average score. After three years of continuous implementation of SCTBL, the average score showed a constant upward trend, and a steady increase was found. The SCTBL method brought about improvement in the class average, which provides evidence for its effectiveness in medicinal chemistry."

Li, et al, 2022, p.1858-1860, emphasis added

This appears to refer to three distinct measures:

  • average scores (produced by weighed summations of various assessment components as discussed above)
  • exam scores (perhaps just the "midterm examination…and final examination", or perhaps just the final examination?)
  • grades

Formal grades are not discussed in the paper (the word is only used in this one place), although the authors do refer to categorising students into descriptive classes ('levels') according to scores on 'assessments', and may see these as grades:

"Assessments have been divided into five levels: disqualified (below 60), qualified (60-69), medium (70-79), good (80-89), and excellent (90 and above)."

Li, et al, 2022, p.1856, emphasis added

In the longer extract above, the reference to testing difference in "grades" is followed by reporting the outcome of the test for "average score":

"We used a t test to test …grades …The calculation results … there is no significant difference in average score"

As Student's t-test was used, it seems unlikely that the assignment of students to grades could have been tested. That would surely have needed something like the Chi-squared statistic to test categorical data – looking for an association between (i) the distributions of the number of students in the different cells 'disqualified', 'qualified', 'medium', 'good' and 'excellent'; and (ii) treatment group.

Presumably, then, the statistical testing was applied to the average course scores shown in the graph above. This also makes sense because the classification into descriptive classes loses some of the detail in the data and there is no obvious reason why the researchers would deliberately chose to test 'reduced' data rather than the full data set with the greatest resolution.


Didactic control conditions

Another ethically questionable science education experiment?


Keith S. Taber


This seems to be a rhetorical experiment where an educational treatment that is already known to be effective is 'tested' to demonstrate that it is more effective than suboptimal teaching – by asking a teacher to constrain her teaching to students assigned to be an unethical comparison condition

one group of students were deliberately disadvantaged by asking an experienced and skilled teacher to teach in a way all concerned knew was sub-optimal so as to provide a low base line that would be outperformed by the intervention, simply to replicate a much demonstrated finding

In a scientific experiment, an intervention is made into the natural state of affairs to see if it produces a hypothesised change. A key idea in experimental research is control of variables: in the ideal experiment only one thing is changed. In the control condition all relevant variables are fixed so that there is a fair test between the experimental treatment and the control.

Although there are many published experimental studies in education, such research can rarely claim to have fully controlled all potentially relevant variables: there are (nearly always, always?) confounding factors that simply can not be controlled.

Read about confounding variables

Experimental research in education, then, (nearly always, always?) requires some compromising of the pure experimental method.

Where those compromises are substantial, we might ask if experiment was the wrong choice of methodology: even if a good experiment is often the best way to test an idea, a bad experiment may be less informative than, for example, a good case study.

That is primarily a methodological matter, but testing educational innovations and using control conditions in educational studies also raises ethical issues. After all, an experiment means experimenting with real learners' educational experiences. This can certainly be sometimes justified – but there is (or should be) an ethical imperative:

  • researchers should never ask learners to participate in a study condition they have good reason to expect will damage their opportunities to learn.

If researchers want to test a genuinely innovative teaching approach or learning resource, then they have to be confident it has a reasonable chance of being effective before asking learners to participate in a study where they will be subjected to an untested teaching input.

It is equally the case that students assigned to a control condition should never be deliberately subjected to inferior teaching simply in order to help make a strong contrast with an experimental approach being tested. Yet, reading some studies leads to a strong impression that some researchers do seek to constrain teaching to a control group to help bias studies towards the innovation being tested (Taber, 2019). That is, such studies are not genuinely objective, open-minded investigations to test a hypothesis, but 'rhetorical' studies set up to confirm and demonstrate the researchers' prior assumptions. We might say these studies do not reflect true scientific values.


A general scheme for a 'rhetorical experiment'

Read about rhetorical experiments


I have raised this issue in the research literature (Taber, 2019), so when I read experimental studies in education I am minded to check see that any control condition has been set up with a concern to ensure that the interests of all study participants (in both experimental and control conditions) have been properly considered.

Jigsaw cooperative learning in elementary science: physical and chemical changes

I was reading a study called "A jigsaw cooperative learning application in elementary science and technology lessons: physical and chemical changes" (Tarhan, Ayyıldız, Ogunc & Sesen, 2013) published in a respectable research journal (Research in Science & Technological Education).

Tarhan and colleagues adopted a common type of research design, and the journal referees and editor presumably were happy with the design of their study. However, I think the science education community should collectively be more critical about the setting up of control conditions which require students to be deliberately taught in ways that are considered to be less effective (Taber, 2019).


Jigsaw learning involves students working in co-operative groups, and in undertaking peer-teaching

Jigsaw learning is a pedagogic technique which can be seen as a constructivist, student-centred, dialogic, form of 'active learning'. It is based on collaborative groupwork and includes an element of peer-tutoring. In this paper the technique is described as "jigsaw cooperative learning", and the article authors explain that "cooperative learning is an active learning approach in which students work together in small groups to complete an assigned task" (p.185).

Read about jigsaw learning

Random assignment

The study used an experimental design, to compare between learning outcomes in two classes taught the same topic in two different ways. Many studies that compare between two classes are problematic because whole extant classes are assigned to conditions which means that the unit of analysis should be the class (experimental condition, n=1; control condition, n=1). Yet, despite this, such studies commonly analyse results as if each learner was an independent unit of analysis (e.g., experimental condition, n=c.30; control condition, n=c.30) which is necessary to obtain statistical results, but unfortunately means that inferences drawn from those statistics are invalid (Taber, 2019). Such studies offer examples of where there seems little point doing an experiment badly as the very design makes it intrinsically impossible to obtain a (i.e., a valid) statistically significant outcome.


Experimental designs may be categorised as true experiments, quasi-experiments and natural experiments (Taber, 2019).

Tarhan and colleagues, however, randomly assign the learners to the two conditions so can genuinely claim that in their study they have a true experiment: for their study, experimental condition, n=30; control condition, n=31.

Initial equivalence between groups

Assigning students in this way also helped ensure the two groups started from a similar base. Often such experimental studies use a pre-test to compare the groups before teaching. However, often the researchers look for a statistical difference between the groups which does not reach statistical significance (Taber, 2019). That is, if a statistical test shows p≥0.05 (in effect, the initial difference between the groups is not very unlikely to occur by chance) this is taken as evidence of equivalence. That is like saying we will consider two teachers to be of 'equivalent' height as long as there is no more than 30 cm difference in their height!

In effect

'not very different'

is being seen as a synonym for

'near enough the same'


Some analogies for how equivalence is determined in some studies: read about testing for initial equivalence

However, the pretest in Tarhan and colleagues' study found that the difference between two groups in performances on the pretest was at a level likely to occur by chance (not simply something more than 5%, but) 87% of the time. This is a much more convincing basis for seeing the two groups as initially similar.

So, there are two ways in which the Tarhan et al. study seemed better thought-through than many small scale experiments in teaching I have read.

Comparing two conditions

The research was carried out with "sixth grade students in a public elementary school in Izmir, Turkey" (p.184). The focus was learning about physical and chemical changes.

The experimental condition

At the outset of the study, the authors suggest it is already known that

  • "Jigsaw enhances cooperative learning" (p.185)"
  • "Jigsaw promotes positive attitudes and interests, develops communication skills between students, and increases learning achievement in chemistry" (p.186)
  • "the jigsaw technique has the potential to improve students' attitude towards science"
  • development of "students' understanding of chemical equilibrium in a first year general chemistry course [was more successful] in the jigsaw class…than …in the individual learning class"

It seems the approach being tested was already demonstrated to be effective in a range of contexts. Based on the existing research, then, we could already expect well-implemented jigsaw learning to be effective in facilitating student learning.

Similarly, the authors tell the readers that the broader category of cooperative learning has been well established as successful,

"The benefits of cooperative learning have been well documented as being

higher academic achievement,

higher level of reasoning and critical thinking skills,

deeper understanding of learned material,

better attention and less disruptive behavior in class,

more motivation to learn and achieve,

positive attitudes to subject matter,

higher self-esteem and

higher social skills."

Tarhan et al., 2013, p.185

What is there not to like here? So, what was this highly effective teaching approach compared with?

What is being compared?

Tarhan and colleagues tell readers that:

"The experimental group was taught via jigsaw cooperative learning activities developed by the researchers and the control group was taught using the traditional science and technology curriculum."

Tarhan et al., 2013, p.189
A different curriculum?

This seems an unhelpful statement as it does not seem to compare like with like:


conditioncurriculumpedagogy
experimental?jigsaw cooperative learning activities developed by the researchers
control traditional science and technology curriculum?
A genuine experiment would look to control variables, so would not simultaneously vary both curriculum and pedagogy

The study uses a common test to compare learning in the two conditions, so the study only makes sense as an experimental test of jigsaw learning if the same curriculum is being followed in both conditions. Otherwise, there is no prima facie reason to think that the post-test is equally fair in testing what has been taught in the two conditions. 1

The control condition

The paper includes an account of the control condition which seems to make it clear that both groups were taught "the same content", which is helpful as to have done otherwise would have seriously undermined the study.

The control group was instructed via a teacher-centered didactic lecture format. Throughout the lesson, the same science and technology teacher presented the same content as for the experimental group to achieve the same learning objectives, which were taught via detailed instruction in the experimental group. This instruction included lectures, discussions and problem solving. During this process, the teacher used the blackboard and asked some questions related to the subject. Students also used a regular textbook. While the instructor explained the subject, the students listened to her and took notes. The instruction was accomplished in the same amount of time as for the experimental group.

Tarhan et al., 2013, p.194

So, it seems:


conditioncurriculumpedagogy
experimental[by inference: "traditional science and technology curriculum"]jigsaw cooperative learning activities developed by the researchers
control traditional science and technology curriculum
[the same content as for the experimental group to achieve the same learning objectives]
teacher-centred didactic lecture format:
instructor explained the subject and asked questions
controlled variableindependent variable
An experiment relies on control of variables and would not simultaneously vary both curriculum and pedagogy

The statement is helpful, but might be considered ambiguous as "this instruction which included lectures, discussions and problem solving" seems to relate to what had been "taught via detailed instruction in the experimental group".

But this seems incongruent with the wider textual context. The experimental group were taught by a jigsaw learning technique – not lectures, discussions and problem solving. Yet, for that matter, the experimental group were not taught via 'detailed instruction' if this means the teacher presenting the curriculum content. So, this phrasing seems unhelpfully confusing (to me, at least – presumably, the journal referees and editor thought this was clear enough.)

So, this probably means the "lectures, discussions and problem solving" were part of the control condition where "the teacher used the blackboard and asked some questions related to the subject. Students also used a regular textbook. While the instructor explained the subject, the students listened to her and took notes".

'Lectures' certainly fit with that description.

However, genuine 'discussion' work is a dialogic teaching method and would not seem to fit within a "teacher-centered didactic lecture format". But perhaps 'discussion' simply refers to how the "teacher used the blackboard and asked some questions" that members of the class were invited to answer?

Read about dialogic teaching

Writing-up research is a bit like teaching in that in presenting to a particular audience, one works with a mental model of what that audience already knowns and understands, and how they use specific terms, and this model is never likely to be perfectly accurate:

  • when teaching, the learners tend to let you know this, whereas,
  • when writing, this kind of immediate feedback is lacking.

Similarly, problem-solving would not seem to fit within a "teacher-centered didactic lecture format". 'Problem-solving' engages high level cognitive and metacognitive skills because a 'problem' is a task that students are not able to respond to simply by recalling what they have been told and applying learnt algorithms. Problem-solving requires planning and applying strategies to test out ideas and synthesise knowledge. Yet teachers and textbooks commonly refer to simple questions that simply test recall and comprehension, or direct application of learnt techniques, as 'problems' when they are better understood as 'exercises' as they do not pose authentic problems.

The imprecise use of terms that may be understood differently across diverse contexts is characteristic of educational discourse, so Tarhan and colleagues may have simply used the labels that are normally applied in the context where they are working. It should also be noted that as the researchers are based in Turkey they are presumably finding the best English translations they can for the terms used locally.

Read about the challenges of translation in research writing

So, it seems we have:


Experimental conditionin one of the conditions?Control condition
Jigsaw learning (set out in some detail in the paper) – an example of
cooperative learning – an active learning approach in which students work together in small groups
detailed instruction?
discussions (=teacher questioning?)
problem solving? (=practice exercises?)
teacher-centred didactic lecture format…the teacher used the blackboard and asked some questions…a regular textbook….the instructor explained the subject, the students listened and took notes
The independent variable – teaching methodology

The teacher variable

One of the major problems with some educational experiments comparing different teaching approaches is the confound of the teacher. If

  • class A is taught through approach 'a' by teacher 1, and
  • class B is taught through approach 'b' by teacher 2

then even if there is a good case that class A and class B start off as 'equivalent' in terms of readiness to learn about the focal topic then any differences in study outcomes could be as much down to different teachers (and we all know that different teachers are not equivalent!) as different teaching methodology.

At first sight this is easily solved by having the same teacher teach both classes (as in the study discussed here). That certainly seems to help. But, a little thought suggests it is not a foolproof approach (Taber, 2019).

Teachers inevitably have better rapport with some classes than others (even when those classes are shown to be technically 'equivalent') simply because that is the nature of how diverse personalities interact. 3 Even the most professional teachers find they prefer to teach some classes than others, enjoy the teaching more, and seem to get better results (even when the classes are supposed to be equivalent).

In an experiment, there is no reason why the teacher would work better with a class assigned the experimental condition; it might just as well be the control condition. However, this is still a confound and there is no obvious solution to this, except having multiple classes and teachers in each condition such that the statistics can offer guide on whether outcomes are sufficiently unlikely to be able to reasonable discount these types of effect.

Different teachers also have different styles and approaches and skills sets – so the same teacher will not be equally suited to every teaching approach and pedagogy. Again, this does not necessarily advantage the experimental condition, but, again, is something that can only be addressed by having a diverse range of teachers in each condition (Taber, 2019).

So, although we might expect having the same teacher teach both classes is the preferred approach, the same teacher is not exactly the same teacher in different classes or teaching in different ways.

And what do participants expect will happen?

Moreover, expectancy effects can be very influential in education. Expecting something to work, or not work, has been shown to have real effects on outcomes. It may not be true, as some motivational gurus like to pretend, that we can all of us achieve anything if only we believe: but we are more likely to be successful when we believe we can succeed. When confident, we tend to be more motivated, less easily deterred, and (given the human capacity for perceiving with confirmation bias) more likely to judge we are making good progress. So, any research design which communicates to teachers and students (directly, or through the teacher's or researcher's enthusiasm) an expectation of success in some innovation is more likely to lead to success. This is a potential confound that is not even readily addressed by having large numbers of classes and teachers (Taber, 2019)!

Read about expectancy effects

The authors report that

Before implementation of the study, all students and their families were informed about the aims of the study and the privacy of their personal information. Permission for their children attend the study was obtained from all families.

Tarhan et al., 2013, p.194

This is as it should be. School children are not data-fodder for researchers, and they should always be asked for, and give, voluntary informed consent when recruited to join a research project. However, researchers need to open and honest about their work, whilst also being careful about how they present their research aims. We can imagine a possible form of invitation,

We would like you to invite you to be part of a study where some of you will be subject to traditional learning through a teacher-centred didactic lecture format where the teacher will give you notes and ask you questions, and some of you will learn by a different approach that has been shown to enhance learning, promote positive attitudes and interests, develop communication skills, increase achievement, support higher level of reasoning and critical thinking skills, lead to deeper understanding of learned material…

An honest, but unhelpful, briefing for students and parents

If this was how the researchers understood the background to their study, then this would be a fair and honest briefing. Yet, this would clearly set up strong expectations in the student groups!

A suitable teacher

Tarhan and colleagues report that

"A teacher experienced in active learning was trained in how to implement the instruction based on jigsaw cooperative learning. The teacher and researchers discussed the instructional plans before implementing the activities."

Tarhan et al., 2013, p.189

So, the teacher who taught both classes, using an jigsaw cooperative learning in one class and a teacher-centred didactic lecture approach in the other was "experienced in active learning". So, it seems that

  • the researchers were already convinced that active learning approaches were far superior to teaching via a lecture approach
  • the teacher had experience in teaching though more engaging, effective student-centred active learning approaches

despite this, a control condition was set-up that required the teacher to, in effect, de-skill, and teach in a way the researchers were well aware research suggested was inferior, for the sake of carrying out an experiment to demonstrate in a specific context what had already been well demonstrated elsewhere.

In other words, it seems that one group of students were deliberately disadvantaged by asking an experienced and skilled teacher to teach in a way all concerned knew was sub-optimal, so as to provide a low base line that would be outperformed by the intervention, simply to replicate a much demonstrated finding. When seen in that way, this is surely unethical research.

The researchers may not have been consciously conceptualising their design in those terms, but it is hard to see this as a fair test of the jigsaw learning approach – it can show it is better than suboptimal teaching, but does not offer a comparison with an example of the kind of teaching that is recommended in the national context where the research took place.

Unethical, but not unusual

I am not seeking to pick out Tarhan and colleagues in particular for designing an unethical study, because they are not unique in adopting this approach (Taber, 2019): indeed, they are following a common formula (an experimental 'paradigm' in the sense the term is used in psychology).

Tarhan and colleagues have produced a study that is interesting and informative, and which seems well planned, and strongly-motivated when considered as part of tradition of such studies. Clearly, the referees and journal editor were not minded to question the procedure. The problem is that as a science education community we have allowed this tradition to continue such that a form of study that was originally genuinely open-ended (in that it examined under-researched teaching approaches of untested efficacy) has not been modified as published study after published study has slowly turned those untested teaching approaches into well-researched and repeatedly demonstrated approaches.

So much so, that such studies are now in danger of simply being rhetorical research – where (as in this case) the authors tell readers at the outset that it is already known that what they are going to test is widely shown to be effective good practice. Rhetorical research is set up to produce an expected result, and so is not authentic research. A real experiment tests a genuine hypothesis rather than demonstrates a commonplace. A question researchers might ask themselves could be

'how surprised would I be if this leads to a negative outcome'?

If the answer is

'that would be very surprising'

then they should consider modifying their research so it is likely to be more than minimally informative.

Finding out that jigsaw learning achieved learning objectives better/as well as/not so well as, say, P-O-E (predict-observe-explain) activities might be worth knowing: that it is better than deliberately constrained teaching does not tell us very much that is not obvious.

I do think this type of research design is highly questionable and takes unfair advantage of students. It fails to meet my suggested guideline that

  • researchers should never ask learners to participate in a study condition they have good reason to expect will damage their opportunities to learn

The problem of generalisation

Of course, one fair response is that despite all the claims of the superiority of constructivist, active, cooperatative (etc.) learning approaches, the diversity of educational contexts means we can not simply generalise from an experiment in one context and assume the results apply elsewhere.

Read about generalising from research

That is, the research literature shows us that jigsaw learning is an effective teaching approach, but we cannot be certain it will be effective in the particular context of teaching about chemical and physical changes to sixth grade students in a public elementary school in Izmir, Turkey.

Strictly that is true! But we should ask:

do we not know this because

  1. research shows a great variation in whether jigsaw learning is effective or not as it differs according to contexts and conditions
  2. although jigsaw learning has consistently been shown to be effective in many different contexts, no one has yet tested it in the specific case of teaching about chemical and physical changes to sixth grade students in a public elementary school in Izmir, Turkey

It seems clear from the paper that the researchers are presenting the second case (in which case the study would actually be of more interest and importance if had been found that in this context jigsaw learning was not effective).

Given there are very good reasons to expect a positive outcome, there seems no need to 'stack the odds' by using deliberately detrimental control conditions.

Even had situation 1 applied, it seems of limited value to know that jigsaw learning is more effective (in teaching about chemical and physical changes to sixth grade students in a public elementary school in Izmir, Turkey) than an approach we already recognise is suboptimal.

An ethical alternative

This does not mean that there is no value in research that explores well-established teaching approaches in new contexts. However, unless the context is very different from where the approach has already been widely demonstrated, there is little value in comparing it with approaches that are known to be sub-optimal (which in Turkey, a country where constructivist 'reform' teaching approaches are supposed to be the expected standard, seem to often be labelled as 'traditional').

Detailed case studies of the implementation of a reform pedagogy in new contexts that collect rich 'process' data to explore challenges to implementation and to identify especially effective specific practices would surely be more informative? 4

If researchers do feel the need to do experiments, then rather than comparing known-to-be-effective approaches with suboptimal approaches hoping to demonstrate what everyone already knows, why not use comparison conditions that really test the innovation. Of course jigsaw learning out performed lecturing in an elementary school – but how might it have compared with another constructivist approach?

I have described the constructivist science teacher as a kind of learning doctor. Like medical doctors, our first tenet should be to do no harm. So, if researchers want to set up experimental comparisons, they have a duty to try to set up two different approaches that they believe are likely to benefit the learners (whichever condition they are assigned to):

  • not one condition that advantages one group of students
  • and another which deliberately disadvantages another group of students for the benefit of a 'positive' research outcome.

If you already know the outcome then it is not genuine research – and you need a better research question.


Work cited:

Note:

1 Imagine teaching one class about acids by jigsaw learning, and teaching another about the nervous system by some other pedagogy – and then comparing the pedagogies by administering a test – about acids! The class in the jigsaw condition might well do better, without it being reasonable to assume this reflects more effective pedagogy.

So, I am tempted to read this as simply a drafting/typographical error that has been missed, and suspect the authors intended to refer to something like the traditional approach to teaching the science and technology curriculum. Otherwise the experiment is fatally flawed.

Yet, one purpose of the study was to find out

"Does jigsaw cooperative learning instruction contribute to a better conceptual understanding of 'physical and chemical changes' in sixth grade students compared to the traditional science and technology curriculum?"

Tarhan et al., 2013, p.187

This reads as if the researchers felt the curriculum was not sufficiently matched to what they felt were the most important learning objectives in the topic of physical and chemical changes, so they have undertaken some curriculum development, as well as designed a teaching unit accordingly, to be taught by jigsaw learning pedagogy. If so the experiment is testing

traditional curriculum x traditional pedagogy

vs.

reformed curriculum x innovative pedagogy

making it impossible to disentangle the two components.

This suggests the researchers are testing the combination of curriculum and pedagogy, and doing so with a test biased towards the experimental condition. This seems illogical, but I have actually worked in a project where we faced a similar dilemma. In the epiSTEMe project we designed innovative teaching units for lower secondary science and maths. In both physics units we incorporated innovative aspects to the curriculum.

  • In the forces unit material on proportionality was introduced, with examples (car stopping distance) normally not taught at that grade level (Y7);
  • In the electricity unit the normal physics content was embedded in an approach designed to teach aspects of the nature of science.

In the forces unit, the end-of-topic test included material that was included in the project-designed units, but unlikely to be taught in the control classes. There was evidence that on average students in the project classes did better on the test.

In the electricity unit, the nature of science objectives were not tested as these would not necessarily have been included in teaching control classes. On average, there was very little difference in learning about electrical circuits in the two conditions. There was however a very wide range of class performances – oddly just as wide in the experimental condition (where all classes had a common scheme of work, common activities, and common learning materials) as in the control condition where teachers taught the topic in their customary ways.


2 It could be read either as


1

ControlExperimental
The control group was instructed via a teacher-centered didactic lecture format. Throughout the lesson, the same science and technology teacher presented the same content as for the experimental group to achieve the same learning objectives, which were taught via detailed instruction in the experimental group.
…detailed instruction in the experimental group. This instruction included lectures, discussions and problem solving.
During this process, the teacher used the blackboard and asked some questions related to the subject. Students also used a regular textbook. While the instructor explained the subject, the students listened to her and took notes. The instruction was accomplished in the same amount of time as for the experimental group.
What was 'this instruction' which included lectures, discussions and problem solving?

or


2

ControlExperimental
The control group was instructed via a teacher-centered didactic lecture format. Throughout the lesson, the same science and technology teacher presented the same content as for the experimental group to achieve the same learning objectives, which were taught via detailed instruction in the experimental group.
…detailed instruction in the experimental group.
This [sic] instruction included lectures, discussions and problem solving. During this process, the teacher used the blackboard and asked some questions related to the subject. Students also used a regular textbook. While the instructor explained the subject, the students listened to her and took notes. The instruction was accomplished in the same amount of time as for the experimental group.
What was 'this instruction' which included lectures, discussions and problem solving?

3 A class, of course, is not a person, but a collection of people, so perhaps does not have a 'personality' as such. However, for teachers, classes do take on something akin to a personality.

This is not just an impression. It was pointed out above that if a researcher wants to treat each learner as a unit of analysis (necessary to use inferential statistics when only working with a small number of classes) then learners, not intact classes, should be assigned to conditions. However, even a newly formed class will soon develop something akin to a personality. This will certainly be influenced by individual learners present but develop through the history of their evolving mutual interactions and is not just a function of the sum of their individual characteristics.

So, even when a class is formed by random assignment of learners at the start of a study, it is still strictly questionable whether these students should be seen as independent units for analysis (Taber, 2019).


4 I suspect that science educators have a justified high regard for experimental method in the natural sciences, which sometimes blinkers us to its limitations in social contexts where there are myriad interacting variables and limited controls.

Read: Why do natural scientists tend to make poor social scientists?


A case study of educational innovation?

Design and Assessment of an Online Prelab Model in General Chemistry


Keith S. Taber


Case study is meant to be naturalistic – whereas innovation sounds like an intervention. But interventions can be the focus of naturalistic enquiry.

One of the downsides of having spent years teaching research methods is that one cannot help but notice how so much published research departs from the ideal models one offers to students. (Which might be seen as a polite way of saying authors often seem to get key things wrong.) I used to teach that how one labelled one's research was less important than how well one explained it. That is, different people would have somewhat different takes on what is, or is not, grounded theory, case study or action research, but as long as an author explained what they had done, and could adequately justify why, the choice of label for the methodology was of secondary importance.

A science teacher can appreciate this: a student who tells the teacher they are doing a distillation when they are actually carrying out reflux – but clearly explains what they are doing and why, will still be understood (even if the error should be pointed out). On the other hand if a student has the right label but an alternative conception this is likely to be a more problematic 'bug' in the teaching-learning system. 1

That said, each type of research strategy has its own particular weaknesses and strengths so describing something as an experiment, or a case study, if it did not actually share the essential characteristics of that strategy, can mislead the reader – and sometimes even mislead the authors such that invalid conclusions are drawn.

A 'case study', that really is a case study

I made reference above to action research, grounded theory, and case study – three methodologies which are commonly name-checked in education research. There are a vast number of papers in the literature with one of these terms in the title, and a good many of them do not report work that clearly fits the claimed approach! 2


The case study was published in the Journal for the Research Center for Educational Technology

So, I was pleased to read an interesting example of a 'case study' that I felt really was a case study (Llorens-Molina, 2009). 'Design and assessment of an online prelab model in general chemistry: A case study' offered a good example of a case study. Although, I suspect some other authors might have been tempted to describe this research differently.

Is it a bird, is it a plane; no it's…

Llorens-Molina's study included an experimental aspect. A cohort of learners was divided into two groups to allow the researcher to compare two different educational treatments; then, measurements were made to compare outcomes quantitatively. That might sound like an experiment. Moreover, this study reported an attempt to innovate in a teaching situation, which gives the work a flavour of action research. Despite this, I agree with Llorens-Molinathat that the work is best characterised as a case study.

Read about experiments

Read about action research


A case study focuses on 'one instance' from among many


What is a case study?

A case study is an in-depth examination of one instance: one example – of something for which there are many examples. The focus of a case study might be one learner, one teacher, one group of students working together on a task, one class, one school, one course, one examination paper, one text book, one laboratory session, one lesson, one enrichment programme… So, there is great variety in what kind of entity a case study is a study of, but what case studies have in common is they each focus in detail on that one instance.

Read about case study methodology


Characteristics of case study

Characteristics of case study

Case studies are naturalistic studies, which means they are studies of things as they are, not attempts to change things. The case has to be bounded (a reader of a case study learns what is in the case and what is not) but tends to be embedded in a wider context that impacts upon it. That is, the case is entangled in a context from which it could not easily be extracted and still be the same case. (Imagine moving a teacher with her class from their school to have their lesson in a university where it could be observed by researchers – it would not be 'the same lesson' as would have occurred in situ).

The case study is reported in detail, often in a narrative form (not just statistical summaries) – what is sometimes called 'thick description'. Usually several 'slices' of data are collected – often different kinds of data – and often there is a process of 'triangulation' to check the consistency of the account presented in relation to the different slices of data available. Although case studies can include analysis of quantitative data, they are usually seen as interpretive as the richness of data available usually reflects complexity and invites nuance.



Design and Assessment of an Online Prelab Model in General Chemistry

Llorens-Molina's study explored the use of prelabs that are "used to introduce and contextualize laboratory work in learning chemistry" (p.15), and in particular "an alternative prelab model, which consists of an audiovisual tutorial associated with an online test" (p.15).

An innovation

The research investigated an innovation in teaching practice,

"In our habitual practice, a previous lecture at the beginning of each laboratory session, focused almost exclusively on the operational issues, was used. From our teaching experience, we can state that this sort of introductory activity contributes to a "cookbook" way to carry out the laboratory tasks. Furthermore, the lecture takes up valuable time (about half an hour) of each ordinary two-hour session. Given this set-up, the main goal of this research was to design and assess an alternative prelab model, which was designed to enhance the abilities and skills related to an inquiry-type learning environment. Likewise, it would have to allow us to save a significant amount of time in laboratory sessions due to its online nature….

a prelab activity developed …consists of two parts…a digital video recording about a brief tutorial lecture, supported by a slide presentation…[followed by ] an online multiple choice test"

Llorens-Molina, 2009, p.16-17
Not action research?

The reference to shifting "our habitual practice" indicates this study reports practitioner research. Practitioner studies, such as this, that test a new innovation are often labelled by authors as 'action research'. (Indeed, sometimes, the fact that research is carried out by practitioners looking to improve their own practice is seen as sufficient for action research: when actually this is a necessary, but not a sufficient condition.)

Genuine action research aims at improving practice, not simply seeing if a specific innovation is working. This means action research has an open-ended design, and is cyclical – with iterations of an innovation tested and the outcomes used as feedback to inform changes in the innovation. (Despite this, a surprising number of published studies labelled as action research lack any cyclic element, simply reporting one iteration of a innovation.) Llorens-Molina's study does not have a cyclic design, so would not be well-characterised as action research.

An experimental design?

Llorens-Molina reports that the study was motivated by three hypotheses (p.16):

  • "Substituting an initial lecture by an online prelab to save time during laboratory sessions will not have negative repercussions in final examination marks.
  • The suggested online prelab model will improve student autonomy and prerequisite knowledge levels during laboratory work. This can be checked by analyzing the types and quantity of SGQ [student generated questions].
  • Student self-perceptions about prelab activities will be more favourable than those of usual lecture methods."

To test these hypotheses the student cohort was divided into two groups, to be split between the customary and innovative approach. This seems very much like an experiment.

It may be useful here to make a discrimination between two levels of research design – methodology (akin to strategy) and techniques (akin to tactics). In research design, a methodology is chosen to meet the overall aims of the study, and then one or more research techniques are selected consistent with that methodology (Taber, 2013). Experimental techniques may be included in a range of methodologies, but experiment as an overall methodology has some specific features.

Read about Research design

In a true experiment there is random assignment to conditions, and often there is an intention to generalise results to a wider population considered to be sampled in the study. Llorens-Molina reports that although inferential statistics were used to test the hypotheses, there was no intention to offer statistical generalisation beyond the case. The cohort of students was not assumed to be a sample representing some wider population (such as, say, undergraduates on chemistry courses in Spain) – and, indeed, clearly such an assumption would not have been justified.

Case study is naturalistic – but an innovation is an intervention in practice…

Case study is said to be naturalistic research – it is a method used to understand and explore things as they are, not to bring about change. Yet, here the focus is an innovation. That seems a contradiction. It would be a contradiction if the study was being carried out by external researchers who had asked the teaching team to change practice for the benefits of their study. However, here it is useful to separate out the two roles of teacher and researcher.

This is a situation that I commonly faced when advising graduates preparing for school teaching who were required to carry out a classroom based study into an aspect of their school placement practice context as part of their university qualification (the Post-Graduate Certificate in Education, P.G.C.E.). Many of these graduates were unfamiliar with research into social phenomena. Science graduates often brought a model of what worked in the laboratory to their thinking about their projects – and had a tendency to think that transferring the experimental approach to classrooms (where there are usually a large number of potentially relevant variables, many of which can not be controlled) would be straightforward.

Read 'Why do natural scientists tend to make poor social scientists?'

The Cambridge P.G.C.E. teaching team put into place a range of supports to introduce graduate preparing for teaching to the kinds of education research useful for teachers who want to evaluate and improve their own teaching. This included a book written to introduce classroom-based research that drew heavily on analysis of published studies (Taber, 2007; 2013). Part of our advice was that those new to this kind of enquiry might want to consider action research and case study as suitable options for their small-scale projects.


Useful strategies for the novice practitioner-researcher (Figure: diagram used in working with graduates preparing for teaching, from Taber, 2010)

Simplistically, action research might be considered best suited to a project to test an innovation or address a problem (e.g., evaluating a new teaching resource; responding to behavioural issues), and case study best suited to an exploratory study (e.g., what do Y9 students understand about photosynthesis?; what is the nature of peer dialogue during laboratory working in this class?) However, it was often difficult for the graduates to carry out authentic action research as the constraints of the school-based placements seldom allowed them to test successive iterations of the same intervention until they found something like an optimal specification.

Yet, they often were in a good position to undertake a detailed study of one iteration, collecting a range of different data, and so producing a detailed evaluation. That sounds like a case study.

Case study is supposed to be naturalistic – whereas innovation sounds like an intervention. But some interventions in practice can be considered the focus of naturalistic enquiry. My argument was that when a teacher changes the way they do something to try and solve a problem, or simply to find a better way to work, that is a 'natural' part of professional practice. The teacher-researcher, as researcher, is exploring something the fully professional teacher does as matter of course – seek to develop practice. After all, our graduates were being asked to undertake research to give them the skills expected to meet professional teaching standards, which

"clearly requires the teacher to have both the procedural knowledge to undertake small-scale classroom enquiry, and 'conceptual frameworks' for thinking about teaching and learning that can provide the basis for evaluating their teaching. In other words, the professional teacher needs both the ability to do her own research and knowledge of what existing research suggests"

Taber, 2013, p.8

So, the research is on something that is naturally occurring in the classroom context, rather than an intervention imported into the context in order to answer an external researcher's questions. A case study of an intervention introduced by practitioners themselves can be naturalistic – even if the person implementing the change is the researcher as well as the teacher.


If a teacher-researcher (qua researcher) wishes to enquire into an innovation introduced by the teacher-researcher (qua teacher) then this can be considered as naturalistic enquiry


The case and the context

In Llorens-Molina's study, the case was a sequence of laboratory activities carried out by a cohort of undergraduates undertaking a course of General and Organic Chemistry as part of an Agricultural Engineering programme. So, the case was bounded (the laboratory part of one taught course) and embedded in a wider context – a degree programme in a specific institution in Spain: the Polytechnic University of Valencia.

The primary purpose of the study was to find out about the specific innovation in the particular course that provided the case. This was then what is known as an intrinsic case study. (When a case is studied primarily as an example of a class of cases, rather than primarily for its own interest, it is called an instrumental case study).

Llorens-Molina recognised that what was found in this specific case, in its particular context, could not be assumed to apply more widely. There can be no statistical generalisation to other courses elsewhere. In case study, the intention is to offer sufficient detail of the case for readers to make judgements of the likely relevance to other context of interest (so-called 'reader generalisation').

The published report gives a good deal of information about the course as well as much information about how data was collected, and equally important, analysed.

Different slices of data

Case study often uses a range of data sources to develop a rounded picture of the case. In this study the identification of three specific hypotheses (less usual in case studies, which often have more open-ended research questions) led to the collection of three different types of data.

  • Students were assessed on each of six laboratory activities. A comparison was made between the prelab condition and the existing approach.
  • Questions asked by students in the laboratories were recorded and analysed to see if the quality/nature of such questions was different in the two conditions. A sophisticated approach was developed to analyse the questions.
  • Students were asked to rate the prelabs through responding to items on a questionnaire.

This approach allowed the author to go beyond simply reporting whether hypotheses were supported by the analysis, to offer a more nuanced discussion around each feature. Such nuance is not only more informative to the reader of a case study, but reflects how the researcher, as practitioner, has an ongoing commitment to further develop practice and not see the study as an end in itself.

Avoiding the 'equivalence' and the 'misuse of control groups' problems

I particularly appreciate a feature of the research design that many educational studies that claim to be experiments could benefit from. To test his hypotheses Llorens-Molina employed two conditions or treatments, the innovation and a comparison condition, and divided the cohort: "A group with 21 students was split into two subgroups, with 10 and 11 in each one, respectively". Llorens-Molina does not suggest this was based on random assignment, which is necessary for a 'true' experiment.

In many such quasi-experiments (where randomisation to condition is not carried out, and is indeed often not possible) the researchers seek to offer evidence of equivalence before the treatments occur. After all, if the two subgroups are different in terms of past subject attainment or motivation or some other relevant factor (or, indeed, if there is no information to allow a judgement regarding whether this is the case or not), no inferences about an intervention can be drawn from any measured differences. (Although that does not always stop researchers from making such claims regardless: e.g., see Lack of control in educational research.)

Another problem is that if learners are participating in research but are assigned to a control or comparison condition then it could be asked if they are just being used as 'data fodder', and would that be fair to them? This is especially so in those cases (so, not this one) where researchers require that the comparison condition is educationally deficient – many published studies report a control condition where schools students have effectively been lectured to, and no discussion work, group work, practical work, digital resources, et cetera, have been allowed, in order to ensure a stark contrast with whatever supposedly innovative pedagogy or resource is being evaluated (Taber, 2019).

These issues are addressed in research designs which have a compensatory structure – in effect the groups switch between being the experimental and comparison condition – as here:

"Both groups carried out the alternative prelab and the previous lecture (traditional practice), alternately. In this way, each subgroup carried out the same number of laboratory activities with either a prelab and previous lecture"

Llorens-Molina, 2009, p.19

This is good practice both from methodological and ethical considerations.


The study used a compensatory design which avoids the need to ensure both groups are equivalent at the start, and does not disadvantage one group. (Figure from Llorens-Molina, 2009, p.22 – published under a creative commons Attribution-NonCommercial-NoDerivs 3.0 United States license allowing redistribution with attribution)

A case of case study

Do I think this is a model case study that perfectly exemplifies all the claimed characteristics of the methodology? No, and very few studies do. Real research projects, often undertaken in complex contexts with limited resources and intractable constraints, seldom fit such ideal models.

However, unlike some studies labelled as case studies, this study has an explicit bounded case and has been carried out in the spirit of case study that highlights and values the intrinsic worth of individual cases. There is a good deal of detail about aspects of the case. It is in essence a case study, and (unlike what sometimes seems to be the case [sic]) not just called a case study for want of a methodological label. Most educational research studies examine one particular case of something – but (and I do not think this is always appreciated) that does not automatically make them case studies. Because it has been both conceptualised and operationalised as a case study, Llorens-Molina's study is a coherent piece of research.

Given how, in these pages, I have often been motivated to call out studies I have read that I consider have major problems – major enough to be sufficient to undermine the argument for the claimed conclusions of the research – I wanted to recognise a piece of research that I felt offered much to admire.


Work cited:

Notes:

1 I am using language here reflecting a perspective on teaching as being based on a model (whether explicit or not) in the teacher's mind of the learners' current knowledge and understanding and how this will respond to teaching. That expects a great deal of the teacher, so there are often bugs in the system (e.g., the teacher over-estimates prior knowledge) that need to be addressed. This is why being a teacher involves being something of a 'learning doctor'.

Read about the learning doctor perspective on teaching


2 I used to teach sessions introducing each of these methodologies when I taught on an Educational Research course. One of the class activities was to examine published papers claiming the focal methodology, asking students to see if studies matched the supposed characteristics of the strategy. This was a course with students undertaking a very diverse range of research projects, and I encouraged them to apply the analysis to papers selected because they were of particular interest and relevance to to their own work. Many examples selected by students proved to offer poor match between claimed methodology and the actual research design of ther study!

Lack of control in educational research

Getting that sinking feeling on reading published studies


Keith S. Taber


this is like finding that, after a period of watering plant A, it is taller than plant B – when you did not think to check how tall the two plants were before you started watering plant A

Research on prelabs

I was looking for studies which explored the effectiveness of 'prelabs', activities which students are given before entering the laboratory to make sure they are prepared for practical work, and can therefore use their time effectively in the lab. There is much research suggesting that students often learn little from science practical work, in part because of cognitive overload – that is, learners can be so occupied with dealing with the apparatus and materials they have little capacity left to think about the purpose and significance of the work. 1


Okay, so is THIS the pipette?
(Image by PublicDomainPictures from Pixabay)

Approaching a practical work session having already spent time engaging with its purpose and associated theories/models, and already having become familiar with the processes to be followed, should mean students enter the laboratory much better prepared to use their time efficiently, and much better informed to reflect on the wider theoretical context of the work.

I found a Swedish paper (Winberg & Berg, 2007) reporting a pair of studies that tested this idea by using a simulation as a prelab activity for undergraduates about to engage with an acid-base titration. The researchers tested this innovation by comparisons between students who completed the prelab before the titration, and those who did not.

The work used two basic measures:

  • types (sophistication) of questions asked by students during the lab. session
  • elicitation of knowledge in interviews after the laboratory activity

The authors found some differences (between those who had completed the prelab and those that had not) in the sophistication of the questions students asked, and in the quality of the knowledge elicited. They used inferential statistics to suggest at least some of the differences found were statistically significant. From my reading of the paper, these claims were not justified.

A peer reviewed journal (no, really, this time)

This is a paper in a well respected journal (not one of the predatory journals I have often discussed on this site). The Journal of Research in Science Teaching is published by Wiley (a major respected publisher of academic material) and is the official journal of NARST (which used to stand for the National Association for Research in Science Teaching – where 'national' referred to the USA 2). This is a journal that does take peer review very seriously.

The paper is well-written and well-structured. Winberg and Berg set out a conceptual framework for the research that includes a discussion of previous relevant studies. They adopt a theoretical framework based on the Perry's model of intellectual development (Taber, 2020). There is considerable detail of how data was collected and analysed. This account is well-argued. (But, you, dear reader, can surely sense a 'but' coming.)

Experimental research into experimental work?

The authors do not seem to explicitly describe their research as an experiment as such (as opposed to adopting some other kind of research strategy such as survey or case study), but the word 'experiment' and variations of it appear in the paper.

For one thing, the authors refer to students' practical work as being experiments,

"Laboratory exercises, especially in higher education contexts, often involve training in several different manipulative skills as well as a high information flow, such as from manuals, instructors, output from the experimental equipment, and so forth. If students do not have prior experiences that help them to sort out significant information or reduce the cognitive effort required to understand what is happening in the experiment, they tend to rely on working strategies that help them simply to cope with the situation; for example, focusing only on issues that are of immediate importance to obtain data for later analysis and reflective thought…"

Winberg & Berg, 2007

Now, some student practical work is experimental, where a student is actively looking to see what happens when they manipulate some variable to test a hypothesis. This type of practical work is sometimes labelled enquiry (or inquiry in US spelling). But a lot of school and university laboratory work, however, is undertaken to learn techniques, or (probably more often) to support the learning of taught theory – where it is usually important the learners know what is meant to happen before they begin the laboratory activity.

Winberg and Berg refer to the 'laboratory exercise' as 'the experiment' as though any laboratory work counts as an experiment. In Winberg and Berg's research, students were asked about their "own [titration] experiment", despite the prelab material involving a simulation of the titration process, in advance of which "the theoretical concepts, ideas, and procedures addressed in the simulation exercise had been treated mainly quantitatively during the preceding 1-week instructional sequence". So, the laboratory titration exercise does not seem to be an experiment in the scientific sense of the term.

School children commonly describe all practical work in the lab as 'doing experiments'. It cannot help students learn what an experiment really is when the word 'experiment' has two quite distinct meanings in the science classroom:

  • experiment(technical) = an empirical test of a hypothesis involving the careful control of variables and observation of the effect on a specified (hypothetised as) dependent variable of changing the variable specified as the independent variable
  • experiment(casual) = absolutely any practical activity carried out with laboratory equipment

We might describe this second meaning as an alternative conception of 'experiment', a way of understanding that is inconsistent with the scientific meaning. (Just as there are common alternative conceptions of other 'nature of science' concepts such as 'theory').

I would imagine Winberg and Berg were well aware of what an experiment is, although their casual use of language might suggest a lack of rigour in thinking with the term. They refer to having "both control and experiment groups" in their studies, and refer to "the experimental chronology" of their research design. So, they certainly seem to think of their work as a kind of experiment.

Experimental design

In a true experiment, a sample is randomly drawn from a population of interest (say, first year undergraduate chemistry students; or, perhaps, first year undergraduate chemistry students attending Swedish Universities, or… 3) and assigned randomly to the conditions being compared. Providing a genuine form of random assignment is used, then inferential statistical tests can guide on whether any differences found between groups at the end of an experiment should be considered statistically significant. 4

"Statistics can only indicate how likely a measured result would occur by chance (as randomisation of units of analysis to different treatments can only make uneven group composition unlikely, not impossible)…Randomisation cannot ensure equivalence between groups (even if it makes any imbalance just as likely to advantage either condition)"

Taber, 2019, p.73

Inferential statistics can be used to test for statistical significance in experiments – as long as the 'units of analysis' (e.g., students) are randomly assigned to the experimental and control conditions.
(Figure from Taber, 2019)

That is, if the are difference that the stats. tests suggests are very unlikely to happen by chance, then they are very unlikely to be due to an initial difference between the groups in the two conditions as long as the groups were the result of random assignment. But that is a very important proviso.

There are two aspects to this need for randomisation:

  • to be able to suggest any differences found reflect the effects of the intervention, then there should be random assignment to the two (or more) conditions
  • to be able to suggest the results reflect what would probably would be found in a wider population, the sample should be randomly selected from the population of interest 3

Studies in education seldom meet the requirements for being true experiments
(Figure from Taber, 2019)

In education, it is not always possible to use random assignment, so true experiments are then not possible. However, so-called 'quasi-experiments' may be possible where differences between the outcomes in different conditions may be understood as informative, as long as there is good reason to believe that even without random assignment, the groups assigned to the different conditions are equivalent.

In this specific research, that would mean having good reason to believe that without the intervention (the prelab):

  • students in both groups would have asked overall equivalent (in terms of the analysis undertaken in this study) questions in the lab.;
  • students in both groups would have been judged as displaying overall equivalent subject knowledge.

Often in research where a true experiment is not possible some kind of pre-testing is used to make a case for equivalence between groups.

Two control groups that were out of control

In Winberg and Berg's research there were two studies where comparisons were made between 'experimental' and 'control' conditions

StudyExperimentalControl
Study 1n=78: first-year students, following completion of their first chemistry course in 2001n=97: students who had been interviewed by the researchers during the same course in the previous year
Study 2n=21 (of 58 in cohort)n=37 (of 58 in same cohort)

In the first study, a comparison was made between the cohort where the innovation was introduced and a cohort from the previous year. All other things being equal, it seems likely these two cohorts were fairly similar. But in education all thing are seldom equal, so there is no assurance they were similar enough to be considered equivalent.

In the second study

"Students were divided into treatment (n = 21) and control (n = 37) groups. Distribution of students between the treatment and control groups was not controlled by the researchers".

Winberg & Berg, 2007

So, some factor(s) external to the researchers divided the cohort into two groups – and the reader is told nothing about the basis for this, nor even if the two groups were assigned to the treatments randomly.5 The authors report that the cohort "comprised prospective molecular biologists (31%), biologists (51%), geologists (7%), and students who did not follow any specific program (11%)", and so it is possible the division into two uneven sized groups was based on timetabling constraints with students attending chemistry labs sessions according to their availability based on specialism. But that is just a guess. (It is usually better when the reader of a research report is not left to speculate about procedures and constraints.)

What is important for a reader to note is that in these studies:

  • the researchers were not able to assign learners to conditions randomly;
  • nor were the researchers able to offer any evidence of equivalence between groups (such as near identical pre-test scores);
  • so, the requirements for inferring significance from statistical tests were not met;
  • so, claims in the paper about finding statistically significant differences between conditions cannot therefore be justified given the research design;
  • and therefore the conclusions presented in the paper are strictly not valid.

If students are not randomly assigned to conditions, then any statistically unlikely difference found at the end of an experiment cannot be assumed to be likely to be due to intervention, rather than some systematic initial difference between the groups.
(Figure adapted from Taber, 2019)


This is a shame, because this is in many ways an interesting paper, and much thought and care seems to have been taken about the collection and analysis of meaningful data. Yet, drawing conclusions from statistical tests comparing groups that might never have been similar in the first case is like finding that careful use of a vernier scale shows that after a period of watering plant A, plant A is taller than plant B – having been very careful to make sure plant A was watered regularly with carefully controlled volumes, while plant B was not watered at all – when you did not think to check how tall the two plants were before you started watering plant A.

In such a scenario we might be tempted to assume plant A has actually become taller because it had been watered; but that is just applying what we had conjectured should be the case, and we would be mistaking our expectations for experimental evidence.

Work cited:

Notes:

1 The part of the brain where we can consciously mentipulate ideas is called the working memory (WM). Research suggests that WM has a very limited capacity in the sense that people can only hold in mind a very small number of different things at once. (These 'things' however are somewhat subjective – a complex idea that is treated as a single 'thing' in the WM of an expert can overload a novice.) This limit to ~WM is considered to be one of the most substantial constraints on effective classroom learning. This is also, then, one of the key research findings informing the design of effective teaching.

Read about working memory

Read about key ideas for teaching in accordance with learning theory

How fat is your memory? – read about a chemical analogy for working memory


2 The organisation has seemingly spotted that the USA is only one part of the world, and now describes itself as a global organisation for improving science education through research.


3 There is no reason why an experiment cannot be carried out on a very specific population, such as first year undergraduate chemistry students attending a specific Swedish University such a, say, Umea ̊ University. However, if researchers intend their study to have results generalisable beyond their specific research contexts (say, to first year undergraduate chemistry students attending any Swedish University) then it is important to have a representative sample of that population.

Read about populations of interest in research

Read about generalisation from research studies


4 It might be assumed that scientists, and researchers know what is meant by random, and how to undertake random assignment. Sadly, the literature suggests that in practice the term 'randomly' is sometimes used in research reports to mean something like 'arbitrarily' (Taber, 2013), which fills short of being random.

Read about randomisation in research


5 Arguably, even if the two groups were assigned randomly, there is only one 'unit of analysis' in each condition, as they were assigned as groups. That is, for statistical purposes, the two groups have size n=1 and n=1, which would not allow statistical significance to be found: e.g, see 'Quasi-experiment or crazy experiment?'