randomisation – Science-Education-Research

Can we be sure that fun in the sun alters water chemistry?

Minimalist sampling and experimental variables

Keith S. Taber

Dirty water

I was reading the latest edition of Education in Chemistry and came across an article entitled "Fun in the sun alters water chemistry. How swimming and tubing are linked to concerning rises in water contaminants" (Notman, 2023). This was not an article about teaching, but a report of some recent chemistry research summarised for teachers. [Teaching materials relating to this article can be downloaded from the RSC website.]

I have to admit to not having understood what 'tubing' was (I plead 'age') apart from its everyday sense of referring collectively to tubes, such as those that connect Bunsen burners to gas supplies, and was intrigued by what kinds of tubes were contaminating the water.

The research basically reported on the presence of higher levels of contaminants in the same body of water at Clear Creak, Colorado on a public holiday when many people used the water for recreational pursuits (perhaps even for 'tubing'?) than on a more typical day.

This seems logical enough: more people in the water; more opportunities for various substances to enter the water from them. I have my own special chemical sensor which supports this finding. I go swimming in the local hotel pool, and even though people are supposed to shower before entering the pool: not everyone does (or at least, not effectively). Sometimes one can 'taste' ¹ the change when someone gets in the water without washing off perfume or scented soap residue. Indeed, occasionally the water 'tastes' ¹ differently after people enter the pool area wearing strong perfume, even if they do not use the pool and come into direct contact with the water!

The scientists reported finding various substances they assumed were being excreted ² by the people using the water – substances such as antihistamines and cocaine – as well as indicators of various sunscreens and cosmetics. (They also found higher levels of "microbes associated with humans", although this was not reported in Education in Chemistry.)

I'm not sure why I bother having a shower BEFORE I go for a swim in there… (Image by sandid from Pixabay)

It makes sense – but is there a convincing case?

Now this all seems very reasonable, as the results fit into a narrative that seems theoretically feasible: a large number of people entering the fresh water of Clear Creek are likely to pollute it sufficiently (if not to rename it Turbid Creek) for detection with the advanced analytical tools available to the modern chemist (including "an inductively coupled plasma mass spectrometer and a liquid chromatography high resolution mass spectrometer").

However, reading on, I was surprised to learn that the sampling in this study was decidedly dodgy.

"The scientists collected water samples during a busy US public holiday in September 2022 and on a quiet weekday afterwards."

I am not sure how this (natural) experiment would rate as a design for a school science investigation. I would certainly have been very critical if any educational research study I had been asked to evaluate relied on sampling like this. Even if large numbers of samples were taken from various places in the water over an extended period during these two days this procedure has a major flaw. This is because the level of control of other possibly relevant factors is minimal.

Read about control in experimental research

The independent variable is whether the samples were collected on a public holiday when there was much use of the water for leisure, or on a day with much less leisure use. The dependent variables measured were levels of substances in the water that would not be considered part of the pristine natural composition of river water. A reasonable hypothesis is that there would be more contamination when more people were using the water, and that was exactly what was found. But is this enough to draw any strong conclusions?

Considering the counterfactual

A useful test is to ask whether we would have been convinced that people do not contaminate the water had the analysis shown there was no significant difference in water samples on the two days? That is to examine a 'counterfactual' situation (one that is not the case, but might have been).

In this counterfactual scenario, would similar levels of detected contaminants be enough to convince us the hypotheses was misguided – or might we look to see if there was some other factor which might explain this unexpected (given how reasonable the hypothesis seems) result and rescue our hypothesis?

Had pollutant levels been equally high on both days, might we have sought ('ad hoc') to explain that through other factors:

Maybe it was sunnier on the second day with high U.V. levels which led to more breakdown of organic debris in the river?
Perhaps there was a spill of material up-river ³ which masked any effect of the swimmers (and, er, tubers?)
Perhaps rainfall between the two sampling dates had increased the flow of the river and raised its level, washing more material into the water?
Perhaps the wind direction was different and material was being blown in from nearby agricultural land on the second day.
Perhaps the water temperature was different?
Perhaps a local industry owner tends to illegally discharge waste into the river when the plant is operating on normal working days?
Perhaps spawning season had just started for some species, or some species was emerging from a larval state on the river bed and disturbing the debris on the bottom?
Perhaps passing migratory birds were taking the opportunity to land in the water for some respite, and washing off parasites as well as dust.
Perhaps a beaver's dam had burst up stream ³ ?
Perhaps (for any panspermia fans among readers) an asteroid covered with organic residues had landed in the river?
Or…

But: if we might consider some of those factors to potentially explain a lack of effect we were expecting, then we should equally consider them as possible alternative causes for an effect we predicted.

Maybe it was sunnier on the first day with high U.V. levels which led to more breakdown of organic debris in the river?
…
Perhaps a local industry owner tends to illegally discharge waste into the river on public holidays because the work force are off site and there will be no one to report this?
… etc.

Lack of control of confounding variables

Now, in environmental research, as in research into teaching, we cannot control conditions in the way we can in a laboratory. We cannot ensure the temperature and wind direction and biota activity in a river is the same. Indeed, one thing about any natural environment that we can be fairly sure of is that biological activity (and so the substances released by such activity) varies seasonally, and according to changing weather conditions, and in different ways for different species.

So, as in educational research, there are often potentially confounding variables which can undermine our experiments:

In quasi-experiments or natural experiments, a more complex design than simply comparing outcome measures is needed. …this means identifying and measuring any relevant variables. …Often…there are other variables which it is recognised could have an effect, other than the dependent variable: 'confounding' variables.
Taber, 2019, p.85 [Download this article]

independent variable	class of day (busy holiday versus quiet working day)
dependent variables	concentrations of substances and organisms considered to indicate contamination
confounding variables	*anything* that might feasibly influence the level of concentrations of substances and organisms considered to indicate contamination – other than the class of day

In a controlled experiment any potential confounding variables are held at fixed levels, but in 'natural experiments' this is not possible

Read about confounding variables in research

Sufficient sampling?

The best we can do to mitigate for the lack of control is rigorous sampling. If water samples from a range of days when there was high level of leisure activity, and a range of days when there was low level of leisure activity were compared, this would be more convincing that just one day from each category. Especially so if these were randomly selected days. It is still possible that factors such as wind direction and water temperature could bias findings, but it becomes less likely – and with random sampling of days it is possible to estimate how likely such chance factors are to have an effect. Then we can at least apply models that suggest whether observed differences in outcomes exceed the level likely due to chance effects.

Read about sampling in research

I would like to think that any educational study that had this limitation would be questioned in peer review. The Education in Chemistry article cited the original research, although I could not immediately find this. The work does not seem to have been published in a research journal (at least, not yet) but was presented at a conference, and is discussed in a video published by the American Chemical Society on YouTube.

"With Labor Day approaching, many people are preparing to go tubing and swimming at local streams and rivers. These delightful summertime activities seem innocuous, but do they have an impact on these waterways? Today, scientists report preliminary [sic] results from the first holistic study of this question ⁴, which shows that recreation can alter the chemical and microbial fingerprint of streams, but the environmental and health ramifications are not yet known."
American Chemical Society Meeting Newsroom, 2023

In the video, Noor Hamdan, of John Hopkins University, reports that "we are thinking of collecting more samples and doing some more statistical analysis to really, really make sure that humans are significantly impacting a stream".

This seems very wise, as it is only too easy to be satisfied with very limited data when it seems to fit with your expectations. Indeed that is one of the everyday ways of thinking that science challenges by requiring more rigorous levels of argument and evidence. In the meantime, Noor Hamdan suggests people using the water should use mineral-based rather than organic-based sunscreens, and she "recommend[s] not peeing in rivers". No, I am fairly sure 'tubing' is not meant as a euphemism for that. ⁵

Work cited:

American Chemical Society Meeting Newsroom: 2023, Tubing and swimming change the chemistry and microbiome of stream. https://www.youtube.com/watch?v=4fLArTDRYuE&list=PL-qHxGvFeZV3ftwffkiRifq6E0CvXexwU&index=14
Notman, N. (2023). Fun in the sun alters water chemistry. How swimming and tubing are linked to concerning rises in water contaminants [Science research news]. Education in Chemistry, 60(6), 8.
Taber, K. S. (2019). Experimental research into teaching innovations: responding to methodological and ethical challenges. Studies in Science Education, 55(1), 69-119. https://doi.org/10.1080/03057267.2019.1658058 [Download this article]

Notes:

¹ Perhaps more correctly, smell, though it is perceived as tasting – most of the flavour we taste in food is due to volatile substances evaporating in the mouth cavity and diffusing to be detected in the nose lining.

² The largest organ of excretion for humans is the skin. The main mechanism for excreting the detected contaminating substances into the water (if perhaps not the only pertinent one, according to the researchers) was sweating. Physical exertion (such as swimming) tends to be associated with higher levels of sweating. We do not notice ourselves sweating when the sweat evaporates as fast as it is released – nor, of course, when we are immersed in water.

One of those irregular verbs?

I perspire.

You sweat.

She excretes through her skin

(Image by Sugar from Pixabay)

³ The video suggests that sampling took place both upriver and downriver of the Creek which would offer some level of control for the effect of completely independent influxes into the water – unless they occurred between the sampling points.

⁴ There seem to be plenty of studies of the effects of water quality on leisure use of waterways: but not on the effects of the recreational use of waterways on their quality.

⁵ Just in case any readers were also ignorant about this, it apparently refers to using tyre inner tubes (or similar) as floatation devices. This suggests a new line of research. People who float around in inner tubes will tend to sweat less than those actively swimming – but are potentially harmful substances leached from the inner tubes themselves?

Join an email discussion list for those teaching chemistry

Is your heart in the research?

Someone else's research, that is

Keith S. Taber

Imagine you have a painful and debilitating illness. Your specialist tells you there is no conventional treatment known to help. However, there is a new – experimental – procedure: a surgery that may offer relief. But it has not yet been fully tested. If you are prepared to sign up for a study to evaluate this new procedure, then you can undergo surgery.

You are put under and wheeled into the operating theatre. Whilst you experience – rather, do not experience – the deep, sleepless rest of anaesthesia, the surgeon saws through your breastbone, prises open your ribcage with a retractor (hopefully avoiding breaking any ribs),
reaches in, and gently lifts up your heart.

The surgeon, pauses, perhaps counts to five, then carefully replaces your heart between the lungs. The ribcage is closed, and you are sown-up without any actual medical intervention. You had been randomly assigned to the control group.

How can we test whether surgical interventions are really effective without blind controls?

Is it right to carry out sham operations on sick people just for the sake of research?

Where is the balance of interests?

(Image from Pixabay)

Research ethics

A key aspect of planning, executing and reviewing research is ethical scrutiny. Planning, obviously, needs to take into account ethical considerations and guidelines. But even the best laid plans 'of mice and men' (or, of, say, people investigating mice) may not allow for all eventualities (after all, if we knew what was going to happen for sure in a study, it would not be research – and it would be unethical to spend precious public resources on the study), so the ethical imperative does not stop once we have got approval and permissions. And even then, we may find that we cannot fully mitigate for unexpected eventualities – which is something to be reported and discussed to help inform future research.

Read about research ethics

When preparing students setting out on research, instruction about research ethics is vital. It is possible to teach about rules, and policies, and guidelines and procedures – but real research contexts are often complex, and ethical thinking cannot be algorithmic or a matter of adopting slogans and following heuristics. In my teaching I would include discussion of past cases of research studies that raised ethical questions for students to discuss and consider.

One might think that as research ethics is so important, it would be difficult to find many published studies which were not exemplars of good practice – but attitudes to, and guidance on, ethics have developed over time, and there are many past studies which, if not clearly unethical in today's terms, at least present problematic cases. (That is without the 'doublethink' that allows some contemporary researchers to, in a single paper, both claim active learning methods should be studied because it is known that passive learning activities are not effective, yet then report how they required teachers to instruct classes through passive learning to act as control groups.)

Indeed, ethical decision-making may not always be straight-forward – as it often means balancing different considerations, and at a point where any hoped-for potential benefits of the research must remain uncertain.

Pretending to operate on ill patients

I recently came across an example of a medical study which I thought raised some serious questions, and which I might well have included in my teaching of research ethics as a case for discussion, had I known about before I retired.

The research apparently involved surgeons opening up a patient's ribcage (not a trivial procedure), and lifting out the person's heart in order to carry out a surgical intervention…or not,

"In the late 1950s and early 60s two different surgical teams, one in Kansas City and one in Seattle, did double-blind trials of a ligation procedure – the closing of a duct or tube using a clip – for very ill patients suffering from severe angina, a condition in which pain radiates from the chest to the outer extremities as a result of poor blood supply to the heart. The surgeons were not told until they arrived in the operating theatre which patients were to receive a real ligation and which were not. All the patients, whether or not they were getting the procedure, had their chest cracked open and their heart lifted out. But only half the patients actually had their arteries rerouted so that their blood could more efficiently bathe its pump …"
Slater, 2018

The quote is taken from a book by Lauren Slater which sets out a history of drug use in psychiatry. Slater is a psychotherapist who has written a number of books about aspects of mental health conditions and treatments.

Fair testing

In order to make a fair experiment, the double-blind procedure sought to treat the treatment and control group the same in all respects, apart from the actual procedure of ligation of selected blood vessels that comprised the mooted intervention. The patients did not know (at least, in one of the studies) they might not have the real operation. Their physicians were not told who was getting the treatment. Even the surgeons only found out who was in each group when the patient arrived in theatre.

It was necessary for those in the control group to think they were having an intervention, and to undergo the sham surgery, so that they formed a fair comparison with those who got the ligation.

Read about control of variables

It was necessary to have double-blind study (neither the patients themselves, nor the physicians looking after them, were told which patients were, and which were not, getting the treatment), because there is a great deal of research which shows that people's beliefs and expectations make substantial differences to outcomes. This is a real problem in educational research when researchers want to test classroom practices such as new teaching schemes or resources or innovative pedagogies (Taber, 2019).The teacher almost certainly knows whether she is teaching the experimental or control group, and usually the students have a pretty good idea. (If every previous lesson has been based on teacher presentations and note-taking, and suddenly they are doing group discussion work and making videos, they are likely to notice.)

Read about expectancy effects

It was important to undertake a study, because there was not clear objective evidence to show whether the new procedure actually improved patient outcomes (or possibly even made matters worst). Doctors reported seeing treated patients do better – but could only guess how they might have done without surgery. Without proper studies, many thousands or people might ultimately undergo an ineffective surgery, with all the associated risks and costs, without getting any benefit.

Simply comparing treated patients with matched untreated patients would not do the job, as there can be a strong placebo effect of believing one is getting a treatment. (It is likely that at least some alternative therapies largely work because a practitioner with good social skills spends time engaging with the patient and their concerns, and the client expects a positive outcome.)

If any positive effects of heart surgery were due to the placebo effect, then perhaps a highly coloured sugar pill prescribed with confidence by a physician could have the same effect without operating theatres, surgical teams, hospital stays… (For that matter, a faith healer who pretended to operate without actually breaking the skin, and revealed a piece of material {perhaps concealed in a pocket or sleeve} presented as an extracted mass of diseased tissue or a foreign body, would be just as effective if the patient believed in the procedure.)

So, I understood the logic here.

Do no harm

All the same – this seemed an extreme intervention. Even today, anaesthesia is not very well understood in detail: it involves giving a patient drugs that could kill them in carefully controlled sub-lethal doses – when how much would actually be lethal (and what would be insufficient to fully sedate) varies from person to person. There are always risks involved.

"All the patients, whether or not they were getting the procedure had their chest cracked open and their heart lifted out."

(Image by Starllyte from Pixabay)

Open heart surgery exposes someone to infection risks. Cracking open the chest is a big deal. It can take two months for the disrupted tissues to heal. Did the research really require opening up the chest and lifting the heart for the control group?

Could this really ever have been considered ethical?

I might have been much more cynical had I not known of other, hm, questionable medical studies. I recall hearing a BBC radio documentary in the 1990s about American physicians who deliberately gave patients radioactive materials without their knowledge, just to to explore the effects. Perhaps most infamously there was the Tuskegee Syphilis study where United States medical authorities followed the development of disease over decades without revealing the full nature of the study, or trying to treat any of those infected. Compared with these violations, the angina surgery research seemed tame.

But do not believe everything you read…

According to the notes at the back of Slater's book, her reference was another secondary source (Moerman, 2002) – that is someone writing about what the research reports said, not those actual 'primary' accounts in the research journals.

So, I looked on-line for the original accounts. I found a 1959 study, by a team from the University of Washington School of Medicine. They explained that:

"Considerable relief of symptoms has been reported for patient with angina pectoris subjected to bilateral ligation of the internal mammary arteries. The physiologic basis for the relief of angina afforded by this rather simple operation is not clear."
Cobb, Thomas, Dillard, Merendino & Bruce, 1959

It was not clear why clamping these blood vessels in the chest should make a substantial difference to blood flow to the heart muscles – despite various studies which had subjected a range of dogs (who were not complaining of the symptoms of angina, and did not need any surgery) to surgical interventions followed by invasive procedures in order to measure any modifications in blood flow (Blair, Roth & Zintel, 1960).

Would you like your aorta clamped, and the blood drained from the left side of your heart, for the sake of a research study?

That raises another ethical issue – the extent of pain and suffering and morbidity it is fair to inflect on non-human animals (which are never perfect models for human anatomy and physiology) to progress human medicine. Some studies explored the details of blood circulation in dogs. Would you like your aorta clamped, and the blood drained from the left side of your heart, for the sake of a research study? Moreover, in order to test the effectiveness of the ligation procedure, in some studies healthy dogs had to have the blood supply to the heart muscles disrupted to given them similar compromised heart function as the human angina sufferers. ¹

But, hang on a moment. I think I passed over something rather important in that last quote: "this rather simple operation"?

"Considerable relief of symptoms has been reported for patient with angina pectoris subjected to bilateral ligation of the internal mammary arteries. The physiologic basis for the relief of angina afforded by this rather simple operation is not clear."

Cobb and colleagues' account of the procedure contradicted one of my assumptions,

At the time of operation, which was performed under local anesthesia [anaesthesia], the surgeon was handed a randomly selected envelope, which contained a card instructing him whether or not to ligate the internal mammary arteries after they had been isolated.
Cobb et al, 1959

It seems my inference that the procedure was carried out under general anaesthetic was wrong. Never assume! Surgery under local anaesthetic is not a trivial enterprise, but carries much less risk than general anaesthetic.

Yet, surely, even back then, no surgeon was going to open up the chest and handle the heart under a local anaesthetic? Cobb and colleagues wrote:

"The surgical procedures commonly used in the therapy of coronary-artery disease have previously been "major" operations utilizing thoracotomy and accompanied by some morbidity and a definite mortality. … With the advent of internal-mammary-artery ligation and its alleged benefit, a unique opportunity for applying the principles of a double-blind evaluation to a surgical procedure has been afforded
Cobb, Thomas, Dillard, Merendino & Bruce, 1959

So, the researchers were arguing that, previously, surgical interventions for this condition were major operations that did involve opening up the chest (thorax) – thoracotomy – where sham surgery would not have been ethical; but the new procedure they were testing – "this rather simple operation" was different.

Effects of internal-mammary-artery ligation on 17 patients with angina pectoris were evaluated by a double-blind technic. Eight patients had their internal mammary arteries ligated; 9 had skin incisions only.
Cobb et al, 1959

They describe "a 'placebo' procedure consisting of parasternal skin incisions"– that is some cuts were made into the skin next to the breast bone. Skin incisions are somewhat short of open heart surgery.

The description given by the Kansas team (from the Departments of Medicine and Surgery, University of Kansas Medical Center, Kansas City) also differs from Slater's third-hand account in this important way:

"The patients were operated on under local anesthesia. The surgeon, by random sampling, selected those in whom bilateral internal mammary artery and vein ligation (second interspace) was to be carried out and those in whom a sham procedure was to be performed. The sham procedure consisted of a similar skin incision with exposure of the internal mammary vessels, but without ligation."
Dimond, Kittle & Crocket, 1960

This description of the surgery seemed quite different from that offered by Slater.

These teams seemed to be reporting a procedure that could be carried out without exposing the lungs or the heart and opening their protective covers ("in this technique…the pericardium and pleura are not entered or disturbed", Glover, et al, 1957), and which could be superficially forged by making a few cuts into the skin.

"The performance of bilateral division of the internal mammary arteries as compared to other surgical procedures for cardiac disease is safe, simple and innocuous in capable hands."
Glover, Kitchell, Kyle, Davila & Trout, 1958

The surgery involved making cuts into the skin of the chest to access, and close off, arteries taking blood to (more superficial) chest areas in the hope it would allow more to flow to the heart muscles; the sham surgery, the placebo, involved making similar incisions, but without proceeding to change the pattern of arterial blood flow.

The sham surgery did not require general anaesthesia and involved relatively superficial wounds – and offered a research technique that did not need to cause suffering to, and the sacrifice of, perfectly healthy dogs. So, that's all ethical then?

The first hand research reports at least give a different impression of the balance of costs and potential benefits to stakeholders than I had originally drawn from Lauren Slater's account.

Getting consent for sham surgery

A key requirement for ethical research with human participants is being offered voluntary informed consent. Unlike dogs, humans can assent to research procedures, and it is generally considered that research should not be undertaken without such consent.

Read about voluntary informed consent

Of course, there is nuance and complication. The kind of research where investigators drop large denomination notes to test the honesty of passers by – where the 'participants' are in a public place and will not be identified or identifiable – is not usually seen as needing such consent (which would clearly undermine any possibility of getting authentic results). But is it acceptable to observe people using public toilets without their knowledge and consent (as was described in one published study I used as a teaching example)?

The extent to which a lay person can fully understand the logic and procedures explained to them when seeking consent can vary. The extent to which most participants would need, or even want to, know full details of the study can vary. When children of various ages are are involved, the extent to which consent can be given on their behalf by a parent or teachers raises interesting questions.

"I'm looking for volunteers to have a procedure designed to make it look like you've had surgery"

Image by mohamed_hassan from Pixabay

There is much nuance and many complications – and this is an area researchers needs to give very careful consideration.

How many ill patients would volunteer for sham surgery to help someone else's research?
Would that answer change, if the procedure being tested would later be offered to them?
What about volunteering for a study where you have a 50-50 chance of getting the real surgery or the placebo treatment?

In Cobb's study, the participants had all volunteered – but we might wonder if the extent of the information they were given amounted to what was required for informed consent,

The subjects were informed of the fact that this procedure had not been proved to be of value, and yet many were aware of the enthusiastic report published in the Reader's Digest. The patients were told only that they were participating in an evaluation of this operation; they were not informed of the double-blind nature of the study.
Cobb et al, 1959

So, it seems the patients thought they were having an operation that had been mooted to help angina sufferers – and indeed some of them were, but others just got taken into surgery to get a few wounds that suggested something more substantive had been done.

Was that ethical? (I doubt it would be allowed anywhere today?)

The outcome of these studies was that although the patients getting the ligation surgery did appear to get relief from their angina – so did those just getting the skin incisions. The placebo seemed just as good as the re-plumbing.

In hindsight, does this make the studies more worthwhile and seem more ethical? This research has probably prevented a great many people having an operation to have some of their vascular system blocked when that does not seem to make any difference to angina. Does that advance in medical knowledge justify the deceit involved in leading people to think they would get an experimental surgical treatment when they might just get an experimental control treatment?

Ethical principles and guidelines can helps us judge the merits of study

Coda – what did the middle man have to say?

I wondered how a relatively minor sham procedure under local anaesthetic became characterised as "the patients, whether or not they were getting the procedure had their chest cracked open and their heart lifted out" – a description which gave a vivid impression of a major intervention.

The heart is pretty well integrated into the body – how easy is it to life an intact, fully connected, working heart out of position?

Image by HANSUAN FABREGAS from Pixabay

I wondered to what extent it would even be possible to lift the heart out from the chest whilst it remained connected with the major vessels passing the blood it was pumping, and the nerves supplying it, and the vessels supplying blood to its own muscles (the ones that were considered compromised enough to make the treatment being tested worth considering). Some sources I found on-line referred to the heart being 'lifted' during open-heart procedures to give the surgeon access to specific sites: but that did not mean taking the heart out of the body. Having the heart 'lifted out' seemed more akin to Aztec sacrificial rites than medical treatment.

Although all surgery involves some risk, the actual procedure being investigated seemed of relatively routine nature. I actually attended a 'minor' operation which involved cutting into the chest when my late wife was prepared for kidney dialysis. Usually a site for venal access is prepared in the arm well in advance, but it was decided my wife needed to be put on dialysis urgently. A temporary hole was cut into her neck to allow the surgeon to connect a tube (a central venous catheter) to a vein, and another hole into her chest so that the catheter would exit in her chest, where the tap could be kept sterile, bandaged to the chest. This was clearly not considered a high risk operation (which is not to say I think I could have coped with having this done to me!) as I was asked by the doctors to stay in the room with my wife during the procedure, and I did not need to 'scrub' or 'gown up'.

Bilateral internal mammary artery ligation seemed a procedure on that kind of level, accessing blood vessels through incisions made in the skin. However, if Lauren Slater had read up some of the earlier procedures that did require opening the chest, or if she had read the papers describing how the dogs were investigated to trace blood flow through connected vessels, measure changes in flow, and prepare them for induced heart conditions, I could appreciate the potential for confusion. Yet she did not cite the primary research, but rather Daniel Moerman, an Emeritus Professor of Anthropology at University of Michigan-Dearborn, who has written a book about placebo treatments in medicine.

Moerman does write about the bilateral internal mammary artery ligation, and the two sham surgery studies I found in my search. Moerman describes the operation:

"It was quite simple, and since the arteries were not deep in the body, could be performed under local anaesthetic."
Moerman, 2002

He also refers to the subjective reports on one of the patients assigned to the placebo condition in one of the studies, who claimed to feel much better immediately after the procedure:

"This patient's arteries were not ligated…But he did have two scars on his chest…"
Moerman, 2002

But nobody cracked open his chest, and no one handled his heart.

There are still ethical issues here, but understanding the true (almost superficial) nature of the sham surgery clearly changes the balance of concerns. If there is a moral to this article, it is perhaps the importance of being fully informed before reaching judgement about the ethics of a research study.

Work cited:

Blair, C. R., Roth, R. F., & Zintel, H. A. (1960). Measurement of coronary artery blood-flow following experimental ligation of the internal mammary artery. Annals of Surgery, 152(2), 325.
Cobb, L. A., Thomas, G. I., Dillard, D. H., Merendino, K. A., & Bruce, R. A. (1959). An evaluation of internal-mammary-artery ligation by a double-blind technic. New England Journal of Medicine, 260(22), 1115-1118.
Dimond, E. G., Kittle, C. F., & Crockett, J. E. (1960). Comparison of internal mammary artery ligation and sham operation for angina pectoris. The American Journal of Cardiology, 5(4), 483-486.
Glover, R. P., Davila, J. C., Kyle, R. H., Beard, J. C., Trout, R. G., & Kitchell, J. R. (1957). Ligation of the internal mammary arteries as a means of increasing blood supply to the myocardium. Journal of Thoracic Surgery, 34(5), 661-678. https://doi.org/https://doi.org/10.1016/S0096-5588(20)30315-9
Glover, R. P., Kitchell, J. R., Kyle, R. H., Davila, J. C., & Trout, R. G. (1958). Experiences with Myocardial Revascularization By Division of the Internal Mammary Arteries. Diseases of the Chest, 33(6), 637-657. https://doi.org/https://doi.org/10.1378/chest.33.6.637
Moerman, D. E. (2002). Meaning, Medicine, and the "Placebo Effect". Cambridge University Press Cambridge.
Slater, Lauren (2018) The Drugs that Changed our Minds. The history of psychiatry in ten treatments. London. Simon & Schuster
Taber, K. S. (2019). Experimental research into teaching innovations: responding to methodological and ethical challenges. Studies in Science Education, 55(1), 69-119. doi:10.1080/03057267.2019.1658058 [Download this paper.]

Note:

¹ To find out if the ligation procedure protected a dog required stressing the blood supply to the heart itself,

"An attempt has been made to evaluate the degree of protection preliminary ligation of the internal mammary artery may afford the experimental animal when subjected to the production of sudden, acute myocardial infarction by ligation of the anterior descending coronary artery at its origin. …

It was hoped that survival in the control group would approximate 30 per cent so that infarct size could be compared with that of the "protected" group of animals. The "protected" group of dogs were treated in the same manner but in these the internal mammary arteries were ligated immediately before, at 24 hours, and at 48 hours before ligation of the anterior descending coronary.

In 14 control dogs, the anterior descending coronary artery with the aforementioned branch to the anterolateral aspect of the left ventricle was ligated. Nine of these animals went into ventricular fibrillation and died within 5 to 20 minutes. Attempts to resuscitate them by defibrillation and massage were to no avail. Four others died within 24 hours. One dog lived 2 weeks and died in pulmonary edema."
Glover, Davila, Kyle, Beard, Trout & Kitchell, 1957

Pulmonary oedema involves fluid build up in the lungs that restricts gaseous exchange and prevents effective breathing. The dog that survived longest (if it was kept conscious) will have experienced death as if by slow suffocation or drowning.

Reflecting the population

Sampling an "exceedingly large number of students"

Keith S. Taber

the key to sampling a population is identifying a representative sample

Obtaining a representative sample of a population can be challenging
(Image by Gerd Altmann from Pixabay)

Many studies in education are 'about' an identified population (students taking A level Physics examinations; chemistry teachers in German secondary schools; children transferring from primary to secondary school in Scotland; undergraduates majoring in STEM subjects in Australia…).

Read about populations of interest in r e search

But, in practice, most studies only collect data from a sample of the population of interest.

Sampling the population

One of the key challenges in social research is sampling. Obtaining a sample is usually not that difficult. However, often the logic of research is something along the lines:

1. Aim – to find out about a population.
2. As it is impractical to collect data from the whole population, collect data from a sample.
3. Analyse data collected from the sample.
4. Draw inferences about the population from the analysis of data collected form the sample.

For example, if one wished to do research into the views of school teachers in England and there are, say, 600 000 of them, it is, unlikely anyone could undertake research that collected and analysed data from all of them and produce results in a short enough period for the findings to still be valid (unless they were prepared to employ a research team of thousands!) But perhaps one could collect data from a sample that would be informative about the population.

This can be a reasonable approach (and, indeed, is a very common approach in research in areas like education) but relies on the assumption that what is true of the sample, can be generalised to the population.

That clearly depends on the sample being representatives of the larger population (at least in those ways which are pertinent to the the research).

When a study (as here in the figure an experiment) collects data from a sample drawn at random from a wider population, then the findings of the experiment can be assumed to apply (on average) to the population. (Figure from Taber, 2019.) In practice, unless a population of interest is quite modest in size (e.g., teachers in one school; post-graduate students in one university department; registered members of a society) it is usually simply not feasible to obtain a random sample.

For example, if we were interested in secondary school students in England, and we had a sample of secondary students from England that (a) reflected the age profile of the population; (b) reflected the gender profile of the population; but (c) were all drawn from one secondary school, this is unlikely to be a representative sample.

If we do have a representative sample, then the likely error in generalising from sample to population can be calculated (and can be reduced by having a larger sample);
If we do not have a representative sample, then there is no way of knowing how well the findings from the sample reflect the wider population and increasing sample size does not really help; and, for that matter,
If we do not know whether we have a representative sample, then, again, there is no way of knowing how well the findings from the sample reflect the wider population and increasing sample size does not really help.

So, the key to sampling a population is identifying a representative sample.

Read about sampling a population

If we know that only a small number of factors are relevant to the research then we may (if we are able to characterise members of the population on these criteria) be able to design a sample which is representative based on those features which are important.

If the relevant factors for a study were teaching subject; years of teaching experience; teacher gender, then we would want to build a sample that fitted the population profile accordingly, so, maybe, 3% female maths teachers with 10+ years of teaching experience, et cetera. We would need suitable demographic information about the population to inform the building of the sample.

We can then randomly select from those members of the the population with the right characteristics within the different 'cells'.

However, if we do not know exactly what specific features might be relevant to characterise a population in a particular research project, the best we might be able to do is to to employ a randomly chosen sample which at least allows the measurement error to be estimated.

Labs for exceedingly large numbers of students

Leopold and Smith (2020) were interested in the use of collaborative group work in a "general chemistry, problem-based lab course" at a United States university, where students worked in fixed groups of three or four throughout the course. As well as using group work for more principled reasons, "group work is also utilized as a way to manage exceedingly large numbers of students and efficiently allocate limited time, space, and equipment" (p.1). They tell readers that

"the case we examine here is a general chemistry, problem-based lab course that enrols approximately 3500 students each academic year"
Leopold & Smith, 2020, p.5

Although they recognised a wide range of potential benefits of collaborative work, these depend upon students being able to work effectively in groups, which requires skills that cannot be take for granted. Leopold and Smith report how structured support was put in place that help students diagnose impediments to the effective work of their groups – and they investigated this in their study.

The data collected was of two types. There was a course evaluation at the end of the year taken by all the students in the cohort, "795 students enrolled [in] the general chemistry I lab course during the spring 2019 semester" (p.7). However, they also collected data from a sample of student groups during the course, in terms of responses to group tasks designed to help them think about and develop their group work.

Population and sample

As the focus of their research was a specific course, the population of interest was the cohort of undergraduates taking the course. Given the large number of students involved, they collected qualitative data from a sample of the groups.

Units of analysis

The course evaluation questions sought individual learners' views so for that data the unit of analysis was the individual student. However, the groups were tasked with working as a group to improve their effectiveness in collaborative learning. So, in Leopold and Smith's sample of groups, the unit of analysis was the group. Some data was received from individual groups members, and other data were submitted as group responses: but the analysis was on the basis of responses from within the specific groups in the sample.

A stratified sample

Leopold and Smith explained that

"We applied a stratified random sampling scheme in order to account for variations across lab sections such as implementation fidelity and instructor approach so as to gain as representative a sample as possible. We stratified by individual instructors teaching the course which included undergraduate teaching assistants (TAs), graduate TAs, and teaching specialists. One student group from each instructor's lab sections was randomly selected. During spring 2019, we had 19 unique instructors teaching the course therefore we selected 19 groups, for a total of 76 students."
Leopold & Smith, 2020, p.7

The paper does not report how the random assignment was made – how it was decided which group would be selected for each instructor. As any competent scientist ought to be able to make a random selection quite easily in this situation, this is perhaps not a serious omission. I mention this because sadly not all authors who report having used randomisation can support this when asked how (Taber, 2013).

Was the sample representative?

Leopold and Smith found that, based on their sample, student groups could diagnose impediments to effective group working, and could often put in place effective strategies to increase their effectiveness.

We might wonder if the sample was representative of the wider population. If the groups were randomly selected in the way claimed then one would expect this would probably be the case – only 'probably', as that is the best randomisation and statistics can do – we can never know for certain that a random sample is representative, only that it is unlikely to be especially unrepresentative!

The only way to know for sure that a sample is genuinely representative of the population of interest in relation to the specific focus of a study, would be to collect data from the whole population and check the sample data matches the population data.* But, of course, if it was feasible to collect data from everyone in the population, there would be no need to sample in the first place.

However, because the end of course evaluation was taken by all students in the cohort (the study population) Leopold and Smith were able to see if those students in the sample responded in ways that were generally in line with the population as a whole. The two figures reproduced here seem to suggest they did!

Figure 1 from Leopold & Smith, 2020, p.10, which is published with a Creative Commons Attribution (CC BY) license allowing reproduction.

Figure 2 from Leopold & Smith, 2020, p.10, which is published with a Creative Commons Attribution (CC BY) license allowing reproduction.

There is clearly a pretty good match here. However, it is important to not over-interpret this data. The questions in the evaluation related to the overall experience of group working, whereas the qualitative data analysed from the sample related to the more specific issues of diagnosing and addressing issues in the working of groups. These are related matters but not identical, and we cannot assume that the very strong similarity between sample and population outcomes in the survey demonstrates (or proves!) that the analysis of data from the sample is also so closely representative of what would have been obtained if all the groups had been included in the data collection.

	Experiences of learning through group-work	Learning to work more effectively in groups
Sample	patterns in data closely reflected population responses	data only collected from a sample of groups
Population	all invited to provide feedback	[it seems reasonable to assume results from sample are likely to apply to the cohort as a whole]

The similarly of the feedback viewing by students in the sample of groups to the overall cohort responses suggests that the sample was broadly representative of the overall population in terms of developing group-work skills and practices

It might well have been, but we cannot know for sure. (* The only way to know for sure that a sample is genuinely representative of the population of interest in relation to the specific focus of a study, would be …)

However, the way the sample so strongly reflected the population in relation to the evaluation data, shows that in that (related if not identical) respect at least the sample is strongly representative, and that is very likely to give readers confidence in the sampling procedure used. If this had been my study I would have been pretty pleased with this, at least strongly suggestive, circumstantial evidence of the representativeness of the sampling of the student groups.

Work cited:

Leopold, H., & Smith, A. (2020). Implementing Reflective Group Work Activities in a Large Chemistry Lab to Support Collaborative Learning. Education Sciences, 10(1), 7. https://www.mdpi.com/2227-7102/10/1/7
Taber, K. S. (2013). Non-random thoughts about research. Chemistry Education Research and Practice, 14(4), 359-362. doi:10.1039/c3rp90009f
Taber, K. S. (2019). Experimental research into teaching innovations: responding to methodological and ethical challenges. Studies in Science Education, 55(1), 69-119. doi:10.1080/03057267.2019.1658058 [Download manuscript version]

Delusions of educational impact

A 'peer-reviewed' study claims to improve academic performance by purifying the souls of students suffering from hallucinations

Keith S. Taber

The research design is completely inadequate…the whole paper is confused…the methodology seems incongruous…there is an inconsistency…nowhere is the population of interest actually identified…No explanation of the discrepancy is provided…results of this analysis are not reported…the 'interview' technique used in the study is highly inadequate…There is a conceptual problem here…neither the validity nor reliability can be judged…the statistic could not apply…the result is not reported…approach is completely inappropriate…these tables are not consistent…the evidence is inconclusive…no evidence to demonstrate the assumed mechanism…totally unsupported claims…confusion of recommendations with findings…unwarranted generalisation…the analysis that is provided is useless…the research design is simply inadequate…no control condition…such a conclusion is irresponsible
Some issues missed in peer review for a paper in the European Journal of Education and Pedagogy

An invitation to publish without regard to quality?

I received an email from an open-access journal called the European Journal of Education and Pedagogy, with the subject heading 'Publish Fast and Pay Less' which immediately triggered the thought "another predatory journal?" Predatory journals publish submissions for a fee, but do not offer the editorial and production standards expected of serious research journals. In particular, they publish material which clearly falls short of rigorous research despite usually claiming to engage in peer review.

A peer reviewed journal?

Checking out the website I found the usual assurances that the journal used rigorous peer review as:

"The process of reviewing is considered critical to establishing a reliable body of research and knowledge. The review process aims to make authors meet the standards of their discipline, and of science in general.

We use a double-blind system for peer-reviewing; both reviewers and authors' identities remain anonymous to each other. The paper will be peer-reviewed by two or three experts; one is an editorial staff and the other two are external reviewers."
https://www.ej-edu.org/index.php/ejedu/about

Peer review is critical to the scientific process. Work is only published in (serious) research journals when it has been scrutinised by experts in the relevant field, and any issues raised responded to in terms of revisions sufficient to satisfy the editor.

I could not find who the editor(-in-chief) was, but the 'editorial team' of European Journal of Education and Pedagogy were listed as

Bea Tomsic Amon, University of Ljubljana, Slovenia
Chunfang Zhou, University of Southern Denmark, Denmark
Gabriel Julien, University of Sheffield, UK
Intakhab Khan, King Abdulaziz University, Saudi Arabia
Mustafa Kayıhan Erbaş, Aksaray University, Turkey
Panagiotis J. Stamatis, University of the Aegean, Greece

I decided to look up the editor based in England where I am also based but could not find a web presence for him at the University of Sheffield. Using the ORCID (Open Researcher and Contributor ID) provided on the journal website I found his ORCID biography places him at the University of the West Indies and makes no mention of Sheffield.

If the European Journal of Education and Pedagogy is organised like a serious research journal, then each submission is handled by one of this editorial team. However the reference to "editorial staff" might well imply that, like some other predatory journals I have been approached by (e.g., Are you still with us, Doctor Wu?), the editorial work is actually carried out by office staff, not qualified experts in the field.

That would certainly help explain the publication, in this 'peer-reviewed research journal', of the first paper that piqued my interest enough to motivate me to access and read the text.

The Effects of Using the Tazkiyatun Nafs Module on the Academic Achievement of Students with Hallucinations

The abstract of the paper published in what claims to be a peer-reviewed research journal

The paper initially attracted my attention because it seemed to about treatment of a medical condition, so I wondered was doing in an education journal. Yet, the paper seemed to also be about an intervention to improve academic performance. As I read the paper, I found a number of flaws and issues (some very obvious, some quite serious) that should have been spotted by any qualified reviewer or editor, and which should have indicated that possible publication should have been be deferred until these matters were satisfactorily addressed.

This is especially worrying as this paper makes claims relating to the effective treatment of a symptom of potentially serious, even critical, medical conditions through religious education ("a spiritual approach", p.50): claims that might encourage sufferers to defer seeking medical diagnosis and treatment. Moreover, these are claims that are not supported by any evidence presented in this paper that the editor of the European Journal of Education and Pedagogy decided was suitable for publication.

An overview of what is demonstrated, and what is claimed, in the study.

Limitations of peer review

Peer review is not a perfect process: it relies on busy human beings spending time on additional (unpaid) work, and it is only effective if suitable experts can be found that fit with, and are prepared to review, a submission. It is also generally more challenging in the social sciences than in the natural sciences. ¹

That said, one sometimes finds papers published in predatory journals where one would expect any intelligent person with a basic education to notice problems without needing any specialist knowledge at all. The study I discuss here is a case in point.

Purpose of the study

Under the heading 'research objectives', the reader is told,

"In general, this journal [article?] attempts to review the construction and testing of Tazkiyatun Nafs [a Soul Purification intervention] to overcome the problem of hallucinatory disorders in student learning in secondary schools. The general objective of this study is to identify the symptoms of hallucinations caused by subtle beings such as jinn and devils among students who are the cause of disruption in learning as well as find solutions to these problems.

Meanwhile, the specific objective of this study is to determine the effect of the use of Tazkiyatun Nafs module on the academic achievement of students with hallucinations.

To achieve the aims and objectives of the study, the researcher will get answers to the following research questions [sic]:

Is it possible to determine the effect of the use of the Tazkiyatun Nafs module on the academic achievement of students with hallucinations?"
Awang, 2022, p.42

I think I can save readers a lot of time regarding the research question by suggesting that, in this study, at least, the answer is no – if only because the research design is completely inadequate to answer the research question. (I should point that the author comes to the opposite conclusion: e.g., "the approach taken in this study using the Tazkiyatun Nafs module is very suitable for overcoming the problem of this hallucinatory disorder", p.49.)

Indeed, the whole paper is confused in terms of what it is setting out to do, what it actually reports, and what might be concluded. As one example, the general objective of identifying "the symptoms of hallucinations caused by subtle beings such as jinn and devils" (but surely, the hallucinations are the symptoms here?) seems to have been forgotten, or, at least, does not seem to be addressed in the paper. ²

The study assumes that hallucinations are caused by subtle beings such as jinn and devils possessing the students.
(Image by Tünde from Pixabay)

Methodology

So, this seems to be an intervention study.

Some students suffer from hallucinations.
This is detrimental to their education.
It is hypothesised that the hallucinations are caused by supernatural spirits ("subtle beings that lead to hallucinations"), so, a soul purification module might counter this detriment;
if so, sufferers engaging with the soul purification module should improve their academic performance;
and so the effect of the module is being tested in the study.

Thus we have a kind of experimental study?

No, not according to the author. Indeed, the study only reports data from a small number of unrepresentative individuals with no controls,

"The study design is a case study design that is a qualitative study in nature. This study uses a case study design that is a study that will apply treatment to the study subject to determine the effectiveness of the use of the planned modules and study variables measured many times to obtain accurate and original study results. This study was conducted on hallucination disorders [students suffering from hallucination disorders?] to determine the effectiveness of the Tazkiyatun Nafs module in terms of aspects of student academic achievement."
Awang, 2022, p.42

Case study?

So, the author sees this as a case study. Research methodologies are better understood as clusters of similar approaches rather than unitary categories – but case study is generally seen as naturalistic, rather than involving an intervention by an external researcher. So, case study seems incongruous here. Case study involves the detailed exploration of an instance (of something of interest – a lesson, a school, a course of tudy, a textbook, …) reported with 'thick description'.

Read about the characteristics of case study research

The case is usually a complex phenomena which is embedded within a context from which is cannot readily be untangled (for example, a lesson always takes place within a wider context of a teacher working over time with a class on a course of study, within a curricular, and institutional, and wider cultural, context, all of which influence the nature of the specific lesson). So, due to the complex and embedded nature of cases, they are all unique.

"a case study is a study that is full of thoroughness and complex to know and understand an issue or case studied…this case study is used to gain a deep understanding of an issue or situation in depth and to understand the situation of the people who experience it"
Awang, 2022, p.42

A case is usually selected either because that case is of special importance to the researcher (an intrinsic case study – e.g., I studied this school because it is the one I was working in) or because we hope this (unique) case can tell us something about similar (but certainly not identical) other (also unique) cases. In the latter case [sic], an instrumental case study, we are always limited by the extent we might expect to be able to generalise beyond the case.

This limited generalisation might suggest we should not work with a single case, but rather look for a suitably representative sample of all cases: but we sometimes choose case study because the complexity of the phenomena suggests we need to use extensive, detailed data collection and analyses to understand the complexity and subtlety of any case. That is (i.e., the compromise we choose is), we decide we will look at one case in depth because that will at least give us insight into the case, whereas a survey of many cases will inevitably be too superficial to offer any useful insights.

So how does Awang select the case for this case study?

"This study is a case study of hallucinatory disorders. Therefore, the technique of purposive sampling (purposive sampling [sic]) is chosen so that the selection of the sample can really give a true picture of the information to be explored ….

Among the important steps in a research study is the identification of populations and samples. The large group in which the sample is selected is termed the population. A sample is a small number of the population identified and made the respondents of the study. A case or sample of n = 1 was once used to define a patient with a disease, an object or concept, a jury decision, a community, or a country, a case study involves the collection of data from only one research participant…
Awang, 2022, p.42

Of course, a case study of "a community, or a country" – or of a school, or a lesson, or a professional development programme, or a school leadership team, or a homework policy, or an enrichnment activity, or … – would almost certainly be inadequate if it was limited to "the collection of data from only one research participant"!

I do not think this study actually is "a case study of hallucinatory disorders [sic]". Leading aside the shift from singular ("a case study") to plural ("disorders"), the research does not investigate a/some hallucinatory disorders, but the effect of a soul purification module on academic performance. (Actually, spoiler alert 😉, it does not actually investigate the effect of a soul purification module on academic performance either, but the author seems to think it does.)

If this is a case study, there should be the selection of a case, not a sample. Sometimes we do sample within a case in case study, but only from those identified as part of the case. (For example, if the case was a year group in a school, we may not have resources to interact in depth with several hundred different students). Perhaps this is pedantry as the reader likely knows what Awang meant by 'sample' in the paper – but semantics is important in research writing: a sample is chosen to represent a population, whereas the choice of case study is an acknowledgement that generalisation back to a population is not being claimed).

However, if "among the important steps in a research study is the identification of populations" then it is odd that nowhere in the paper is the population of interest actually specified!

Things slip our minds. Perhaps Awang intended to define the population, forgot, and then missed this when checking the text – buy, hey, that is just the kind of thing the reviewers and editor are meant to notice! Otherwise this looks very like including material from standard research texts to play lip-service to the idea that research-design needs to be principled, but without really appreciating what the phrases used actually mean. This impression is also given by the descriptions of how data (for example, from interviews) were analysed – but which are not reflected at all in the results section of the paper. (I am not accusing Awang of this, but because of the poor standard of peer review not raising the question, the author is left vulnerable to such an evaluation.)

The only one research participant?

So, what do we know about the "case or sample of n = 1 ", the "only one research participant" in this study?

The actual respondents in this case study related to hallucinatory disorders were five high school students. The supportive respondents in the case study related to hallucination disorders were five counseling teachers and five parents or guardians of students who were the actual respondents."
Awang, 2022, p.42

It is certainly not impossible that a case could comprise a group of five people – as long as those five make up a naturally bounded group – that is a group that a reasonable person would recognise as existing as a coherent entiy as they clearly had something in common (they were in the same school class, for example; they were attending the same group therapy session, perhaps; they were a friendship group; they were members of the same extended family diagnosed with hallucinatory disorders…something!) There is no indication here of how these five make up a case.

The identification of the participants as a case might have made sense had the participants collectively undertaken the module as a group, but the reader is told: "This study is in the form of a case study. Each practice and activity in the module are done individually" (p.50). Another justification could have been if the module had been offered in one school, and these five participants were the students enrolled in the programme at that time but as "analysis of the respondents' academic performance was conducted after the academic data of all respondents were obtained from the respective respondent's school" (p.45) it seems they did not attend a single school.

The results tables and reports in the text refer to "respondent 1" to "respondent 4". In case study, an approach which recognises the individuality and inherent value of the particular case, we would usually assign assumed names to research participants, not numbers. But if we are going to use numbers, should there not be a respondent 5?

The other one research participant?

It seems that these is something odd here.

Both the passage above, and the abstract refer to five respondents. The results report on four. So what is going on? No explanation of the discrepancy is provided. Perhaps:

There only ever were four participants, and the author made a mistake in counting.
There only ever were four participants, and the author made a typographical mistake (well, strictly, six typographical mistakes) in drafting the paper, and then missed this in checking the manuscript.
There were five respondents and the author forgot to include data on respondent 5 purely by accident.
There were five respondents, but the author decided not to report on the fifth deliberately for a reason that is not revealed (perhaps the results did not fit with the desired outcome?)

The significant point is not that there is an inconsistency but that this error was missed by peer reviewers and the editor – if there ever was any genuine peer review. This is the kind of mistake that a school child could spot – so, how is it possible that 'expert reviewers' and 'editorial staff' either did not notice it, or did not think it important enough to query?

Research instruments

Another section of the paper reports the instrumentation used in the paper.

"The research instruments for this study were Takziyatun Nafs modules, interview questions, and academic document analysis. All these instruments were prepared by the researcher and tested for validity and reliability before being administered to the selected study sample [sic, case?]."
Awang, 2022, p.42

Of course, it is important to test instruments for validity and reliability (or perhaps authenticity and trustworthiness when collecting qualitative data). But it is also important

to tell the reader how you did this
to report the outcomes

which seems to be missing (apart from in regard to part of the implemented module – see below). That is, the reader of a research study wants evidence not simply promises. Simply telling readers you did this is a bit like meeting a stranger who tells you that you can trust them because they (i.e., say that they) are honest.

Later the reader is told that

"Semi- structured interview questions will be [sic, not 'were'?] developed and validated for the purpose of identifying the causes and effects of hallucinations among these secondary school students…

…this interview process will be [sic, not 'was'] conducted continuously [sic!] with respondents to get a clear and specific picture of the problem of hallucinations and to find the best solution to overcome this disorder using Islamic medical approaches that have been planned in this study
Awang, 2022, pp.43-44

At the very least, this seems to confuse the plan for the research with a report of what was done. (But again, apparently, the reviewers and editorial staff did not think this needed addressing.) This is also confusing as it is not clear how this aspect of the study relates to the intervention. Were the interviews carried out before the intervention to help inform the design of the modules (presumably not as they had already been "tested for validity and reliability before being administered to the selected study sample"). Perhaps there are clear and simple answers to such questions – but the reader will not know because the reviewers and editor did not seem to feel they needed to be posed.

If "Interviews are the main research instrument in this study" (p.43), then one would expect to see examples of the interview schedules – but these are not presented. The paper reports a complex process for analysing interview data, but this is not reflected in the findings reported. The readers is told that the six stage process leads to the identifications and refinement of main and sub-categories. Yet, these categories are not reported in the paper. (But, again, peer reviewers and the editor did not apparently raise this as something to be corrected.) More generally "data analysis used thematic analysis methods" (p.44), so why is there no analysis presented in terms of themes? The results of this analysis are simply not reported.

The reader is told that

"This interview method…aims to determine the respondents' perspectives, as well as look at the respondents' thoughts on their views on the issues studied in this study."
Awang, 2022, p.44

But there is no discussion of participants perspectives and views in the findings of the study. ² Did the peer reviewers and editor not think this needed addressing before publication?

Even more significantly, in a qualitative study where interviews are supposedly the main research instrument, one would expect to see extracts from the interviews presented as part of the findings to support and exemplify claims being made: yet, there are none. (Did this not strike the peer reviewers and editor as odd: presumably they are familiar with the norms of qualitative research?)

The only quotation from the qualitative data (in this 'qualitative' study) I can find appears in the implications section of the paper:

"Are you aware of the importance of education to you? Realize. Is that lesson really important? Important. The success of the student depends on the lessons in school right or not? That's right"
Respondent 3: Awang, 2022, p.49

This seems a little bizarre, if we accept this is, as reported, an utterance from one of the students, Respondent 3. It becomes more sensible if this is actually condensed dialogue:

"Are you aware of the importance of education to you?"

"Realize."

"Is that lesson really important?"

"Important."

"The success of the student depends on the lessons in school right or not?"

"That's right"

It seems the peer review process did not lead to suggesting that the material should be formatted according to the norms for presenting dialogue in scholarly texts by indicating turns. In any case, if that is typical of the 'interview' technique used in the study then it is highly inadequate, as clearly the interviewer is leading the respondent, and this is more an example of indoctrination than open-ended enquiry.

Random sampling of data

Completely incongruous with the description of the purposeful selection of the participants for a case study is the account of how the assessment data was selected for analysis:

"The process of analysis of student achievement documents is carried out randomly by taking the results of current examinations that have passed such as the initial examination of the current year or the year before which is closest to the time of the study."
Awang, 2022, p.44

Did the peer reviewers or editor not question the use of the term random here? It is unclear what is meant to by 'random' here, but clearly if the analysis was based on randomly selected data that would undermine the results.

Validating the soul purification module

There is also a conceptual problem here. The Takziyatun Nafs modules are the intervention materials (part of what is being studied) – so they cannot also be research instruments (used to study them). Surely, if the Takziyatun Nafs modules had been shown to be valid and reliable before carrying out the reported study, as suggested here, then the study would not be needed to evaluate their effectiveness. But, presumably, expert peer reviewers (if there really were any) did not see an issue here.

The reliability of the intervention module

The Takziyatun Nafs modules had three components, and the author reports the second of the three was subjected to tests of validity and reliability. It seems that Awang thinks that this demonstrates the validity and reliability of the complete intervention,

"The second part of this module will go through [sic] the process of obtaining the validity and reliability of the module. Proses [sic] to obtain this validity, a questionnaire was constructed to test the validity of this module. The appointed specialists are psychologists, modern physicians (psychiatrists), religious specialists, and alternative medicine specialists. The validity of the module is identified from the aspects of content, sessions, and activities of the Tazkiyatun Nafs module. While to obtain the value of the reliability coefficient, Cronbach's alpha coefficient method was used. To obtain this Cronbach's alpha coefficient, a pilot test was conducted on 50 students who were randomly selected to test the reliability of this module to be conducted."
Awang, 2022, pp.43-44

Now to unpack this, it may be helpful to briefly outline what the intervention involved (as as the paper is open access anyone can access and read the full details in the report).

From the MGM film 'A Night at the Opera' (1935): "The introduction of the module will elaborate on the introduction, rationale, and objectives of this module introduced"

The description does not start off very helpfully ("The introduction of the module will elaborate on the introduction, rationale, and objectives of this module introduced" (p.43) put me in mind of the Marx brothers: "The party of the first part shall be known in this contract as the party of the first part"), but some key points are,

"the Tazkiyatun Nafs module was constructed to purify the heart of each respondent leading to the healing of hallucinatory disorders. This liver purification process is done in stages…

"the process of cleansing the patient's soul will be done …all the subtle beings in the patient will be expelled and cleaned and the remnants of the subtle beings in the patient will be removed and washed…

The second process is the process of strengthening and the process of purification of the soul or heart of the patient …All the mazmumah (evil qualities) that are in the heart must be discarded…

The third process is the process of enrichment and the process of distillation of the heart and the practices performed. In this process, there will be an evaluation of the practices performed by the patient as well as the process to ensure that the patient is always clean from all the disturbances and disturbances [sic] of subtle beings to ensure that students will always be healthy and clean from such disturbances…
Awang, 2022, p.45, p.43

Quite how this process of exorcising and distilling and cleansing will occur is not entirely clear (and if the soul is equated with the heart, how is the liver involved?), but it seems to involve reflection and prayer and contemplation of scripture – certainly a very personal and therapeutic process.

And yet its validity and reliability was tested by giving a questionnaire to 50 students randomly selected (from the unspecified population, presumably)? No information is given on how a random section was made (Taber, 2013) – which allows a reader to be very sceptical that this actually was a random sample from the (un?)identified population, and not just an arbitrary sample of 50 students. (So, that is twice the word 'random' is used in the paper when it seems inappropriate.)

It hardly matters here, as clearly neither the validity nor the reliability of a spiritual therapy can be judged from a questionnaire (especially when administered to people who have never undertaken the therapy). In any case, the "reliability coefficient" obtained from an administration of a questionnaire ONLY applies to that sample on that occasion. So, the statistic could not apply to the four participants in the study. And, in any case, the result is not reported, so the reader has no idea what the value of Cronbach's alpha was (but then, this was described as a qualitative study!)

Moreover, Cronbach's alpha only indicates the internal coherence of the items on a scale (Taber, 2019): so, it only indicates whether the set of questions included in the questionnaire seem to be accessing the same underlying construct in motivating the responses of those surveyed across the set of items. It gives no information about the reliability of the instrument (i.e., whether it would give the same results on another occasion).

This approach to testing validity and reliability is then completely inappropriate and unhelpful. So, even if the outcomes of the testing had been reported (and they are not) they would not offer any relevant evidence. Yet it seems that peer reviewers and editor did not think to question why this section was included in the paper.

Ethical issues

A study of this kind raises ethical issues. It may well be that the research was carried out in an entirely proper and ethical manner, but it is usual in studies with human participants ('human subjects') to make this clear in the published report (Taber, 2014b). A standard issue is whether the participants gave voluntary, informed, consent. This would mean that they were given sufficient information about the study at the outset to be able to decide if they wished to participate, and were under no undue pressure to do so. The 'respondents' were school students: if they were considered minors in the research context (and oddly for a 'case study' such basic details as age and gender are not reported) then parental permission would also be needed, again subject to sufficient briefing and no duress.

However, in this specific research there are also further issues due to the nature of the study. The participants were subject to medical disorders, so how did the researcher obtain information about, and access to, the students without medical confidentiality being broken? Who were the 'gatekeepers' who provided access to the children and their personal data? The researcher also obtained assessment data "from the class teacher or from the Student Affairs section of the student's school" (p.44), so it is important to know that students (and parents/guardians) consented to this. Again, peer review does not seem to have identified this as an issue to address before publication.

There is also the major underlying question about the ethics of a study when recognising that these students were (or could be, as details are not provided) suffering from serious medical conditions, but employing religious education as a treatment ("This method of treatment is to help respondents who suffer from hallucinations caused by demons or subtle beings", p.44). Part of the theoretical framework underpinning the study is the assumption that what is being addressed is"the problem of hallucinations caused by the presence of ethereal beings…" (p.43) yet it is also acknowledged that,

"Hallucinatory disorders in learning that will be emphasized in this study are due to several problems that have been identified in several schools in Malaysia. Such disorders are psychological, environmental, cultural, and sociological disorders. Psychological disorders such as hallucinatory disorders can lead to a more critical effect of bringing a person prone to Schizophrenia. Psychological disorders such as emotional disorders and psychiatric disorders. …Among the causes of emotional disorders among students are the school environment, events in the family, family influence, peer influence, teacher actions, and others."
Awang, 2022, p.41

There seem to be three ways of understanding this apparent discrepancy, which I might gloss:

there are many causes of conditions that involve hallucinations, including, but not only, possession by evil or mischievousness spirits;
the conditions that lead to young people having hallucinations may be understood at two complementary levels, at a spiritual level in terms of a need for inner cleansing and exorcising of subtle beings, and in terms of organic disease or conditions triggered by, for example, social and psychological factors;
in the introduction the author has relied on various academic sources to discuss the nature of the phenomenon of students having hallucinations, but he actually has a working assumption that is completely different: hallucinations are due to the presence of jinn or other spirits.

I do not think it is clear which of these positions is being taken by the study's author.

In the first case it would be necessary to identify which causes are present in potential respondents and only recruit those suffering possession for this study (which does not seem to have been done);
In the second case, spiritual treatment would need to complement medical intervention (which would completely undermine the validity of the study as medical treatments for the underlying causes of hallucinations are likely to be the cause of hallucinations ceasing, not the tested intervention);
The third position is clearly problematic in terms of academic scholarship as it is either completely incompetent or deliberately disregards academic norms that require the design of a study to reflect the conceptual framework set out to motivate it.

So, was this tested intervention implemented instead of or alongside formal medical intervention?

If it was alongside medical treatment, then that raises a major confound for the study.
Yet it would clearly be unacceptable to deny sufferers indicated medical treatment in order to test an educational intervention that is in effect a form of exorcism.

Again, it may be there are simple and adequate responses to these questions (although here I really cannot see what they might be), but unfortunately it seems the journal referees and editor did not think to ask for them.

Findings

The key findings presented concern academic performance at school. Core results are presented in tables I and II. Unfortunately these tables are not consistent as they report contradictory results for the academic performance of students before and during periods when they had hallucinations.

They can be made consistent if the reader assumes that two of the columns in table II are mislabelled. If the reader assumes that the column labelled 'before disruption' actually reports the performance 'during disruption' and that the column actually labelled 'during disruption' is something else, then they become consistent. For the results to tell a coherent story and agree with the author's interpretation this 'something else' presumably should be 'after disruption'.

This is a very unfortunate error – and moreover one that is obvious to any careful reader. (So, why was it not obvious to the referees and editor?)

As well as looking at these overall scores, other assessment data is presented separately for each of respondent 1 – respondent 4. Theses sections comprise presentations of information about grades and class positions, mixed with claims about the effects of the intervention. These claims are not based on any evidence and in many cases are conclusions about 'respondents' in general although they are placed in sections considering the academic assessment data of individual respondents. So,there are a number of problems with these claims:

they are of the nature of conclusions, but appear in the section presenting the findings;
they are about the specific effects of the intervention that the author assumes has influenced academic performance, not the data analysed in these sections;
they are completely unsubstantiated as no data or analysis is offered to support them;
often they make claims about 'respondents' in general, although as part of the consideration of data from individual learners.

Despite this, the paper passed peer-review and editorial scrutiny.

Rhetorical research?

This paper seems to be an example of a kind of 'rhetorical research' where a researcher is so convinced about their pre-existant theoretical commitments that they simply assume they have demonstrated them. Here the assumption seem to be:

Recovering from suffering hallucinations will increase student performance
Hallucinations are caused by jinn and devils
A spiritual intervention will expel jinn and devils
So, a spiritual intervention will cure hallucinations
So, a spiritual intervention will increase student performance

The researcher provided a spiritual intervention, and the student performance increased, so it is assumed that the scheme is demonstrated. The data presented is certainly consistent with the assumption, but does not in itself support this scheme without evidence. Awang provides evidence that student performance improved in four individuals after they had received the intervention – but there is no evidence offered to demonstrate the assumed mechanism.

A gardener might think that complimenting seedlings will cause them to grow. Perhaps she praises her seedlings every day, and they do indeed grow. Are we persuaded about the efficacy of her method, or might we suspect another cause at work? Would the peer-reveiewers and editor of the European Journal of Education and Pedagogy be persuaded this demonstrated that compliments cause plant growth? On the evidence of this paper, perhaps they would.

This is what Awang tells readers about the analysis undertaken:

Each student respondent involved in this study [sic, presumably not, rather the researcher] will use the analysis of the respondent's performance to determine the effect of hallucination disorders on student achievement in secondary school is accurate.

The elements compared in this analysis are as follows: a) difference in mean percentage of achievement by subject, b) difference in grade achievement by subject and c) difference in the grade of overall student achievement. All academic results of the respondents will be analyzed as well as get the mean of the difference between the performance before, during, and after the respondents experience hallucinations.

These results will be used as research material to determine the accuracy of the use of the Tazkiyatun Nafs Module in solving the problem of hallucinations in school and can improve student achievement in academic school."
Awang, 2022, p.45

There is clearly a large jump between the analysis outlined in the second paragraph here, and testing the study hypotheses as set out in the final paragraph. But the author does not seem to notice this (and more worryingly, nor do the journal's reviewers and editor).

So interleaved into the account of findings discussing "mean percentage of achievement by subject…difference in grade achievement by subject…difference in the grade of overall student achievement" are totally unsupported claims. Here is an example for Respondent 1:

"Based on the findings of the respondent's achievement in the grade for Respondent 1 while facing the problem of hallucinations shows that there is not much decrease or deterioration of the respondent's grade. There were only 4 subjects who experienced a decline in grade between before and during hallucination disorder. The subjects that experienced decline were English, Geography, CBC, and Civics. Yet there is one subject that shows a very critical grade change the Civics subject. The decline occurred from grade A to grade E. This shows that Civics education needs to be given serious attention in overcoming this problem of decline. Subjects experiencing this grade drop were subjects involving emotion, language, as well as psychomotor fitness. In the context of psychology, unstable emotional development leads to a decline in the psychomotor and emotional development of respondents.

After the use of the Tazkiyatun Nafs module in overcoming this problem, hallucinatory disorders can be overcome. This situation indicates the development of the respondents during and after experiencing hallucinations after practicing the Tazkiyatun Nafs module. The process that takes place in the Tzkiyatun Nafs module can help the respondent to stabilize his emotions and psyche for the better. From the above findings there were 5 subjects who experienced excellent improvement in grades. The increase occurred in English, Malay, Geography, and Civics subjects. The best improvement is in the subject of Civic education from grade E to grade B. The improvement in this language subject shows that the respondents' emotions have stabilized. This situation is very positive and needs to be continued for other subjects so that respondents continue to excel in academic achievement in school.""
Awang, 2022, p.45 (emphasis added)

The material which I show here as underlined is interjected completely gratuitously. It does not logically fit in the sequence. It is not part of the analysis of school performance. It is not based on any evidence presented in this section. Indeed, nor is it based on any evidence presented anywhere else in the paper!

This pattern is repeated in discussing other aspects of respondents' school performance. Although there is mention of other factors which seem especially pertinent to the dip in school grades ("this was due to the absence of the respondents to school during the day the test was conducted", p.46; "it was an increase from before with no marks due to non-attendance at school", p.46) the discussion of grades is interspersed with (repetitive) claims about the effects of the intervention for which no evidence is offered.

	Respondent 1	Respondent 2	Respondent 3	Respondent 4
§: Differences in Respondents' Grade Achievement by Subject	"After the use of the Tazkiyatun Nafs module in overcoming this problem, hallucinatory disorders can be overcome. This situation indicates the development of the respondents during and after experiencing hallucinations after practicing the Tazkiyatun Nafs module. The process that takes place in the Tzkiyatun Nafs module can help the respondent to stabilize his emotions and psyche for the better." (p.45)	"After the use of the Tazkiyatun Nafs module as a soul purification module, showing the development of the respondents during and after experiencing hallucination disorders is very good. The process that takes place in the Tzkiyatun Nafs module can help the respondent to stabilize his emotions and psyche for the better." (p.46)	"*The process that takes place in the Tazkiyatun Nafs module can help the respondent to stabilize his emotions and psyche for the better*" (p.46)	"*The process that takes place in the Tazkiyatun Nafs module can help the respondent to stabilize his emotions and psyche for the better*." (p.46)
§:Differences in Respondent Grades according to Overall Academic Achievement	"Based on the findings of the study after the hallucination disorder was overcome showed that the development of the respondents was very positive after going through the treatment process using the Tazkiyatun Nafs module…In general, the use of Tazkiyatun Nafs module successfully changed the learning lifestyle and achievement of the respondents from poor condition to good and excellent achievement." (pp.46-7)	"Based on the findings of the study after the hallucination disorder was overcome showed that the development of the respondents was very positive after going through the treatment process using the Tazkiyatun Nafs module. … This excellence also shows that the respondents have recovered from hallucinations after practicing the methods found in the Tazkiayatun Nafs module that has been introduced. In general, the use of the Tazkiyatun Nafs module successfully changed the learning lifestyle and achievement of the respondents from poor condition to good and excellent achievement." (p.47)	"Based on the findings of the study after the hallucination disorder was overcome showed that the development of the respondents was very positive after going through the treatment process using the Tazkiyatun Nafs module…In general, the use of the Tazkiyatun Nafs module successfully changed the learning lifestyle and achievement of the respondents from poor condition to good and excellent achievement." (p.47)	"Based on the findings of the study after the hallucination disorder was overcome showed that the development of the respondents was very positive after going through the treatment process using the Tazkiyatun Nafs module…In general, the use of the Tazkiyatun Nafs module has successfully changed the learning lifestyle and achievement of the respondents from poor condition to good and excellent achievement." (p.47)

Unsupported claims made within findings sections reporting analyses of individual student academic grades: note (a) how these statements included in the analysis of individual school performance data from four separate participants (in a case study – a methodology that recognises and values diversity and individuality) are very similar across the participants; (b) claims about 'respondents' (plural) are included in the reports of findings from individual students.

Awang summarises what he claims the analysis of 'differences in respondents' grade achievement by subject' shows:

"The use of the Tazkiyatun Nafs module in this study helped the students improve their respective achievement grades. Therefore, this soul purification module should be practiced by every student to help them in stabilizing their soul and emotions and stay away from all the disturbances of the subtle beings that lead to hallucinations"
Awang, 2022, p.46

And, on the next page, Awang summarises what he claims the analysis of 'differences in respondent grades according to overall academic achievement' shows:

"The use of the Tazkiyatun Nafs module in this study helped the students improve their respective overall academic achievement. Therefore, this soul purification module should be practiced by every student to help them in stabilizing the soul and emotions as well as to stay away from all the disturbances of the subtle beings that lead to hallucination disorder."
Awang, 2022, p.47

So, the analysis of grades is said to demonstrate the value of the intervention, and indeed Awang considers this is reason to extend the intervention beyond the four participants, not just to others suffering hallucinations, but to "every student". The peer review process seems not to have raised queries about

the unsupported claims,
the confusion of recommendations with findings (it is normal to keep to results in a findings section), nor
the unwarranted generalisation from four hallucination suffers to all students whether healthy or not.

Interpreting the results

There seem to be two stories that can be told about the results:

When the four students suffered hallucinations, this led to a deterioration in their school performance. Later, once they had recovered from the episodes of hallucinations, their school performance improved.
Narrative 1

Now narrative 1 relies on a very substantial implied assumption – which is that the numbers presented as school performance are comparable over time. So, a control would be useful: such as what happened to the performance scores of other students in the same classes over the same time period. It seems likely they would not have shown the same dip – unless the dip was related to something other than hallucinations – such as the well-recognised dip after long school holidays, or some cultural distraction (a major sports tournament; fasting during Ramadan; political unrest; a pandemic…). Without such a control the evidence is suggestive (after all, being ill, and missing school as a result, is likely to lead to a dip in school performance, so the findings are not surprising), but inconclusive.

Intriguingly, the author tells readers that "student achievement statistics from the beginning of the year to the middle of the current [sic, published in 2022] year in secondary schools in Northern Peninsular Malaysia that have been surveyed by researchers show a decline (Sabri, 2015 [sic])" (p.42), but this is not considered in relation to the findings of the study.

When the four students suffered hallucinations, this led to a deterioration in their school performance. Later, as a result of undergoing the soul purification module, their school performance improved.
Narrative 2

Clearly narrative 2 suffers from the same limitation as narrative 1. However, it also demands an extra step in making an inference. I could re-write this narrative:

When the four students suffered hallucinations, this led to a deterioration in their school performance. Later, once they had recovered from the episodes of hallucinations, their school performance improved.
AND
the recovery was due to engagement with the soul purification module.
Narrative 2'.

That is, even if we accept narrative 1 as likely, to accept narrative 2 we would also need to be convinced that:

a) sufferers from medical conditions leading to hallucinations do not suffer periodic attacks with periods of remission in between; or
b) episodes of hallucinations cannot be due to one-off events (emotional trauma, T.I.A. {transient ischaemic attack or mini-strokes},…) that resolve naturally in time; or
c) sufferers from medical conditions leading to hallucinations do not find they resolve due to maturation; or
d) the four participants in this study did not undertaken any change in life-style (getting more sleep, ceasing eating strange fungi found in the woods) unrelated to the intervention that might have influenced the onset of hallucinations; or
e) the four participants in this study did not receive any medical treatment independent of the intervention (e.g., prescribed medication to treat migraine episodes) that might have influenced the onset of hallucinations

Despite this study being supposedly a case study (where the expectation is there should be 'thick description' of the case and its context), there is no information to help us exclude such options. We do not know the medical diagnoses of the conditions causing the participants' hallucinations, or anything about their lives or any medical treatment that may have been administered. Without such information, the analysis that is provided is useless for answering the research question.

In effect, regardless of all the other issues raised, the key problem is that the research design is simply inadequate to test the research question. But it seems the referees and editor did not notice this shortcoming.

Alleged implications of the research

After presenting his results Awang draws various implications, and makes a number of claims about what had been found in the study:

"After the students went through the treatment session by using the Tazkiayatun Nafsmodule to treat hallucinations, it showed a positive effect on the student respondents. All this was certified by the expert, the student's parents as well as the counselor's teacher." (p.48)
"Based on these findings, shows that hallucinations are very disturbing to humans and the appropriate method for now to solve this problem is to use the Tazkiyatun Nafs Module." (p.48)
"…the use of the Tazkiyatun Nafs module while the respondent is suffering from hallucination disorder is very appropriate…is very helpful to the respondents in restoring their minds and psyche to be calmer and healthier. These changes allow students to focus on their studies as well as allow them to improve their academic performance better." (p.48)
"The use of the Tazkiyatun Nafs Module in this study has led to very positive changes there are attitudes and traits of students who face hallucinations before. All the negative traits like irritability, loneliness, depression,etc. can be overcome completely." (p.49)
"The personality development of students is getting better and perfect with the implementation of the Tazkiaytun Nafs module in their lives." (p.49)
"Results indicate that students who suffer from this hallucination disorder are in a state of high depression, inactivity, fatigue, weakness and pain,and insufficient sleep." (p.49)
"According to the findings of this study, the history of this hallucination disorder started in primary school and when a person is in adolescence, then this disorder becomes stronger and can cause various diseases and have various effects on a person who is disturbed." (p.50)

Given the range of interview data that Awang claims to have collected and analysed, at least some of the claims here are possibly supported by the data. However, none of this data and analysis is available to the reader. ² These claims are not supported by any evidence presented in the paper. Yet peer reviewers and the editor who read the manuscript seem to feel it is entirely acceptable to publish such claims in a research paper, and not present any evidence whatsoever.

Summing up

In summary: as far as these four students were concerned (but not perhaps the fifth participant?), there did seem to be a relationship between periods of experiencing hallucinations and lower school performance (perhaps explained by such factors as "absenteeism to school during the day the test was conducted" p.46) ,

"the performance shown by students who face chronic hallucinations is also declining and declining. This is all due to the actions of students leaving the teacher's learning and teaching sessions as well as not attending school when this hallucinatory disorder strikes. This illness or disorder comes to the student suddenly and periodically. Each time this hallucination disease strikes the student causes the student to have to take school holidays for a few days due to pain or depression"
Awang, 2022, p.42

However,

these four students do not represent any wider population;
there is no information about the specific nature, frequency, intensity, etcetera, of the hallucinations or diagnoses in these individuals;
there was no statistical test of significance of changes; and
there was no control condition to see if performance dips were experienced by others not experiencing hallucinations at the same time.

Once they had recovered from the hallucinations (and it is not clear on what basis that judgement was made) their scores improved.

The author would like us to believe that the relief from the hallucinations was due to the intervention, but this seems to be (quite literally) an act of faith ³ as no actual research evidence is offered to show that the soul purification module actually had any effect. It is of course possible the module did have an effect (whether for the conjectured or other reasons – such as simply offering troubled children some extra study time in a calm and safe environment and special attention – or because of an expectancy effect if the students were told by trusted authority figures that the intervention would lead to the purification of their hearts and the healing of their hallucinatory disorder) but the study, as reported, offers no strong grounds to assume it did have such an effect.

An irresponsible journal

As hallucinations are often symptoms of organic disease affecting blood supply to the brain, there is a major question of whether treating the condition by religious instruction is ethically sound. For example, hallucinations may indicate a tumour growing in the brain. Yet, if the module was only a complement to proper medical attention, a reader may prefer to suspect that any improvement in the condition (and consequent increased engagement in academic work) may have been entirely unrelated to the module being evaluated.

Indeed, a published research study that claims that soul purification is a suitable treatment for medical conditions presenting with hallucinations is potentially dangerous as it could lead to serious organic disease going untreated. If Awang's recommendations were widely taken up in Malaysia such that students with serious organic conditions were only treated for their hallucinations by soul purification rather than with medication or by surgery it would likely lead to preventable deaths. For a research journal to publish a paper with such a conclusion, where any qualified reviewer or editor could easily see the conclusion is not warranted, is irresponsible.

As the journal website points out,

"The process of reviewing is considered critical to establishing a reliable body of research and knowledge. The review process aims to make authors meet the standards of their discipline, and of science in general."
https://www.ej-edu.org/index.php/ejedu/about

So, why did the European Journal of Education and Pedagogy not subject this submission to meaningful review to help the author of this study meet the standards of the discipline, and of science in general?

Work cited:

Awang, S. B. (2022). Hallucination Disorders: The Effects of Using the Tazkiyatun Nafs Module on the Academic Achievement of Students with Hallucinations. European Journal of Education and Pedagogy, 3(4), 41-51.
Taber, K. S. (2013). Non-random thoughts about research. Chemistry Education Research and Practice, 14(4), 359-362. doi: 10.1039/c3rp90009f. [Free access]
Taber, K. S. (2014). Methodological issues in science education research: a perspective from the philosophy of science. In M. R. Matthews (Ed.), International Handbook of Research in History, Philosophy and Science Teaching (Vol. 3, pp. 1839-1893): Springer Netherlands.) (Download the author's manuscript version of the chapter.)
Taber, K. S. (2014b). Ethical considerations of chemistry education research involving "human subjects". Chemistry Education Research and Practice, 15(2), 109-113. [Free access]
Taber, K. S. (2018). The Use of Cronbach's Alpha When Developing and Reporting Research Instruments in Science Education. Research in Science Education, 48, 1273-1296. doi:10.1007/s11165-016-9602-2 [Open access]

Notes:

¹ In mature fields in the natural sciences there are recognised traditions ('paradigms', 'disciplinary matrices') in any active field at any time. In general (and of course, there will be exceptions):

at any historical time, there is a common theoretical perspective underpinning work in a research programme, aligned with specific ontological and epistemological commitments;
at any historical time, there is a strong alignment between the active theories in a research programme and the acceptable instrumentation, methodology and analytical conventions.

Put more succinctly, in a mature research field, there is generally broad agreement on how a phenomenon is to be understood; and how to go about investigating it, and how to interpret data as research evidence.

This is generally not the case in educational research – which is in part at least due to the complexity and, so, multi-layered nature, of the phenomena studied (Taber, 2014a): phenomena such as classroom teaching. So, in reviewing educational papers, it is sometimes necessary to find different experts to look at the theoretical and the methodological aspects of the same submission.

² The paper is very strange in that the introductory sections and the conclusions and implications sections have a very broad scope, but the actual research results are restricted to a very limited focus: analysis of school test scores and grades.

It is as if as (and could well be that) a dissertation with a number of evidential strands has been reduced to a paper drawing upon only one aspect of the research evidence, but with material from other sections of the dissertation being unchanged from the original broader study.

³ Readers are told that

"All these acts depend on the sincerity of the medical researcher or fortune-teller seeking the help of Allah S.W.T to ensure that these methods and means are successful. All success is obtained by the permission of Allah alone"
Awang, 2022, p.43

Lack of control in educational research

Getting that sinking feeling on reading published studies

Keith S. Taber

this is like finding that, after a period of watering plant A, it is taller than plant B – when you did not think to check how tall the two plants were before you started watering plant A

Research on prelabs

I was looking for studies which explored the effectiveness of 'prelabs', activities which students are given before entering the laboratory to make sure they are prepared for practical work, and can therefore use their time effectively in the lab. There is much research suggesting that students often learn little from science practical work, in part because of cognitive overload – that is, learners can be so occupied with dealing with the apparatus and materials they have little capacity left to think about the purpose and significance of the work. ¹

Okay, so is *THIS* the pipette?
(Image by PublicDomainPictures from Pixabay)

Approaching a practical work session having already spent time engaging with its purpose and associated theories/models, and already having become familiar with the processes to be followed, should mean students enter the laboratory much better prepared to use their time efficiently, and much better informed to reflect on the wider theoretical context of the work.

I found a Swedish paper (Winberg & Berg, 2007) reporting a pair of studies that tested this idea by using a simulation as a prelab activity for undergraduates about to engage with an acid-base titration. The researchers tested this innovation by comparisons between students who completed the prelab before the titration, and those who did not.

The work used two basic measures:

types (sophistication) of questions asked by students during the lab. session
elicitation of knowledge in interviews after the laboratory activity

The authors found some differences (between those who had completed the prelab and those that had not) in the sophistication of the questions students asked, and in the quality of the knowledge elicited. They used inferential statistics to suggest at least some of the differences found were statistically significant. From my reading of the paper, these claims were not justified.

A peer reviewed journal (no, really, this time)

This is a paper in a well respected journal (not one of the predatory journals I have often discussed on this site). The Journal of Research in Science Teaching is published by Wiley (a major respected publisher of academic material) and is the official journal of NARST (which used to stand for the National Association for Research in Science Teaching – where 'national' referred to the USA ²). This is a journal that does take peer review very seriously.

The paper is well-written and well-structured. Winberg and Berg set out a conceptual framework for the research that includes a discussion of previous relevant studies. They adopt a theoretical framework based on the Perry's model of intellectual development (Taber, 2020). There is considerable detail of how data was collected and analysed. This account is well-argued. (But, you, dear reader, can surely sense a 'but' coming.)

Experimental research into experimental work?

The authors do not seem to explicitly describe their research as an experiment as such (as opposed to adopting some other kind of research strategy such as survey or case study), but the word 'experiment' and variations of it appear in the paper.

For one thing, the authors refer to students' practical work as being experiments,

"Laboratory exercises, especially in higher education contexts, often involve training in several different manipulative skills as well as a high information flow, such as from manuals, instructors, output from the experimental equipment, and so forth. If students do not have prior experiences that help them to sort out significant information or reduce the cognitive effort required to understand what is happening in the experiment, they tend to rely on working strategies that help them simply to cope with the situation; for example, focusing only on issues that are of immediate importance to obtain data for later analysis and reflective thought…"
Winberg & Berg, 2007

Now, some student practical work is experimental, where a student is actively looking to see what happens when they manipulate some variable to test a hypothesis. This type of practical work is sometimes labelled enquiry (or inquiry in US spelling). But a lot of school and university laboratory work, however, is undertaken to learn techniques, or (probably more often) to support the learning of taught theory – where it is usually important the learners know what is meant to happen before they begin the laboratory activity.

Winberg and Berg refer to the 'laboratory exercise' as 'the experiment' as though any laboratory work counts as an experiment. In Winberg and Berg's research, students were asked about their "own [titration] experiment", despite the prelab material involving a simulation of the titration process, in advance of which "the theoretical concepts, ideas, and procedures addressed in the simulation exercise had been treated mainly quantitatively during the preceding 1-week instructional sequence". So, the laboratory titration exercise does not seem to be an experiment in the scientific sense of the term.

School children commonly describe all practical work in the lab as 'doing experiments'. It cannot help students learn what an experiment really is when the word 'experiment' has two quite distinct meanings in the science classroom:

experiment_(technical) = an empirical test of a hypothesis involving the careful control of variables and observation of the effect on a specified (hypothetised as) dependent variable of changing the variable specified as the independent variable
experiment_(casual) = absolutely any practical activity carried out with laboratory equipment

We might describe this second meaning as an alternative conception of 'experiment', a way of understanding that is inconsistent with the scientific meaning. (Just as there are common alternative conceptions of other 'nature of science' concepts such as 'theory').

I would imagine Winberg and Berg were well aware of what an experiment is, although their casual use of language might suggest a lack of rigour in thinking with the term. They refer to having "both control and experiment groups" in their studies, and refer to "the experimental chronology" of their research design. So, they certainly seem to think of their work as a kind of experiment.

Experimental design

In a true experiment, a sample is randomly drawn from a population of interest (say, first year undergraduate chemistry students; or, perhaps, first year undergraduate chemistry students attending Swedish Universities, or… ³) and assigned randomly to the conditions being compared. Providing a genuine form of random assignment is used, then inferential statistical tests can guide on whether any differences found between groups at the end of an experiment should be considered statistically significant. ⁴

"Statistics can only indicate how likely a measured result would occur by chance (as randomisation of units of analysis to different treatments can only make uneven group composition unlikely, not impossible)…Randomisation cannot ensure equivalence between groups (even if it makes any imbalance just as likely to advantage either condition)"
Taber, 2019, p.73

Inferential statistics can be used to test for statistical significance in experiments – as long as the 'units of analysis' (e.g., students) are randomly assigned to the experimental and control conditions.
(Figure from Taber, 2019)

That is, if the are difference that the stats. tests suggests are very unlikely to happen by chance, then they are very unlikely to be due to an initial difference between the groups in the two conditions as long as the groups were the result of random assignment. But that is a very important proviso.

There are two aspects to this need for randomisation:

to be able to suggest any differences found reflect the effects of the intervention, then there should be random assignment to the two (or more) conditions
to be able to suggest the results reflect what would probably would be found in a wider population, the sample should be randomly selected from the population of interest ³

Studies in education seldom meet the requirements for being true experiments
(Figure from Taber, 2019)

In education, it is not always possible to use random assignment, so true experiments are then not possible. However, so-called 'quasi-experiments' may be possible where differences between the outcomes in different conditions may be understood as informative, as long as there is good reason to believe that even without random assignment, the groups assigned to the different conditions are equivalent.

In this specific research, that would mean having good reason to believe that without the intervention (the prelab):

students in both groups would have asked overall equivalent (in terms of the analysis undertaken in this study) questions in the lab.;
students in both groups would have been judged as displaying overall equivalent subject knowledge.

Often in research where a true experiment is not possible some kind of pre-testing is used to make a case for equivalence between groups.

Two control groups that were out of control

In Winberg and Berg's research there were two studies where comparisons were made between 'experimental' and 'control' conditions

Study	Experimental	Control
Study 1	n=78: first-year students, following completion of their first chemistry course in 2001	n=97: students who had been interviewed by the researchers during the same course in the previous year
Study 2	n=21 (of 58 in cohort)	n=37 (of 58 in same cohort)

In the first study, a comparison was made between the cohort where the innovation was introduced and a cohort from the previous year. All other things being equal, it seems likely these two cohorts were fairly similar. But in education all thing are seldom equal, so there is no assurance they were similar enough to be considered equivalent.

In the second study

"Students were divided into treatment (n = 21) and control (n = 37) groups. Distribution of students between the treatment and control groups was not controlled by the researchers".
Winberg & Berg, 2007

So, some factor(s) external to the researchers divided the cohort into two groups – and the reader is told nothing about the basis for this, nor even if the two groups were assigned to the treatments randomly.⁵ The authors report that the cohort "comprised prospective molecular biologists (31%), biologists (51%), geologists (7%), and students who did not follow any specific program (11%)", and so it is possible the division into two uneven sized groups was based on timetabling constraints with students attending chemistry labs sessions according to their availability based on specialism. But that is just a guess. (It is usually better when the reader of a research report is not left to speculate about procedures and constraints.)

What is important for a reader to note is that in these studies:

the researchers were not able to assign learners to conditions randomly;
nor were the researchers able to offer any evidence of equivalence between groups (such as near identical pre-test scores);
so, the requirements for inferring significance from statistical tests were not met;
so, claims in the paper about finding statistically significant differences between conditions cannot therefore be justified given the research design;
and therefore the conclusions presented in the paper are strictly not valid.

If students are not randomly assigned to conditions, then any statistically unlikely difference found at the end of an experiment cannot be assumed to be likely to be due to intervention, rather than some systematic initial difference between the groups.
(Figure adapted from Taber, 2019)

This is a shame, because this is in many ways an interesting paper, and much thought and care seems to have been taken about the collection and analysis of meaningful data. Yet, drawing conclusions from statistical tests comparing groups that might never have been similar in the first case is like finding that careful use of a vernier scale shows that after a period of watering plant A, plant A is taller than plant B – having been very careful to make sure plant A was watered regularly with carefully controlled volumes, while plant B was not watered at all – when you did not think to check how tall the two plants were before you started watering plant A.

In such a scenario we might be tempted to assume plant A has actually become taller because it had been watered; but that is just applying what we had conjectured should be the case, and we would be mistaking our expectations for experimental evidence.

Work cited:

Taber, K. S. (2013). Non-random thoughts about research. Chemistry Education Research and Practice, 14(4), 359-362. doi:10.1039/c3rp90009f
Taber, K. S. (2019). Experimental research into teaching innovations: responding to methodological and ethical challenges. Studies in Science Education, 55(1), 69-119. doi:10.1080/03057267.2019.1658058 [Download the manuscript version of this paper].
Taber, K. S. (2020). Developing intellectual sophistication and scientific thinking – The schemes of William G. Perry and Deanna Kuhn. In B. Akpan & T. Kennedy (Eds.), Science Education in Theory and Practice: An introductory guide to learning theory (pp. 209-223). Springer.
Winberg, T. M., & Berg, C. A. R. (2007). Students' cognitive focus during a chemistry laboratory exercise: Effects of a computer-simulated prelab. Journal of Research in Science Teaching, 44(8), 1108-1133. https://doi.org/https://doi.org/10.1002/tea.20217

Notes:

¹The part of the brain where we can consciously mentipulate ideas is called the working memory (WM). Research suggests that WM has a very limited capacity in the sense that people can only hold in mind a very small number of different things at once. (These 'things' however are somewhat subjective – a complex idea that is treated as a single 'thing' in the WM of an expert can overload a novice.) This limit to ~WM is considered to be one of the most substantial constraints on effective classroom learning. This is also, then, one of the key research findings informing the design of effective teaching.

Read about working memory

Read about key ideas for teaching in accordance with learning theory

How fat is your memory? – read about a chemical analogy for working memory

² The organisation has seemingly spotted that the USA is only one part of the world, and now describes itself as a global organisation for improving science education through research.

³ There is no reason why an experiment cannot be carried out on a very specific population, such as first year undergraduate chemistry students attending a specific Swedish University such a, say, Umea ̊ University. However, if researchers intend their study to have results generalisable beyond their specific research contexts (say, to first year undergraduate chemistry students attending any Swedish University) then it is important to have a representative sample of that population.

Read about populations of interest in research

Read about generalisation from research studies

⁴ It might be assumed that scientists, and researchers know what is meant by random, and how to undertake random assignment. Sadly, the literature suggests that in practice the term 'randomly' is sometimes used in research reports to mean something like 'arbitrarily' (Taber, 2013), which fills short of being random.

Read about randomisation in research

⁵ Arguably, even if the two groups were assigned randomly, there is only one 'unit of analysis' in each condition, as they were assigned as groups. That is, for statistical purposes, the two groups have size n=1 and n=1, which would not allow statistical significance to be found: e.g, see 'Quasi-experiment or crazy experiment?'

Assessing Chemistry Laboratory Equipment Availability and Practice

Comparative education on a local scale?

Keith S. Taber

I have just read a paper in a research journal which compares the level of chemistry laboratory equipment and 'practice' in two schools in the "west Gojjam Administrative zone" (which according to a quick web-search is in the Amhara Region in Ethiopia). According to Yesgat and Yibeltal (2021),

"From the analysis of Chemistry laboratory equipment availability and laboratory practice in both … secondary school and … secondary school were found in very low level and much far less than the average availability of chemistry laboratory equipment and status of laboratory practice. From the data analysis average chemistry laboratory equipment availability and status of laboratory practice of … secondary school is better than that of Jiga secondary school."
Yesgat and Yibeltal, 2021: abstract [I was tempted to omit the school names in this posting as I was not convinced the schools had been treated reasonably, but the schools are named in the very title of the article]

Now that would seem to be something that could clearly be of interest to teachers, pupils, parents and education administrators in those two particular schools, but it raises the question that can be posed in relation to any research: 'so what?' The findings might be a useful outcome of enquiry in its own context, but what generalisable knowledge does this offer that justifies its place in the research literature? Why should anyone outside of West Gojjam care?

The authors tell us,

"There are two secondary schools (Damot and Jiga) with having different approach of teaching chemistry in practical approach"
Yesgat and Yibeltal, 2021: 96

So, this suggests a possible motivation.

If these two approaches reflect approaches that are common in schools more widely, and
if these two schools can be considered representative of schools that adopt these two approaches, and
if 'Chemistry Laboratory Equipment Availability and Practice' can be considered to be related to (a factor influencing? an effect of?) these different approaches, and
if the study validly and reliably measures 'Chemistry Laboratory Equipment Availability and Practice', and
if substantive differences are found between the schools

then the findings might well be of wider interest. As always in research, the importance we give to findings depends upon a whole logical chain of connections that collectively make an argument.

Spoiler alert!

At the end of the paper, I was none the wiser what these 'different approaches' actually were.

A predatory journal

I have been reading some papers in a journal that I believed, on the basis of its misleading title and website details, was an example of a poor-quality 'predatory journal'. That is, a journal which encourages submissions simply to be able to charge a publication fee (currently $1519, according to the website), without doing the proper job of editorial scrutiny. I wanted to test this initial evaluation by looking at the quality of some of the work published.

Although the journal is called the Journal of Chemistry: Education Research and Practice (not to be confused, even if the publishers would like it to be, with the well-established journal Chemistry Education Research and Practice) only a few of the papers published are actually education studies. One of the articles that IS on an educational topic is called 'Assessment of Chemistry Laboratory Equipment Availability and Practice: A Comparative Study Between Damot and Jiga Secondary Schools' (Yesgat & Yibeltal, 2021).

Comparative education?

Yesgat and Yibeltal imply that their study falls in the field of comparative education. ¹ They inform readers that ²,

"One purpose of comparative education is to stimulate critical reflection about our educational system, its success and failures, strengths and weaknesses. This critical reflection facilitates self-evaluation of our work and is the basis for determining appropriate courses of action. Another purpose of comparative education is to expose us to educational innovations and systems that have positive outcomes. Most compartivest states [sic] that comparative education has four main purposes. These are:
To describe educational systems, processes or outcomes
To assist in development of educational institutions and practices
To highlight the relationship between education and society
To establish generalized statements about education that are valid in more than one country
Yesgat & Yibeltal, 2021: 95-96

Comparative education studies look to characterise (national) education systems in relation to their social/cultural contexts (Image by Gerd Altmann from Pixabay)

Of course, like any social construct, 'comparative education' is open to interpretation and debate: for example, "that comparative education brings together data about two or more national systems of education, and comparing and contrasting those data" has been characterised as an "a naive and obvious answer to the question of what constitutes comparative education" (Turner, 2019, p.100).

There is then some room for discussion over whether particular research outputs should count as 'comparative education' studies or not. Many comparative education studies do not actually compare two educational systems, but rather report in detail from a single system (making possible subsequent comparisons based across several such studies). These educational systems are usually understood as national systems, although there may be a good case to explore regional differences within a nation if regions have autonomous education systems and these can be understood in terms of broader regional differences.

Yet, studying one aspect of education within one curriculum subject at two schools in one educational educational administrative area of one region of one country cannot be understood as comparative education without doing excessive violence to the notion. This work does not characterise an educational system at national, regional or even local level.

My best assumption is that as the study is comparing something (in this case an aspect of chemistry education in two different schools) the authors feel that makes it 'comparative education', by which account of course any educational experiment (comparing some innovation with some kind of comparison condition) would automatically be a comparative education study. We all make errors sometimes, assuming terms have broader or different meanings than their actual conventional usage – and may indeed continue to misuse a term till someone points this out to us.

This article was published in what claims to be a peer reviewed research journal, so the paper was supposedly evaluated by expert reviewers who would have provided the editor with a report on strengths and weaknesses of the manuscript, and highlighted areas that would need to be addressed before possible publication. Such a reviewer would surely have reported that 'this work is not comparative education, so the paragraph on comparative education should either be removed, or authors should contextualise it to explain why it is relevant to their study'.

The weak links in the chain

A research report makes certain claims that derive from a chain of argument. To be convinced about the conclusions you have to be convinced about all the links in the chain, such as:

sampling (were the right people asked?)
methodology (is the right type of research design used to answer the research question?)
instrumentation (is the data collection instrument valid and reliable?)
analysis (have appropriate analytical techniques been carried out?)

These considerations cannot be averaged: if, for example, a data collection instrument does not measure what it is said to measure, then it does not matter how good the sample, or how careful the analysis, the study is undermined and no convincing logical claims can be built. No matter how skilled I am in using a tape measure, I will not be able to obtain accurate weights with it.

Sampling

The authors report the make up of their sample – all the chemistry teachers in each school (13 in one, 11 in the other), plus ten students from each of grades 9, 10 and 11 in each school. They report that "… 30 natural science students from Damot secondary school have been selected randomly. With the same technique … 30 natural sciences students from Jiga secondary school were selected".

Random selection is useful to know there is no bias in a sample, but it is helpful if the technique for randomisation is briefly reported to assure readers that 'random' is not being used as a synonym for 'arbitrary' and that the technique applied was adequate (Taber, 2013b).

A random selection across a pooled sample is unlikely to lead to equal representation in each subgroup (From Taber, 2013a)

Actually, if 30 students had been chosen at random from the population of students taking natural sciences in one of the schools, it would be extremely unlikely they would be evenly spread, 10 from each year group. Presumably, the authors made random selections within these grade levels (which would be eminently sensible, but is not quite what they report).

Read about the criterion for randomness in research

Data collection

To collect data the authors constructed a questionnaire with Likert-type items.

"…questionnaire was used as data collecting instruments. Closed ended questionnaires with 23 items from which 8 items for availability of laboratory equipment and 15 items for laboratory practice were set in the form of "Likert" rating scale with four options (4=strongly agree, 3=agree, 2=disagree and 1=strongly disagree)"
Yesgat & Yibeltal, 2021: 96

These categories were further broken down (Yesgat & Yibeltal, 2021: 96): "8 items of availability of equipment were again sub grouped in to

physical facility (4 items),
chemical availability (2 items), and
laboratory apparatus (2 items)

whereas 15 items of laboratory practice were further categorized as

before actual laboratory (4 items),
during actual laboratory practice (6 items) and
after actual laboratory (5 items)

Internal coherence

So, there were two basic constructs, each broken down into three sub-constructs. This instrument was piloted,

"And to assure the reliability of the questionnaire a pilot study on a [sic] non-sampled teachers and students were conducted and Cronbach's Alpha was applied to measure the coefficient of internal consistency. A reliability coefficient of 0.71 was obtained and considered high enough for the instruments to be used for this research"
Yesgat & Yibeltal, 2021: 96

Running a pilot study can be very useful as it can highlight issues about items. However, although simply asking people to complete a questionnaire might highlight items people could not make any sense of, it may not be as useful as interviewing them about how they understood items to check that respondents understand items in the same way as researchers.

The authors cite the value of Cronbach's alpha to demonstrate their instrument has internal consistency. However, they seem to be quoting the value obtained in the pilot study, where the statistic strictly applies to a particular administration of an instrument (so the value from the main study is more relevant to the results reported).

More problematic, the authors appear to cite a value of alpha from across all 23 items (n.b., the value of alpha tends to increase as the number of items increases, so what is considered an acceptable value needs to allow for the number of items included) when these are actually two distinct scales: 'availability of laboratory equipment' and 'laboratory practice'. Alpha should be quoted separately for each scale – values across distinct scales are not useful (Taber, 2018). ³

Do the items have face validity?

The items in the questionnaire are reported in appendices (pp.102-103), so I have tabulated them here, so readers can consider

(a) whether they feel these items reflect the constructs of 'availability of equipment' and 'laboratory practice';
(b) whether the items are phrased in a clear way for both teachers and students (the authors report "conceptually the same questionnaires with different forms were prepared" (p.101) but if this means different wording fro teachers than students this is not elaborated – teachers were also asked demographic questions about their educational level)); and
(c) whether they are all reasonable things to expect both teachers and students to be able to rate.

'Availability of equipment' items	'Laboratory practice' items
Structured and well- equipped laboratory room	You test the experiments before your work with students
Availability of electric system in laboratory room	You give laboratory manuals to student before practical work
Availability of water system in laboratory room	You group and arrange students before they are coming to laboratory room
Availability of laboratory chemicals are available [sic]	You set up apparatus and arrange chemicals for activities
No interruption due to lack of lab equipment	You follow and supervise students when they perform activities
Isolated bench to each student during laboratory activities	You work with the lab technician during performing activity
Chemicals are arranged in a logical order.	You are interested to perform activities?
Laboratory apparatus are arranged in a logical order	You check appropriate accomplishment of your students' work
	Check your students' interpretation, conclusion and recommendations
	Give feedbacks to all your students work
	Check whether the lab report is individual work or group
	There is a time table to teachers to conduct laboratory activities.
	Wear safety goggles, eye goggles, and other safety equipment in doing so
	Work again if your experiment is failed
	Active participant during laboratory activity

Items teachers and students were asked to rate on a four point scale (agree / strongly agree / disagree / strongly disagree)

Perceptions

One obvious limitation of this study is that it relies on reported perceptions.

One way to find out about the availability of laboratory equipment might be to visit teaching laboratories and survey them with an observation schedule – and perhaps even make a photographic record. The questionnaire assumes that teacher and student perceptions are accurate and that honest reports would be given (might teachers have had an interest in offering a particular impression of their work?)

Sometimes researchers are actually interested in impressions (e.g., for some purposes whether a students considers themselves a good chemistry student may be more relevant than an objective assessment), and sometimes researchers have no direct access to a focus of interest and must rely on other people's reports. Here it might be suggested that a survey by questionnaire is not really the best way to, for example, "evaluate laboratory equipment facilities for carrying out practical activities" (p.96).

Findings

The authors describe their main findings as,

"Chemistry laboratory equipment availability in both Damot secondary school and Jiga secondary school were found in very low level and much far less than the average availability of chemistry laboratory equipment. This finding supported by the analysis of one sample t-values and as it indicated the average availability of laboratory equipment are very much less than the test value and the p-value which is less than 0.05 indicating the presence of significant difference between the actual availability of equipment to the expected test value (2.5).
Chemistry laboratory practice in both Damot secondary school and Jiga secondary school were found in very low level and much far less than the average chemistry laboratory practice. This finding supported by the analysis of one sample t-values and as it indicated the average chemistry laboratory practice are very much less than the test value and the p-value which is less than 0.05 indicating the presence of significant difference between the actual chemistry laboratory practice to the expected test value."
Yesgat & Yibeltal, 2021: 101 (emphasis added)

This is the basis for the claim in the abstract that "From the analysis of Chemistry laboratory equipment availability and laboratory practice in both Damot secondary school and Jiga secondary school were found in very low level and much far less than the average availability of chemistry laboratory equipment and status of laboratory practice."

'The average …': what is the standard?

But this raises a key question – how do the authors know what the "the average availability of chemistry laboratory equipment and status of laboratory practice" is, if they have only used their questionnaire in two schools (which are both found to be below average)?

Yesgat & Yibeltal have run a comparison between the average ratings they get from the two schools on their two scales and the 'average test value' rating of 2.5. As far as I can see, this is not an empirical value at all. It seems the authors have just assumed that if people are asked to use a four point scale – 1, 2, 3, 4 – then the average rating will be…2.5. Of course, that is a completely arbitrary assumption. (Consider the question – 'how much would you like to be beaten and robbed today?': would the average response be likely to be nominal mid-point of a ratings scale?) Perhaps if a much wider survey had been undertaken the actual average rating would have been 1.9 0r 2.7 or …

That is even assuming that 'average' is a meaningful concept here. A four point Likert scale is an ordinal scale ('agree' is always less agreement than 'strongly agree' and more than 'disagree') but not a ratio scale (that is, it cannot be assumed that the perceived 'agreement' gap (i) from 'strongly disagree' to 'disagree' is the same for each respondent and the same as that (ii) from 'disagree' to 'agree' and (iii) from 'agree' to 'strongly agree'). Strictly, Likert scale ratings cannot be averaged (better being presented as bar charts showing frequencies of response) – so although the authors carry out a great deal of analysis, much of this is, strictly, invalid.

So what has been found out from this study?

I would very much like to know what peer reviewers made of this study. Expert reviewers would surely have identified some very serious weaknesses in the study and would have been expected to have recommended some quite major revisions even if they thought it might eventually be publishable in a research journal.

An editor is expected to take on board referee evaluations and ask authors to make such revisions as are needed to persuade the editor the submission is ready for publication. It is the job of the editor of a research journal, supported by the peer reviewers, to

a) ensure work of insufficient quality is not published

b) help authors strengthen their paper to correct errors and address weaknesses

Sometimes this process takes some time, with a number of cycles of revision and review. Here, however, the editor was able to move to a decision to publish in 5 days.

The study reflects a substantive amount of work by the authors. Yet, it is hard to see how this study, at least as reported in this journal, makes a substantive contribution to public knowledge. The study finds that one school has somewhat higher survey ratings on an instrument that has not been fully validated than another school, and is based on a pooling of student and teacher perceptions, and which guesses that both rate lower than a hypothetical 'average' school. The two schools were supposed to represent a "different approach[es] of teaching chemistry in practical approach" – but even if that is the case, the authors have not shared with their readers what these different approaches are meant to be. So, there would be no possibility of generalising from the schools to 'approach[es] of teaching chemistry', even if that was logically justifiable. And comparative education it is not.

This study, at least as published, does not seem to offer useful new knowledge to the chemistry education community that could support teaching practice or further research. Even in the very specific context of the two specific schools it is not clear what can be done with the findings which simply reflect back to the informants what they have told the researchers, without exploring the reasons behind the ratings (how do different teachers and students understand what counts as 'Chemicals are arranged in a logical order') or the values the participants are bringing to the study (is 'Check whether the lab report is individual work or group' meant to imply that it is seen as important to ensure that students work cooperatively or to ensure they work independently or …?)

If there is a problem highlighted here by the "very low levels" (based on a completely arbitrary interpretation of the scales) there is no indication of whether this is due to resourcing of the schools, teacher preparation, levels of technician support, teacher attitudes or pedagogic commitments, timetabling problems, …

This seems to be a study which has highlighted two schools, invited teachers and students to complete a dubious questionnaire, and simply used this to arbitrarily characterise the practical chemistry education in the schools as very poor, without contextualising any challenges or offering any advice on how to address the issues.

Work cited:

Mbozi, E. H. (2017). Comparative Education. Nairobi, Kenya: African Virtual University.
Taber, K. S. (2013a). Classroom-based Research and Evidence-based Practice: An introduction (2nd ed.). London: Sage.
Taber, K. S. (2013b). Non-random thoughts about research. Chemistry Education Research and Practice, 14(4), 359-362
Taber, K. S. (2018). The Use of Cronbach's Alpha When Developing and Reporting Research Instruments in Science Education. Research in Science Education, 48, 1273-1296. doi:10.1007/s11165-016-9602-2
Turner, D. A. (2019). What Is Comparative Education? In A. W. Wiseman (Ed.), Annual Review of Comparative and International Education 2018 (Vol. 37, pp. 99-114): Emerald Publishing Limited.
Yesgat, D., & Yibeltal, J. (2021). Assessment of Chemistry Laboratory Equipment Availability and Practice: A Comparative Study Between Damot and Jiga Secondary Schools. Journal of Chemistry: Education Research and Practice, 5(2), 95-103.

Note:

¹ 'Imply' as Yesgat and Yibeltal do not actually state that they have carried out comparative education. However, if they do not think so, then the paragraph on comparative education in their introduction has no clear relationship with the rest of the study and is not more than a gratuitous reference, like suddenly mentioning Nottingham Forest's European Cup triumphs or noting a preferred flavour of tea.

² This seemed an intriguing segment of the text as it was largely written in a more sophisticated form of English than the rest of the paper, apart from the odd reference to "Most compartivest [comparative education specialists?] states…" which seemed to stand out from the rest of the segment. Yesgat and Yibeltal do not present this as a quote, but cite a source informing their text (their reference [4] :Joubish, 2009). However, their text is very similar to that in another publication:

Quote from Mbozi, 2017, p.21	Quote from Yesgat and Yibeltal, 2021, pp.95-96
"One purpose of comparative education is to stimulate critical reflection about our educational system, its success and failures, strengths and weaknesses.	"One purpose of comparative education is to stimulate critical reflection about our educational system, its success and failures, strengths and weaknesses.
This critical reflection facilitates self-evaluation of our work and is the basis for determining appropriate courses of action.	This critical reflection facilitates self-evaluation of our work and is the basis for determining appropriate courses of action.
Another purpose of comparative education is to expose us to educational innovations and systems that have positive outcomes.	Another purpose of comparative education is to expose us to educational innovations and systems that have positive outcomes.
The exposure facilitates our adoption of best practices.
Some purposes of comparative education were not covered in your exercise above.
Purposes of comparative education suggested by two authors Noah (1985) and Kidd (1975) are presented below to broaden your understanding of the purposes of comparative education.
Noah, (1985) states that comparative education has four main purposes [4] and these are:	Most compartivest states that comparative education has four main purposes. These are:
1. To describe educational systems, processes or outcomes	• To describe educational systems, processes or outcomes
2. To assist in development of educational institutions and practices	• To assist in development of educational institutions and practices
3. To highlight the relationship between education and society	• To highlight the relationship between education and society
4. To establish generalized statements about education, that are valid in more than one country."	• To establish generalized statements about education that are valid in more than one country"

Comparing text (broken into sentences to aid comparison) from two sources

³ There are more sophisticated techniques which can be used to check whether items do 'cluster' as expected for a particular sample of respondents.

⁴ As suggested above, researchers can pilot instruments with interviews or 'think aloud' protocols to check if items are understood as intended. Asking assumed experts to read through and check 'face validity' is of itself quite a limited process, but can be a useful initial screen to identify items of dubious relevance.

Nothing random about a proper scientific evaluation?

Keith S. Taber

I heard about an experiment comparing home-based working with office-based working on the radio today (BBC Radio 4 – Positive Thinking: Curing Our Productivity Problem, https://www.bbc.co.uk/sounds/play/m000kgsb). This was a randomised controlled trial (RCT). An RCT is, it was explained, "a proper scientific evaluation". The RCT is indeed considered to be the rigorous way of testing an idea in the social sciences (see Experimental research into teaching innovations: responding to methodological and ethical challenges).

Randomisation in RCTs

As the name suggests, a key element of a RCT is randomisation. This can occur at two levels. Firstly, research often involves selecting a sample from a larger population, and ideally one selects the sample at random from the population (so every member of the wider population has exactly the same chance of being selected for the sample), so that it can be assumed that what is found with the sample is likely to reflect what would have occurred had the entire population been participating in the experiment. This can be very difficult to organise.

More critically though, it is most important that the people in the sample each have an equal chance of being assigned to each of the conditions. So, in the simplest case there will be two conditions (e.g., here working at home most workdays vs. working in the office each workday) and each person will be assigned in such a way that they have just as much chance as being in one condition as anyone else. We do not put the cleverest, more industrious, the tallest, the fastest, the funniest, etcetera, in one group – rather, we randomise.

If we are randomising, there should be no systematic difference between the people in each condition. That is, we should not be able to use any kind of algorithm to predict who will be in each condition because assignments are made randomly – in effect, according to 'chance'. So, if we examine the composition of the two groups, there is unlikely to be any systematic pattern that distinguishes the two groups.

Two groups – with elements not selected at random (Image by hjrivas from Pixabay)

Now some scientists might suspect that nothing happens by chance – that if we could know the precise position and momentum of every particle in the universe (contra Heisenberg) … perhaps even that probabilistic effects found in quantum mechanics follow patterns due to hidden variables we have not yet uncovered…

How can we randomise?

Even if that is not so, it is clear that many ways we use to randomise may be deterministic at some level (when we throw a die, how it lands depends upon physical factors that could in principle, even if not easily in practice, be controlled) but that does not matter if that level is far enough from human comprehension or manipulation. We seek, at least, a quasi-randomisation (we throw dice; we mix up numbered balls in a bag, and then remove them one at a time 'blind'; we flip a coin for each name as we go down a list, until we have a full group for one condition; we consult a table of 'random' numbers; whatever), that is in effect random in the sense that the researchers could never know in advance who would end up in each condition.

When I was a journal editor it became clear to me that claims of randomisation reported in submitted research reports are often actually false, even if inadvertently so (see: Non-random thoughts about research). A common 'give away' here is when you ask the authors of a report how they carried out the randomisation. If they are completely at odds to answer, beyond repeating 'we chose randomly', then it is quite likely not truly random.

To randomise, one needs to adopt a technique: if one has not adopted a randomisation technique, then one used a non-random method of assignment. Asking the more confident, more willing, more experienced, less conservative, etc., teacher to teach the innovation condition is not random. For that matter, asking the first teacher one meets in the staffroom is arbitrary and not really random, even if it may feel as if it is.

…they were randomised, by even and odd birthdates…

The study I was hearing about on the radio was the work of Stanford Professor Nick Bloom, who explained how the 'randomisation' occurred:

"…for those volunteers, they were randomised, by even and odd birth dates, so anyone with an even birth date, if you were born on like the 2nd, the 4th, the 6th, the 8th, etcetera,of the month, you get to work at home for four out of five days a week, for the next nine months, and if you are odd like, I'm the 5th May, you had to stay in the office for the next nine months…"
Professor Nick Bloom interviewed on Positive Thinking: Curing Our Productivity Problem

So, by my definition, that is not randomisation at all – it is totally deterministic. I would necessarily have been in the working at home condition, with zero possibility of being in the office working condition. If this had been random there would have been a 50:50 chance of Prof. Bloom and myself being assigned to the same condition – but with the non-random, systematic assignment used it was certain that we would have ended up in different conditions. So, this was a RCT without randomisation, but rather a completely systematic assignment to conditions.

This raises some questions.

Is it likely that a professor of economics does not understand randomisation?
Does it really matter?

Interestingly, I see from Prof. Bloom's website that one "area of [his] research is on the causes and consequences of uncertainty", so I suspect he actually understands randomisation very well. Presumably, Prof. Bloom knows that strictly there was no randomisation in this experiment, but is confident that it does not matter here.

Had Prof. Bloom assigned the volunteers to conditions depending on whether they were born before or after midnight on the 31st December 1989, this clearly would have introduced a major confounding variable. Had he assigned the volunteers according to those born in March to August to one condition and those born in September to February to the other, say, this might have been considered to undermine the research as it is quite conceivable that the time of year people were gestated, and born, and had to survive the first months of life, might well be a factor that makes a difference to work effectiveness later.

Even if we had no strong evidence to believe this would be so, any systematic difference where we might conjecture some mechanism that could have an effect has to be considered a potential confound that undermines confidence in the results of a RCT. Any difference found could be due to something other (e.g., greater thriving of Summer babies) than the intended difference in conditions ; any failure to find an effect might mean that a real effect (e.g., home working being more efficient than office working) is being masked by the confounding variable (e.g., season of birth).

It does not seem conceivable that even and odd birth dates could have any effect (and this assignment is much easier to organise than actually going through the process of randomisation when dealing with a large number of study participants). So, in practice, it probably does not matter here. It seems very unlikely this could undermine Prof. Bloom's conclusions. Yet, in principle, we randomise in part because we are not sure which variables will, or will not, be relevant, and so we seek to avoid any systematic basis for assigning participants to conditions. And given the liberties I've seen some other researchers take when they think they are making random choices, my instinct is to want to see an RCT where there is actual randomisation.