Quasi-experiment or crazy experiment?

Trustworthy research findings are conditional on getting a lot of things right


Keith S. Taber


A good many experimental educational research studies that compare treatments across two classes or two schools are subject to potentially conflating variables that invalidate study findings and make any consequent conclusions and recommendations untrustworthy.

I was looking for research into the effectiveness of P-O-E (predict-observe-explain) pedagogy, a teaching technique that is believed to help challenge learners' alternative conceptions and support conceptual change.

Read about the predict-observe-explain approach



One of the papers I came across reported identifying, and then using P-O-E to respond to, students' alternative conceptions. The authors reported that

The pre-test revealed a number of misconceptions held by learners in both groups: learners believed that salts 'disappear' when dissolved in water (37% of the responses in the 80% from the pre-test) and that salt 'melts' when dissolved in water (27% of the responses in the 80% from the pre-test).

Kibirige, Osodo & Tlala, 2014, p.302

The references to "in the 80%" did not seem to be explained anywhere. Perhaps only 80% of students responded to the open-ended questions included as part of the assessment instrument (discussed below), so the authors gave the incidence as a proportion of those responding? Ideally, research reports are explicit about such matters avoiding the need for readers to speculate.

The authors concluded from their research that

"This study revealed that the use of POE strategy has a positive effect on learners' misconceptions about dissolved salts. As a result of this strategy, learners were able to overcome their initial misconceptions and improved on their performance….The implication of these results is that science educators, curriculum developers, and textbook writers should work together to include elements of POE in the curriculum as a model for conceptual change in teaching science in schools."

Kibirige, Osodo & Tlala, 2014, p.305

This seemed pretty positive. As P-O-E is an approach which is consistent with 'constructivist' thinking that recognises the importance of engaging with learners' existing thinking I am probably biased towards accepting such conclusions. I would expect techniques such as P-O-E, when applied carefully in suitable curriculum contexts, to be effective.

Read about constructivist pedagogy

Yet I also have a background in teaching research methods and in acting as a journal editor and reviewer – so I am not going to trust the conclusion of a research study without having a look at the research design.


All research findings are subject to caveats and provisos: good practice in research writing is for the authors to discuss them – but often they are left unmentioned for readers to spot. (Read about drawing conclusions from studies)


Kibirige and colleagues describe their study as a quasi-experiment.

Experimental research into teaching approaches

If one wants to see if a teaching approach is effective, then it seems obvious that one needs to do an experiment. If we can experimentally compare different teaching approaches we can find out which are more effective.

An experiment allows us to make a fair comparison by 'control of variables'.

Read about experimental research

Put very simply, the approach might be:

  • Identify a representative sample of an identified population
  • Randomly assign learners in the sample to either an experimental condition or a control condition
  • Set up two conditions that are alike in all relevant ways, apart from the independent variable of interest
  • After the treatments, apply a valid instrument to measure learning outcomes
  • Use inferential statistics to see if any difference in outcomes across the two conditions reaches statistical significance
  • If it does, conclude that
    • the effect is likely to due to the difference in treatments
    • and will apply, on average, to the population that has been sampled

Now, I expect anyone reading this who has worked in schools, and certainly anyone with experience in social research (such as research into teaching and learning), will immediately recognise that in practice it is very difficult to actually set up an experiment into teaching which fits this description.

Nearly always (if indeed not always!) experiments to test teaching approaches fall short of this ideal model to some extent. This does not mean such studies can not be useful – especially where there are many of them with compensatory strengths and weaknesses offering similar findings (Taber, 2019a)- but one needs to ask how closely published studies fit the ideal of a good experiment. Work in high quality journals is often expected to offer readers guidance on this, but readers should check for themselves to see if they find a study convincing.

So, how convincing do I find this study by Kibirige and colleagues?

The sample and the population

If one wishes a study to be informative about a population (say, chemistry teachers in the UK; or 11-12 year-olds in state schools in Western Australia; or pharmacy undergraduates in the EU; or whatever) then it is important to either include the full population in the study (which is usually only feasible when the population is a very limited one, such as graduate students in a single university department) or to ensure the sample is representative.

Read about populations of interest in research

Read about sampling a population

Kibirige and colleagues refer to their participants as a sample

"The sample consisted of 93 Grade 10 Physical Sciences learners from two neighbouring schools (coded as A and B) in a rural setting in Moutse West circuit in Limpopo Province, South Africa. The ages of the learners ranged from 16 to 20 years…The learners were purposively sampled."

Kibirige, Osodo & Tlala, 2014, p.302

Purposive sampling means selecting participants according to some specific criteria, rather than sampling a population randomly. It is not entirely clear precisely what the authors mean by this here – which characteristics they selected for. Also, there is no statement of the population being sampled – so the reader is left to guess what population the sample is a sample of. Perhaps "Grade 10 Physical Sciences" students – but, if so, universally, or in South Africa, or just within Limpopo Province, or indeed just the Moutse West circuit? Strictly the notion of a sample is meaningless without reference to the population being sampled.

A quasi-experiment

A key notion in experimental research is the unit of analysis

"An experiment may, for example, be comparing outcomes between different learners, different classes, different year groups, or different schools…It is important at the outset of an experimental study to clarify what the unit of analysis is, and this should be explicit in research reports so that readers are aware what is being compared."

Taber, 2019a, p.72

In a true experiment the 'units of analysis' (which in different studies may be learners, teachers, classes, schools, exam. papers, lessons, textbook chapters, etc.) are randomly assigned to conditions. Random assignment allows inferential statistics to be used to directly compare measures made in the different conditions to determine whether outcomes are statistically significant. Random assignment is a way of making systematic differences between groups unlikely (and so allows the use of inferential statistics to draw meaningful conclusions).

Random assignment is sometimes possible in educational research, but often researchers are only able to work with existing groupings.

Kibirige, Osodo & Tlala describe their approach as using a quasi-experimental design as they could not assign learners to groups, but only compare between learners in two schools. This is important, as means that the 'units of analysis' are not the individual learners, but the groups: in this study one group of students in one school (n=1) is being compared with another group of students in a different school (n=1).

The authors do not make it clear whether they assigned the schools to the two teaching conditions randomly – or whether some other criterion was used. For example, if they chose school A to be the experimental school because they knew the chemistry teacher in the school was highly skilled, always looking to improve her teaching, and open to new approaches; whereas the chemistry teacher in school B had a reputation for wishing to avoid doing more than was needed to be judged competent – that would immediately invalidate the study.

Compensating for not using random assignment

When it is not possible to randomly assign learners to treatments, researchers can (a) use statistics that take into account measurements on each group made before, as well as after, the treatments (that is, a pre-test – post-test design); (b) offer evidence to persuade readers that the groups are equivalent before the experiment. Kibirige, Osodo and Tlala seek to use both of these steps.

Do the groups start as equivalent?

Kibirige, Osodo and Tlala present evidence from the pre-test to suggest that the learners in the two groups are starting at about the same level. In practice, pre-tests seldom lead to identical outcomes for different groups. It is therefore common to use inferential statistics to test for whether there is a statistically significant difference between pre-test scores in the groups. That could be reasonable, if there was an agreed criterion for deciding just how close scores should be to be seen as equivalent. In practice, many researchers only check that the differences do not reach statistical significance at the level of probability <0.05: that it they look to see if there are strong differences, and, if not, declare this is (or implicitly treat this as) equivalence!

This is clearly an inadequate measure of equivalence as it will only filter out cases where there is a difference so large it is found to be very unlikely to be a chance effect.


If we want to make sure groups start as 'equivalent', we cannot simply look to exclude the most blatant differences. (Original image by mcmurryjulie from Pixabay)

See 'Testing for initial equivalence'


We can see this in the Kibirige and colleagues' study where the researchers list mean scores and standard deviations for each question on the pre-test. They report that:

"The results (Table 1) reveal that there was no significant difference between the pre-test achievement scores of the CG [control group] and EG [experimental group] for questions (Appendix 2). The p value for these questions was greater than 0.05."

Kibirige, Osodo & Tlala, 2014, p.302

Now this paper is published "licensed under Creative Commons Attribution 3.0 License" which means I am free to copy from it here.



According to the results table, several of the items (1.2, 1.4, 2.6) did lead to statistically significantly different response patterns in the two groups.

Most of these questions (1.1-1.4; 2.1-2.8; discussed below) are objective questions, so although no marking scheme was included in the paper, it seems they were marked as correct or incorrect.

So, let's take as an example question 2.5 where readers are told that there was no statistically significant difference in the responses of the two groups. The mean score in the control group was 0.41, and in the experimental group was 0.27. Now, the paper reports that:

"Forty nine (49) learners (31 males and 18 females) were from school A and acted as the experimental group (EG) whereas the control group (CG) consisted of 44 learners (18 males and 26 females) from school B."

Kibirige, Osodo & Tlala, 2014, p.302

So, according to my maths,


Correct responsesIncorrect responses
School A (49 students)(0.27 ➾) 1336
School B (44 students)(0.41 ➾) 1826
pre-test results for an item with no statistically significant difference between groups

"The achievement of the EG and CG from pre-test results were not significantly different which suggest that the two groups had similar understanding of concepts" (p.305).
Pre-test results for an item with no statistically significant difference between groups (offered as evidence of 'similar' levels of initial understanding in the two groups)

While, technically, there may have been no statistically significant difference here, I think inspection is sufficient to suggest this does not mean the two groups were initially equivalent in terms of performance on this item.


Data that is normally distributed falls on a 'bell-shaped' curve

(Image by mcmurryjulie from Pixabay)


Inspection of this graphic also highlights something else. Student's t-test (used by the authors to produce the results in their table 1), is a parametric test. That means it can only be used when the data fit certain criteria. The data sample should be randomly selected (not true here) and normally distributed. A normal distribution means data is distributed in a bell-shaped Gaussian curve (as in the image in the blue circle above).If Kibirige, Osodo & Tlala were applying the t-test to data distributed as in my graphic above (a binary distribution where answers were either right or wrong) then the test was invalid.

So, to summarise, the authors suggest there "was no significant difference between the pre-test achievement scores of the CG and EG for questions", although sometimes there was (according to their table); and they used the wrong test to check for this; and in any case lack of statistical significance is not a sufficient test for equivalence.

I should note that the journal does claim to use peer review to evaluate submissions to see if they are ready for publication!

Comparing learning gains between the two groups

At one level equivalence might not be so important, as the authors used an ANCOVA (Analysis of Covariance) test which tests for difference at post-test taking into account the pre-test. Yet this test also has assumptions that need to be tested for and met, but here seem to have just been assumed.

However, to return to an even more substantive point I made earlier, as the learners were not randomly assigned to the two different conditions /treatments, what should be compared are the two school-based groups (i.e., the unit of analysis should be the school group) but that (i.e., a sample of 1 class, rather than 40+ learners, in each condition) would not facilitate using inferential statistics to make a comparison. So, although the authors conclude

"that the achievement of the EG [taking n=49] after treatment (mean 34. 07 ± 15. 12 SD) was higher than the CG [taking n =44] (mean 20. 87 ± 12. 31 SD). These means were significantly different"

Kibirige, Osodo & Tlala, 2014, p.303

the statistics are testing the outcomes as if 49 units independently experienced one teaching approach and 44 independently experienced another. Now, I do not claim to be a statistics expert, and I am aware that most researchers only have a limited appreciation of how and why stats. tests work. For most readers, then, a more convincing argument may be made by focussing on the control of variables.

Controlling variables in educational experiments

The ability to control variables is a key feature of laboratory science, and is critical to experimental tests. Control of variables, even identification of relevant variables, is much more challenging outside of a laboratory in social contexts – such as schools.

In the case of Kibirige, Osodo & Tlala's study, we can set out the overall experimental design as follows


Independent
variable
Teaching approach:
predict-observe-explain (experimental)
– lectures (comparison condition)
Dependent
variable
Learning gains
Controlled
variable(s)
Anything other than teaching approach which might make a difference to student learning
Variables in Kibirige, Osodo & Tlala's study

The researchers set up the two teaching conditions, measure learning gains, and need to make sure any other factors which might have an effect on learning outcomes, so called confounding variables, are controlled so the same in both conditions.

Read about confounding variables in research

Of course, we cannot be sure what might act as a confounding variable, so in practice we may miss something which we do not recognise is having an effect. Here are some possibilities based on my own (now dimly recalled) experience of teaching in school.

The room may make a difference. Some rooms are

  • spacious,
  • airy,
  • well illuminated,
  • well equipped,
  • away from noisy distractions
  • arranged so everyone can see the front, and the teacher can easily move around the room

Some rooms have

  • comfortable seating,
  • a well positioned board,
  • good acoustics

Others, not so.

The timetable might make a difference. Anyone who has ever taught the same class of students at different times in the week might (will?) have noticed that a Tuesday morning lesson and a Friday afternoon lesson are not always equally productive.

Class size may make a difference (here 49 versus 44).

Could gender composition make a difference? Perhaps it was just me, but I seem to recall that classes of mainly female adolescents had a different nature than classes of mainly male adolescents. (And perhaps the way I experienced those classes would have been different if I had been a female teacher?) Kibirige, Osodo and Tlala report the sex of the students, but assuming that can be taken as a proxy for gender, the gender ratios were somewhat different in the two classes.


The gender make up of the classes was quite different: might that influence learning?

School differences

A potentially major conflating variable is school. In this study the researchers report that the schools were "neighbouring" and that

Having been drawn from the same geographical set up, the learners were of the same socio-cultural practices.

Kibirige, Osodo & Tlala, 2014, p.302

That clearly makes more sense than choosing two schools from different places with different demographics. But anyone who has worked in schools will know that two neighbouring schools serving much the same community can still be very different. Different ethos, different norms, and often different levels of outcome. Schools A and B may be very similar (but the reader has no way to know), but when comparing between groups in different schools it is clear that school could be a key factor in group outcome.

The teacher effect

Similar points can be made about teachers – they are all different! Does ANY teacher really believe that one can swap one teacher for another without making a difference? Kibirige, Osodo and Tlala do not tell readers anything about the teachers, but as students were taught in their own schools the default assumption must be that they were taught by their assigned class teachers.

Teachers vary in terms of

  • skill,
  • experience,
  • confidence,
  • enthusiasm,
  • subject knowledge,
  • empathy levels,
  • insight into their students,
  • rapport with classes,
  • beliefs about teaching and learning,
  • teaching style,
  • disciplinary approach
  • expectations of students

The same teacher may perform at different levels with different classes (preferring to work with different grade levels, or simply getting on/not getting on with particular classes). Teachers may have uneven performance across topics. Teachers differentially engage with and excel in different teaching approaches. (Even if the same teacher had taught both groups we could not assume they were equally skilful in both teaching conditions.)

Teacher variable is likely to be a major difference between groups.

Meta-effects

Another conflating factor is the very fact of the research itself. Students may welcome a different approach because it is novel and a change from the usual diet (or alternatively they may be nervous about things being done differently) – but such 'novelty' effects would disappear once the new way of doing things became established as normal. In which case, it would be an effect of the research itself and not of what is being researched.

Perhaps even more powerful are expectancy effects. If researchers expect an innovation to improve matters, then these expectations get communicated to those involved in the research and can themselves have an affect. Expectancy effects are so well demonstrated that in medical research double-blind protocols are used so that neither patients nor health professionals they directly engage with in the study know who is getting which treatment.

Read about expectancy effects in research

So, we might revise the table above:


Independent
variable
Teaching approach:
predict-observe-explain (experimental)
– lectures (comparison condition)
Dependent
variable
Learning gains
Potentially conflating
variables
School effect
Teacher effect
Class size
Gender composition of teaching groups
Relative novelty of the two teaching approaches
Variables in Kibirige, Osodo & Tlala's study

Now, of course, these problems are not unique to this particular study. The only way to respond to teacher and school effects of this kind is to do large scale studies, and randomly assign a large enough number of schools and teachers to the different conditions so that it becomes very unlikely there will be systematic differences between treatment groups.

A good many experimental educational research studies that compare treatments across two classes or two schools are subject to potentially conflating variables that invalidate study findings and make any consequent conclusions and recommendations untrustworthy (Taber, 2019a). Strangely, often this does not seem to preclude publication in research journals. 1

Advice on controls in scientific investigations:

I can probably do no better than to share some advice given to both researchers, and readers of research papers, in an immunology textbook from 1910:

"I cannot impress upon you strongly enough never to operate without the necessary controls. You will thus protect yourself against grave errors and faulty diagnoses, to which even the most competent investigator may be liable if he [or she] fails to carry out adequate controls. This applies above all when you perform independent scientific investigations or seek to assess them. Work done without the controls necessary to eliminate all possible errors, even unlikely ones, permits no scientific conclusions.

I have made it a rule, and would advise you to do the same, to look at the controls listed before you read any new scientific papers… If the controls are inadequate, the value of the work will be very poor, irrespective of its substance, because none of the data, although they may be correct, are necessarily so."

Julius Citron

The comparison condition

It seems clear that in this study there is no strict 'control' of variables, and the 'control' group is better considered just a comparison group. The authors tell us that:

"the control group (CG) taught using traditional methods…

the CG used the traditional lecture method"

Kibirige, Osodo & Tlala, 2014, pp.300, 302

This is not further explained, but if this really was teaching by 'lecturing' then that is not a suitable approach for teaching school age learners.

This raises two issues.

There is a lot of evidence that a range of active learning approaches (discussion work, laboratory work, various kinds of group work) engages and motivates students more than whole lessons spent listening to a teacher. Therefore any approach which basically involves a mixture of students doing things, discussing things, engaging with manipulatives and resources as well as listening to a teacher, tends to be superior to just being lectured. Good science teaching normally involves lessons sequenced into a series of connected episodes involving different types of student activity (Taber, 2019b). Teacher presentations of the target scientific account are very important, but tend to be effective when embedded in a dialogic approach that allows students to explore their own thinking and takes into account their starting points.

So, comparing P-O-E with lectures (if they really were lectures) may not tell researchers much about P-O-E specifically, as a teaching approach. A better test would compare P-O-E with some other approach known to be engaging.

"Many published studies argue that the innovation being tested has the potential to be more effective than current standard teaching practice, and seek to demonstrate this by comparing an innovative treatment with existing practice that is not seen as especially effective. This seems logical where the likely effectiveness of the innovation being tested is genuinely uncertain, and the 'standard' provision is the only available comparison. However, often these studies are carried out in contexts where the advantages of a range of innovative approaches have already been well demonstrated, in which case it would be more informative to test the innovation that is the focus of the study against some other approach already shown to be effective."

Taber, 2019a, p.93

The second issue is more ethical than methodological. Sometimes in published studies (and I am not claiming I know this happened here, as the paper says so little about the comparison condition) researchers seem to deliberately set up a comparison condition they have good reason to expect is not effective: such as asking a teacher to lecture and not include practical work or discussion work or use of digital learning technologies and so forth. Potentially the researchers are asking the teacher of the 'control' group to teach less effectively than normally to bias the experiment towards their preferred outcome (Taber, 2019a).

This is not only a failure to do good science, but also an abuse of those learners being deliberately subjected to poor teaching. Perhaps in this study the class in School B was habitually taught by being lectured at, so the comparison condition was just what would have occurred in the absence of the research, but this is always a worry when studies report comparison conditions that seem to deliberately disadvantage students. (This paper does not seem to report anything about obtaining voluntary informed consent from participants, nor indeed about how access to the schools was negotiated. )

"In most educational research experiments of the type discussed in this article, potential harm is likely to be limited to subjecting students (and teachers) to conditions where teaching may be less effective, and perhaps demotivating…It can also potentially occur in control conditions if students are subjected to teaching inputs of low effectiveness when better alternatives were available. This may be judged only a modest level of harm, but – given that the whole purpose of experiments to test teaching innovations is to facilitate improvements in teaching effectiveness – this possibility should be taken seriously."

Taber, 2019a, p.94

Validity of measurements

Even leaving aside all the concerns expressed above, the results of a study of this kind depends upon valid measurements. Assessment items must test what they claim to test, and their analysis should be subject to quality control (and preferably blind to which condition a script being analysed derives form). Kibirige, Osodo and Tlala append the test they used in the study (Appendix 2, pp.309-310), which is very helpful in allowing readers to judge at least its face validity. Unfortunately, they do not include a mark/analysis scheme to show what they considered responses worthy of credit.

"The [Achievement Test] consisted of three questions. Question one consisted of five statements which learners had to classify as either true or false. Question two consisted of nine [sic, actually eight] multiple questions which were used as a diagnostic tool in the design of the teaching and learning materials in addressing misconceptions based on prior knowledge. Question three had two open-ended questions to reveal learners' views on how salts dissolve in water (Appendix 1 [sic, 2])."

Kibirige, Osodo & Tlala, 2014, p.302

"Question one consisted of five statements which learners had to classify as either true or false."

Question 1 is fairly straightforward.

1.2: Strictly all salts do dissolve in water to some extent. I expect that students were taught that some salts are insoluble. Often in teaching we start with simple dichotomous models (metal-non metal; ionic-covalent; soluble-insoluble; reversible – irreversible) and then develop these to more continuous accounts that recognise difference of degree. It is possible here then that a student who had learnt that all salts are soluble to some extent might have been disadvantaged by giving the 'wrong' ('True') response…

…although[sic] , actually, there is perhaps no excuse for answering 'True' ('All salts can dissolve in water') here as a later question begins "3.2. Some salts does [sic] not dissolve in water. In your own view what happens when a salt do [sic] not dissolve in water".

Despite the test actually telling students the answer to this item, it seems only 55% of the experimental group, and 23% of the control group obtained the correct answer on the post test – precisely the same proportions as on the pre-test!



1.4: Seems to be 'False' as the ions exist in the salt and are not formed when it goes into solution. However, I am not sure if that nuance of wording is intended in the question.

Question 2 gets more interesting.


"Question two consisted of nine multiple questions" (seven shown here)

I immediately got stuck on question 2.2 which asked which formula (singular, not 'formula/formulae', note) represented a salt. Surely, they are all salts?

I had the same problem on 2.4 which seemed to offer three salts that could be formed by reacting acid with base. Were students allowed to give multiple responses? Did they have to give all the correct options to score?

Again, 2.5 offered three salts which could all be made by direct reaction of 'some substances'. (As a student I might have answered A assuming the teacher meant to ask about direct combination of the elements?)

At least in 2.6 there only seemed to be two correct responses to choose between.

Any student unsure of the correct answer in 2.7 might have taken guidance from the charges as shown in the equation given in question 2.8 (although indicated as 2.9).

How I wished they had provided the mark scheme.



The final question in this section asked students to select one of three diagrams to show what happens when a 'mixture' of H2O and NaCl in a closed container 'react'. (In chemistry, we do not usually consider salt dissolving as a reaction.)

Diagram B seemed to show ion pairs in solution (but why the different form of representation?) Option C did not look convincing as the chloride ions had altogether vanished from the scene and sodium seemed to have formed multiple bonds with oxygen and hydrogens.

So, by a process of elimination, the answer is surely A.

  • But components seem to be labelled Na and Cl (not as ions).
  • And the image does not seem to represent a solution as there is much too much space between the species present.
  • And in salt solution there are many water molecules between solvated ions – missing here.
  • And the figure seems to show two water molecules have broken up, not to give hydrogen and hydroxide ions, but lone oxygen (atoms, ions?)
  • And why is the chlorine shown to be so much larger in solution than it was in the salt? (If this is meant to be an atom, it should be smaller than the ion, not larger. The real mystery is why the chloride ions are shown so much smaller than smaller sodium ions before salvation occurs when chloride ions have about double the radii of sodium ions.)

So diagram A is incredible, but still not quite as crazy an option as B and C.

This is all despite

"For face validity, three Physical Sciences experts (two Physical Sciences educators and one researcher) examined the instruments with specific reference to Mpofu's (2006) criteria: suitability of the language used to the targeted group; structure and clarity of the questions; and checked if the content was relevant to what would be measured. For reliability, the instruments were piloted over a period of two weeks. Grade 10 learners of a school which was not part of the sample was used. Any questions that were not clear were changed to reduce ambiguity."

Kibirige, Osodo & Tlala, 2014, p.302

One wonders what the less clear, more ambiguous, versions of the test items were.

Reducing 'misconceptions'

The final question was (or, perhaps better, questions were) open-ended.



I assume (again, it would be good for authors of research reports to make such things explicit) these were the questions that led to claims about the identified alternative conceptions at pre-test.

"The pre-test revealed a number of misconceptions held by learners in both groups: learners believed that salts 'disappear' when dissolved in water (37% of the responses in the 80% from the pre-test) and that salt 'melts' when dissolved in water (27% of the responses in the 80% from the pre-test)."

Kibirige, Osodo & Tlala, 2014, p.302

As the first two (sets of) questions only admit objective scoring, it seems that this data can only have come from responses to Q3. This means that the authors cannot be sure how students are using terms. 'Melt' is often used in an everyday, metaphorical, sense of 'melting away'. This use of language should be addressed, but it may not be a conceptual error

As the first two (sets of) questions only admit objective scoring, it seems that this data can only have come from responses to Q3. This means that the authors cannot be sure how students are using terms. 'Melt' is often used in an everyday, metaphorical, sense of 'melting away'. This use of language should be addressed, but it may not (for at least some of these learners) be a conceptual error as much as poor use of terminology. .

To say that salts disappear when they dissolve does not seem to me a misconception: they do. To disappear means to no longer be visible, and that's a fair description of the phenomenon of salt dissolving. The authors may assume that if learners use the term 'disappear' they mean the salt is no longer present, but literally they are only claiming it is not directly visible.

Unfortunately, the authors tell us nothing about how they analysed the data collected form their test, so the reader has no basis for knowing how they interpreted student responded to arrive at their findings. The authors do tell us, however, that:

"the intervention had a positive effect on the understanding of concepts dealing with dissolving of salts. This improved achievement was due to the impact of POE strategy which reduced learners' misconceptions regarding dissolving of salts"

Kibirige, Osodo & Tlala, 2014, p.305

Yet, oddly, they offer no specific basis for this claim – no figures to show the level at which "learners believed that salts 'disappear' when dissolved in water …and that salt 'melts' when dissolved in water" in either group at the post-test.


'disappear' misconception'melt' misconception
pre-test:
experimental group
not reportednot reported
pre-test:
comparison group
not reportednot reported
pre-test:
total
(0.37 x 0.8 x 93 =)
24.5 (!?)
(0.27 x 0.8 x 93 =)
20
post-test:
experimental group
not reportednot reported
post-test:
comparison group
not reportednot reported
post-test:
total
not reportednot reported
Data presented about the numbers of learners considered to hold specific misconceptions said to have been 'reduced' in the experimental condition

It seems journal referees and the editor did not feel some important information was missing here that should be added before publication.

In conclusion

Experiments require control of variables. Experiments require random assignment to conditions. Quasi-experiments, where random assignment is not possible, are inherently weaker studies than true experiments.

Control of variables in educational contexts is often almost impossible.

Studies that compare different teaching approaches using two different classes each taught by a different teacher (and perhaps not even in the same school) can never be considered fair comparisons able to offer generalisable conclusions about the relative merits of the approaches. Such 'experiments' have no value as research studies. 1

Such 'experiments' are like comparing the solubility of two salts by (a) dropping a solid lump of 10g of one salt into some cold water, and (b) stirring a finely powdered 35g sample of the other salt into hot propanol; and watching to see which seems to dissolve better.

Only large scale studies that encompass a wide range of different teachers/schools/classrooms in each condition are likely to produce results that are generalisable.

The use of inferential statistical tests is only worthwhile when the conditions for those statistical tests are met. Sometimes tests are said to be robust to modest deviations from such acquirements as normality. But applying tests to data that do not come close to fitting the conditions of the test is pointless.

Any research is only as trustworthy as the validity of its measurements. If one does not trust the measuring instrument or the analysis of measurement data then one cannot trust the findings and conclusions.


The results of a research study depend on an extended chain of argumentation, where any broken link invalidates the whole chain. (From 'Critical reading of research')

So, although the website for the Mediterranean Journal of Social Science claims "All articles submitted …undergo to a rigorous double blinded peer review process", I think the peer reviewers for this article were either very generous, very ignorant, or simply very lazy. That may seem harsh, but peer review is meant to help authors improve submissions till they are worthy of appearing in the literature, and here peer review has failed, and the authors (and readers of the journal) have been let down by the reviewers and the editor who ultimately decided this study was publishable in this form.

If I asked a graduate student (or indeed an undergraduate student) to evaluate this paper, I would expect to see a response something along these sorts of lines:


Applying the 'Critical Reading of Empirical Studies Tool' to 'The effect of predict-observe-explain strategy on learners' misconceptions about dissolved salts'

I still think P-O-E is a very valuable part of the science teacher's repertoire – but this paper can not contribute anything to support to that view.

Work cited:

Note

1 A lot of these invalid experiments get submitted to research journals, scrutinised by editors and journal referees, and then get published without any acknowledgement of how they fall short of meeting the conditions for a valid experiment. (See, for example, examples discussed in Taber 2019a.) It is as if the mystique of experiment is so great that even studies with invalid conclusions are considered worth publishing as long as the authors did an experiment.

Author: Keith

Former school and college science teacher, teacher educator, research supervisor, and research methods lecturer. Emeritus Professor of Science Education at the University of Cambridge.

Leave a Reply

Your email address will not be published. Required fields are marked *