Educational experiments – making the best of an unsuitable tool?

Can small-scale experimental investigations of teaching carried-out in a couple of arbitrary classrooms really tells us anything about how to teach well?


Keith S. Taber


Undertaking valid educational experiments involves (often, insurmountable) challenges, but perhaps this grid (shown larger below) might be useful for researchers who do want to do genuinely informative experimental studies into teaching?


Applying experimental method to educational questions is a bit like trying to use a precision jeweller's screwdriver to open a tin of paint: you may get the tin open eventually, but you will probably have deformed the tool in the process whilst making something of a mess of the job.


In recent years I seem to have developed something of a religious fervour about educational research studies of the kind that claim to be experimental evaluations of pedagogies, classroom practices, teaching resources, and the like. I think this all started when, having previously largely undertaken interpretive studies (for example, interviewing learners to find out what they knew and understood about science topics) I became part of a team looking to develop, and experimentally evaluate, classroom pedagogy (i.e., the epiSTEMe project).

As a former school science teacher, I had taught learners about the basis of experimental method (e.g., control of variables) and I had read quite a number of educational research studies based on 'experiments', so I was pretty familiar with the challenges of doing experiments in education. But being part of a project which looked to actually carry out such a study made a real impact on me in this regard. Well, that should not be surprising: there is a difference between watching the European Cup Final on the TV, and actually playing in the match, just as reading a review of a concert in the music press is not going to impact you as much as being on stage performing.

Let me be quite clear: the experimental method is of supreme value in the natural sciences; and, even if not all natural science proceeds that way, it deserves to be an important focus of the science curriculum. Even in science, the experimental strategy has its limitations. 1 But experiment is without doubt a precious and powerful tool in physics and chemistry that has helped us learn a great deal about the natural world. (In biology, too, but even here there are additional complications due to the variations within populations of individuals of a single 'kind'.)

But transferring experimental method from the laboratory to the classroom to test hypotheses about teaching is far from straightforward. Most of the published experimental studies drawing conclusions about matters such as effective pedagogy, need to be read with substantive and sometimes extensive provisos and caveats; and many of them are simply invalid – they are bad experiments (Taber, 2019). 2

The experiment is a tool that has been designed, and refined, to help us answer questions when:

  • we are dealing with non-sentient entities that are indifferent to outcomes;
  • we are investigating samples or specimens of natural kinds;
  • we can identify all the relevant variables;
  • we can measure the variables of interest;
  • we can control all other variables which could have an effect;

These points simply do not usually apply to classrooms and other learning contexts. 3 (This is clearly so, even if educational researchers often either do not appreciate these differences, or simply pretend they can ignore them.)

Applying experimental method to educational questions is a bit like trying to use a precision jeweller's screwdriver to open a tin of paint: you may get the tin open eventually, but you will probably have deformed the tool in the process whilst making something of a mess of the job.

The reason why experiments are to be preferred to interpretive ('qualitative') studies is that supposedly experiments can lead to definite conclusions (by testing hypotheses), whereas studies that rely on the interpretation of data (such as classroom observations, interviews, analysis of classroom talk, etc.) are at best suggestive. This would be a fair point when an experimental study genuinely met the control-of-variables requirements for being a true experiment – although often, even then, to draw generalisable conclusions that apply to a wide population one has to be confident one is working with a random or representatives sample, and use inferential statistics which can only offer a probabilistic conclusion.

My creed…researchers should prefer to undertake competent work

My proselytising about this issue, is based on having come to think that:

  • most educational experiments do not fully control relevant variables, so are invalid;
  • educational experiments are usually subject to expectancy effects that can influence outcomes;
  • many (perhaps most) educational experiments have too few independent units of analysis to allow the valid use of inferential statistics;
  • most large-scale educational experiments can not assure that samples are fully representative of populations, so strictly cannot be generalised;
  • many experiments are rhetorical studies that deliberately compare a condition (supposedly being tested but actually) assumed to be effective with a teaching condition known to fall short of good teaching practice;
  • an invalid experiment tells us nothing that we can rely upon;
  • a detailed case study of a learning context which offers rich description of teaching and learning potentially offers useful insights;
  • given a choice between undertaking a competent study of a kind that can offer useful insights, and undertaking a bad experiment which cannot provide valid conclusions, researchers should prefer to undertake competent work;
  • what makes work scientific is not the choice of methodology per se, but the adoption of a design that fits the research constraints and offers a genuine opportunity for useful learning.

However, experiments seem very popular in education, and often seem to be the methodology of choice for researchers into pedagogy in science education.

Read: Why do natural scientists tend to make poor social scientists?

This fondness of experiments will no doubt continue, so here are some thoughts on how to best draw useful implications from them.

A guide to using experiments to inform education

It seems there are two very important dimensions that can be used to characterise experimental research into teaching – relating to the scale and focus of the research.


Two dimensions used to characterise experimental studies of teaching


Scale of studies

A large-scale study has a large number 'units of analysis'. So, for example, if the research was testing out the value of using, say, augmented reality in teaching about predator-prey relationships, then in such a study there would need to be a large number of teaching-learning 'units' in the augmented learning condition and a similarly large number of teaching-learning 'units' in the comparison condition. What a unit actually is would vary from study to study. Here a unit might be a sequence of three lessons where a teacher teaches the topic to a class of 15-16 year-old learners (either with, or without, the use of augmented reality).

For units of analysis to be analysed statistically they need to be independent from each other – so different students learning together from the same teacher in the same classroom at the same time are clearly not learning independently of each other. (This seems obvious – but in many published studies this inconvenient fact is ignored as it is 'unhelpful' if researchers wish to use inferential statistics but are only working with a small number of classes. 4)

Read about units of analysis in research

So, a study which compared teaching and learning in two intact classes can usually only be considered to have one unit of analysis in each condition (making statistical tests completely irrelevant 5, thought this does not stop them often being applied anyway). There are a great many small scale studies in the literature where there are only one or a few units in each condition.

Focus of study

The other dimension shown in the figure concerns the focus of a study. By the focus, I mean whether the researchers are interested in teaching and learning in some specific local context, or want to find out about some general population.

Read about what is meant by population in research

Studies may be carried out in a very specific context (e.g., one school; one university programme) or across a wide range of contexts. That seems to simply relate to the scale of the study, just discussed. But by focus I mean whether the research question of interest concerns just a particular teaching and learning context (which may be quite appropriate when practitioner-researchers explore their own professional contexts, for exmample), or is meant to help us learn about a more general situation.


local focusgeneral focus
Why does school X get such outstanding science examination scores?Is there a relationship between teaching pedagogy employed and science examination results in English schools?
Will jig-saw learning be a productive way to teach my A level class about the properties of the transition elements?Is jig-saw learning an effective pedagogy for use in A level chemistry classes?
Some hypothetical research questions relating either to a specific teaching context, or a wider population. (n.b. The research literature includes a great many studies that claim to explore general research questions by collecting data in a single specific context.)

If that seems a subtle distinction between two quite similar dimensions then it is worth noting that the research literature contains a great many studies that take place in one context (small-scale studies) but which claim (implicitly or explicitly) to be of general relevance. So, many authors, peer reviewers, and editors clearly seem think one can generalise from such small scale studies.

Generalisation

Generalisation is the ability to draw general conclusions from specific instances. Natural science does this all the time. If this sample of table salt has the formula NaCl, then all samples of table salt do; if the resistance of this copper wire goes up when the wire is heated the same will be found with other specimens as well. This usually works well when dealing with things we think are 'natural kinds' – that is where all the examples (all samples of NaCl, all pure copper wires) have the same essence.

Read about generalisation in research

Education deals with teachers, classes, lessons, schools…social kinds that lack that kind of equivalence across examples. You can swap any two electrons in a structure and it will make absolutely no difference. Does any one think you can swap the teachers between two classes and safely assume it will not have an effect?

So, by focus I mean whether the point of the research is to find out about the research context in its own right (context-directed research) or to learn something that applies to a general category of phenomena (theory-directed research).

These two dimensions, then, lead to a model with four quadrants.

Large-scale research to learn about the general case

In the top-right quadrant is research which focuses on the general situation and is larger-scale. In principle 6 this type of research can address a question such as 'is this pedagogy (teaching resource, etc.) generally effective in this population', as long as

  • the samples are representative of the wider population of interest, and
  • those sampled are randomly assigned to conditions, and
  • the number of units supports statistical analysis.

The slight of hand employed in many studies is to select a convenience sample (two classes of thirteen years old students at my local school) yet to claim the research is about, and so offers conclusions about, a wider population (thirteen year learners).

Read about some examples of samples used to investigate populations


When an experiment tests a sample drawn at random from a wider population, then the findings of the experiment can be assumed to (probably) apply (on average) to the population. (Taber, 2019)

Even when a population is properly sampled, it is important not to assume that something which has been found to be generally effective in a population will be effective throughout the population. Schools, classes, courses, learners, topics, etc. vary. If it has been found that, say, teaching the reactivity series through enquiry generally works in the population of English classes of 13-14 year students, then a teacher of an English class of 13-14 year students might sensibly think this is an approach to adopt, but cannot assume it will be effective in her classroom, with a particular group of students.

To implement something that has been shown to generally work might be considered research-based teaching, as long as the approach is dropped or modified if indications are it is not proving effective in this particular context. That is, there is nothing (please note, UK Department for Education, and Ofsted) 'research-based' about continuing with a recommended approach in the face of direct empirical evidence that it is not working in your classroom.

Large-scale research to learn about the range of effectiveness

However, even large-scale studies where there are genuinely sufficient units of analysis for statistical analysis may not logically support the kinds of generalisation in the top-right quadrant. For that, researchers needs either a random sampling of the full population (seldom viable given people and institutions must have a choice to participate or not 7), or a sample which is known to be representative of the population in terms of the relevant characteristics – which means knowing a lot about

  • (i) the population,
  • (ii) the sample, and
  • (ii) which variables might be relevant!

Imagine you wanted to undertake a survey of physics teachers in some national context, and you knew you could not reach all that population so you needed to survey a sample. How could you possibly know that the teachers in your sample were representative of the wider population on whatever variables might potentially be pertinent to the survey (level of qualification?; years of experience?; degree subject?; type of school/college taught in?; gender?…)

But perhaps a large scale study that attracts a diverse enough sample may still be very useful if it collects sufficient data about the individual units of analysis, and so can begin to look at patterns in how specific local conditions relate to teaching effectiveness. That is, even if the sample cannot be considered representative enough for statistical generalisation to the population, such a study might be a be to offer some insights into whether an approach seems to work well in mixed-ability classes, or top sets, or girls' schools, or in areas of high social deprivation, or…

In practice, there are very few experimental research studies which are large-scale, in the sense of having enough different teachers/classes as units of analysis to sit in either of these quadrants of the chart. Educational research is rarely funded at a level that makes this possible. Most researchers are constrained by the available resources to only work with a small number of accessible classes or schools.

So, what use are such studies for producing generalisable results?

Small-scale research to incrementally extend the range of effectiveness

A single small-scale study can contribute to a research programme to explore the range of application of an innovation as if it was part of a large-scale study with a diverse sample. But this means such studies need to be explicitly conceptualised and planned as part of such a programme.

At the moment it is common for research papers to say something like

"…lots of research studies, from all over the place, report that asking students to

(i) first copy science texts omitting all the vowels, and then

(ii) re-constituting them in full by working from the reduced text, by writing it out adding vowels that produce viable words and sentences,

is an effective way of supporting the learning of science concepts; but no one has yet reported testing this pedagogic method when twelve year old students are studying the topic of acids in South Cambridgeshire in a teaching laboratory with moveable stools and West-facing windows.

In this ground-breaking study, we report an experiment to see if this constructivist, active-learning, teaching approach leads to greater science learning among twelve year old students studying the topic of acids in South Cambridgeshire in a teaching laboratory with moveable stools and West-facing windows…"

Over time, the research literature becomes populated with studies of enquiry-based science education, jig-saw learning, use of virtual reality, etc., etc., and these tend to refer to a range of national contexts, variously aged students, diverse science topics, etc., this all tends to be piecemeal. A coordinated programme of research could lead to researchers both (a) giving rich description of the context used, and (b) selecting contexts strategically to build up a picture across ranges of contexts,

"When there is a series of studies testing the same innovation, it is most useful if collectively they sample in a way that offers maximum information about the potential range of effectiveness of the innovation.There are clearly many factors that may be relevant. It may be useful for replication studies of effective innovations to take place with groups of different socio-economic status, or in different countries with different curriculum contexts, or indeed in countries with different cultural norms (and perhaps very different class sizes; different access to laboratory facilities) and languages of instruction …. It may be useful to test the range of effectiveness of some innovations in terms of the ages of students, or across a range of quite different science topics. Such decisions should be based on theoretical considerations.

Given the large number of potentially relevant variables, there will be a great many combinations of possible sets of replication conditions. A large number of replications giving similar results within a small region of this 'phase space' means each new study adds little to the field. If all existing studies report positive outcomes, then it is most useful to select new samples that are as different as possible from those already tested. …

When existing studies suggest the innovation is effective in some contexts but not others, then the characteristics of samples/context of published studies can be used to guide the selection of new samples/contexts (perhaps those judged as offering intermediate cases) that can help illuminate the boundaries of the range of effectiveness of the innovation."

Taber, 2019

Not that the research programme would be co-ordinated by a central agency or authority, but by each contributing researcher/research team (i) taking into account the 'state of play' at the start of their research; (ii) making strategic decisions accordingly when selecting contexts for their own work; (iii) reporting the context in enough detail to allow later researchers to see how that study fits into the ongoing programme.

This has to be a more scientific approach than simply picking a convenient context where researchers expect something to work well; undertake a small-scale local experiment (perhaps setting up a substandard control condition to be sure of a positive outcome); and then report along the lines "this widely demonstrated effective pedagogy works here too", or, if it does not, perhaps putting the study aside without publication. As the philosopher of science, Karl Popper, reminded us, science proceeds through the testing of bold conjectures: an 'experiment' where you already know the outcome is actually a demonstration. Demonstrations are useful in teaching, but do not contribute to research. What can contribute is an experiment in a context where there is reason to be unsure if an innovation will be an improvement or not, and where the comparison reflects good teaching practice to offer a meaningful test.

Small-scale research to inform local practice

Now, I would be the first to admit that I am not optimistic that such an approach will be developed by researchers; and even if it is, it will take time for useful patterns to arise that offer genuine insights into the range of convenience of different pedagogies.

Does this mean that small-scale studies in single context are really a waste of research resource and an unmerited inconvenient for those working in such contexts?

Well, I have time for studies in my final (bottom left) quadrant. Given that schools and classrooms and teachers and classes all vary considerably, and that what works well in a highly selective boys-only fee-paying school with a class size of 16 may not be as effective in a co-educational class of 32 mixed ability students in an under-resourced school in an area of social deprivation – and vice versa, of course!, there is often value in testing out ideas (even recommended 'research-based' ones) in specific contexts to inform practice in that context. These are likely to be genuine experiments, as the investigators are really motived to find out what can improve practice in that context.

Often such experiments will not get published,

  • perhaps because the researchers are teachers with higher priorities than writing for publication;
  • perhaps because it is assumed such local studies are not generalisable (but they could sometimes be moved into the previous category if suitably conceptualised and reported);
  • perhaps because the investigators have not sought permissions for publication (part of the ethics of research), usually not necessary for teachers seeking innovations to improve practice as part of their professional work;
  • perhaps because it has been decided inappropriate to set up control conditions which are not expected to be of benefit to those being asked to participate;
  • but also because when trying out something new in a classroom, one needs to be open to make ad hoc modifications to, or even abandon, an innovation if it seems to be having a deleterious effect.

Evaluation of effectiveness here usually comes down to professional judgement (rather than statistical testing – which assumes a large random sample of a population – being used to invalidly generalise small, non-random, local results to that population) which might, in part, rely on the researcher's close (and partially tacit) familiarity with the research context.

I am here describing 'action research', which is highly useful for informing local practice, but which is not ideally suited for formal reporting in academic journals.

Read about action research

So, I suspect there may be an irony here.

There may be a great many small-scale experiments undertaken in schools and colleges which inform good teaching practice in their contexts, without ever being widely reported; whilst there are a great many similar scale, often 'forced' experiments, carried out by visiting researchers with little personal stake in the research context, reporting the general effectiveness of teaching approaches, based on misuse of statistics. I wonder which approach best reflects the true spirit of science?

Source cited:


Notes:

1 For example:

Even in the natural sciences, we can never be absolutely sure that we have controlled all relevant variables (after all, if we already knew for sure which variables were relevant, we would not need to do the research). But usually existing theory gives us a pretty good idea what we need to control.

Experiments are never a simple test of the specified hypothesis, as the experiment is likely to depends upon the theory of instrumentation and the quality of instruments. Consider an extreme case such as the discovery of the Higgs boson at CERN: the conclusions relied on complex theory that informed the design of the apparatus, and very challenging precision engineering, as well as complex mathematical models for interpreting data, and corresponding computer software specifically programmed to carry out that analysis.

The experimental results are a test of a hypothesis (e.g., that a certain particle would be found at events below some calculated energy level) subject to the provisos that

  • the theory of the the instrument and its design is correct; and
  • the materials of the apparatus (an apparatus as complex and extensive as a small city) have no serious flaws; and
  • the construction of the instrumentation precisely matches the specifications;
  • and the modelling of how the detectors will function (including their decay in performance over time) is accurate; and
  • the analytical techniques designed to interpret the signals are valid;
  • the programming of the computers carries out the analysis as intended.

It almost requires an act of faith to have confidence in all this (and I am confident there is no one scientist anywhere in the world who has a good enough understanding and familiarity will all these aspects of the experiment to be able to give assurances on all these areas!)


CREST {Critical Reading of Empirical Studies} evaluation form: when you read a research study, do you consider the cumulative effects of doubts you may have about different aspects of the work?

I would hope at least that as professional scientists and engineers they might be a little more aware of this complex chain of argumentation needed to support robust conclusions than many students – for students often seem to be overconfident in the overall value of research conclusions given any doubts they may have about aspects of the work reported.

Read about the Critical Reading of Empirical Studies Tool


Galileo Galilei was one of the first people to apply the telescope to study the night sky

Galileo Galilei was one of the first people to apply the telescope to study the night sky (image by Dorothe from Pixabay)


A historical example is Galileo's observations of astronomical phenomena such as Jovian moons (he spotted the four largest: Io, Europa, Ganymede and Callisto) and the irregular surface of the moon. Some of his contemporaries rejected these findings on the basis that they were made using an apparatus, the newly fanged telescope, that they did not trust. Whilst this is now widely seen as being arrogant and/or ignorant, arguably if you did not understand how a telescope could magnify, and you did not trust the quality of the lenses not to produce distortions, then it was quite reasonable to be sceptical of findings which were counter to a theory of the 'heavens' that had been generally accepted for many centuries.


2 I have discussed a number of examples on this site. For example:

Falsifying research conclusions: You do not need to falsify your results if you are happy to draw conclusions contrary to the outcome of your data analysis.

Why ask teachers to 'transmit' knowledge…if you believe that "knowledge is constructed in the minds of students"?

Shock result: more study time leads to higher test scores (But 'all other things' are seldom equal)

Experimental pot calls the research kettle black: Do not enquire as I do, enquire as I tell you

Lack of control in educational research: Getting that sinking feeling on reading published studies


3 For a detailed discussion of these and other challenges of doing educational experiments, see Taber, 2019.


4 Consider these two situations.

A researcher wants to find out if a new textbook 'Science for the modern age' leads to more learning among the Grade 10 students she teaches than the traditional book 'Principles of the natural world'. Imagine there are fifty grade 10 students divided already into two classes. The teacher flips a coin and randomly assigns one of the classes to the innovative book, the other being assigned by default the traditional book. We will assume she has a suitable test to assess each students' learning at the end of the experiment.

The teacher teaches the two classes the same curriculum by the same scheme of work. She presents a mini-lecture to a class, then sets them some questions to discuss using the text book. At the end of the (three part!) lesson, she leads a class disucsison drawing on students' suggested answers.

Being a science teacher, who believes in replication, she decides to repeat the exercise the following year. Unfortunately there is a pandemic, and all the students are sent into lock-down at home. So, the teacher assigns the fifty students by lot into two groups, and emails one group the traditional book, and the other the innovative text. She teaches all the students on line as one cohort: each lesson giving them a mini-lecture, then setting them some reading from their (assigned) book, and a set of questions to work through using the text, asking them to upload their individual answers for her to see.

With regard to experimental method, in the first cohort she has only two independent units of analysis – so she may note that the average outcome scores are higher in one group, but cannot read too much into that. However, in the second year, the fifty students can be considered to be learning independently, and as they have been randomly assigned to conditions, she can treat the assessment scores as being from 25 units of analysis in each condition (and so may sensibly apply statistics to see if there is a statistically significant different in outcomes).


5 Inferential statistical tests are usually used to see if the difference in outcomes across conditions is 'significant'. Perhaps the average score in a class with an innovation is 5.6, compared with an average score in the control class of 5.1. The average score is higher in the experimental condition, but is the difference enough to matter?

Well, actually, if the question is whether the difference is big enough to likely to make a difference in practice then researchers should calculate the 'effect size' which will suggest whether the difference found should be considered small, moderate or large. This should ideally be calculated regardless of whether inferential statistics are being used or not.

Inferential statistical tests are often used to see if the result is generalisable to the wider population – but, as suggested above, this is strictly only valid if the population of interest have been randomly sampled – which virtually never happens in educational studies as it is usually not feasible.

Often researchers will still do the calculation, based on the sets of outcome scores in the two conditions, to see if they can claim a statistically significant difference – but the test will only suggest how likely or unlikely the difference between the outcomes is, if the units of analysis have been randomly assigned to the conditions. So, if there are 50 learners each randomly assigned to experimental or control condition this makes sense. That is sometimes the case, but nearly always the researchers work with existing classes and do not have the option of randomly mixing the students up. [See the example in the previous note 4.] In such a situation, the stats. are not informative. (That does not stop them often being reported in published accounts as if they are useful.)


6 That is, if it possible to address such complications as participant expectations, and equitable teacher-familiarity with the different conditions they are assigned to (Taber, 2019).

Read about expectancy effects


7 A usual ethical expectation is that participants voluntarily (without duress) offer informed consent to participate.

Read about voluntary informed consent


Shock result: more study time leads to higher test scores

(But 'all other things' are seldom equal)


Keith S. Taber


I came across an interesting journal article that reported a quasi-experimental study where different groups of students studied the same topic for different periods of time. One group was given 3 half-hour lessons, another group 5 half-hour lessons, and the third group 8 half-hour lessons. Then they were tested on the topic they had been studying. The researchers found that the average group performance was substantially different across the different conditions. This was tested statistically, but the results were clear enough to be quite impressive when presented visually (as I have below).


Results from a quasi-experiment: its seems more study time can lead to higher achievement

These results seem pretty clear cut. If this research could be replicated in diverse contexts then the findings could have great significance.

  • Is your manager trying to cut course hours to save budget?
  • Does your school want you to teach 'triple science' in a curriculum slot intended for 'double science'?
  • Does your child say they have done enough homework?

Research evidence suggests that, ceteris paribus, learners achieve more by spending more time studying.

Ceteris paribus?

That is ceteris paribus (no, it is not a newly discovered species of whale): all other things being equal. But of course, in the real world they seldom – if ever – are.

If you wondered about the motivation for a study designed to see whether more teaching led to more learning (hardly what Karl Popper would have classed as a suitable 'bold conjecture' on which to base productive research), then I should confess I am being disingenuous. The information I give above is based on the published research, but offers a rather different take on the study from that offered by the authors themselves.

An 'alternative interpretation' one might say.

How useful are DARTs as learning activities?

I came across this study when looking to see if there was any research on the effectiveness of DARTs in chemistry teaching. DARTs are directed activities related to text – that is text-based exercises designed to require learners to engage with content rather than just copy or read it. They have long been recommended, but I was not sure I had seen any published research on their use in science classrooms.

Read about using DARTs in teaching

Shamsulbahri and Zulkiply (2021) undertook a study that "examined the effect of Directed Activity Related to Texts (DARTs) and gender on student achievement in qualitative analysis in chemistry" (p.157). They considered their study to be a quasi-experiment.

An experiment…

Experiment is the favoured methodology in many areas of natural science, and, indeed, the double blind experiment is sometimes seen as the gold standard methodology in medicine – and when possible in the social sciences. This includes education, and certainly in science education the literature reports many, many educational experiments. However, doing experiments well in education is very tricky and many published studies have major methodological problems (Taber, 2019).

Read about experiments in education

…requires control of variables

As we teach in school science, fair testing requires careful control of variables.

So, if I suggest there are some issues that prevent a reader from being entirely confident in the conclusions that Shamsulbahri and Zulkiply reach in their paper, it should be borne in mind that I think it is almost impossible to do a rigorously 'fair' small-scale experiment in education. By small-scale, I mean the kind of study that involves a few classes of learners as opposed to studies that can enrol a large number of classes and randomly assign them to conditions. Even large scale randomised studies are usually compromised by factors that simply cannot be controlled in educational contexts (Taber, 2019) , and small scale studies are subject to additional, often (I would argue) insurmountable, 'challenges'.

The study is available on the web, open access, and the paper goes into a good deal of detail about the background to, and aspects of, the study. Here, I am focusing on a few points that relate to my wider concerns about the merits of experimental research into teaching, and there is much of potential interest in the paper that I am ignoring as not directly relevant to my specific argument here. In particular, the authors describe the different forms of DART they used in the study. As, inevitably (considering my stance on the intrinsic problems of small-scale experiments in education), the tone of this piece is critical, I would recommend readers to access the full paper and make up your own minds.

Not a predatory journal

I was not familiar with the journal in which this paper was published – the Malaysian Journal of Learning and Instruction. It describes itself as "a peer reviewed interdisciplinary journal with an international advisory board". It is an open access journal that charges authors for publication. However, the publication fees are modest (US$25 if authors are from countries that are members of The Association of Southeast Asian Nations, and US$50 otherwise). This is an order of magnitude less than is typical for some of the open-access journals that I have criticised here as being predatory – those which do not engage in meaningful peer review, and will publish some very low quality material as long as a fee is paid. 25 dollars seems a reasonable charge for the costs involved in publishing work, unlike the hefty fees charged by many of the less scrupulous journals.

Shamsulbahri and Zulkiply seem, then, to have published in a well-motivated journal and their paper has passed peer review. But this peer thinks that, like most small scale experiments into teaching, it is very hard to draw any solid conclusions from this work.

What do the authors conclude?

Shamsulbahri and Zulkiply argue that their study shows the value of DARTs activities in learning. I approach this work with a bias, as I also think DARTs can be very useful. I used different kinds of DARTs extensively in my teaching with 14-16 years olds when I worked in schools.

The authors claim their study,

"provides experimental evidence in support of the claim that the DARTs method has been beneficial as a pedagogical approach as it helps to enhance qualitative analysis learning in chemistry…

The present study however, has shown that the DARTs method facilitated better learning of the qualitative analysis component of chemistry when it was combined with the experimental method. Using the DARTs method only results in better learning of qualitative analysis component in chemistry, as compared with using the Experimental method only."

Shamsulbahri & Zulkiply, 2021

Yet, despite my bias, which leads me to suspect they are right, I do not think we can infer this much from their quasi-experiment.

I am going to separate out three claims in the quote above:

  1. the DARTs method has been beneficial as a pedagogical approach as it helps to enhance qualitative analysis learning in chemistry
  2. the DARTs method facilitated better learning of the qualitative analysis component of chemistry when it was combined with the [laboratory1] method
  3. the DARTs method [by itself] results in better learning of qualitative analysis component in chemistry, as compared with using the [laboratory] method only.

I am going to suggest that there are two weak claims here and one strong claim. The weak claims are reasonably well supported (but only as long as they are read strictly as presented and not assumed to extend beyond the study) but the strong claim is not.

Limitations of the experiment

I suggest there are several major limiations of this research design.

What population is represented in the study?

In a true experiment researchers would nominate the population of interest (say, for example, 14-16 year old school learners in Malaysia), and then randomly select participants from this population who would be randomly assigned to the different conditions being compared. Random selection and assignment cannot ensure that the groupings of participants are equivalent, nor that the samples genuinely represent the population; as by chance it could happen that, say, the most studious students are assigned to one condition and all the lazy students to an other – but that is very unlikely. Random selection and assignment means that there is strong statistical case to think the outcomes of the experiment probably represent (more or less) what would have happened on a larger scale had it been possible to include the whole population in the experiment.

Read about sampling in research

Obviously, researchers in small-scale experiments are very unlikely to be able to access full populations to sample. Shamsulbahri and Zulkiply did not – and it would be unreasonable to criticise them for this. But this does raise the question of whether what happens in their samples will reflect what would happen with other groups of students. Shamsulbahri and Zulkiply acknowledge their sample cannot be considered typical,

"One limitation of the present study would be the sample used; the participants were all from two local fully residential schools, which were schools for students with high academic performance."

Shamsulbahri & Zulkiply, 2021

So, we have to be careful about generalising from what happened in this specific experiment to what we might expect with different groups of learners. In that regard, two of the claims from the paper that I have highlighted (i.e., the weaker claims) do not directly imply these results can be generalised:

  1. the DARTs method has been beneficial as a pedagogical approach…
  2. the DARTs method facilitated better learning of the qualitative analysis component of chemistry when it was combined with the [laboratory] method

These are claims about what was found in the study – not inferences about what would happen in other circumstances.

Read about randomisation in studies

Equivalence at pretest?

When it is not possible to randomly assign participants to the different conditions then there is always the possibility that whatever process has been used to assign conditions to groups produces a bias. (An extreme case would be in a school that used setting, that is assigning students to teaching groups according to achievement, if one set was assigned to one condition, and another set to a different condition.)

In quasi-experiments on teaching it is usual to pre-test students and to present analysis to show that at the start of the experiment the groups 'are equivalent'. Of course, it is very unlikely two different classes would prove to be entriely equivalent on a pre-test, so often there is a judgement made of the test results being sufficiently similar across the conditions. In practice, in many published studies, authors settle for the very weak (and inadequate) test of not finding differences so great that would be very unlikely to occur by chance (Taber, 2019)!

Read about testing for equivalence

Shamsulbahri and Zulkiply did pretest all participants as a screening process to exclude any students who already had good subject knowledge in the topic (qualitative chemical analysis),

"Before the experimental manipulation began, all participants were given a pre-screening test (i.e., the Cation assessment test) with the intention of selecting only the most qualified participants, that is, those who had a low-level of knowledge on the topic….The participants who scored ten or below (out of a total mark of 30) were selected for the actual experimental manipulation. As it turned out, all 120 participants scored 10 and below (i.e., with an average of 3.66 out of 30 marks), which was the requirement that had been set, and thus they were selected for the actual experimental manipulation."

Shamsulbahri & Zulkiply, 2021

But the researchers do not report the mean results for the groups in the three conditions (laboratory1; DARTs; {laboratory+DARTs}) or give any indication of how similar (or not) these were. Nor do these scores seem to have been included as a variable in the analysis of results. The authors seem to be assuming that as no students scored more than one-third marks in the pre-test, then any differences beteen groups at pre-test can be ignored. (This seems to suggest that scoring 30% or 0% can be considered the same level of prior knowledge in terms of the potential influence on further learning and subsequent post-test scores.) That does not seem a sound assumption.

"It is important to note that there was no issue of pre-test treatment interaction in the context of the present study. This has improved the external validity of the study, since all of the participants were given a pre-screening test before they got involved in the actual experimental manipulation, i.e., in one of the three instructional methods. Therefore, any differences observed in the participants' performance in the post-test later were due to the effect of the instructional method used in the experimental manipulation."

Shamsulbahri & Zulkiply, 2021 (emphasis added)

There seems to be a flaw in the logic here, as the authors seem to be equating demonstrating an absence of high scorers at pre-test with there being no differences between groups which might have influenced learning. 2

Units of analysis

In any research study, researchers need to be clear regarding what their 'unit of analysis' should be. In this case the extreme options seem to be:

  • 120 units of analysis: 40 students in each of three conditions
  • 3 units of analysis: one teaching group in each condition

The key question is whether individual learners can be considered as being subject to the treatment conditions independently of others assiged to the same condition.

"During the study phase, student participants from the three groups were instructed by their respective chemistry teachers to learn in pairs…"

Shamsulbahri & Zulkiply, 2021

There is a strong argument that when a group of students attend class together, and are taught together, and interact with each other during class, they strictly should not be considered as learning independently of each other. Anyone who has taught parallel classes that are supposedly equivalent will know that classes take on their own personalities as groups, and the behaviour and learning of individual students is influenced by the particular class ethos.

Read about units of analysis

So, rigorous research into class teaching pedagogy should not treat the individual learners as units of analysis – yet it often does. The reason is obvious – it is only possible to do statistical testing when the sample size is large enough, and in small scale educational experiments the sample size is never going to be large enough unless one…hm…pretends/imagines/considers/judges/assumes/hopes?, that each learner is independently subject to the assigned treatment without being substantially influenced by others in that condition.

So, Shamsulbahri and Zulkiply treated their participants as independent units of analysis and based on this find a statistically significant effect of treatment:

⎟laboratory⎢ vs. ⎟DARTs⎢ vs. ⎟laboratory+DARTs⎢.

That is questionable – but what if, for argument's sake, we accept this assumption that within a class of 40 students the learners can be considered not to influence each other (even their learning partner?) or the classroom more generally sufficiently to make a difference to others in the class?

A confounding variable?

Perhaps a more serious problem with the research design is that there is insufficient control of potentially relevant variables. In order to make a comparison of ⎟laboratory⎢ vs. ⎟DARTs⎢ vs. ⎟laboratory+DARTs⎢ then the only relevant difference between the three treatment conditions should be whether the students learn by laboratory activity, DARTs, or both. There should not be any other differences between the groups in the different treatments that might reasonably be expected to influence the outcomes.

Read about confounding variables

But the description of how groups were set up suggests this was not the case:

"….the researchers conducted a briefing session on the aims and experimental details of the study for the school's [schools'?] chemistry teachers…the researchers demonstrated and then guided the school's chemistry teachers in terms of the appropriate procedures to implement the DARTs instructional method (i.e., using the DARTs handout sheets)…The researcher also explained to the school's chemistry teachers the way to implement the combined method …

Participants were then classified into three groups: control group (experimental method), first treatment group (DARTs method) and second treatment group (Combination of experiment and DARTs method). There was an equal number of participants for each group (i.e., 40 participants) as well as gender distribution (i.e., 20 females and 20 males in each group). The control group consisted of the participants from School A, while both treatment groups consisted of participants from School B"


Shamsulbahri & Zulkiply, 2021

Several different teachers seems to have been involved in teaching the classes, and even if it is not entirely clear how the teaching was divided up, it is clear that the group that only undertook the laboratory activities were from a different school than those in the other two conditions.

If we think one teacher can be replaced by another without changing learning outcomes, and that schools are interchangeable such that we would expect exactly the same outcomes if we swapped a class of students from one school for a class from another school, then these variables are unimportant. If, however, we think the teacher doing the teaching and the school from which learners are sampled could reasonably make a difference to the learning achieved, then these are confounding variables which have not been properly controlled.

In my own experience, I do not think different teachers become equivalent even when their are briefed to teach in the same way, and I do not think we can assume schools are equivalent when providing students to participate in learning. These differences, then, undermine our ability to assign any differences in outcomes as due to the differences in pedagogy (that "any differences observed…were due to the effect of the instructional method used").

Another confounding variable

And then I come back to my starting point. Learners did not just experience different forms of pedagogy but also different amounts of teaching. The difference between 3 lessons and 5 lessons might in itself be a factor (that is, even if the pedagogy employed in those lessons had been the same), as might the difference between 5 lessons and 8 lessons. So, time spent studying must be seen as a likely confounding variable. Indeed, it is not just the amount of time, but also the number of lessons, as the brain processes learning between classes and what is learnt in one lesson can be reinforced when reviewed in the next. (So we could not just assume, for example, that students automatically learn the same amount from, say, two 60 min. classes and four 30 min. classes covering the same material.)

What can we conclude?

As with many experiments in science teaching, we can accept the results of Shamsulbahri and Zulkiply's study, in terms of what they found in the specific study context, but still not be able to draw strong conclusions of wider significance.

Is the DARTs method beneficial as a pedagogical approach?

I expect the answer to this question is yes, but we need to be careful in drawing this conclusion from the experiment. Certainly the two groups which undertook the DARTs activities outperformed the group which did not. Yet that group was drawn from a different school and taught by a different teacher or teachers. That could have explained why there was less learning. (I am not claiming this is so – the point is we have no way of knowing as different variables are conflated.) In any case, the two groups that did undertake the DARTs activity were both given more lessons and spent substantially longer studying the topic they were tested on, than the class that did not. We simply cannot make a fair comparison here with any confidence.

Did the DARTs method facilitate better learning when it was combined with laboratory work?

There is a stronger comparison here. We still do not know if the two groups were taught by the same teacher/teachers (which could make a difference) or indeed whether the two groups started from a very similar level of prior knowledge. But, at least the two groups were from the same school, and both experienced the same DARTs based instruction. Greater learning was achieved when students undertook laboratory work as well as undertaking DARTs activities compared with students who only undertook the DARTs activity.

The 'combined' group still had more teaching than the DARTs group, but that does not matter here in drawing a logical conclusion because the question being explored is of the form 'does additional teaching input provide additional value?' (Taber, 2019). The question here is not whether one type of pedagogy is better than the other, but simply whether also undertaking practical works adds something over just doing the paper based learning activities.

Read about levels of control in experimental design

As the sample of learners was not representative of any specfiic wider population, we cannot assume this result would generalise beyond the participants in the study, although we might reasonably expect this result would be found elsewhere. But that is because we might already assume that learning about a practical activity (qualitative chemical analysis) will be enhanced by adding some laboratory based study!

Does DARTs pedagogy produce more learning about qualitative analysis than laboratory activities?

Shamsulbahri and Zulkiply's third claim was bolder because it was framed as a generalisation: instruction through DARTs produces more learning about qualitative analysis than laboratory-based instruction. That seems quite a stretch from what the study clearly shows us.

What the research does show us with confidence is that a group of 40 students in one school taught by a particular teacher/teaching team with 5 lessons of a specific set of DARTs activities, performed better on a specific assessment instrument than a different group of 40 students in another school taught by a different teacher/teaching team through three lessons of laboratory work following a specific scheme of practical activities.


a group of 40 students
performed better on a specific assessment instrumentthan a different group of 40 students
in one schoolin another school
taught by a particular teacher/teaching team
taught by a different teacher/teaching team
with 5 lessonsthrough 3 lessons
of a specific set of DARTs activities, of laboratory work following a specific scheme of practical activities
Confounded variables

Test instrument bias?

Even if we thought the post-test used by Shamsulbahri and Zulkiply was perfectly valid as an assessment of topic knowledge, we might be concerned by knowing that learning is situated in a context – we better recall in a similar context to that in which we learned.


How can we best assess students' learning about qualitative analysis?


So:

  • should we be concerned that the form of assessment, a paper-based instrument, is closer in nature to the DARTs learning experience than the laboratory learning experience?

and, if so,

  • might this suggest a bias in the measurement instrument towards one treatment (i.e., DARTs)

and, if so,

  • might a laboratory-based assessment have favoured the group that did the laboratory based learning over the DARTs group, and led to different outcomes?

and, if so,

  • which approach to assessment has more ecological validity in this case: which type of assessment activity is a more authentic way of testing learning about a laboratory-based activity like qualitative chemical analysis?

A representation of my understanding of the experimental design

Can we generalise?

As always with small scale experiments into teaching, we have to judge the extent to which the specifics of the study might prevent us from generalising the findings – to be able to assume they would generally apply elsewhere.3 Here, we are left to ask to what extent we can

  • ignore any undisclosed difference between the groups in levels of prior learning;
  • ignore any difference between the schools and their populations;
  • ignore any differences in teacher(s) (competence, confidence, teaching style, rapport with classes, etc.);
  • ignore any idiosyncrasies in the DARTs scheme of instruction;
  • ignore any idiosyncrasies in the scheme of laboratory instruction;
  • ignore any idiosyncrasies (and potential biases) in the assessment instrument and its marking scheme and their application;

And, if we decide we can put aside any concerns about any of those matters, we can safely assume that (in learning this topic at this level)

  • 5 sessions of learning by DARTs is more effective than 3 sessions of laboratory learning.

Then we only have to decide if that is because

  • (i) DARTs activities teach more about this topic at this level than laboratory activities, or
  • (ii) whether some or all of the difference in learning outcomes is simply because 150 minutes of study (broken into five blocks) has more effect than 90 minutes of study (broken into three blocks).

What do you think?


Loading poll ...
Work cited:

Notes:

1 The authors refer to the conditions as

  • Experimental control group
  • DARTs
  • combination of Experiment + DARTs

I am referring to the first group as 'laboratory' both because it not clear the students were doing any experiments (that is, testing hypotheses) as the practical activity was learning to undertake standard analytical tests, and, secondly, to avoid confusion (between the educational experiment and the laboratory practicals).


2 I think the reference to "no issue of pre-test treatment interaction" is probably meant to suggest that as all students took the same pre-test it will have had the same effect on all participants. But this not only ignores the potential effect of any differences in prior knowledge reflected in the pre-test scores that might influence subsequent learning, but also the effect of taking the pre-test cannot be assumed to be neutral if for some learners it merely told them they knew nothing about the topic, whilst for others it activated and so reinforced some prior knowledge in the subject. In principle, the interaction between prior knowledge and taking the pretest could have influenced learning at both cognitive and affective levels: that is, both in terms of consolidation of prior learning and cuing for the new learning; and in terms of a learner's confidence in, and attitude towards, learning the topic.


3 Even when we do have a representative sample of a population to test, we can only infer that the outcomes of an experiment reflect what will be most likely for members (schools, learners, classes, teachers…) of the wider population. Individual differences are such that we can never say that what most probably is the case will always be the case.


When an experiment tests a sample drawn at random from a wider population, then the findings of the experiment can be assumed to apply (on average) to the population. (Source: after Taber, 2019).

A case study of educational innovation?

Design and Assessment of an Online Prelab Model in General Chemistry


Keith S. Taber


Case study is meant to be naturalistic – whereas innovation sounds like an intervention. But interventions can be the focus of naturalistic enquiry.

One of the downsides of having spent years teaching research methods is that one cannot help but notice how so much published research departs from the ideal models one offers to students. (Which might be seen as a polite way of saying authors often seem to get key things wrong.) I used to teach that how one labelled one's research was less important than how well one explained it. That is, different people would have somewhat different takes on what is, or is not, grounded theory, case study or action research, but as long as an author explained what they had done, and could adequately justify why, the choice of label for the methodology was of secondary importance.

A science teacher can appreciate this: a student who tells the teacher they are doing a distillation when they are actually carrying out reflux – but clearly explains what they are doing and why, will still be understood (even if the error should be pointed out). On the other hand if a student has the right label but an alternative conception this is likely to be a more problematic 'bug' in the teaching-learning system. 1

That said, each type of research strategy has its own particular weaknesses and strengths so describing something as an experiment, or a case study, if it did not actually share the essential characteristics of that strategy, can mislead the reader – and sometimes even mislead the authors such that invalid conclusions are drawn.

A 'case study', that really is a case study

I made reference above to action research, grounded theory, and case study – three methodologies which are commonly name-checked in education research. There are a vast number of papers in the literature with one of these terms in the title, and a good many of them do not report work that clearly fits the claimed approach! 2


The case study was published in the Journal for the Research Center for Educational Technology

So, I was pleased to read an interesting example of a 'case study' that I felt really was a case study (Llorens-Molina, 2009). 'Design and assessment of an online prelab model in general chemistry: A case study' offered a good example of a case study. Although, I suspect some other authors might have been tempted to describe this research differently.

Is it a bird, is it a plane; no it's…

Llorens-Molina's study included an experimental aspect. A cohort of learners was divided into two groups to allow the researcher to compare two different educational treatments; then, measurements were made to compare outcomes quantitatively. That might sound like an experiment. Moreover, this study reported an attempt to innovate in a teaching situation, which gives the work a flavour of action research. Despite this, I agree with Llorens-Molinathat that the work is best characterised as a case study.

Read about experiments

Read about action research


A case study focuses on 'one instance' from among many


What is a case study?

A case study is an in-depth examination of one instance: one example – of something for which there are many examples. The focus of a case study might be one learner, one teacher, one group of students working together on a task, one class, one school, one course, one examination paper, one text book, one laboratory session, one lesson, one enrichment programme… So, there is great variety in what kind of entity a case study is a study of, but what case studies have in common is they each focus in detail on that one instance.

Read about case study methodology


Characteristics of case study

Characteristics of case study

Case studies are naturalistic studies, which means they are studies of things as they are, not attempts to change things. The case has to be bounded (a reader of a case study learns what is in the case and what is not) but tends to be embedded in a wider context that impacts upon it. That is, the case is entangled in a context from which it could not easily be extracted and still be the same case. (Imagine moving a teacher with her class from their school to have their lesson in a university where it could be observed by researchers – it would not be 'the same lesson' as would have occurred in situ).

The case study is reported in detail, often in a narrative form (not just statistical summaries) – what is sometimes called 'thick description'. Usually several 'slices' of data are collected – often different kinds of data – and often there is a process of 'triangulation' to check the consistency of the account presented in relation to the different slices of data available. Although case studies can include analysis of quantitative data, they are usually seen as interpretive as the richness of data available usually reflects complexity and invites nuance.



Design and Assessment of an Online Prelab Model in General Chemistry

Llorens-Molina's study explored the use of prelabs that are "used to introduce and contextualize laboratory work in learning chemistry" (p.15), and in particular "an alternative prelab model, which consists of an audiovisual tutorial associated with an online test" (p.15).

An innovation

The research investigated an innovation in teaching practice,

"In our habitual practice, a previous lecture at the beginning of each laboratory session, focused almost exclusively on the operational issues, was used. From our teaching experience, we can state that this sort of introductory activity contributes to a "cookbook" way to carry out the laboratory tasks. Furthermore, the lecture takes up valuable time (about half an hour) of each ordinary two-hour session. Given this set-up, the main goal of this research was to design and assess an alternative prelab model, which was designed to enhance the abilities and skills related to an inquiry-type learning environment. Likewise, it would have to allow us to save a significant amount of time in laboratory sessions due to its online nature….

a prelab activity developed …consists of two parts…a digital video recording about a brief tutorial lecture, supported by a slide presentation…[followed by ] an online multiple choice test"

Llorens-Molina, 2009, p.16-17
Not action research?

The reference to shifting "our habitual practice" indicates this study reports practitioner research. Practitioner studies, such as this, that test a new innovation are often labelled by authors as 'action research'. (Indeed, sometimes, the fact that research is carried out by practitioners looking to improve their own practice is seen as sufficient for action research: when actually this is a necessary, but not a sufficient condition.)

Genuine action research aims at improving practice, not simply seeing if a specific innovation is working. This means action research has an open-ended design, and is cyclical – with iterations of an innovation tested and the outcomes used as feedback to inform changes in the innovation. (Despite this, a surprising number of published studies labelled as action research lack any cyclic element, simply reporting one iteration of a innovation.) Llorens-Molina's study does not have a cyclic design, so would not be well-characterised as action research.

An experimental design?

Llorens-Molina reports that the study was motivated by three hypotheses (p.16):

  • "Substituting an initial lecture by an online prelab to save time during laboratory sessions will not have negative repercussions in final examination marks.
  • The suggested online prelab model will improve student autonomy and prerequisite knowledge levels during laboratory work. This can be checked by analyzing the types and quantity of SGQ [student generated questions].
  • Student self-perceptions about prelab activities will be more favourable than those of usual lecture methods."

To test these hypotheses the student cohort was divided into two groups, to be split between the customary and innovative approach. This seems very much like an experiment.

It may be useful here to make a discrimination between two levels of research design – methodology (akin to strategy) and techniques (akin to tactics). In research design, a methodology is chosen to meet the overall aims of the study, and then one or more research techniques are selected consistent with that methodology (Taber, 2013). Experimental techniques may be included in a range of methodologies, but experiment as an overall methodology has some specific features.

Read about Research design

In a true experiment there is random assignment to conditions, and often there is an intention to generalise results to a wider population considered to be sampled in the study. Llorens-Molina reports that although inferential statistics were used to test the hypotheses, there was no intention to offer statistical generalisation beyond the case. The cohort of students was not assumed to be a sample representing some wider population (such as, say, undergraduates on chemistry courses in Spain) – and, indeed, clearly such an assumption would not have been justified.

Case study is naturalistic – but an innovation is an intervention in practice…

Case study is said to be naturalistic research – it is a method used to understand and explore things as they are, not to bring about change. Yet, here the focus is an innovation. That seems a contradiction. It would be a contradiction if the study was being carried out by external researchers who had asked the teaching team to change practice for the benefits of their study. However, here it is useful to separate out the two roles of teacher and researcher.

This is a situation that I commonly faced when advising graduates preparing for school teaching who were required to carry out a classroom based study into an aspect of their school placement practice context as part of their university qualification (the Post-Graduate Certificate in Education, P.G.C.E.). Many of these graduates were unfamiliar with research into social phenomena. Science graduates often brought a model of what worked in the laboratory to their thinking about their projects – and had a tendency to think that transferring the experimental approach to classrooms (where there are usually a large number of potentially relevant variables, many of which can not be controlled) would be straightforward.

Read 'Why do natural scientists tend to make poor social scientists?'

The Cambridge P.G.C.E. teaching team put into place a range of supports to introduce graduate preparing for teaching to the kinds of education research useful for teachers who want to evaluate and improve their own teaching. This included a book written to introduce classroom-based research that drew heavily on analysis of published studies (Taber, 2007; 2013). Part of our advice was that those new to this kind of enquiry might want to consider action research and case study as suitable options for their small-scale projects.


Useful strategies for the novice practitioner-researcher (Figure: diagram used in working with graduates preparing for teaching, from Taber, 2010)

Simplistically, action research might be considered best suited to a project to test an innovation or address a problem (e.g., evaluating a new teaching resource; responding to behavioural issues), and case study best suited to an exploratory study (e.g., what do Y9 students understand about photosynthesis?; what is the nature of peer dialogue during laboratory working in this class?) However, it was often difficult for the graduates to carry out authentic action research as the constraints of the school-based placements seldom allowed them to test successive iterations of the same intervention until they found something like an optimal specification.

Yet, they often were in a good position to undertake a detailed study of one iteration, collecting a range of different data, and so producing a detailed evaluation. That sounds like a case study.

Case study is supposed to be naturalistic – whereas innovation sounds like an intervention. But some interventions in practice can be considered the focus of naturalistic enquiry. My argument was that when a teacher changes the way they do something to try and solve a problem, or simply to find a better way to work, that is a 'natural' part of professional practice. The teacher-researcher, as researcher, is exploring something the fully professional teacher does as matter of course – seek to develop practice. After all, our graduates were being asked to undertake research to give them the skills expected to meet professional teaching standards, which

"clearly requires the teacher to have both the procedural knowledge to undertake small-scale classroom enquiry, and 'conceptual frameworks' for thinking about teaching and learning that can provide the basis for evaluating their teaching. In other words, the professional teacher needs both the ability to do her own research and knowledge of what existing research suggests"

Taber, 2013, p.8

So, the research is on something that is naturally occurring in the classroom context, rather than an intervention imported into the context in order to answer an external researcher's questions. A case study of an intervention introduced by practitioners themselves can be naturalistic – even if the person implementing the change is the researcher as well as the teacher.


If a teacher-researcher (qua researcher) wishes to enquire into an innovation introduced by the teacher-researcher (qua teacher) then this can be considered as naturalistic enquiry


The case and the context

In Llorens-Molina's study, the case was a sequence of laboratory activities carried out by a cohort of undergraduates undertaking a course of General and Organic Chemistry as part of an Agricultural Engineering programme. So, the case was bounded (the laboratory part of one taught course) and embedded in a wider context – a degree programme in a specific institution in Spain: the Polytechnic University of Valencia.

The primary purpose of the study was to find out about the specific innovation in the particular course that provided the case. This was then what is known as an intrinsic case study. (When a case is studied primarily as an example of a class of cases, rather than primarily for its own interest, it is called an instrumental case study).

Llorens-Molina recognised that what was found in this specific case, in its particular context, could not be assumed to apply more widely. There can be no statistical generalisation to other courses elsewhere. In case study, the intention is to offer sufficient detail of the case for readers to make judgements of the likely relevance to other context of interest (so-called 'reader generalisation').

The published report gives a good deal of information about the course as well as much information about how data was collected, and equally important, analysed.

Different slices of data

Case study often uses a range of data sources to develop a rounded picture of the case. In this study the identification of three specific hypotheses (less usual in case studies, which often have more open-ended research questions) led to the collection of three different types of data.

  • Students were assessed on each of six laboratory activities. A comparison was made between the prelab condition and the existing approach.
  • Questions asked by students in the laboratories were recorded and analysed to see if the quality/nature of such questions was different in the two conditions. A sophisticated approach was developed to analyse the questions.
  • Students were asked to rate the prelabs through responding to items on a questionnaire.

This approach allowed the author to go beyond simply reporting whether hypotheses were supported by the analysis, to offer a more nuanced discussion around each feature. Such nuance is not only more informative to the reader of a case study, but reflects how the researcher, as practitioner, has an ongoing commitment to further develop practice and not see the study as an end in itself.

Avoiding the 'equivalence' and the 'misuse of control groups' problems

I particularly appreciate a feature of the research design that many educational studies that claim to be experiments could benefit from. To test his hypotheses Llorens-Molina employed two conditions or treatments, the innovation and a comparison condition, and divided the cohort: "A group with 21 students was split into two subgroups, with 10 and 11 in each one, respectively". Llorens-Molina does not suggest this was based on random assignment, which is necessary for a 'true' experiment.

In many such quasi-experiments (where randomisation to condition is not carried out, and is indeed often not possible) the researchers seek to offer evidence of equivalence before the treatments occur. After all, if the two subgroups are different in terms of past subject attainment or motivation or some other relevant factor (or, indeed, if there is no information to allow a judgement regarding whether this is the case or not), no inferences about an intervention can be drawn from any measured differences. (Although that does not always stop researchers from making such claims regardless: e.g., see Lack of control in educational research.)

Another problem is that if learners are participating in research but are assigned to a control or comparison condition then it could be asked if they are just being used as 'data fodder', and would that be fair to them? This is especially so in those cases (so, not this one) where researchers require that the comparison condition is educationally deficient – many published studies report a control condition where schools students have effectively been lectured to, and no discussion work, group work, practical work, digital resources, et cetera, have been allowed, in order to ensure a stark contrast with whatever supposedly innovative pedagogy or resource is being evaluated (Taber, 2019).

These issues are addressed in research designs which have a compensatory structure – in effect the groups switch between being the experimental and comparison condition – as here:

"Both groups carried out the alternative prelab and the previous lecture (traditional practice), alternately. In this way, each subgroup carried out the same number of laboratory activities with either a prelab and previous lecture"

Llorens-Molina, 2009, p.19

This is good practice both from methodological and ethical considerations.


The study used a compensatory design which avoids the need to ensure both groups are equivalent at the start, and does not disadvantage one group. (Figure from Llorens-Molina, 2009, p.22 – published under a creative commons Attribution-NonCommercial-NoDerivs 3.0 United States license allowing redistribution with attribution)

A case of case study

Do I think this is a model case study that perfectly exemplifies all the claimed characteristics of the methodology? No, and very few studies do. Real research projects, often undertaken in complex contexts with limited resources and intractable constraints, seldom fit such ideal models.

However, unlike some studies labelled as case studies, this study has an explicit bounded case and has been carried out in the spirit of case study that highlights and values the intrinsic worth of individual cases. There is a good deal of detail about aspects of the case. It is in essence a case study, and (unlike what sometimes seems to be the case [sic]) not just called a case study for want of a methodological label. Most educational research studies examine one particular case of something – but (and I do not think this is always appreciated) that does not automatically make them case studies. Because it has been both conceptualised and operationalised as a case study, Llorens-Molina's study is a coherent piece of research.

Given how, in these pages, I have often been motivated to call out studies I have read that I consider have major problems – major enough to be sufficient to undermine the argument for the claimed conclusions of the research – I wanted to recognise a piece of research that I felt offered much to admire.


Work cited:

Notes:

1 I am using language here reflecting a perspective on teaching as being based on a model (whether explicit or not) in the teacher's mind of the learners' current knowledge and understanding and how this will respond to teaching. That expects a great deal of the teacher, so there are often bugs in the system (e.g., the teacher over-estimates prior knowledge) that need to be addressed. This is why being a teacher involves being something of a 'learning doctor'.

Read about the learning doctor perspective on teaching


2 I used to teach sessions introducing each of these methodologies when I taught on an Educational Research course. One of the class activities was to examine published papers claiming the focal methodology, asking students to see if studies matched the supposed characteristics of the strategy. This was a course with students undertaking a very diverse range of research projects, and I encouraged them to apply the analysis to papers selected because they were of particular interest and relevance to to their own work. Many examples selected by students proved to offer poor match between claimed methodology and the actual research design of ther study!

Assessing Chemistry Laboratory Equipment Availability and Practice

Comparative education on a local scale?

Keith S. Taber

Image by Mostafa Elturkey from Pixabay 

I have just read a paper in a research journal which compares the level of chemistry laboratory equipment and 'practice' in two schools in the "west Gojjam Administrative zone" (which according to a quick web-search is in the Amhara Region in Ethiopia). According to Yesgat and Yibeltal (2021),

"From the analysis of Chemistry laboratory equipment availability and laboratory practice in both … secondary school and … secondary school were found in very low level and much far less than the average availability of chemistry laboratory equipment and status of laboratory practice. From the data analysis average chemistry laboratory equipment availability and status of laboratory practice of … secondary school is better than that of Jiga secondary school."

Yesgat and Yibeltal, 2021: abstract [I was tempted to omit the school names in this posting as I was not convinced the schools had been treated reasonably, but the schools are named in the very title of the article]

Now that would seem to be something that could clearly be of interest to teachers, pupils, parents and education administrators in those two particular schools, but it raises the question that can be posed in relation to any research: 'so what?' The findings might be a useful outcome of enquiry in its own context, but what generalisable knowledge does this offer that justifies its place in the research literature? Why should anyone outside of West Gojjam care?

The authors tell us,

"There are two secondary schools (Damot and Jiga) with having different approach of teaching chemistry in practical approach"

Yesgat and Yibeltal, 2021: 96

So, this suggests a possible motivation.

  • If these two approaches reflect approaches that are common in schools more widely, and
  • if these two schools can be considered representative of schools that adopt these two approaches, and
  • if 'Chemistry Laboratory Equipment Availability and Practice' can be considered to be related to (a factor influencing? an effect of?) these different approaches, and
  • if the study validly and reliably measures 'Chemistry Laboratory Equipment Availability and Practice', and
  • if substantive differences are found between the schools

then the findings might well be of wider interest. As always in research, the importance we give to findings depends upon a whole logical chain of connections that collectively make an argument.

Spoiler alert!

At the end of the paper, I was none the wiser what these 'different approaches' actually were.

A predatory journal

I have been reading some papers in a journal that I believed, on the basis of its misleading title and website details, was an example of a poor-quality 'predatory journal'. That is, a journal which encourages submissions simply to be able to charge a publication fee (currently $1519, according to the website), without doing the proper job of editorial scrutiny. I wanted to test this initial evaluation by looking at the quality of some of the work published.

Although the journal is called the Journal of Chemistry: Education Research and Practice (not to be confused, even if the publishers would like it to be, with the well-established journal Chemistry Education Research and Practice) only a few of the papers published are actually education studies. One of the articles that IS on an educational topic is called 'Assessment of Chemistry Laboratory Equipment Availability and Practice: A Comparative Study Between Damot and Jiga Secondary Schools' (Yesgat & Yibeltal, 2021).

Comparative education?

Yesgat and Yibeltal imply that their study falls in the field of comparative education. 1 They inform readers that 2,

"One purpose of comparative education is to stimulate critical reflection about our educational system, its success and failures, strengths and weaknesses. This critical reflection facilitates self-evaluation of our work and is the basis for determining appropriate courses of action. Another purpose of comparative education is to expose us to educational innovations and systems that have positive outcomes. Most compartivest states [sic] that comparative education has four main purposes. These are:

To describe educational systems, processes or outcomes

To assist in development of educational institutions and practices

To highlight the relationship between education and society

To establish generalized statements about education that are valid in more than one country

Yesgat & Yibeltal, 2021: 95-96
Comparative education studies look to characterise (national) education systems in relation to their social/cultural contexts (Image by Gerd Altmann from Pixabay)

Of course, like any social construct, 'comparative education' is open to interpretation and debate: for example, "that comparative education brings together data about two or more national systems of education, and comparing and contrasting those data" has been characterised as an "a naive and obvious answer to the question of what constitutes comparative education" (Turner, 2019, p.100).

There is then some room for discussion over whether particular research outputs should count as 'comparative education' studies or not. Many comparative education studies do not actually compare two educational systems, but rather report in detail from a single system (making possible subsequent comparisons based across several such studies). These educational systems are usually understood as national systems, although there may be a good case to explore regional differences within a nation if regions have autonomous education systems and these can be understood in terms of broader regional differences.

Yet, studying one aspect of education within one curriculum subject at two schools in one educational educational administrative area of one region of one country cannot be understood as comparative education without doing excessive violence to the notion. This work does not characterise an educational system at national, regional or even local level.

My best assumption is that as the study is comparing something (in this case an aspect of chemistry education in two different schools) the authors feel that makes it 'comparative education', by which account of course any educational experiment (comparing some innovation with some kind of comparison condition) would automatically be a comparative education study. We all make errors sometimes, assuming terms have broader or different meanings than their actual conventional usage – and may indeed continue to misuse a term till someone points this out to us.

This article was published in what claims to be a peer reviewed research journal, so the paper was supposedly evaluated by expert reviewers who would have provided the editor with a report on strengths and weaknesses of the manuscript, and highlighted areas that would need to be addressed before possible publication. Such a reviewer would surely have reported that 'this work is not comparative education, so the paragraph on comparative education should either be removed, or authors should contextualise it to explain why it is relevant to their study'.

The weak links in the chain

A research report makes certain claims that derive from a chain of argument. To be convinced about the conclusions you have to be convinced about all the links in the chain, such as:

  • sampling (were the right people asked?)
  • methodology (is the right type of research design used to answer the research question?)
  • instrumentation (is the data collection instrument valid and reliable?)
  • analysis (have appropriate analytical techniques been carried out?)

These considerations cannot be averaged: if, for example, a data collection instrument does not measure what it is said to measure, then it does not matter how good the sample, or how careful the analysis, the study is undermined and no convincing logical claims can be built. No matter how skilled I am in using a tape measure, I will not be able to obtain accurate weights with it.

Sampling

The authors report the make up of their sample – all the chemistry teachers in each school (13 in one, 11 in the other), plus ten students from each of grades 9, 10 and 11 in each school. They report that "… 30 natural science students from Damot secondary school have been selected randomly. With the same technique … 30 natural sciences students from Jiga secondary school were selected".

Random selection is useful to know there is no bias in a sample, but it is helpful if the technique for randomisation is briefly reported to assure readers that 'random' is not being used as a synonym for 'arbitrary' and that the technique applied was adequate (Taber, 2013b).

A random selection across a pooled sample is unlikely to lead to equal representation in each subgroup (From Taber, 2013a)

Actually, if 30 students had been chosen at random from the population of students taking natural sciences in one of the schools, it would be extremely unlikely they would be evenly spread, 10 from each year group. Presumably, the authors made random selections within these grade levels (which would be eminently sensible, but is not quite what they report).

Read about the criterion for randomness in research

Data collection

To collect data the authors constructed a questionnaire with Likert-type items.

"…questionnaire was used as data collecting instruments. Closed ended questionnaires with 23 items from which 8 items for availability of laboratory equipment and 15 items for laboratory practice were set in the form of "Likert" rating scale with four options (4=strongly agree, 3=agree, 2=disagree and 1=strongly disagree)"

Yesgat & Yibeltal, 2021: 96

These categories were further broken down (Yesgat & Yibeltal, 2021: 96): "8 items of availability of equipment were again sub grouped in to

  • physical facility (4 items),
  • chemical availability (2 items), and
  • laboratory apparatus (2 items)

whereas 15 items of laboratory practice were further categorized as

  • before actual laboratory (4 items),
  • during actual laboratory practice (6 items) and
  • after actual laboratory (5 items)

Internal coherence

So, there were two basic constructs, each broken down into three sub-constructs. This instrument was piloted,

"And to assure the reliability of the questionnaire a pilot study on a [sic] non-sampled teachers and students were conducted and Cronbach's Alpha was applied to measure the coefficient of internal consistency. A reliability coefficient of 0.71 was obtained and considered high enough for the instruments to be used for this research"

Yesgat & Yibeltal, 2021: 96

Running a pilot study can be very useful as it can highlight issues about items. However, although simply asking people to complete a questionnaire might highlight items people could not make any sense of, it may not be as useful as interviewing them about how they understood items to check that respondents understand items in the same way as researchers.

The authors cite the value of Cronbach's alpha to demonstrate their instrument has internal consistency. However, they seem to be quoting the value obtained in the pilot study, where the statistic strictly applies to a particular administration of an instrument (so the value from the main study is more relevant to the results reported).

More problematic, the authors appear to cite a value of alpha from across all 23 items (n.b., the value of alpha tends to increase as the number of items increases, so what is considered an acceptable value needs to allow for the number of items included) when these are actually two distinct scales: 'availability of laboratory equipment' and 'laboratory practice'. Alpha should be quoted separately for each scale – values across distinct scales are not useful (Taber, 2018). 3

Do the items have face validity?

The items in the questionnaire are reported in appendices (pp.102-103), so I have tabulated them here, so readers can consider

  • (a) whether they feel these items reflect the constructs of 'availability of equipment' and 'laboratory practice';
  • (b) whether the items are phrased in a clear way for both teachers and students (the authors report "conceptually the same questionnaires with different forms were prepared" (p.101) but if this means different wording fro teachers than students this is not elaborated – teachers were also asked demographic questions about their educational level)); and
  • (c) whether they are all reasonable things to expect both teachers and students to be able to rate.
'Availability of equipment' items'Laboratory practice' items
Structured and well- equipped laboratory roomYou test the experiments before your work with students
Availability of electric system in laboratory roomYou give laboratory manuals to student before practical work
Availability of water system in laboratory roomYou group and arrange students before they are coming to laboratory room
Availability of laboratory chemicals are available [sic]You set up apparatus and arrange chemicals for activities
No interruption due to lack of lab equipmentYou follow and supervise students when they perform activities
Isolated bench to each student during laboratory activitiesYou work with the lab technician during performing activity
Chemicals are arranged in a logical order.You are interested to perform activities?
Laboratory apparatus are arranged in a logical orderYou check appropriate accomplishment of your students' work
Check your students' interpretation, conclusion and recommendations
Give feedbacks to all your students work
Check whether the lab report is individual work or group
There is a time table to teachers to conduct laboratory activities.
Wear safety goggles, eye goggles, and other safety equipment in doing so
Work again if your experiment is failed
Active participant during laboratory activity
Items teachers and students were asked to rate on a four point scale (agree / strongly agree / disagree / strongly disagree)

Perceptions

One obvious limitation of this study is that it relies on reported perceptions.

One way to find out about the availability of laboratory equipment might be to visit teaching laboratories and survey them with an observation schedule – and perhaps even make a photographic record. The questionnaire assumes that teacher and student perceptions are accurate and that honest reports would be given (might teachers have had an interest in offering a particular impression of their work?)

Sometimes researchers are actually interested in impressions (e.g., for some purposes whether a students considers themselves a good chemistry student may be more relevant than an objective assessment), and sometimes researchers have no direct access to a focus of interest and must rely on other people's reports. Here it might be suggested that a survey by questionnaire is not really the best way to, for example, "evaluate laboratory equipment facilities for carrying out practical activities" (p.96).

Findings

The authors describe their main findings as,

"Chemistry laboratory equipment availability in both Damot secondary school and Jiga secondary school were found in very low level and much far less than the average availability of chemistry laboratory equipment. This finding supported by the analysis of one sample t-values and as it indicated the average availability of laboratory equipment are very much less than the test value and the p-value which is less than 0.05 indicating the presence of significant difference between the actual availability of equipment to the expected test value (2.5).

Chemistry laboratory practice in both Damot secondary school and Jiga secondary school were found in very low level and much far less than the average chemistry laboratory practice. This finding supported by the analysis of one sample t-values and as it indicated the average chemistry laboratory practice are very much less than the test value and the p-value which is less than 0.05 indicating the presence of significant difference between the actual chemistry laboratory practice to the expected test value."

Yesgat & Yibeltal, 2021: 101 (emphasis added)

This is the basis for the claim in the abstract that "From the analysis of Chemistry laboratory equipment availability and laboratory practice in both Damot secondary school and Jiga secondary school were found in very low level and much far less than the average availability of chemistry laboratory equipment and status of laboratory practice."

'The average …': what is the standard?

But this raises a key question – how do the authors know what the "the average availability of chemistry laboratory equipment and status of laboratory practice" is, if they have only used their questionnaire in two schools (which are both found to be below average)?

Yesgat & Yibeltal have run a comparison between the average ratings they get from the two schools on their two scales and the 'average test value' rating of 2.5. As far as I can see, this is not an empirical value at all. It seems the authors have just assumed that if people are asked to use a four point scale – 1, 2, 3, 4 – then the average rating will be…2.5. Of course, that is a completely arbitrary assumption. (Consider the question – 'how much would you like to be beaten and robbed today?': would the average response be likely to be nominal mid-point of a ratings scale?) Perhaps if a much wider survey had been undertaken the actual average rating would have been 1.9 0r 2.7 or …

That is even assuming that 'average' is a meaningful concept here. A four point Likert scale is an ordinal scale ('agree' is always less agreement than 'strongly agree' and more than 'disagree') but not a ratio scale (that is, it cannot be assumed that the perceived 'agreement' gap (i) from 'strongly disagree' to 'disagree' is the same for each respondent and the same as that (ii) from 'disagree' to 'agree' and (iii) from 'agree' to 'strongly agree'). Strictly, Likert scale ratings cannot be averaged (better being presented as bar charts showing frequencies of response) – so although the authors carry out a great deal of analysis, much of this is, strictly, invalid.

So what has been found out from this study?

I would very much like to know what peer reviewers made of this study. Expert reviewers would surely have identified some very serious weaknesses in the study and would have been expected to have recommended some quite major revisions even if they thought it might eventually be publishable in a research journal.

An editor is expected to take on board referee evaluations and ask authors to make such revisions as are needed to persuade the editor the submission is ready for publication. It is the job of the editor of a research journal, supported by the peer reviewers, to

a) ensure work of insufficient quality is not published

b) help authors strengthen their paper to correct errors and address weaknesses

Sometimes this process takes some time, with a number of cycles of revision and review. Here, however, the editor was able to move to a decision to publish in 5 days.

The study reflects a substantive amount of work by the authors. Yet, it is hard to see how this study, at least as reported in this journal, makes a substantive contribution to public knowledge. The study finds that one school has somewhat higher survey ratings on an instrument that has not been fully validated than another school, and is based on a pooling of student and teacher perceptions, and which guesses that both rate lower than a hypothetical 'average' school. The two schools were supposed to represent a "different approach[es] of teaching chemistry in practical approach" – but even if that is the case, the authors have not shared with their readers what these different approaches are meant to be. So, there would be no possibility of generalising from the schools to 'approach[es] of teaching chemistry', even if that was logically justifiable. And comparative education it is not.

This study, at least as published, does not seem to offer useful new knowledge to the chemistry education community that could support teaching practice or further research. Even in the very specific context of the two specific schools it is not clear what can be done with the findings which simply reflect back to the informants what they have told the researchers, without exploring the reasons behind the ratings (how do different teachers and students understand what counts as 'Chemicals are arranged in a logical order') or the values the participants are bringing to the study (is 'Check whether the lab report is individual work or group' meant to imply that it is seen as important to ensure that students work cooperatively or to ensure they work independently or …?)

If there is a problem highlighted here by the "very low levels" (based on a completely arbitrary interpretation of the scales) there is no indication of whether this is due to resourcing of the schools, teacher preparation, levels of technician support, teacher attitudes or pedagogic commitments, timetabling problems, …

This seems to be a study which has highlighted two schools, invited teachers and students to complete a dubious questionnaire, and simply used this to arbitrarily characterise the practical chemistry education in the schools as very poor, without contextualising any challenges or offering any advice on how to address the issues.

Work cited:
Note:

1 'Imply' as Yesgat and Yibeltal do not actually state that they have carried out comparative education. However, if they do not think so, then the paragraph on comparative education in their introduction has no clear relationship with the rest of the study and is not more than a gratuitous reference, like suddenly mentioning Nottingham Forest's European Cup triumphs or noting a preferred flavour of tea.


2 This seemed an intriguing segment of the text as it was largely written in a more sophisticated form of English than the rest of the paper, apart from the odd reference to "Most compartivest [comparative education specialists?] states…" which seemed to stand out from the rest of the segment. Yesgat and Yibeltal do not present this as a quote, but cite a source informing their text (their reference [4] :Joubish, 2009). However, their text is very similar to that in another publication:

Quote from Mbozi, 2017, p.21Quote from Yesgat and Yibeltal, 2021, pp.95-96
"One purpose of comparative education is to stimulate critical reflection about our educational system, its success and failures, strengths and weaknesses."One purpose of comparative education is to stimulate critical reflection about our educational system, its success and failures, strengths and weaknesses.
This critical reflection facilitates self-evaluation of our work and is the basis for determining appropriate courses of action.This critical reflection facilitates self-evaluation of our work and is the basis for determining appropriate courses of action.
Another purpose of comparative education is to expose us to educational innovations and systems that have positive outcomes. Another purpose of comparative education is to expose us to educational innovations and systems that have positive outcomes.
The exposure facilitates our adoption of best practices.
Some purposes of comparative education were not covered in your exercise above.
Purposes of comparative education suggested by two authors Noah (1985) and Kidd (1975) are presented below to broaden your understanding of the purposes of comparative education.
Noah, (1985) states that comparative education has four main purposes [4] and these are:Most compartivest states that comparative education has four main purposes. These are:
1. To describe educational systems, processes or outcomes• To describe educational systems, processes or outcomes
2. To assist in development of educational institutions and practices• To assist in development of educational institutions and practices
3. To highlight the relationship between education and society• To highlight the relationship between education and society
4. To establish generalized statements about education, that are valid in more than one country."• To establish generalized statements about education that are valid in more than one country"
Comparing text (broken into sentences to aid comparison) from two sources

3 There are more sophisticated techniques which can be used to check whether items do 'cluster' as expected for a particular sample of respondents.


4 As suggested above, researchers can pilot instruments with interviews or 'think aloud' protocols to check if items are understood as intended. Asking assumed experts to read through and check 'face validity' is of itself quite a limited process, but can be a useful initial screen to identify items of dubious relevance.

Not motivating a research hypothesis

A 100% survey return that represents 73% (or 70%, or perhaps 48%) of the population

Keith S. Taber

…the study seems to have looked for a lack of significant difference regarding a variable which was not thought to have any relevance…

This is like hypothesising…that the amount of alkali needed to neutralise a certain amount of acid will not depend on the eye colour of the researcher; experimentally confirming this is the case; and then seeking to publish the results as a new contribution to knowledge.

…as if a newspaper headline was 'Earthquake latest' and then the related news story was simply that, as usual, no earthquakes had been reported.

Structuring a research report

A research report tends to have a particular kind of structure. The first section sets out background to the study to be described. Authors offer an account of the current state of the relevant field – what can be called a conceptual framework.

In the natural sciences it may be that in some specialised fields there is a common, accepted way of understanding that field (e.g., the nature of important entities, the relevant variables to focus on). This has been described as working within an established scientific 'paradigm'. 1 However, social phenomena (such as classroom teaching) may be of such complexity that a full account requires exploration at multiple levels, with a range of analytical foci (Taber, 2008). 2 Therefore the report may indicate which particular theoretical perspective (e.g., personal constructivism, activity theory, Gestalt psychology, etc.) has informed the study.

This usually leads to one or more research questions, or even specific hypotheses, that are seen to be motivated by the state of the field as reflected in the authors' conceptual framework.

Next, the research design is explained: the choice of methodology (overall research strategy), the population being studied and how it was sampled, the methods of data collection and development of instruments, and choice of analytical techniques.

All of this is usually expected before any discussion (leaving aside a short statement as part of the abstract) of the data collected, results of analysis, conclusions and implications of the study for further research or practice.

There is a logic to designing research. (Image after Taber, 2014).

A predatory journal

I have been reading some papers in a journal that I believed, on the basis of its misleading title and website details, was an example of a poor-quality 'predatory journal'. That is, a journal which encourages submissions simply to be able to charge a publication fee (currently $1519, according to the website), without doing the proper job of editorial scrutiny. I wanted to test this initial evaluation by looking at the quality of some of the work published.

Although the journal is called the Journal of Chemistry: Education Research and Practice (not to be confused, even if the publishers would like it to be, with the well-established journal Chemistry Education Research and Practice) only a few of the papers published are actually education studies. One of the articles that IS on an educational topic is called 'Students' Perception of Chemistry Teachers' Characteristics of Interest, Attitude and Subject Mastery in the Teaching of Chemistry in Senior Secondary Schools' (Igwe, 2017).

A research article

The work of a genuine academic journal

A key problem with predatory journals is that because their focus is on generating income they do not provide the service to the community expected of genuine research journals (which inevitably involves rejecting submissions, and delaying publication till work is up to standard). In particular, the research journal acts as a gatekeeper to ensure nonsense or seriously flawed work is not published as science. It does this in two ways.

Discriminating between high quality and poor quality studies

Work that is clearly not up to standard (as judged by experts in the field) is rejected. One might think that in an ideal world no one is going to send work that has no merit to a research journal. In reality we cannot expect authors to always be able to take a balanced and critical view of their own work, even if we would like to think that research training should help them develop this capacity.

This assumes researchers are trained, of course. Many people carrying out educational research in science teaching contexts are only trained as natural scientists – and those trained as researchers in natural science often approach the social sciences with significant biases and blind-spots when carrying out research with people. (Watch or read 'Why do natural scientists tend to make poor social scientists?')

Also, anyone can submit work to a research journal – be they genius, expert, amateur, or 'crank'. Work is meant to be judged on its merits, not by the reputation or qualifications of the author.

De-bugging research reports – helping authors improve their work

The other important function of journal review is to identify weaknesses and errors and gaps in reports of work that may have merit, but where these limitations make the report unsuitable for publication as submitted. Expert reviewers will highlight these issues, and editors will ensure authors respond to the issues raised before possible publication. This process relies on fallible humans, and in the case of reviewers usually unpaid volunteers, but is seen as important for quality control – even if it not a perfect system. 3

This improvement process is a 'win' all round:

  • the quality of what is published is assured so that (at least most) published studies make a meaningful contribution to knowledge;
  • the journal is seen in a good light because of the quality of the research it publishes; and
  • the authors can be genuinely proud of their publications which can bring them prestige and potentially have impact.

If a predatory journal which claims (i) to have academic editors making decisions and (ii) to use peer review does not rigorously follow proper processes, and so publishes (a) nonsense as scholarship, and (b) work with major problems, then it lets down the community and the authors – if not those making money from the deceit.

The editor took just over a fortnight to arrange any peer review, and come to a decision that the research report was ready for publication

Students' perceptions of chemistry teachers' characteristics

There is much of merit in this particular research study. Dr Iheanyi O. Igwe explains why there might be a concern about the quality of chemistry teaching in the research context, and draws upon a range of prior literature. Information about the population (the public secondary schools II chemistry students in Abakaliki Education Zone of Ebonyi State) and the sample is provided – including how the sample, of 300 students at 10 schools, was selected.

There is however an unfortunate error in characterising the population:

"the chemistry students' population in the zone was four hundred and ten (431)"

Igwe, 2017, p.8

This seems to be a simple typographic error, but the reader cannot be sure if this should read

  • "…four hundred and ten (410)" or
  • "…four hundred and thirty one (431)".

Or perhaps neither, as the abstract tells readers

"From a total population of six hundred and thirty (630) senior secondary II students, a sample of three hundred (300) students was used for the study selected by stratified random sampling technique."

Igwe, 2017, abstract

Whether the sample is 300/410 or 300/431 or even 300/630 does not fundamentally change the study, but one does wonder how these inconsistencies were not spotted by the editor, or a peer reviewer, or someone in the production department. (At least, one might wonder about this if one had not seen much more serious failures to spot errors in this journal.) A reader could wonder whether the presence of such obvious errors may indicate a lack of care that might suggest the possibility of other errors that a reader is not in a position to spot. (For example, if questionnaire responses had not been tallied correctly in compiling results, then this would not be apparent to anyone who did not have access to the raw data to repeat the analysis.) The author seems to have been let down here.

A multi-scale instrument

The final questionnaire contained 5 items on each of three scales

  • students' perception of teachers' interest in the teaching of chemistry;
  • students' perception of teachers' attitude towards the teaching of chemistry;
  • students' perception of teachers' mastery of the subject in the teaching of chemistry

Igwe informs readers that,

"the final instrument was tested for reliability for internal consistency through the Cronbach Alpha statistic. The reliability index for the questionnaire was obtained as 0.88 which showed that the instrument was of high internal consistency and therefore reliable and could be used for the study"

Igwe, 2017, p.4

This statistic is actually not very useful information as one would want to know about the internal consistency within the scales – an overall value across scales is not informative (conceptually, it is not clear how it should be interpreted – perhaps that the three scales are largely eliciting much the same underlying factor? ) (Taber, 2018). 4

There are times when aggregate information is not very informative (Image by Syaibatul Hamdi from Pixabay )

Again, one might have hoped that expert reviewers would have asked the author to quote the separate alpha values for the three scales as it is these which are actually informative.

The paper also offers a detailed account of the analysis of the data, and an in-depth discussion of the findings and potential implications. This is a serious study that clearly reflects a lot of work by the researcher. (We might hope that could be taken for granted when discussing work published in a 'research journal', but sadly that is not so in some predatory journals.) There are limitations of course. All research has to stop somewhere, and resources and, in particular, access opportunities are often very limited. One of these limitations is the wider relevance of the population sampled.

But do the results apply in Belo Horizonte?

This is the generalisation issue. The study concerns the situation in one administrative zone within a relatively small state in South East Nigeria. How do we know it has anything useful to tell us about elsewhere in Nigeria, let alone about the situation in Mexico or Vietnam or Estonia? Even within Ebonyi State, the Abakaliki Education Zone (that is, the area of the state capital) may well be atypical – perhaps the best qualified and most enthusiastic teachers tend to work in the capital? Perhaps there would have been different findings in a more rural area?

Yet this is a limitation that applies to a good deal of educational research. This goes back to the complexity of educational phenomena. What you find out about an electron or an oxidising agent studied in Abakaliki should apply in Cambridge, Cambridgeshire or equally in Cambridge, Massachusetts. That cannot be claimed about what you may find out about a teacher in Abakaliki, or a student, a class, a school, a University

Misleading study titles?

Educational research studies often have strictly misleading titles – or at least promise a lot more than they deliver. This may in part be authors making unwarranted assumptions, or it may be journal editors wanting to avoid unwieldy titles.

"This situation has inadvertently led to production of half backed graduate Chemistry educators."

Igwe, 2017, p.2

The title of this study does suggest that the study concerns perceptions of Chemistry Teachers' Characteristics …in Senior Secondary Schools, when we cannot assume that chemistry teachers in the Abakaliki Education Zone of Ebonyi State can stand for chemistry teachers more widely. Indeed some of the issues raised as motivating the need for the study are clearly not issues that would apply in all other educational contexts – that is the 'situation', which is said to be responsible for the "production of half backed [half-baked?] graduate Chemistry educators" in Nigeria, will not apply everywhere. Whilst the title could be read as promising more general findings than were possible in the study, Igwe's abstract is quite explicit about the specific population sampled.

A limited focus?

Another obvious limitation is that whilst pupils' perceptions of their teachers are very important, it does not offer a full picture. Pupils may feel the need to give positive reviews, or may have idealistic conceptions. Indeed, assuming that voluntary, informed consent was given (which would mean that students knew they could decline to take part in the research without fear of sanctions) it is of note that every one of the 30 students targeted in each of the ten schools agreed to complete the survey,

"The 300 copies of the instrument were distributed to the respondents who completed them for retrieval on the spot to avoid loss and may be some element of bias from the respondents. The administration and collection were done by the researcher and five trained research assistants. Maximum return was made of the instrument."

Igwe, 2017, p.4

To get a 100% return on a survey is pretty rare, and if normal ethical procedures were followed (with the voluntary nature of the activity made clear) then this suggests these students were highly motivated to appease adults working in the education system.

But we might ask how student perceptions of teacher characteristics actually relate to teacher characteristics?

For example, observations of the chemistry classes taught by these teachers could possibly give a very different impression of those teachers than that offered by the student ratings in the survey. (Another chemistry teacher may well be able to distinguish teacher confidence or bravado from subject mastery when a learner is not well placed to do so.) Teacher self-reports could also offer a different account of their 'Interest, Attitude and Subject Mastery', as could evaluations by their school managers. Arguably, a study that collected data from multiple sources would offer the possibility of 'triangulating' between sources.

However, Igwe, is explicit about the limited focus of the study, and other complementary strands of research could be carried out to follow-up on the study. So, although the specific choice of focus is a limitation, this does not negate the potential value of the study.

Research questions

Although I recognise a serious and well-motivated study, there is one aspect of Igwe's study which seemed rather bizarre. The study has three research questions (which are well-reflected in the title of the study) and a hypothesis which I suspect will likely surprise some readers.

That is not a good thing. At least, I always taught research students that unlike in a thriller or 'who done it?' story, where a surprise may engage and amuse a reader, a research report or thesis is best written to avoid such surprises. The research report is an argument that needs to flow though the account – if a reader is surprised at something the researcher reports doing then the author has probably forgotten to properly introduce or explain something earlier in the report.

Here are the research questions and hypotheses:

"Research Questions

The following research questions guided the study, thus:

How do students perceive teachers' interest in the teaching of chemistry?

How do students perceive teachers' attitude towards the teaching of chemistry?

How do students perceive teachers' mastery of the subjects in the teaching of chemistry?

Hypotheses
The following null hypothesis was tested at 0.05 alpha levels, thus:
HO1 There is no significant difference in the mean ratings of male and female students on their perception of chemistry teachers' characteristics in the teaching of chemistry."

Igwe, 2017, p.3

A surprising hypothesis?

A hypothesis – now where did that come from?

Now, I am certainly not criticising a researcher for looking for gender differences in research. (That would be hypocritical as I looked for such differences in my own M.Sc. thesis, and published on gender differences in teacher-student interactions in physics classes, gender differences in students' interests in different science topics on stating secondary school, and links between pupil perceptions of (i) science-relatedness and (ii) gender-appropriateness of careers.)

There might often be good reasons in studies to look for gender differences. But these reasons should be stated up-front. As part of the conceptual framework motivating the study, researchers should explain that based on their informal observations, or on anecdotal evidence, or (better) drawing upon explicit theoretical considerations, or that informed by the findings of other related studies – or whatever reason there might – there are good reasons to check for gender differences.

The flow of research (Underlying image from Taber, 2013) The arrows can be read as 'inform(s)'.

Perhaps Igwe had such reasons, but there seems to be no mention of 'gender' as a relevant variable prior to the presentation of the hypothesis: not even a concerning dream, or signs in the patterns of tea leaves. 5 To some extent, this is reinforced by the choice of the null hypothesis – that no such difference will be found. Even if it makes no substantive difference to a study whether a hypothesis is framed in terms of there being a difference or not, psychologically the study seems to have looked for a lack of significant difference regarding a variable which was not thought to have any relevance.

Misuse of statistics

It is important for researchers not to test for effects that are not motivated in their studies. Statistical significance tells a researcher something is unlikely to happen just by chance – but it still might. Just as someone buying a lottery ticket is unlikely to win the lottery – but they might. Logically a small proportion of all the positive statistical results in the literature are 'false positives' because unlikely things do happen by chance – just not that often. 6 The researcher should not (metaphorically!) go round buying up lots of lottery tickets, and then seeing an occasional win as something more than chance.

No alarms and no surprises

And what was found?

"From the result of analysis … the null hypothesis is accepted which means that there is no significant difference in the mean ratings of male and female students in their perception of chemistry teachers' characteristics (interest, attitude and subject mastery) in the teaching of chemistry."

Igwe, 2017, p.6

This is like hypothesising, without any motivation, that the amount of alkali needed to neutralise a certain amount of acid will not depend on the eye colour of the researcher; experimentally confirming this is the case; and then seeking to publish the results as a new contribution to knowledge.

Why did Igwe look for gender difference (or more strictly, look for no gender difference)?

  • A genuine relevant motivation missing from the paper?
  • An imperative to test for something (anything)?
  • Advice that journals are more likely to publish studies using statistical testing?
  • Noticing that a lot of studies do test for gender differences (whether there seems a good reason to do so or not)?

This seems to be an obvious point for peer reviewers and the editor to raise: asking the author to either (a) explain why it makes sense to test for gender differences in this study – or (b) to drop the hypothesis from the paper. It seems they did not notice this, and readers are simply left to wonder – just as you would if a newspaper headline was 'Earthquake latest' and then the related news story was simply that, as usual, no earthquakes had been reported.

Work cited:


Footnotes:

1 The term paradigm became widely used in this sense after Kuhn's (1970) work although he later acknowledged criticisms of the ambiguous way he used the term, in particular as learning about a field through working through standard examples, paradigms, and the wider set of shared norms and values that develop in an established field which he later termed 'disciplinary matrix'. In psychology research 'paradigm' may be used in the more specific sense of an established research design/protocol.


2 There are at least three ways of explaining why a lot of research in the social science seems more chaotic and less structured to outsiders than most research in the natural sciences.

  • a) Ontology. Perhaps the things studied in the natural sciences really exist, and some of those in the social sciences are epiphenomena and do not reflect fundamental, 'real', things. There may be some of that sometimes, but if so I think it is a matter of degree (that is, scientists have not been beyond studying the ether or phlogiston), because of the third option (c).
  • b) The social sciences are not as mature as many areas of the natural sciences and so are sill 'pre-paradigmatic'. I am sure there is sometimes an element of this: any new field will take time to focus in on reliable and productive ways of making sense of its domain.
  • c) The complexity of the phenomena. Social phenomena are inherently more complex, often involving feedback loops between participants' behaviours and feelings and beliefs (including about the research, the researcher, etc.)

Whilst (a) and (b) may sometimes be pertinent, I think (c) is often especially relevant to this question.


3 An alternative approach that has gained some credence is to allow authors to publish, but then invite reader reviews which will also be published – and so allowing a public conversation to develop so readers can see the original work, criticism, responses to those criticisms, and so forth, and make their own judgements. To date this has only become common practice in a few fields.

Another approach for empirical work is for authors to submit research designs to journals for peer review – once a design has been accepted by the journal, the journal agrees to publish the resulting study as long as the agreed protocol has been followed. (This is seen as helping to avoid the distorting bias in the literature towards 'positive' results as studies with 'negative' results may seem less interesting and so less likely to be accepted in prestige journals.) Again, this is not the norm (yet) in most fields.


4 The statistic has a maximum value of 1, which would indicate that the items were all equivalent, so 0.88 seems a high value, till we note that a high value of alpha is a common artefact of including a large number of items.

However, playing Devil's advocate, I might suggest that the high overall value of alpha could suggest that the three scales

  • students' perception of teachers' interest in the teaching of chemistry;
  • students' perception of teachers' attitude towards the teaching of chemistry;
  • students' perception of teachers' mastery of the subject in the teaching of chemistry

are all tapping into a single underlying factor that might be something like

  • my view of whether my chemistry teacher is a good teacher

or even

  • how much I like my chemistry teacher

5 Actually the discrimination made is between male and female students – it is not clear what question students were asked to determine 'gender', and whether other response options were available, or whether students could decline to respond to this item.


6 Our intuition might be that only a small proportion of reported positive results are false positives, because, of course, positive results reflect things unlikely to happen by chance. However if, as is widely believed in many fields, there is a bias to reporting positive results, this can distort the picture.

Imagine someone looking for factors that influence classroom learning. Consider that 50 variables are identified to test, such as teacher eye colour, classroom wall colour, type of classroom window frames, what the teacher has for breakfast, the day of the week that the teacher was born, the number of letters in the teacher's forename, the gender of the student who sits nearest the fire extinguisher, and various other variables which are not theoretically motivated to be considered likely to have an effect. With a confidence level of p[robability] ≤ 0.05 it is likely that there will be a very small number of positive findings JUST BY CHANCE. That is, if you look across enough unlikely events, it is likely some of them will happen. There is unlikely to be a thunderstorm on any particular day. Yet there will likely be a thunderstorm some day in the next year. If a report is written and published which ONLY discusses a positive finding then the true statistical context is missing, and a likely situation is presented as unlikely to be due to chance.


Those flipping, confounding variables!

Keith S. Taber

Alternative interpretations and a study on flipped learning

Image by Please Don't sell My Artwork AS IS from Pixabay

Flipping learning

I was reading about a study of 'flipped learning'. Put very simply, the assumption behind flipped learning is that usually teaching follows a pattern of (a) class time spent with the teacher lecturing, followed by (b) students working through examples largely in their own time. This is a pattern that was (and perhaps still is) often found in Universities in subjects that largely teach though lecture courses.

The flipped learning approach switches the use the class time to 'active' learning activities, such as working through exercises, by having students undertake some study before class. That is, students learn about what would have been presented in the lecture by reading texts, watching videos, interacting with on-line learning resources, and so forth, BEFORE coming to class. The logic is that the teacher's input is more useful  when students are being challenged to apply the new ideas than as a means of presenting information.

That is clearly a quick gloss, and clearly much more could be said about the rationale, the assumptions behind the approach,and its implementation.

(Read more about flipped learning)

However, in simple terms, the mode of instruction for two stages of the learning process

  • being informed of scientific ideas (through a lecture)
  • applying those ideas (in unsupported private study)

are 'flipped' to

  • being informed of scientific ideas (through accessing learning resources)
  • applying those ideas (in a context where help and feedback is provided)

Testing pedagogy

So much for the intention, but does it work? That is where research comes in. If we want to test a hypothesis, such as 'students will learn more if learning is flipped' (or 'students will enjoy their studies more if learning is flipped', or 'more students will opt to study the subject further if learning is flipped', or whatever) then it would seem an experiment is called for.

In principle, experiments allow us to see if changing some factor (say, the sequence of activities in a course module) will change some variable (say, student scores on a test). The experiment is often the go-to methodology in natural sciences: modify one variable, and measure any change in another hypothesised to be affected by it, whilst keeping everything else that could conceivably have an influence constant. Even in science, however, it is seldom that simple, and experiments can never actually 'prove' our hypothesis is correct (or false).

(Read more about the scientific method)

In education, running experiments is even more challenging (Taber, 2019). Learners, classes, teachers, courses, schools, universities are not 'natural kinds'. That is, the kind of comparability you can expect between two copper sulphate crystals of a given mass, or two specimens of copper wire of given dimensions, does not apply: it can matter a lot whether you are testing this student or that student, or if the class is taught one teacher or another.

People respond to conditions different to inanimate objects – if testing the the conductivity of a sample of a salt solution of a given concentration it should not matter if it is Monday morning of Thursday afternoon, or whether it is windy outside, or which team lost last's night's match, or even whether the researcher is respectful or rude to the sample. Clearly when testing the motivation or learning of students, such things could influence measurements. Moreover, a sample of gas neither knows or cares what you are expecting to happen when you compress it, but people can be influenced by the expectations of researchers (so called expectancy effect – also known as the Pygmalion effect).

(Read about experimental research into teaching innovations)

Flipping the fundamentals of analytic chemistry

In the study, by Ponikwer and Patel, researchers flipped part of a module on the fundamentals of analytical chemistry, which was part of a BSc honours degree in biomedical science. The module was divided into three parts:

  1. absorbance and emission spectrosocopy
  2. chromatography and electrophoresis
  3. mass spectroscopy and nuclear magnetic resonance spectroscopy

Students were taught the first topics by the usual lectures, then the topics of chromatography and electrophoresis were taught 'flipped', before the final topics were taught through the usual lectures. This pattern was repeated over three successive years.

[Figure 1 in the paper offers a useful graphical representation of the study design. If I had been prepared to pay SpringerNature a fee, I would have been allowed to reproduce it here.*]

The authors of the study considered the innovation a success

This study suggests that flipped learning can be an effective model for teaching analytical chemistry in single topics and potentially entire modules. This approach provides the means for students to take active responsibility in their learning, which they can do at their own pace, and to conduct problem-solving activities within the classroom environment, which underpins the discipline of analytical chemistry. (Ponikwer & Patel,  2018: p.2268)

Confounding variables

Confounding variables are other factors which might vary between conditions and have an effect.

Read about confounding variables

Ponikwer and Patel were aware that one needs to be careful in interpreting the data collected in such a study. For example, it is not especially helpful to consider how well students did on the examination questions at the end of term to see if students did as well, or better, on the flipped topics that the other topics taught. Clearly students might find some topics, or indeed some questions, more difficult than others regardless of how they studied. Ponikwer and Patel reported that on average students did significantly better on questions from the flipped elements, but included important caveats

"This improved performance could be due to the flipped learning approach enhancing student learning, but may also be due to other factors, such as students finding the topic of chromatography more interesting or easier than spectroscopy, or that the format of flipped learning made students feel more positive about the subject area compared with those subject areas that were delivered traditionally." (Ponikwer & Patel,  2018: p.2267)

Whilst acknowledging such alternative explanations for their findings might seem to undermine their results it is good science to be explicit about such caveats. Looking for (and reporting) alternative explanations is a key part of the scientific attitude.

This good scientific practice is also clear where the authors discuss how attendance patterns varied over the course. The authors report that the attendance at the start of the flipped segment was similar to what had come before, but then attendance increased slightly during the flipped learning section of the course. They point out this shift was "not significant", that is statistics suggested it could not be ruled out to be a chance effect.

However Ponikwer and Patel do report a statistically "significant reduction in the attendance at the non-flipped lectures delivered after the flipped sessions" (p.2265) – that is, once students had experienced the flipped learning, on average they tended to attend normal lectures less later in their course. The authors suggest this could be a positive reaction to how they experienced the flipped learning, but again they point out that there were confounding variables, and other interpretations could not ruled out:

"This change in attendance may be due to increased engagement in the flipped learning module; however, it could also reflect a perception that a more exciting approach of lecturing or content is to be delivered. The enhanced level of engagement may also be because students could feel left behind in the problem-solving workshop sessions. The reduction in attendance after the flipped lecture may be due to students deciding to focus on assessments, feeling that they may have met the threshold attendance requirement" (Ponikwer & Patel,  2018: p.2265).

So, with these students, taking this particular course, in this particular university, having this sequence of topics based on some traditional and some flipped learning, there is some evidence of flipped learning better engaging students and leading to improved learning – but subject to a wide range of caveats which allow various alternative explanations of the findings.

(Read about caveats to research conclusions)

Pointless experiments?

Given the difficulties of interpreting experiments in education, one may wonder if there is any point in experiments in teaching and learning. On the other hand, for the lecturing staff on the course, it would seem strange to get these results, and dismiss them (it has not been proved that flipped learning has positive effects, but the results are at least suggestive and we can only base our action on the available evidence).

Moreover, Ponikwer and Patel collected other data, such as students' perceptions of the advantages and challenges of the flipped learning approach – data that can complement their statistical tests, and also inform potential modifications of the implementation of flipped learning for future iterations of the course.

(Read about the use of multiple research techniques in studies)

Is generalisation possible?

What does this tell us about the use of flipped learning elsewhere? Studies taking place in a single unique teaching and learning context do not automatically tell us what would have been the case elsewhere – with different lecturing staff, different demographic of students, when learning about marine ecology or general relativity. Such studies are best seen as context-directed, as being most relevant to here they are carried out.

However, again, even if research cannot be formally generalised, that does not mean that it cannot be informative to those working elsewhere who may apply a form of 'reader generalisation' to decide either:

a) that teaching and learning context seems very similar to ours: it might be worth trying that here;

or

b) that is a very different teaching and learning context to ours: it may not be worth the effort and disruption to try that out here based on the findings in such a different context.

(Read about generalisation)

This requires studies to give details of the teaching and learning context where they were carried out (so called 'thick description'). Clearly the more similar a study context is to one's own teaching context, and the wider the range of teaching and learning contexts where a particular pedagogy or teaching approach has been shown to have positive outcomes, the more reason there is to feel it is with trying something out in own's own classroom.

I have argued that:

"What are [common in the educational research literature] are individual small-scale experiments that cannot be considered to offer highly generalisable results. Despite this, where these individual studies are seen as being akin to case studies (and reported in sufficient detail) they can collectively build up a useful account of the range of application of tested innovations. That is, some inherent limitations of small-scale experimental studies can be mitigated across series of studies, but this is most effective when individual studies offer thick description of teaching contexts and when contexts for 'replication' studies are selected to best complement previous studies." (Taber, 2019: 106)

In that regard, studies like that of Ponikwer and Patel can be considered not as 'proof' of the effectiveness of flipped learning, but as part of a cumulative evidence base for the value of trying out the approach in various teaching situations.

Why I have not included the orignal figure showing the study design

* I had hoped to include in this post a copy of the figure in the paper showing the study design. The paper is not published open access and so the copyright in the 'design' (that, is the design of the figure **, not the study!) means that it cannot be legally reprodiced without permission. I sought permission to reproduce the figure here through (SpringerNature) the publisher's on line permissions request system, explaining this was to be used in an acdemics scholar's personal blog.

Springer granted permission for reuse, but subject to a fee of £53.83.

As copyright holder/managers they are perfectly entitled to do that. However, I had assumed that they would offer free use for a non-commercial purpose that offers free publicity to their publication. I have other uses for my pension, so I refer readers interested in seeing the figure to the original paper.

** Under the conventions associated with copyright law the reproduction of short extracts of an academic paper for the purposes of criticism and review is normally considered 'fair use' and exempt from copyright restrictions. However, any figure (or table) is treated as a discrete artistic design and cannot be copied from a work in copyright without permission.

(Read about copyright and scholarly works)

 

Work cited:

A case of hybrid research design?

When is "a case study" not a case study? Perhaps when it is (nearly) an experiment?

Keith S. Taber

I read this interesting study exploring learners shifting conceptions of the particulate nature of gases.

Mamombe, C., Mathabathe, K. C., & Gaigher, E. (2020). The influence of an inquiry-based approach on grade four learners' understanding of the particulate nature of matter in the gaseous phase: a case study. EURASIA Journal of Mathematics, Science and Technology Education, 16(1), 1-11. doi:10.29333/ejmste/110391

Key features:

  • Science curriculum context: the particulate nature of matter in the gaseous phase
  • Educational context: Grade 4 students in South Africa
  • Pedagogic context: Teacher-initiated inquiry approach (compared to a 'lecture' condition/treatment)
  • Methodology: "qualitative pre-test/post-test case study design" – or possibly a quasi-experiment?
  • Population/sample: the sample comprised 116 students from four grade four classes, two from each of two schools

This study offers some interesting data, providing evidence of how students represent their conceptions of the particulate nature of gases. What most intrigued me about the study was its research design, which seemed to reflect an unusual hybrid of quite distinct methodologies.

In this post I look at whether the study is indeed a case study as the authors suggest, or perhaps a kind of experiment. I also make some comments about the teaching model of the states of matter presented to the learners, and raise the question of whether the comparison condition (lecturing 8-9 year old children about an abstract scientific model) is appropriate, and indeed ethical.

Learners' conceptions of the particulate nature of matter

This paper is well worth reading for anyone who is not familiar with existing research (such as that cited in the paper) describing how children make sense of the particulate nature of matter, something that many find counter-intuitive. As a taster for this, I reproduce here two figures from the paper (which is published open access under a creative common license* that allows sharing and adaption of copyright material with due acknowledgement).

Figures © 2020 by the authors of the cited paper *

Conceptions are internal, and only directly available to the epistemic subject, the person holding the conception. (Indeed, some conceptions may be considered implicit, and so not even available to direct introspection.) In research, participants are asked to represent their understandings in the external 'public space' – often in talk, here by drawing (Taber, 2013). The drawings have to be interpreted by the researchers (during data analysis). In this study the researchers also collected data from group work during learning (in the enquiry condition) and by interviewing students.

What kind of research design is this?

Mamombe and colleagues describe their study as "a qualitative pre-test/post-test case study design with qualitative content analysis to provide more insight into learners' ideas of matter in the gaseous phase" (p. 3), yet it has many features of an experimental study.

The study was

"conducted to explore the influence of inquiry-based education in eliciting learners' understanding of the particulate nature of matter in the gaseous phase"

p.1

The experiment compared two pedagogical treatments :

  • "inquiry-based teaching…teacher-guided inquiry method" (p.3) guided by "inquiry-based instruction as conceptualized in the 5Es instructional model" (p.5)
  • "direct instruction…the lecture method" (p.3)

These pedagogic approaches were described:

"In the inquiry lessons learners were given a lot of materials and equipment to work with in various activities to determine answers to the questions about matter in the gaseous phase. The learners in the inquiry lessons made use of their observations and made their own representations of air in different contexts."

"the teacher gave probing questions to learners who worked in groups and constructed different models of their conceptions of matter in the gaseous phase. The learners engaged in discussion and asked the teacher many questions during their group activities. Each group of learners reported their understanding of matter in the gaseous phase to the class"

p.5, p.1

"In the lecture lessons learners did not do any activities. They were taught in a lecturing style and given all the notes and all the necessary drawings.

In the lecture classes the learners were exposed to lecture method which constituted mainly of the teacher telling the learners all they needed to know about the topic PNM [particulate nature of matter]. …During the lecture classes the learners wrote a lot of notes and copied a lot of drawings. Learners were instructed to paste some of the drawings in their books."

pp.5-6

The authors report that,

"The learners were given clear and neat drawings which represent particles in the gaseous, liquid and solid states…The following drawing was copied by learners from the chalkboard."

p.6
Figure used to teach learners in the 'lecture' condition. Figure © 2020 by the authors of the cited paper *
A teaching model of the states of matter

This figure shows increasing separation between particles moving from solid to liquid to gas. It is not a canonical figure, in that the spacing in a liquid is not substantially greater than in a solid (indeed, in ice floating on water the spacing is greater in the solid), whereas the difference in spacing in the two fluid states is under-represented.

Such figures do not show the very important dynamic aspect: that in a solid particles can usually only oscillate around a fixed position (a very low rate of diffusion not withstanding), where in a liquid particles can move around, but movement is restricted by the close arrangement of (and intermolecular forces between) the particles, where in a gas there is a significant mean free path between collisions where particles move with virtually constant velocity. A static figure like this, then, does not show the critical differences in particle interactions which are core to the basic scientific model

Perhaps even more significant, figure 2 suggests there is the same level of order in the three states, whereas the difference in ordering between a solid and liquid is much more significant than any change in particle spacing.

In teaching, choices have to be made about how to represent science (through teaching models) to learners who are usually not ready to take on board the full details and complexity of scientific knowledge. Here, Figure 2 represents a teaching model where it has been decided to emphasise one aspect of the scientific model (particle spacing) by distorting the canonical model, and to neglect other key features of the basic scientific account (particle movement and arrangement).

External teachers taught the classes

The teaching was undertaken by two university lecturers

"Two experienced teachers who are university lecturers and well experienced in teacher education taught the two classes during the intervention. Each experienced teacher taught using the lecture method in one school and using the teacher-guided inquiry method in the other school."

p.3

So, in each school there was one class taught by each approach (enquiry/lecture) by a different visiting teacher, and the teachers 'swapped' the teaching approaches between schools (a sensible measure to balance possible differences between the skills/styles of the two teachers).

The research design included a class in each treatment in each of two schools

An experiment; or a case study?

Although the study compared progression in learning across two teaching treatments using an analysis of learner diagrams, the study also included interviews, as well as learners' "notes during class activities" (which one would expect would be fairly uniform within each class in the 'lecture' treatment).

The outcome

The authors do not consider their study to be an experiment, despite setting up two conditions for teaching, and comparing outcomes between the two conditions, and drawing conclusions accordingly:

"The results of the inquiry classes of the current study revealed a considerable improvement in the learners' drawings…The results of the lecture group were however, contrary to those of the inquiry group. Most learners in the lecture group showed continuous model in their post-intervention results just as they did before the intervention…only a slight improvement was observed in the drawings of the lecture group as compared to their pre-intervention results"

pp.8-9

These statements can be read in two ways – either

  • a description of events (it just happened that with these particular classes the researchers found better outcomes in the enquiry condition), or
  • as the basis for a generalised inference.

An experiment would be designed to test a hypothesis (this study does not seem to have an explicit hypothesis, nor explicit research questions). Participants would be assigned randomly to conditions (Taber, 2019), or, at least, classes would be randomly assigned (although then strictly each class should be considered as a single unit of analysis offering much less basis for statistical comparisons). No information is given in the paper on how it was decided which classes would be taught by which treatment.

Representativeness

A study could be carried out with the participation of a complete population of interest (e.g., all of the science teachers in one secondary school), but more commonly a sample is selected from a population of interest. In a true experiment, the sample has to be selected randomly from the population (Taber, 2019) which is seldom possible in educational studies.

The study investigated a sample of 'grade four learners'

In Mamombe and colleagues' study the sample is described. However, there is no explicit reference to the population from which the sample is drawn. Yet the use of the term 'sample' (rather than just, say, 'participants') implies that they did have a population in mind.

The aim of the study is given as to "to explore the influence of inquiry-based education in eliciting learners' understanding of the particulate nature of matter in the gaseous phase" (p.1) which could be considered to imply that the population is 'learners'. The title of the paper could be taken to suggest the population of interests is more specific: "grade four learners". However, the authors make no attempt to argue that their sample is representative of any particular population, and therefore have no basis for statistical generalisation beyond the sample (whether to learners, or to grade four learners, or to grade four learners in RSA, or to grade four learners in farm schools in RSA, or…).

Indeed only descriptive statistics are presented: there is no attempt to use tests of statistical significance to infer whether the difference in outcomes between conditions found in the sample would probably have also been found in the wider population.

(That is inferential stats. are commonly used to suggest 'we found a statistically significant better outcome in one condition in our sample, so in the hypothetical situation that we had been able to include the entire population in out study we would probably have found better mean outcomes in that same condition'.)

This may be one reason why Mamombe and colleagues do not consider their study to be an experiment. The authors acknowledge limitations in their study (as there always are in any study) including that "the sample was limited to two schools and two science education specialists as instructors; the results should therefore not be generalized" (p.9).

Yet, of course, if the results cannot be generalised beyond these four classes in two schools, this undermines the usefulness of the study (and the grounds for the recommendations the authors make for teaching based on their findings in the specific research contexts).

If considered as an experiment, the study suffers from other inherent limitations (Taber, 2019). There were likely novelty effects, and even though there was no explicit hypothesis, it is clear that the authors expected enquiry to be a productive approach, so expectancy effects may have been operating.

Analytical framework

In an experiment is it important to have an objective means to measure outcomes, and this should be determined before data are collected. (Read about 'Analysis' in research studies.). In this study methods used in previous published work were adopted, and the authors tell us that "A coding scheme was developed based on the findings of previous research…and used during the coding process in the current research" (p.6).

But they then go on to report,

"Learners' drawings during the pre-test and post-test, their notes during class activities and their responses during interviews were all analysed using the coding scheme developed. This study used a combination of deductive and inductive content analysis where new conceptions were allowed to emerge from the data in addition to the ones previously identified in the literature"

p.6

An emerging analytical frame is perfectly appropriate in 'discovery' research where a pre-determined conceptualisation of how data is to be understood is not employed. However in 'confirmatory' research, testing a specific idea, the analysis is operationalised prior to collecting data. The use of qualitative data does not exclude a hypothesis-testing, confirmatory study, as qualitative data can be analysed quantitatively (as is done in this study), but using codes that link back to a hypothesis being tested, rather than emergent codes. (Read about 'Approaches to qualitative data analysis'.)

Much of Mamombe and colleagues' description of their work aligns with an exploratory discovery approach to enquiry, yet the gist of the study is to compare student representations in relation to a model of correct/acceptable or alternative conceptions to test the relative effectiveness of two pedagogic treatments (i.e., an experiment). That is a 'nomothetic' approach that assumed standard categories of response.

Overall, the author's account of how they collected and analysed data seem to suggest a hybrid approach, with elements of both a confirmatory approach (suitable for an experiment) and elements of a discovery approach (more suitable for case study). It might seem this is a kind of mixed methods study with both confirmatory/nomothetic and discovery/idiographic aspects – responding to two different types of research question the same study.

Yet there do not actually seem (**) to be two complementary strands to the research (one exploring the richness of student's ideas, the other comparing variables – i.e., type of teaching versus degree of learning), but rather an attempt to hybridise distinct approaches based on incongruent fundamental (paradigmatic) assumptions about research. (** Having explicit research questions stated in the paper could have clarified this issue for a reader.)

So, do we have a case study?

Mamombe and colleagues may have chosen to frame their study as a kind of case study because of the issues raised above in regard to considering it an experiment. However, it is hard to see how it qualifies as case study (even if the editor and peer reviewers of the EURASIA Journal of Mathematics, Science and Technology Education presumably felt this description was appropriate).

Mamombe and colleagues do use multiple data sources, which is a common feature of case study. However, in other ways the study does not meet the usual criteria for case study. (Read more about 'Case study'.)

For one thing, case study is naturalistic. The method is used to study a complex phenomena (e.g., a teacher teaching a class) that is embedded in a wider context (e.g., a particular school, timetable, cultural context, etc.) such that it cannot be excised for clinical examination (e.g., moving the lesson to a university campus for easy observation) without changing it. Here, there was an intervention, imposed from the outside, with external agents acting as the class teachers.

Even more fundamentally – what is the 'case'?

A case has to have a recognisable ('natural') boundary, albeit one that has some permeability in relation to its context. A classroom, class, year group, teacher, school, school district, etcetera, can be the subject of a case study. Two different classes in one school, combined with two other classes from another school, does not seem to make a bounded case.

In case study, the case has to be defined (not so in this study); and it should be clear it is a naturally occurring unit (not so here); and the case report should provide 'thick description' (not provided here) of the case in its context. Mamombe and colleagues' study is simply not a case study as usually understood: not a "qualitative pre-test/post-test case study design" or any other kind of case study.

That kind of mislabelling does not in itself does not invalidate research – but may indicate some confusion in the basic paradigmatic underpinnings of a study. That seems to be the case [sic] here, as suggested above.

Suitability of the comparison condition: lecturing

A final issue of note about the methodology in this study is the nature of one of the two conditions used as a pedagogic treatment. In a true experiment, this condition (against which the enquiry condition was contrasted) would be referred to as the control condition. In a quasi-experiment (where randomisation of participants to conditions is not carried out) this would usually be referred to as the comparison condition.

At one point Mamombe and colleagues refer to this pedagogic treatment as 'direct instruction' (p.3), although this term has become ambiguous as it has been shown to mean quite different things to different authors. This is also referred to in the paper as the lecture condition.

Is the comparison condition ethical?

Parental consent was given for students contributing data for analysis in the study, but parents would likely trust the professional judgement of the researchers to ensure their children were taught appropriately. Readers are informed that "the learners whose parents had not given consent also participated in all the activities together with the rest of the class" (p.3) so it seems some children in the lecture treatment were subject to the inferior teaching approach despite this lack of consent, as they were studying "a prescribed topic in the syllabus of the learners" (p.3).

I have been very critical of a certain kind of 'rhetorical' research (Taber, 2019) report which

  • begins by extolling the virtues of some kind of active / learner-centred / progressive / constructivist pedagogy; explaining why it would be expected to provide effective teaching; and citing numerous studies that show its proven superiority across diverse teaching contexts;
  • then compares this with passive modes of learning, based on the teacher talking and giving students notes to copy, which is often characterised as 'traditional' but is said to be ineffective in supporting student learning;
  • then describes how authors set up an experiment to test the (superior) pedagogy in some specific context, using as a comparison condition the very passive learning approach they have already criticised as being ineffective as supporting learning.

My argument is that such research is unethical

  • It is not genuine science as the researchers are not testing a genuine hypothesis, but rather looking to demonstrate something they are already convinced of (which does not mean they could not be wrong, but in research we are trying to develop new knowledge).
  • It is not a proper test of the effectiveness of the progressive pedagogy as it is being compared against a teaching approach the authors have already established is sub-standard.

Most critically, young people are subjected to teaching that the researchers already believe they know will disadvantage them, just for the sake of their 'research', to generate data for reporting in a research journal. Sadly, such rhetorical studies are still often accepted for publication despite their methodological weaknesses and ethical flaws.

I am not suggesting that Mamombe, Mathabathe and Gaigher have carried out such a rhetorical study (i.e., one that poses a pseudo-question where from the outset only one outcome is considered feasible). They do not make strong criticisms of the lecturing approach, and even note that it produces some learning in their study:

"Similar to the inquiry group, the drawings of the learners were also clearer and easier to classify after teaching"

"although the inquiry method was more effective than the lecture method in eliciting improved particulate conception and reducing continuous conception, there was also improvement in the lecture group"

p.9, p.10

I have no experience of the South African education context, so I do not know what is typical pedagogy in primary schools there, nor the range of teaching approaches that grade 4 students there might normally experience (in the absence of external interventions such as reported in this study).

It is for the "two experienced teachers who are university lecturers and well experienced in teacher education" (p.3) to have judged whether a lecture approach based on teacher telling, children making notes and copying drawings, but with no student activities, can be considered an effective way of teaching 8-9 year old children a highly counter-intuitive, abstract, science topic. If they consider this good teaching practice (i.e., if it is the kind of approach they would recommend in their teacher education roles) then it is quite reasonable for them to have employed this comparison condition.

However, if these experienced teachers and teacher educators, and the researchers designing the study, considered that this was poor pedagogy, then there is a real question for them to address as to why they thought it was appropriate to implement it, rather than compare the enquiry condition with an alternative teaching approach that they would have expected to be effective.

Sources cited:

* Material reproduced from Mamombe, Mathabathe & Gaigher, 2020 is © 2020 licensee Modestum Ltd., UK. That article is an open access article distributed under the terms and conditions of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) [This post, excepting that material, is © 2020, Keith S. Taber.]

An introduction to research in education:

Taber, K. S. (2013). Classroom-based Research and Evidence-based Practice: An introduction (2nd ed.). London: Sage.