Statistical testing in research - Science-Education-Research

Page contents

Statistics is a branch of mathematics that deals with probabilities.

Use of statistical tests

Statistical tests are used in confirmatory research to help draw conclusions from research studies. Statistics are used in 'confirmatory research' – research designed to test hypotheses (rather than exploratory research). The term quantitative research is sometimes used to describe studies that use statistical tests.

Read about confirmatory research

Most commonly statistical tests are used to check on the likelihood of the results obtained in a study.

Consider a silly example:

A magician tells you that if you drink his magic potion then the next person you see when you leave the house will be an Australian. You test this claim by drinking the potion, leaving the house, and identifying the first person you see – an Australian!

Should you be impressed? Perhaps, if you live in Oslo or Toronto or Lima; but probably less so if you live in Sydney or Canberra! Perhaps, if you live in Perth Scotland; but less so if you live in Perth, WA.

To decide if the there is a good reason to suspect the potion worked, you would need to have a basis for knowing how likely the outcome was. After all, if I predict the next living person you talk to will have a functioning heart, and this turns out to be the case, you have little reason to be impressed by my prescience!

Commonly then using statistics in research requires two operations:

a calculation using the statistical test (applying a mathematical formula – though these days usually a computer is used to do the calculation) to find the value of the statistic;
comparing the statistic with some critical value that has been chosen to be the cut off for being impressed with the unlikeliness of the result

The critical value will depend upon the sample sizes – that is, the number of 'units of analysis' (see below) included in the analysis. A technical term used here is 'degrees of freedom' – which represents the number of independent pieces of information able to vary in a data sample.

For example, when using the chi-squared test (for association) the critical value that 𝟀 has to reach significance varies according to the numbers of degrees of freedom (df) of the data. If

df = 1 then the calculated value of 𝟀 has to reach 3.84 for p<0.05 (so 𝟀 = 4.5 would be statistically significant); but
if df = 2 then 𝟀 has to reach 5.99 for p<0.05 (so 𝟀 = 4.5 would not be statistically significant).

Confidence level

The confidence level is the level of probability that you decide will act as the cut off for an event being unlikely enough to be significant.

Most commonly, p (probability) of 0.05 is used as the cut-off in social sciences. That is a likelihood of 1 in 20 or 5%. There is no reason this value has to be used, this is a just a convention, and sometimes other values are used. When using 0.05 as the confidence level, then iif the statistical test gives an outcome which is only likely to occur by chance less than one time in twenty, then this is considered 'statistically significant'.

The choice of confidence level reflects a balance between trying to reduce two types of errors – false positives (where unlikely events occur by chance) and false negatives (where genuine effects are discarded as they are not considered unlikely enough).

There is no perfect choice of confidence level that can prevent sometimes misleading judgements about statistical significance (figure from Taber, 2019)

There may be different circumstances where it may be decided that avoiding one type of errors is more important. For example, in medical screening it may be decided that more false positives (inviting some healthy people back for further investigation) can be tolerated if that limits the frequency of false negatives (notifying ill people that they are clear of signs of disease):

Read: Shortlisting for disease. False positives on screening tests can be understood in relation to job applications

Chance events

Unlikely event do happen. The chances on a specific day of a particular person being struck by lightning; or being involved in a car crash; or winning a national lottery; or being robbed in the street, are all low (e.g., p ≪ 0.05). But, of course, all these things will sometimes happen. (You are VERY unlikely to win the lottery, but someone will.) So, statistical significance is never enough to prove an outcome is not due to chance. It just makes it unlikely.

And such calculations rely on good data.

There are about about 30 million Australians, from among about 8 000 million people on the planet.
If I only know that you live somewhere on planet Earth then by chance the next person you meet is unlikely to be Australian (p = 3 / 800 = 0.0038 < 0.05), but of course if you are actually in Australia this makes this 'unlikely' event become probable. The point is
if the investigator did not know you were in Australia (so is using data for the whole planet) then this might seem like a statistically significant outcome.

Choosing the right statistic

When using statistical tests it is important to select the right test for the circumstances of a particular study.

Some tests (parametric tests) are only strictly valid when the data collected has an approximately 'normal' (Gaussian bell-curve) distribution.
Some tests only work with data that are measured on a continuous scale (like measuring height);
others can cope with data that is discontinuous (like test scores/20) or ordinal (such as students ranked according to height or test score);
yet others can be used even with categorical data where there is no inherent order (such as science students vs. humanities students).

Results are invalid if a test is used in appropriately (but a computer will often still churn out an answer, so the researcher has to make this judgement).

It is also important that the condition of a study meet the logic behind the tests. For example:

Tests of equivalence

In so-called 'natural experiments' researchers have to work with existing units of analysis that reflect different conditions. Often a 'test of equivalence' will be carried out (e.g. by using pre-tests) to show that different groups are equivalent on the relevant measures before they experience the different conditions being compared.

However, often statistical tests are used to check for significant differences between the groups in the different conditions – that is to look for differences that are so unlikely they show p<0.05. This is a misapplication of tests, as the researchers need to show the different groups are very similar, not that they are NOT very different!

Read about testing for equivalence

Unitary constructs

Cronbach's alpha (𝛂) is a statistic used to check for the degree of internal consistency in scales that measure attitudes and similar constructs from a particular sample at a specific time. (It is often said to measure a form of 'reliability' but this is not the case.)

Read Why write about Cronbach's alpha?

It is useful when there is an instrument with several items intended to elicit the same construct (e.g., self-efficacy in physics). A unitary construct is considered to be a discrete single feature of someone's responses to the world. Something like 'attitude to school' is unlikely to operate as a unitary construct, as likely a pupil's attitude to school with have several distinct aspects (they may enjoy being with their peers, but dislike studying; they may look forward to chemistry but dread French, etc.)

Yet, often Cronbach's alpha is calculated for scales that would seem to include items eliciting quite different constructs. So, sometimes the statistic is used to look at test results when the test includes items that are meant to test different things (such as a test that covered heat transfer, the solar system, acids and bases, and food-webs!)

Moreover, often researchers calculate a value of alpha for an instrument which consists of several scales: that is, the researchers have decided there are several different aspects to what they are measuring so have constructed different scales to capture each. Yet they then calculate an 'overall' value of 𝛂 for the multi-scale instrument!

Randomisation

For example, in experimental studies, the application of statistical tests can tell us how unlikely an outcome is by chance – and so require the 'units of analysis' (e.g., classes) to be assigned to the experimental or control condition randomly.

If I only know that you live somewhere on planet Earth then by chance the next RANDOM person you meet is unlikely to be Australian,
but if you choose to go to Australia to undertake the exercise we can no longer consider the process random!

If randomisation is not undertaken (or not undertaken rigorously) then an unlikely result may arise that has nothing to do with the experimental condition, but which reflect a systematic difference due to who was in each condition. (Consider a researcher is given a list of 40 students, and assigns the first 20 to one condition, and the other 20 to the other condition. That would be fine if the names had already been randomised, but what if (unbeknown to the researcher) the list provided was in order of student scores in the most recent examination!)

That is an extreme example, but many lists of students held by institutions are arranged in some systematic way – e.g., alphabetically, or in order of enrolment. The order may seem to have nothing to do with what an experiment is testing, but we can never be sure so randomisation is needed.

(Perhaps the order in which students are enrolled in a primary school is correlated with age, so more mature students appear near the top of the list? Perhaps the enrolment list of a university course includes students who were admitted through 'clearing' at the bottom.)

There are many published educational studies where the statistics used would only be informative had there been initial randomisation but where

there is no statement that randomisation has been undertaken
it is claimed randomisation has been undertaken, but this is contradicted by other information given (e.g. it was ensured that both groups had the same gender balance!)
the assignment to conditions is described and is clearly is not random

Even where randomisation is reported, there is often on detail of how (so we have to assume that the researchers have used a valid method).

Read about randomisation

Units of analysis

When undertaking statistical tests one needs to know the 'grain size' of what is being compared – the unit of analysis. In educational studies this could be different students, different teachers, different classes, different year groups, different schools and so forth.

In experimental work in education, researchers often seek to compare how learners respond to different teaching approaches or resources. For a true experiment the 'units of analysis' needs to be randomly assigned to the conditions. Yet researchers are not usually allowed to break up existing classes to redistribute students, so have to work with existing intact classes. So consider a school with four classes of 25 pupils each that are participating in a research to compare (say) learning by laboratory practical work with learning using virtual reality simulations.

If the researchers are able to randomly assign each pupil to a condition then they will have 50 students in each condition: n = 50, 50.

If, however, the researchers have to work with the existing classes, then they can only assign the four classes: two to each condition: n = 2, 2.

In the former case the student outcomes from each pupil can be considered separately when doing statistical testing. (Actually, there is an argument that as students in the same class can influence each other strongly, even here they are not strictly independent units in the research.) In the latter case, the outcomes need to be examined at the class level – and the aggregate class results used in the statistical calculations.

In practice, researchers often work with intact classes, but then (invalidly) treat the data as though every student has individually been randomly assigned to a condition. That is, the data is treated as if it has many more degrees of freedom than it actually has.

Read about units of analysis

Work cited

Taber, K. S. (2019). Experimental research into teaching innovations: responding to methodological and ethical challenges. Studies in Science Education, 55(1), 69-119. doi:10.1080/03057267.2019.1658058 [Download manuscript version]

My introduction to educational research:

Taber, K. S. (2013). Classroom-based Research and Evidence-based Practice: An introduction (2nd ed.). London: Sage.

Share on Facebook