Hypothesis testing for the mean is a statistical procedure used to assess whether there is enough evidence to support a claim about the population mean. It involves comparing sample data to a null hypothesis \((H_0)\) and an alternative hypothesis \((H_1 or H_a)\) to determine if there is a significant difference between the sample mean and a hypothesized population mean.
The mean and standard deviation for IQ scores in the general population are \(\mu = 100\) and \(\sigma = 15\). Suppose we believe that, in general, Year \(12\) mathematics students score higher on IQ tests than members of the general population. To investigate, we select a random sample of \(100\) Year \(12\) mathematics students and determine their mean IQ to be \(103.6\). This is \(3.6\) points higher than the mean IQ of people in general.
Is it reasonable to conclude that Year \(12\) mathematics students score higher on IQ tests than the general public? We already know that sample means will vary from sample to sample, and we would not expect the mean of an individual sample to have exactly the same value as the mean of the population from which it is drawn.
One explanation is that Year 12 mathematics students perform no better on IQ tests than members of the general public, and the difference between the mean score of the sample, \(\bar{x} = 103.6\), and that of the general population, \(\mu = 100\), is due to sampling variability.
Another explanation is that Year 12 mathematics students actually do better than average on IQ tests, and a sample mean of \(\bar{x} = 103.6\) is consistent with this explanation.
Hypothesis testing is concerned with deciding which of the two explanations is more likely, which we do on the basis of probability.
A hypothesis test can be likened to a trial in a court of law. We begin with a hypothesis that we wish to find evidence to support. In a court, as a prosecutor, your intention is to show that the person is guilty. However, the starting point in the trial is that the person is innocent. It is up to the prosecutor to provide enough evidence to show that this assumption is untenable.
The assumption of innocence in hypothesis-testing terms is called the null hypothesis, denoted by H0. If we can collect evidence to show that the null hypothesis is untenable, we can conclude that there is support for an alternative hypothesis, denoted by H1.
In this IQ example, our hypothesis is that Year 12 mathematics students perform better than the general population on IQ tests. To test this with a hypothesis test, we start by assuming the opposite: we assume that Year 12 mathematics students perform no better on IQ tests than members of the general public. In statistical terms, we are saying that the distribution of IQ scores for these students is the same as for the general public.
For the general public, we know that IQ is normally distributed with a mean of \( \mu = 100 \) and a standard deviation of \( \sigma = 15 \). The null hypothesis is that the students are drawn from a population in which the mean is \( \mu = 100 \). We express this null hypothesis symbolically as \( H_0: \mu = 100 \)
The null hypothesis, \( H_0 \), says that the sample is drawn from a population which has the same mean as before (i.e. the population mean has not changed). Under the null hypothesis, any difference between the values of a sample statistic and the population parameter is explained by sample-to-sample variation.
In this case, we are hypothesizing that the mean IQ of Year 12 mathematics students is higher than that of the general population – that the sample comes from a population with mean \( \mu > 100 \). We express this alternative hypothesis symbolically as \( H_1 : \mu > 100 \).
The alternative hypothesis, \( H_1 \), says that the population mean has changed. That is, while there will always be some sampling variability, the amount of variation is so much that it is more likely that the sample has been drawn from a population with a different mean.
Note: Hypotheses are always expressed in terms of population parameters.
A hypothesis test can be likened to a trial in a court of law. We begin with a hypothesis that we wish to find evidence to support. In a court, as a prosecutor, your intention is to show that the person is guilty. However, the starting point in the trial is that the person is innocent. It is up to the prosecutor to provide enough evidence to show that this assumption is untenable.
The assumption of innocence in hypothesis-testing terms is called the null hypothesis, denoted by H0. If we can collect evidence to show that the null hypothesis is untenable, we can conclude that there is support for an alternative hypothesis, denoted by H1.
How do we decide between the two hypotheses? Both in a court of law and in statistical hypothesis testing, evidence is collected. This evidence is then weighed up (considered) so that a decision can be made. In the court room, the jury functions as the decision maker, weighing the evidence to make a decision of guilty (the alternative hypothesis) or not guilty (the null hypothesis). In hypothesis testing, the evidence is contained in the sample data.
To help us make our decision, we generally summarize the data into a single statistic, called the test statistic. There are many test statistics that can be used. If we are testing a hypothesis about a population mean \(\mu\), then the obvious test statistic is the sample mean \(\overline{x}\).
If we find that the sample mean observed is very unlikely to have been obtained from a sample drawn from the hypothesized population, this will cause us to doubt the credibility of that hypothesized population mean. The statistical tool we use to determine the likelihood of this value of a test statistic is the distribution of sample means.
The p-value is the probability of observing a value of the sample statistic as extreme as or more extreme than the one observed, assuming that the null hypothesis is true.
Consider again the hypothesis that the mean IQ of Year 12 mathematics students is higher than that of the general population.
We have hypotheses:
and the mean of a sample of size 100 is \( \overline{x} = 103.6 \). Thus we can write:
p-value = \( \text{Pr}(X¯ \geq 103.6 | \mu = 100) \)
To get a picture as to how much we could reasonably expect the sample mean to vary from sample to sample, we can use simulation. The following dotplot shows the values of \( \overline{x} \) obtained from 100 samples (each of size 100) taken from a normal distribution with mean \( \mu = 100 \) and standard deviation \( \sigma = 15 \).
If \( X \) is a normally distributed random variable with mean \( \mu \) and standard deviation \( \sigma \), then the distribution of the sample mean \( \bar{X} \) will also be normal, with mean \( E(\bar{X}) = \mu \) and standard deviation \( \text{sd}(\bar{X}) = \frac{\sigma}{\sqrt{n}} \), where \( n \) is the sample size.
Thus, if the null hypothesis is true, then \( \bar{X} \) is normally distributed with \( E(\bar{X}) = \mu = 100 \) and \( \text{sd}(\bar{X}) = \frac{\sigma}{\sqrt{n}} = \frac{15}{\sqrt{100}} = 1.5 \)
Therefore
\( \text{p-value} = \text{Pr}(\bar{X} \geq 103.6 | \mu = 100) = \text{Pr}\left(Z \geq \frac{103.6 - 100}{1.5}\right) = \text{Pr}(Z \geq 2.4) = 0.0082 \)
Thus, the p-value tells us that, if the mean IQ of Year 12 mathematics students is 100, then the likelihood of observing a sample mean as high as or higher than 103.6 is extremely small, only 0.0082.
Consider again our IQ example. The more unlikely it is that the sample we observed could be drawn from a population with a mean IQ of 100, the more convinced we are that the sample must come from a population with a higher IQ.
In general, the smaller the p-value, the smaller the probability that the sample is from a population with the mean under the null hypothesis, and thus the stronger the evidence against the null hypothesis.
How small does the p-value have to be to provide convincing evidence against the null hypothesis? The following table gives some conventions.
p-value | Conclusion |
---|---|
p-value > 0.05 | insufficient evidence against \(H_0\) |
p-value < 0.05 (5%) | good evidence against \(H_0\) |
p-value < 0.01 (1%) | strong evidence against \(H_0\) |
p-value < 0.001 (0.1%) | very strong evidence against \(H_0\) |
For our IQ example, we interpret the p-value of 0.0082 as strong evidence against the null hypothesis and in support of our hypothesis that Year 12 mathematics students perform better than the general population on IQ tests.
The significance level of a test, \(α\), is the condition for rejecting the null hypothesis:
The most commonly used value for the significance level is 0.05 (5%), although 0.01 (1%) and 0.001 (0.1%) are sometimes used.
This approach to hypothesis testing is commonly used.
The hypothesis test for a mean of a sample drawn from a normally distributed population with known standard deviation is called a z-test.
The central limit theorem tells us that, if the sample size is large enough, then the distribution of the sample mean of any random variable is approximately normal. Thus, a z-test can be used even when the distribution of the random variable is not known, provided the sample size is large enough. (For most distributions, a sample size of 30 is sufficient.)
We considered only situations where we had a pretty good idea as to the direction in which the mean might have changed. That is, we considered only that the mean IQ of Year 12 mathematics students might be higher than the general population, or that the fuel consumption of the new model car might be lower than the previous model. These are examples of directional hypotheses. When we translate these hypotheses into testable alternative hypotheses, we say that our sample has come from a population with mean more than 100 (for the IQ example) or less than 13.7 (for the fuel-consumption example).
The presence of a ‘less than’ sign (<) or a ‘greater than’ sign (>) in the alternative hypothesis indicates that we are dealing with a directional hypothesis. Only values of the sample mean more than 100 (for the IQ example) or less than 13.7 (for the fuel-consumption example) will lend support to the alternative hypothesis.
Now suppose that we do not know whether the fuel consumption of our new model car has increased or decreased. In this case, we would hypothesize that the fuel consumption is different for the new model (a non-directional hypothesis). We have to allow for the possibility of the sample mean being less than or greater than 13.7 litres per 100 km. We express this symbolically by using a ‘not equal to’ sign (≠) in the alternative hypothesis:
\(H_1 : \mu \neq 13.7\)
The presence of the ‘not equal to’ sign (≠) in the alternative hypothesis indicates that we are dealing with a non-directional hypothesis. A sample mean either greater than 13.7 or less than 13.7 could provide evidence to support this hypothesis.
The directionality of the alternative hypothesis H1 determines how the p-value is calculated. For the directional hypothesis
\( H_1 : \mu > 13.7 \)
only a sample mean considerably greater than 13.7 will lend support to this hypothesis. Thus, in calculating the p-value, we only consider values in the upper tail of the normal curve.
For the directional hypothesis
\( H_1 : \mu < 13.7
only a sample mean considerably less than 13.7 will lend support to this hypothesis. Thus, in calculating the p-value, we only consider values in the lower tail of the normal curve.
Because the p-values for directional tests are given by an area in just one tail of the curve, these tests are commonly called one-tail tests.
For the non-directional hypothesis \( H_1 : \mu \neq 13.7 \), a sample mean that is either considerably less than 13.7 or considerably greater than 13.7 will lend support to this hypothesis. Thus, in calculating the p-value, we need to consider values in both tails of the normal curve.
Because the p-values for non-directional tests are given by an area in both tails of the curve, these tests are commonly called two-tail tests.
p-value (two-tail test) = 2 × p-value (one-tail test)
We established in Section 16A that a 95% confidence interval for the population mean \( \mu \) is given by
\[ \left( \overline{x} - 1.9600 \frac{\sigma}{\sqrt{n}}, \overline{x} + 1.9600 \frac{\sigma}{\sqrt{n}} \right) \]
There is a close relationship between confidence intervals and two-tail hypothesis tests. To explain this, we will use the following basic fact about intervals of the real number line:
\[ a \in (b - c, b + c) \Leftrightarrow |a - b| < c \Leftrightarrow b \in (a - c, a + c) \]
Now suppose that we are testing the hypotheses
\[ H_0 : \mu = \mu_0 \]
\[ H_1 : \mu \neq \mu_0 \]
Then we have
\[ \mu_0 \in \left( \overline{x} - 1.9600 \frac{\sigma}{\sqrt{n}}, \overline{x} + 1.9600 \frac{\sigma}{\sqrt{n}} \right) \Rightarrow \overline{x} \in \left( \mu_0 - 1.9600 \frac{\sigma}{\sqrt{n}}, \mu_0 + 1.9600 \frac{\sigma}{\sqrt{n}} \right) \]
Hence, the 95% confidence interval does not contain \( \mu_0 \) if and only if we should reject the null hypothesis at the 5% level of significance.
The p-value for a two-tail test can be defined as:
p-value = Pr(|X¯ - µ| ≥ |x¯ - µ|) = Pr(|Z| ≥ \frac{|x¯ - µ|}{\sigma/\sqrt{n}})
where: