Estimation and sampling are fundamental concepts in statistics, crucial for making inferences about a population based on a subset of data. These concepts are widely used in various fields, including economics, medicine, social sciences, and quality control, where it's often impractical or impossible to collect data from an entire population.
Sampling is the process of selecting a subset of individuals, items, or observations from a larger population. The goal is to gather data from this subset, called a sample, to make inferences about the population as a whole. Sampling is essential when it is too costly, time-consuming, or impossible to collect data from every member of the population.
Estimation refers to the process of inferring the value of a population parameter based on sample data. Since it is often impractical to measure the entire population, estimation provides a means to approximate population parameters, such as the mean, variance, or proportion.
The set of all eligible members of a group which we intend to study is called a population. For example, if we are interested in the IQ scores of the Year 12 students at ABC Secondary College, then this group of students could be considered a population; we could collect and analyse all the IQ scores for these students. However, if we are interested in the IQ scores of all Year 12 students across Australia, then this becomes the population.
Often, dealing with an entire population is not practical:
Nevertheless, we often wish to make statements about a property of a population when data about the entire population is unavailable. The solution is to select a subset of the population – called a sample – in the hope that what we find out about the sample is also true about the population it comes from. Dealing with a sample is generally quicker and cheaper than dealing with the whole population, and a well-chosen sample will give much useful information about this population. How to select the sample then becomes a very important issue.
Suppose we are interested in investigating the effect of sustained computer use on the eyesight of a group of university students. To do this we go into a lecture theatre containing the students and select all the students sitting in the front two rows as our sample. This sample may be quite inappropriate, as students who already have problems with their eyesight are more likely to be sitting at the front, and so the sample may not be typical of the population. To make valid conclusions about the population from the sample, we would like the sample to have a similar nature to the population.
While there are many sophisticated methods of selecting samples, the general principle of sample selection is that the method of choosing the sample should not favour or disfavour any subgroup of the population. Since it is not always obvious if the method of selection will favour a subgroup or not, we try to choose the sample so that every member of the population has an equal chance of being in the sample. In this way, all subgroups have a chance of being represented. The way we do this is to choose the sample at random.
A sample of size n is called a simple random sample if it is selected from the population in such a way that every subset of size n has an equal chance of being chosen as the sample. In particular, every member of the population must have an equal chance of being included in the sample.
To choose a sample from the group of university students, we could put the name of every student in a hat and then draw out, one at a time, the names of the students who will be in the sample.
Choosing the sample in an appropriate manner is critical in order to obtain usable results.
In order to make valid conclusions about a population from a sample, we would like the sample chosen to be representative of the population as a whole. This means that all the different subgroups present in the population appear in the sample in similar proportions as they do in the population.
Suppose that our population of interest is the class of students from Example 2, and suppose further that we are particularly interested in the proportion of female students in the class. This is called the population proportion and is generally denoted by p. The population proportion p is constant for a particular population.
Population proportion (p) =
number in population with attribute / population size
In this class, there are 10 females, so the proportion of female students in the class is:
p =
10 / 20 = 1 / 2
Now consider the proportion of female students in the sample chosen:
Sue, Georgia, Miller, Matt, Tom, David
The proportion of females in the sample may be calculated by dividing the number of females in the sample by the sample size. In this case, there are two females in the sample, so the proportion of female students in the sample is:
Sample proportion (p̂) = 2 / 6 = 1 / 3
This value is called the sample proportion and is denoted by p̂ (we say ‘p hat’).
Sample proportion (p̂) =
number in sample with attribute / sample size
Note that different symbols are used for the sample proportion and the population proportion, so that we don’t confuse them. In this particular case:
p̂ = 1 / 3, which is not the same as the population proportion p = 1 / 2.
This does not mean there is a problem. In fact, each time a sample is selected, the number of females in the sample will vary. Sometimes the sample proportion p̂ will be 1 / 2, and sometimes it will not.
We have seen that the sample proportion varies from sample to sample. We can use our knowledge of probability to further develop our understanding of the sample proportion.
Suppose we have a bag containing six blue balls and four red balls, and from the bag, we take a sample of size 4. We are interested in the proportion of blue balls in the sample. We know that the population proportion is equal to:
Population proportion = 6⁄10 = 3⁄5
That is, p = 0.6
The probabilities associated with the possible values of the sample proportion ˆp can be calculated either by direct consideration of the sample outcomes or by using our knowledge of selections. Recall that:
C(n, x) = n!⁄(x!(n - x)!)
This is the number of different ways to select x objects from n objects.
The following table gives the probability of obtaining each possible sample proportion \(\hat{p}\) when selecting a random sample of size 4 from the bag.
Number of blue balls in the sample (x) | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
Proportion of blue balls in the sample, \(\hat{p}\) | 0 | \(\frac{1}{4}\) | \(\frac{1}{2}\) | \(\frac{3}{4}\) | 1 |
Probability | \(\frac{1}{210}\) | \(\frac{24}{210}\) | \(\frac{90}{210}\) | \(\frac{80}{210}\) | \(\frac{15}{210}\) |
We can see from the table that we can consider the sample proportion as a random variable, \(\hat{P}\), and we can write:
The possible values of ˆp and their associated probabilities together form a probability distribution for the random variable P̂, which can be summarised as follows:
p̂ | 0 | 1/4 | 1/2 | 3/4 | 1 |
---|---|---|---|---|---|
Pr(P̂ = p̂) | 1/210 | 24/210 | 90/210 | 80/210 | 15/210 |
The distribution of a statistic which is calculated from a sample (such as the sample proportion) has a special name – it is called a sampling distribution.
Generally, when we select a sample it is from a population which is too large or too difficult to enumerate or even count – populations such as all the people in Australia, or all the cows in Texas, or all the people who will ever have asthma. When the population is so large, we assume that the probability of observing the attribute we are interested in remains constant with each selection, irrespective of prior selections for the sample.
Suppose we know that 70% of all 17-year-olds in Australia attend school. That is, \( p = 0.7 \). We will assume that this probability remains constant for all selections for the sample.
Now consider selecting a random sample of size 4 from the population of all 17-year-olds in Australia. This time we can use our knowledge of binomial distributions to calculate the associated probability for each possible value of the sample proportion \( \hat{p} \), using the probability function:
\[ \text{Pr}(X = x) = \binom{4}{x} \cdot 0.7^x \cdot 0.3^{4-x} \quad \text{where} \quad x = 0, 1, 2, 3, 4 \]
The following table gives the probability of obtaining each possible sample proportion \( \hat{p} \) when selecting a random sample of four 17-year-olds.
Number at school in the sample (x) | Proportion at school in the sample, \( \hat{p} \) | Probability |
---|---|---|
0 | 0 | 0.0081 |
1 | 0.25 | 0.0756 |
2 | 0.5 | 0.2646 |
3 | 0.75 | 0.4116 |
4 | 1 | 0.2401 |
Once again, we can summarise the sampling distribution of the sample proportion as follows:
\( \hat{p} \) | 0 | 0.25 | 0.5 | 0.75 | 1 |
---|---|---|---|---|---|
Pr(\( \hat{P} = \hat{p} \)) | 0.0081 | 0.0756 | 0.2646 | 0.4116 | 0.2401 |
The population that the sample of size \( n = 4 \) is being taken from is such that each item selected has a probability \( p = 0.7 \) of success. Thus we can define the random variable \( \hat{P} = \frac{X}{4} \) where \( X \) is a binomial random variable with parameters \( n = 4 \) and \( p = 0.7 \). To emphasise this we can write:
x | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
\( \hat{p} = \frac{x}{4} \) | 0 | 0.25 | 0.5 | 0.75 | 1 |
Pr(\( \hat{P} = \hat{p} \)) = Pr(X = x) | 0.0081 | 0.0756 | 0.2646 | 0.4116 | 0.2401 |
Note: The probabilities for the sample proportions, \( \hat{p} \), correspond to the probabilities for the numbers of successes, \( x \).