Parameters versus Statistics:
Parameter is some number that describes the Population.
Statistic is some number that describes the Sample.
Examples of parameters we’ve talked about: median of the population, mean of the population , variance of the population , standard deviation of the population, .
Examples of statistics: median of a sample, mean of a sample , variance of a sample , standard deviation of a sample s .
Since the population is what we really care about we really would like to know parameters. But in nearly every case we have to settle for data from a sample and hence we calculate statistics. Hopefully these statistics will give us some idea about the parameters.
Formula for : = (n = sample size, recall we used N for population size).
If the sample is a random sample from the population then is an unbiased estimator of , is an unbiased estimator of , and s is an unbiased estimator of (actually this is not true but for large n the bias is very small, it surprises many people that the sample variance is an unbiased estimator of the population variance but that the same is not true when you take the square roots and deal with standard deviations). Unbiased means that there is no tendency to over or under estimate.
Example: Sulfur compounds such as dimethyl sulfide (DMS) are sometimes present in wine. It has a bad odor so winemakers would like to know the odor threshold (the lowest concentration the human nose can detect). Different people have different thresholds. We are curious what the mean is for all adults. To estimate we start by presenting people with wine with different levels of DMS to find their thresholds. We can then calculate the sample
mean as an estimate for .
Here are the odor thresholds of 10 randomly chosen people measured in micrograms per liter.
28 40 28 33 20 31 29 27 17 21
What is your best guess for ?
Do you think this guess is exact?
If we took another 10 subjects would our guess change?
Would this guess be exact?
If our guesses are not correct and vary from sample to sample how can we trust any guess? It turns out that if we can keep taking larger and larger samples that it becomes more and more likely that the guess is closer and closer to the correct answer. Note here guess is the sample mean and the correct answer is the population mean.
The following graph illustrates the DMS example with a sample of 10,000. The x-axis has the number in the sample and the y-axis has the sample mean. It is reported that the population mean threshold is 25. Do we know for sure that the population mean is 25?
Why do you think the population mean is reported to be 25?
Law of large numbers: Draw observations at random from any population with mean . As the number of observations goes up, the sample mean of the observed values is more likely to be close to the population mean
We saw a similar example of this with coin tossing earlier.
There are two reasons that the graph above is close to the dotted line (population mean). Those are luck and the fact that the sample size is big. At the beginning, say around n=15, its more due to luck, towards the end, say around n=5000 its more due to the large n.
We can never find the exact probability of tossing a particular coin and getting heads, or find the exact mean odor threshold of all adults. However, we can make our estimate so reliable that people will report that they actually do know such things.
Example: Some businesses such as insurance companies and gambling casinos rely on the law of large numbers. With just a few customers it is hard to say what profit they will make, but with many customers the law of large numbers kicks in and they are very sure that they are very likely to make a lot of money.
We are curious what would happen if we repeated the process of finding the sample mean over and over again. We would like to know what would happen on average and also how variable such results would be. By looking at the variability we can see how much we can trust the one estimate we got from our one sample. It is not practical to repeat this sampling process over and over. Suppose for example is you got 10 repeated samples of size 10 to see how the sample means varied. If you would combine these samples you would have a sample of size 100 which by the Law of Large numbers would be a better sample than the sample of size 10. We can do a computer simulation. The results we see in the simulations can be proven with mathematics.
Let’s return to the DMS example. Let’s assume that odor thresholds of adults follow a normal distribution with mean 25 and standard deviation 7. Let’s have a computer repeatedly (1000 times) find a random sample of size 10 and find the sample mean. To see how the answers are distributed we will make a graph called a histogram. We will make numeric categories on the x-axis going from 20.5 to 21.5 and 21.5 to 22.5 and so forth and the y-axis will tell us how many sample means were in each category.
The graph is given below.
What shape does this graph appear to have?
What about the mean?
What about the standard deviation?
Anything really stand out as unusual?
Sampling Distribution of a statistic: the distribution of values of the statistic in all possible samples of the same size from the same population.
In any distribution there are 4 things we are interested in, namely, shape, center, spread, and any data that seem to not fit.
Key Sampling Distribution for us is the Sampling Distribution of the sample mean . This is the distribution of all sample means in all possible samples of the same size from the same population.
There is a website where you can actually have the computer do things like this it is
CENTRAL LIMIT THEOREM (CLT) and related ideas.
2. (assuming the sampling is done with replacement)
3. If the original data, X, is normal, then so are the sample means.
4. Even if the original data isn’t normal, as the sample size, n, increases, the sample means get closer to normal, usually around 30-50 for n is good enough.
Rule 3 is hopefully not too surprising, that is if you have data that is bell shaped and find all sample means, those means will be bell shaped. (We saw this with the DMS example)
Discussion of Rules 1 and 2: Suppose a population of students take an exam and the mean is 80 and the standard deviation is 10. Now imagine finding all possible classes (samples) of size 30 students and finding all these class averages. What do you expect the mean of all these class averages to be? Which will vary more? a) the original student scores, b) the class averages, or c) they will vary the same. Here are two collections of numbers, one is 5 students, and the other 5 class averages. Which do you think are the 5 class averages? Collection 1: 98, 81, 66, 75, 83 Collection 2: 82, 82, 79, 81, 80. You picked the ________(1st or 2nd) set of numbers for the averages because you would expect them to vary ________ (less/more). So hopefully rule 1 isn’t too surprising now. Hopefully you would also agree that it makes sense that
. The fact that rule 2 is exactly what it is isn’t obvious. But note that the standard deviation of the averages is less than for the individuals, and for bigger sample sizes (n) the standard deviation gets smaller, which gels with the law of large numbers.
We should note that the formula is only perfectly true if you do your sampling with replacement, i.e., with the possibility of having repeats in your sample. This is not how most people would want to choose a sample. With a small population this could be a problem. However, most populations are very large and it is unlikely that you would have repeats show up even if you did sampling with replacement. So the formula is extremely close to exact for large populations. For example if the population was all trees in Colorado, it is unlikely that if you picked a sample of size 60 by picking a tree at random, the picking another tree at random from all the trees including the one you got on the first pick, and so forth, do you think you would get any repeats? Also for the tree example the number of samples of size 60 is larger than the number of atomic particles in the known universe. This really shows that and as well as and are very different concepts. It turns out that . You can also see that is a powerful formula, as it would be impossible to figure out by finding all the sample means as there are way too many.
The website mentioned earlier :
also illustrates what the graphs on the previous page illustrate.
Note the triangle pattern that we saw before in the example about adding two random numbers between 0 and 1. Also note that for the n=30 cases, you might not think them bell shaped, because they are skinny, but the skinniness is due to the standard deviation getting smaller, i.e. rule 2 above.
Explain in simple terms why when sampling it makes sense that the standard deviation is smaller (as compared when looking at the data individually)
When you sample you will probably get a variety of numbers some high, some low, some in the middle and the highs and lows will have a tendency to cancel each other out.
Explain the difference between , , and s. Two of these should be close, which two?