VII. STATISTICAL INFERENCES BASED ON LARGE SAMPLES
In the previous chapter, we discussed in considerable detail the process of Statistical Inference in normal populations. We have seen that the Normal Distribution provides a good model for many empirical phenomena. However, one needs to also recognize that many populations about which one wants to make inferences are not normal. These include continuous variables like income and firm size, and discrete variables, such as, the number of voters favoring a tax increase. How does one proceed in such cases? There are two general approaches. One is to develop the necessary sampling theory for each population. The second is to use large samples and rely on a result which suggests that the sample mean or sample population tends to be normally distributed, provided the sample size is sufficiently large. The latter result is called the Central Limit Theorem.
Before proceeding, we need to explain briefly the Central Limit Theorem in order to give us an intuitive idea of its plausibility. If one understands the way in which the Normal Distribution arises, then the Central Limit Theorem is not quite as surprising as it might otherwise seem. The key point is that one way in which a normal distribution can arise is as the sum of a large number of independent small components.
The Central Limit Theorem can be stated as follows:
In sampling from a population with mean mu and standard deviation sigma, the distribution of the quantity
(X-bar - mu)/(sigma/n)
converges to the standard normal distribution as the sample size becomes large.
In trying to understand the Central Limit Theorem, it is important to remember that all that is assumed about the population being sampled is that its mean and standard deviation exist. The proof of the Theorem is beyond the scope of our discussion. One can demonstrate the Central Limit Theorem by means of simulation experiments.
In this chapter, two kinds of problems are discussed:
1. Inference about the means of non-normal populations
2. Inference about proportions in the case of discrete populations.
Statistical Inference about the Mean of a Non-Normal Population
Consider the following problem. A firm caters to the upscale part of the hamburger market. It is trying to decide whether or not to build a store in a new area. It obviously would like to know the mean income in the area. The firm takes a random sample of size 100. The sample mean and standard deviation are :
x-bar = $16,402 s = $13,232
How could one develop a confidence interval for the population mean? We don't know the distribution of the population. We do, however, believe that it is non-symmetric. In fact, our suspicion is that it is lognormal. That is the logarithm of income is normally distributed. One way of proceeding is to argue that since the sample size is large, the random quantity
1) (X-bar - Mu)/(s/sqt(n))
is approximately standard normal. On that basis, one can develop an approximate confidence interval using the z-table. For example, a 100(1-alpha)% approximate confidence interval for the population mean is:
2) x-bar +/- z(1-alpha/2)/(s/sqrt(n))
In the present example:
x-bar = 16,402 s= 13,232
Thus, an appropriate 95% confidence interval is:
[ 13,809, 18,995 ]
Suppose that the firm has found, on the basis of experience, that a store is not profitable unless the mean income of households in the community is $14,000. So, it wants to make sure that the mean income is above $14,000. Its working hypothesis is that the mean is not above $14,000. The firm takes this as it null hypothesis. Thus we have:
H0: Mu < or = $14,000
H1: Mu is > $14,000
From our discussion of estimation, we know that an appropriate test statistic is:
z= (X-bar - Mu)/(s/sqrt(n))
The computed z-score is: z= (16,402 - 14,000)/ 1323 = 1.8155
The p-value is p(z is > 1.82) = .0344
Thus, the decision made by the company will depend on the level of significance that it chooses. For example, if alpha = .01, the firm will conclude that the evidence in the sample is not sufficient for it to reject the hypothesis that the mean income is not high enough to justify building a store. But if the level of significance were alpha = .05, it would built the store.
1. A firm wishes to estimate the mean lifetime of a particular brand of brake linings. It installs the linings on a random sample of 64 from a fleet of cars. It finds that the average lifetime for the sample was 47,000 miles while the the standard deviation was 5,000 miles. Find an approximate 95% confidence interval.
2. A university is interested in estimating the average expenditure on air travel per business trip by its employees.
It draws a random sample of size 100 from its travel voucher file. The sample mean is $345 and the standard deviation is $55. Find a 99% confidence interval for mean expenditure on air travel per trip.
3. Family Cents has decided that it will locate in Valleydale if it can be pretty sure that the mean income level in the place is less than $14,000. It commissions a study in which a random sample of size 100 is drawn. The sample mean is $13,800 and the sample standard deviation is $1,000. The company's choice of alpha level is .001 . Should the company locate in Valleydale?
Inference about a Population Proportion Using a Large Sample
In many problems, one is interested in the proportion or percentage of a population having a particular characteristic. For example, the proportion of the population that likes a new product or the proportion that favors the consolidation of schools in a county. Can you think of other examples? How about the proportion of defective items in a lot, the proportion of females in a certain profession, the percentage of fatal auto accidents where the person is wearing a set belt?
In the above problems, one is dealing in each case with a discrete population. One is usually interested in the population proportion. How does one make inferences about the population proportion? One way is to develop the sampling theory for the
relevant discrete population. That approach is pursued in more advanced courses. Our approach is to use large samples, and then, rely on the Central Limit Theorem.
Suppose one is interested in the proportion of males aged sixteen and over who favor the Seat Belt law. A random sample of size n= 300 is taken. One hundred are in favor of the law. The observation is coded zero if the person says he is opposed to the seat belt law and it is coded one if he says he is in favor. Thus, the observations are zero's and one's. If we let Xi be the i-th observation, then the number of persons in favor is sum (Xi) or the number of one's. The mean of the sample is the sample proportion,
p-hat = Sum(Xi)/n
What is a plausible estimator for the population proportion which is denoted by p? On the basis of our previous discussion, it should be the sample proportion p-hat. That is p-hat = 100/300 in the present example. What about a confidence interval for the population proportion? We should know by now that the confidence interval will be of the form p-hat, plus or minus a margin of error. The margin of error is the product of a quantity that depends on the degree of confidence required and the standard error of the sample proportion p-hat.
In order to proceed, we have to find out about the sampling distribution of p-hat, the sample proportion. Recall from our discussion of the Binomial Distribution that the expected number of successes is np and the variance of the number of successes is np(1-p).
Mu(sum(Xi)) = np
Var(sum(Xi)) = np(1-p)
So, the expected proportion of successes, Mu(sum(Xi)/n) is p and the variance of the same is p(1-p)/n. In other words:
Mu (p-hat) = p
Sigma (p-hat) = sqrt (p (1-p)/n)
Applying the Central Limit Theorem to the present problem tells us that for large n, the quantity
Z= (p-hat - p)/sqrt(p(1-p)/n)
has an approximate Standard Normal Distribution. Unfortunately, this result is not enough to give us a computable confidence interval. Why? Because p, the unknown population proportion, appears in the standard deviation of the sample proportion. All is not lost, however. It can also be shown that the quantity
Z = (p-hat - p)/sqrt(p-hat(1-p-hat/n)
is approximately standard normal for a sufficiently large sample size. How large does the sample size need to be in order for the approximation to be a good one? The conventional view is that one needs np > 5 and n(1-p) > 5.
On the basis of the foregoing discussion, an approximate 95% confidence interval for the proportion of males over age 16 who favor the Seat Belt law is:
1/3 +/- 1.96 sqrt((1/3)(2/3)/300)
[ .2799 , .3866 ]
In general, an approximate 100(1-Alpha)% confidence interval for the population proportion is:
p-hat +/- Z(1-alpha/2)sqrt(p-hat(1-p-hat)/n)
We turn now to the matter of testing hypothesis about population proportions. Consider the following problem. A local insurance broker wants to increase the amount of homeowner's insurance he writes. He decides to spend some money on advertising but he is not sure which medium to use for the purpose. The local radio station has the lowest prices but he is not sure that it reaches a large enough proportion of homeowners. He decides that he will advertise on radio if he can be pretty sure that at least 40% of the homeowners listen to the station.
The broker chooses a level of significance of .01. A random sample of size 100 is obtained through telephone interview. Fifty of the 100 persons in the sample listen to the local radio station. The null and alternative hypotheses are
H0: p = .4
H1: p > .4
The computed value of Z is 2.0412. The associated p-value is .0207. Since the level of significance is .01, the null hypothesis is not rejected.
4. A recent national survey indicated that the default rate on student loans is 15 percent. The President of Local U feels that the rate at his school is well below 15 percent. He is due to speak to the State Bankers Association and wants to be sure that his belief is supported by the evidence. He instructs his statistician to draw a sample of size 100 from the loan files of those who have left Local U in the past eight years. It is found that six have defaulted while 94 have not. The President chooses a one percent level of significance. What should he tell the Bankers?
5. A large firm has decided that it will offer direct deposit of paychecks to local banks if it can be reasonably sure that more than 30 percent, that is .3, of its employees will take advantage of the option. The personnel department draws a random sample of 200 employees and finds that 75 will take advantage of the plan.
What should the firm decide? Explain you reasoning carefully.
6. A JRJ subsidiary wishes to introduce a new decaffeinated soft drink, which is designed to appeal to the 55-64 age group. They take a random sample of 576 persons in the target age group and find that 288 of them like the taste of the new drink. Find a 95 percent confidence interval for the proportion of the target population that likes the new drink.
If the company had wished to be 99% percent confident that their estimate of the sample proportion was within .03 of the population proportion, what is the minimum sample size they should take?
7 A salesperson for an industrial product believes that the probability of obtaining an order when she calls on a customer is at least .4. It is also believed that the probability of an order at one stop is independent of the outcomes at other customers. The salesperson visited 200 customers in the past year. She has received 60 orders. Should she modify her belief that the probability of receiving an order is at least .4 ?
8. A company believes that thirty percent of the consumers in a particular region have a favorable attitude towards its product.
It takes a random sample of size 200. Given the firm's beliefs, find the probabilities of the following events:
a) at least sixty in the sample have a favorable attitude
b) fewer than fifty in the sample like the product.
9. A cereal company has conducted a consumer test of its new version of fruit and fiber. It takes a sample of size of 1,600. It finds that 400 say that their attitude towards the product is favorable. Should the firm market the product if its policy is to market only if it is reasonably sure that 20% favor the product?
10. Mills General is trying to decide if they should introduce a new high sugar fruit and fibre cereal, which is designed to appeal to the 15-24 age group. They wish to estimate the proportion of the target age group who like the taste of the new cereal. They want to be 95% confident that the estimate is within .03 of the true value.
Find the minimum size of sample necessary for this purpose.
11. The suburb of Valleydale is a potential location for a new store for Upscale Clothing. Before making a decision, the firm's manager wants to have a good estimate of the mean family income in the area. In particular, he wants to estimate mean family income within two thousand dollars and with 99% confidence. On the basis of some earlier studies, the manager believes that the standard deviation of income in the area is five thousand dollars.
What is the minimum sample size necessary to obtain such an estimate?
If the manager wished to have the estimate of the mean within one thousand dollars of the true value, what would be the required minimum sample size?