Question 2. Is the primary outcome variable a measurement variable (a.k.a. interval or continuous) or is the primary outcome variable a categoric variable? Choosing a primary outcome variable is analogous to choosing dessert in a restaurant. You may have many favorites, but when the server comes to take your order you have to settle on one, although you can probably procure a taste of everyone else’s dessert. Everyone else’s desserts are analogous to secondary outcome variables. You will be able to assess them too, but your main assessment – what determines whether the research question was answered, or whether the dessert was a success – depends on a single variable.
A measurement variable is one where the characteristic is assessed on a scale with many possible values representing an underlying continuum (e.g. age, height, blood pressure, pain (on a visual analogue scale)). It involved a measuring process and usually requires some sort of “instrumentation” (e.g. ruler, stopwatch, biochemical analysis, psychometric tool). Measurement variables are usually summarized with a mean or median.
A categoric variable involves classification of subjects into one of a number of categories on the basis of a characteristic. There can be two categories only (binary variable), multiple categories where order does not matter (nominal variable) or multiple categories where order does matter (ordinal variable). Categoric variables are usually summarized with proportions.
Sample Size Estimation for Descriptive Studies
If the answer to Question 1 in the previous section was “descriptive study”, you’re in the right place. No further questions need to be answered. The following section will derive these two formulas.
(sample size is based on margin of error, E, in confidence intervals)
N = 1 / E2, for P near 0.5
N = (2 / E)2 P(1-P),
for P near 0 or 1.
N = (2S / E)2
where S is the standard deviation of the variable
A confidence interval is a range of likely or plausible values of the population characteristic of interest. For example, a sample survey can be used to give a range of values that the true population proportion or population mean is expected to lie within. The intervals can be constructed to provide greater or lesser levels of confidence; however, the usual choice is 95% (with 90% and 99% useful in certain situations). For more information on confidence intervals, see the Modules entitled Analysis I and II.
Confidence intervals usually take the form:
(Point estimate) ± (Margin of error)
The point estimate is a value computed from the sample; for example, the sample mean or sample proportion.
The margin of error (or “plus or minus number”) is a value computed from a variety of components – the level of confidence (e.g. 95%), the variability in the outcome variable, and the sample size.
Confidence intervals are used to estimate sample sizes as follows.
>>> When interest is in a population mean (i.e. the primary outcome variable is measurement/continuous), the total number of subjects required (N) is:
N = 4 zα2 S2 / W2
where S is the standard deviation of the variable, W is the width of the confidence interval (equal to twice the “margin of error”), and zα is a value from the normal distribution related to and representing the confidence level (equal to 1.96 for 95% confidence).
The table in Appendix 6.D (p 90, Hulley) provides the sample size for common values of W/S and three choice of confidence level.
The formula can be rewritten as:
N = (zα S / E ) 2 where E is the “margin of error” (half the width, W).
As an approximation, for 95% confidence, use the value of 2 for zα (instead of 1.96) – remember that this is an approximation, after all! Then the formula is a very concise and easily remembered:
N = (2S / E ) 2
That is “twice the standard deviation over the margin of error, all squared”.
Where does the value of S come from? There are a number of sources, including previously published research or a pilot study. When these sources fail, as in the case of brand-new research, with a new instrument or a new population under study, a rough approximation can be made using the six-sigma rule for bell-shaped distributions; the standard deviation is approximately the range (maximum minus minimum) divided by six.
Here is an online calculator that allows you to determine the sample size for a measurement continuous variable in a single sample study: http://www.surveysystem.com/sscalc.htm
>>> When interest is in a population proportion (i.e. the primary outcome variable is categoric – specifically, binary), the total number of subjects required (N) is:
N = 4 zα2 P(1-P) / W2
where P is the expected proportion who have the characteristic of interest, W is the width of the confidence interval (equal to twice the “margin of error”), and zα is a value from the normal distribution related to and representing the confidence level (equal to 1.96 for 95% confidence).
Note that this formula looks like the one for measurement data except that S2 has been replace by P(1-P).
The table in Appendix 6.E (p 91, Hulley) provides the sample size for common choices of P and W, and three choice of confidence level.
The formula can be rewritten as:
N = (zα / E ) 2 P(1-P)
where E is the “margin of error” (half the width, W).
As an approximation, for 95% confidence, use the value of 2 for zα (instead of 1.96) – remember that this is an approximation, after all! Also, use the most conservative value of P, which is 0.5. Then the formula is a very concise and easily remembered:
N = 1 / E2
That is, “one over the square of the margin of error”.
This formula can also be easily rearranged to get: E = 1 / N;
That is, the margin of error is one over the square root of the sample size.
For example, if the sample size is 100, the margin of error is 10%; for a sample size of 400 the margin of error is 5%; and for a sample size of 1000, the margin of error is 3%.
Note that doubling the sample size from 1000 to 2000 only reduces the margin of error to 2%, not much improvement in precision for double the effort. That explains why so many national opinion polls are about 1000 in size.
If the expected proportion is more than half, then plan the sample size based on the proportion expected NOT to have the characteristic. That is, switch the roles of P and 1-P.
If P or 1-P is very close to 0 or 1 (i.e. the characteristic of interest is rare or happens most of the time), the sample size formula of N = 1 / E2 is not appropriate. Instead you need to use the fuller version seen earlier:
N = (2 / E)2 P(1-P)
[Note for the obsessive-compulsives among you: These formulas assume that the population is “infinite” (i.e. very large) in comparison to the sample. There is a finite population correction factor that will come into play when the final confidence intervals are being constructed. But it can be safely ignored in calculating sample size for a survey.]
A confidence interval for the mean should be based on at least twelve observations. The width of a confidence interval, involving estimate of variability and sample size decreases rapidly until 12 observations are reached and then decreases less rapidly.
Exercise: Go to http://www.surveysystem.com/sscalc.htm where there is a calculator used for this type of calculation. Use this calculator to determine the sample size for a survey where you will determine the proportion of people in Belltown who eat doughnuts. You want your estimate of the true proportion to be accurate to 6%. (Note that, unfortunately, this calculator calls this the “confidence interval”, a poor choice of terminology – it should be called the margin of error.) The population of Belltown is 65,000. Try out the calculator for other choices of population size; what would be the required sample size if the population of Belltown were only 1500? 15,000? 150,000? 1,500,000? What would you conclude about the role of the population size in these sample size calculations.
Sample Size Estimation for Comparative Studies
If the answer to Question 2 in the previous section was “comparative study”, you’re in the right place. This section will present:
A review of hypothesis testing
Baseline information – Questions to be answered for comparative studies
Question 1: What is an acceptable significance level (alpha)
Question 2: How large a power is needed?
Question 3: How large is the variability in the effect of interest?
Question 4. What is the smallest detectable effect of interest
Calculating the sample size for comparative studies
A statistical “test” always challenges some hypothesis. A new treatment is investigated by testing that the given treatment has no effect. A comparative study tests a hypothesis that two groups under different treatment exhibit no differences in responses. We describe the results as “significant” or “positive” when such a challenge has been successful and the tested hypothesis overthrown (i.e. the null hypothesis is rejected).
“Significance” refers to the events and data that were actually observed, but which had small probability (P-value) according to the null hypothesis (so the null hypothesis is rejected as being incompatible with the data).
Before proceeding to sample size estimation we need to review the basic concepts of hypothesis testing.
Review of Basic Concepts of Hypothesis Testing
Hypothesis testing requires, first of all, and not surprisingly, hypotheses! That is, two competing claims about a parameter or parameters (characteristics of a population). In the context of sample size estimation the parameters are usually the mean or proportion of the key outcome variable of interest. The null hypothesis is the status quo hypothesis, the position of no difference, no effect, or no change. The alternative hypothesis is often referred to as the research hypothesis. It represents a difference between groups, a real effect, and an abandonment of the status quo.
A hypothesis test culminates with a conclusion about which of the two hypotheses is supported by the available data. The conclusion can either be correct or incorrect. And statisticians, who have their ignorance better organized than ordinary mortals, have classified the ways in which the conclusion can be correct or incorrect. Errors in the conclusion are imaginatively called either Type I or Type II.
A Type I error occurs when the null hypothesis is rejected, but in fact the null hypothesis is actually true. That is, the conclusion is that there is a significant difference when in fact there really isn’t. A Type I error can be thought of as a “false positive”.
A Type II error occurs when the null hypothesis is accepted, but in fact the null hypothesis is actually false. That is, the conclusion is that there is no difference when is fact there really is a difference. A Type II error can be thought of as a “false negative”.
Next we define alpha (α) as the probability of making a Type I error. It is also known as the significance level. Usually α is set at 0.05 (keeping it consistent with 1 – α or .95 or 95% in the context of confidence intervals.
And we define beta (β) as the probability of making a Type II error. Although β doesn’t have another name, 1 – β does. It is know as power.
Power is the probability of correctly rejecting the null hypothesis; for example, concluding that there was a difference when, in fact, there really was one!
Sample size calculations are often called power calculations, which tells you how crucial the concept of power is to the whole exercise.
Aside: A Type III error has been referred to as getting the right answer to the wrong question or to a question nobody asked!
The following two-by-two table summarizes the previous concepts and quantities.
1 – α
1 - β
A useful analogy is to our Western legal system. In our system a defendant is “innocent until proven guilty”. The null hypothesis is “not guilty”; the alternative hypothesis is “guilty”. The onus is on the investigator (i.e. the prosecution) to present the evidence to convince the judge or jury to abandon the null hypothesis in favour of the alternative. If the data are convincingly more consistent with the alternative hypothesis, the judge or jury (barring legal technicalities and theatrics) must conclude that the defendant is guilty.
The conclusion, whichever way it goes, may be the right one or the wrong one. Convicting the guilty or acquitting the innocent are correct decisions. However, convicting an innocent person is a Type I error, while acquitting a guilty person is a Type II error. Neither of these errors is desirable (in this case a Type I error is the worse of the two, but there are other situations where a Type II error is the worse). We would rather not make any errors. Notice however the problem that this presents. In order not to make ANY Type I errors we would have to acquit everyone, which would lead to a high rate of Type II errors. In order not make ANY Type II errors we would convict rather a lot of innocent people along the way. Hypothesis testing, therefore tries to keep both error rates under control, and this is accomplished by collecting more and more evidence (what a non-legal researcher would call data).
Sample size estimation concerns ensuring enough data so as to keep the probabilities of Type I and Type II errors (α and β) at suitable levels.