Chapter 1: Statistics: Part 1
Chapter 1: Statistics: Part 1
Section 1.1: Statistical Basics
Data are all around us. Researchers collect data on the effectiveness of a medication for lowering cholesterol. Pollsters report on the percentage of Americans who support gun control. Economists report on the average salary of college graduates. There are many other areas where data are collected. In order to be able to understand data and how to summarize it, we need to understand statistics.
Suppose you want to know the average net worth of a current U.S. Senator. There are 100 Senators, so it is not that hard to collect all 100 values, and then summarize the data. If instead you want to find the average net worth of all current Senators and Representatives in the U.S. Congress, there are only 435 members of Congress. So even though it will be a little more work, it is not that difficult to find the average net worth of all members. Now suppose you want to find the average net worth of everyone in the United States. This would be very difficult, if not impossible. It would take a great deal of time and money to collect the information in a timely manner before all of the values have changed. So instead of getting the net worth of every American, we have to figure out an easier way to find this information. The net worth is what you want to measure, and is called a variable. The net worth of every American is called the population. What we need to do is collect a smaller part of the population, called a sample. In order to see how this works, let’s formalize the definitions.
Variable: Any characteristic that is measured from an object or individual.
Population: A set of measurements or observations from all objects under study
Sample: A set of measurements or observations from some objects under study (a subset of a population)
Example 1.1.1: Stating Populations and Samples
Determine the population and sample for each situation.

A researcher wants to determine the length of the lifecycle of a bark beetle. In order to do this, he breeds 1000 bark beetles and measures the length of time from birth to death for each bark beetle.
Population: The set of lengths of lifecycle of all bark beetles
Sample: The set of lengths of lifecycle of 1000 bark beetles

The National Rifle Association wants to know what percent of Americans support the right to bear arms. They ask 2500 Americans whether they support the right to bear arms.
Population: The set of responses from all Americans to the question, “Do you support the right to bear arms?”
Sample: The set of responses from 2500 Americans to the question, “Do you support the right to bear arms?”

The Pew Research Center asked 1000 mothers in the U.S. what their highest attained education level was.
Population: The set of highest education levels of all mothers in the U.S.
Sample: The set of highest education level of 1000 mothers in the U.S.
It is very important that you understand what you are trying to measure before you actually measure it. Also, please note that the population is a set of measurements or observations, and not a set of people. If you say the population is all Americans, then you have only given part of the story. More important is what you are measuring from all Americans. The question is, do you want to measure their race, their eye color, their income, their education level, the number of children they have, or other variables? Therefore, it is very important to state what you measured or observed, and from whom or what the measurements or observations were taken.
Once you know what you want to measure or observe, and the source from which you want to take measurements or observations, you need to collect the data.
A data set is a collection of values called data points or data values. N represents the number of data points in a population, while n represents the number of data points in a sample. A data value that is much higher or lower than all of the other data values is called an outlier. Sometimes outliers are just unusual data values that are very interesting and should be studied further, and sometimes they are mistakes. You will need to figure out which is which.
In order to collect the data, we have to understand the types of variables we can collect. There are actually two different types of variables. One is called qualitative and the other is called quantitative.
Qualitative (Categorical) Variable: A variable that represents a characteristic. Qualitative variables are not inherently numbers, and so they cannot be added, multiplied, or averaged, but they can be represented graphically with graphs such as a bar graph.
Examples: gender, hair color, race, nationality, religion, course grade, year in college, etc.
Quantitative (Numerical) Variable: A variable that represents a measurable quantity. Quantitative variables are inherently numbers, and so can they be added, multiplied, averaged, and displayed graphically.
Examples: Height, weight, number of cats owned, score of a football game, etc.
Quantitative variables can be further subdivided into other categories – continuous and discrete.
Continuous Variable: A variable that can take on an uncountable number of values in a range. In other words, the variable can be any number in a range of values. Continuous variables are usually things that are measured.
Examples: Height, weight, foot size, time to take a test, length, etc.
Discrete Variable: A variable that can take on only specific values in a range. Discrete variables are usually things that you count.
Examples: IQ, shoe size, family size, number of cats owned, score in a football game, etc.
Example 1.1.2: Determining Variable Types
Determine whether each variable is quantitative or qualitative. If it is quantitative, then also determine if it is continuous or discrete.

Length of race
Quantitative and continuous, since this variable is a number and can take on any value in an interval.

Opinion of a person about the President
Qualitative, since this variable is not a number.

House color in a neighborhood
Qualitative, since this variable is not a number.

Number of houses that are in foreclosure in a state
Quantitative and discrete, since this variable is a number but can only be certain values in an interval.

Weight of a baby at birth
Quantitative and continuous, since this variable is a number and can take on any value in an interval.

Highest education level of a mother
Qualitative, since the variable is not a number.
Section 1.2: Random Sampling
Now that you know that you have to take samples in order to gather data, the next question is how best to gather a sample? There are many ways to take samples. Not all of them will result in a representative sample. Also, just because a sample is large does not mean it is a good sample. As an example, you can take a sample involving one million people to find out if they feel there should be more gun control, but if you only ask members of the National Rifle Association (NRA) or the Coalition to Stop Gun Violence, then you may get biased results. You need to make sure that you ask a crosssection of individuals. Let’s look at the types of samples that can be taken. Do realize that no sample is perfect, and may not result in a representation of the population.
Census: An attempt to gather measurements or observations from all of the objects in the entire population.
A true census is very difficult to do in many cases. However, for certain populations, like the net worth of the members of the U.S. Senate, it may be relatively easy to perform a census. We should be able to find out the net worth of each and every member of the Senate since there are only 100 members. But, when our government tries to conduct the national census every 10 years, you can believe that it is impossible for them to gather data on each and every American.
The best way to find a sample that is representative of the population is to use a random sample. There are several different types of random sampling. Though it depends on the task at hand, the best method is often simple random sampling which occurs when you randomly choose a subset from the entire population.
Simple Random Sample: Every sample of size n has the same chance of being chosen, and every individual in the population has the same chance of being in the sample.
An example of a simple random sample is to put all of the names of the students in your class into a hat, and then randomly select five names out of the hat.
Stratified Sampling: This is a method of sampling that divides a population into different groups, called strata, and then takes random samples inside each strata.
An example where stratified sampling is appropriate is if a university wants to find out how much time their students spend studying each week; but they also want to know if different majors spend more time studying than others. They could divide the student body into the different majors (strata), and then randomly pick a number of people in each major to ask them how much time they spend studying. The number of people asked in each major (strata) does not have to be the same.
Systematic Sampling: This method is where you pick every kth individual, where k is some whole number. This is used often in quality control on assembly lines.
For example, a car manufacturer needs to make sure that the cars coming off the assembly line are free of defects. They do not want to test every car, so they test every 100^{th} car. This way they can periodically see if there is a problem in the manufacturing process. This makes for an easier method to keep track of testing and is still a random sample.
Cluster Sampling: This method is like stratified sampling, but instead of dividing the individuals into strata, and then randomly picking individuals from each strata, a cluster sample separates the individuals into groups, randomly selects which groups they will use, and then takes a census of every individual in the chosen groups.
Cluster sampling is very useful in geographic studies such as the opinions of people in a state or measuring the diameter at breast height of trees in a national forest. In both situations, a cluster sample reduces the traveling distances that occur in a simple random sample. For example, suppose that the Gallup Poll needs to perform a public opinion poll of all registered voters in Colorado. In order to select a good sample using simple random sampling, the Gallup Poll would have to have all the names of all the registered voters in Colorado, and then randomly select a subset of these names. This may be very difficult to do. So, they will use a cluster sample instead. Start by dividing the state of Colorado up into categories or groups geographically. Randomly select some of these groups. Now ask all registered voters in each of the chosen groups. This makes the job of the pollsters much easier, because they will not have to travel over every inch of the state to get their sample but it is still a random sample.
Quota Sampling: This is when the researchers deliberately try to form a good sample by creating a crosssection of the population under study.
For an example, suppose that the population under study is the political affiliations of all the people in a small town. Now, suppose that the residents of the town are 70% Caucasian, 25% African American, and 5% Native American. Further, the residents of the town are 51% female and 49% male. Also, we know information about the religious affiliations of the townspeople. The residents of the town are 55% Protestant, 25% Catholic, 10% Jewish, and 10% Muslim. Now, if a researcher is going to poll the people of this town about their political affiliation, the researcher should gather a sample that is representative of the entire population. If the researcher uses quota sampling, then the researcher would try to artificially create a crosssection of the town by insisting that his sample should be 70% Caucasian, 25% African American, and 5% Native American. Also, the researcher would want his sample to be 51% female and 49% male. Also, the researcher would want his sample to be 55% Protestant, 25% Catholic, 10% Jewish, and 10% Muslim. This sounds like an admirable attempt to create a good sample, but this method has major problems with selection bias.
The main concern here is when does the researcher stop profiling the people that he will survey? So far, the researcher has crosssectioned the residents of the town by race, gender, and religion, but are those the only differences between individuals? What about socioeconomic status, age, education, involvement in the community, etc.? These are all influences on the political affiliation of individuals. Thus, the problem with quota sampling is that to do it right, you have to take into account all the differences among the people in the town. If you crosssection the town down to every possible difference among people, you end up with single individuals, so you would have to survey the whole town to get an accurate result. The whole point of creating a sample is so that you do not have to survey the entire population, so what is the point of quota sampling?
Note: The Gallup Poll did use quota sampling in the past, but does not use it anymore.
Convenience Sampling: As the name of this sampling technique implies, the basis of convenience sampling is to use whatever method is easy and convenient for the investigator. This type of sampling technique creates a situation where a random sample is not achieved. Therefore, the sample will be biased since the sample is not representative of the entire population.

For example, if you stand outside the Democratic National Convention in order to survey people exiting the convention about their political views. This may be a convenient way to gather data, but the sample will not be representative of the entire population.
Of all of the sampling types, a random sample is the best type. Sometimes, it may be difficult to collect a perfect random sample since getting a list of all of the individuals to randomly choose from may be hard to do.
Example 1.2.1: Which Type of Sample?
Determine if the sample type is simple random sample, stratified sample, systematic sample, cluster sample, quota sample, or convenience sample.

A researcher wants to determine the different species of trees that are in the Coconino National Forest. She divides the forest using a grid system. She then randomly picks 20 different sections and records the species of every tree in each of the chosen sections.
This is a cluster sample, since she randomly selected some of the groups, and all individuals in the chosen groups were surveyed.

A pollster stands in front of an organic foods grocery store and asks people leaving the store how concerned they are about pesticides in their food.
This is a convenience sample, since the person is just standing out in front of one store. Most likely the people leaving an organic food grocery store are concerned about pesticides in their food, so the sample would be biased.

The Pew Research Center wants to determine the education level of mothers. They randomly ask mothers to say if they had some high school, graduated high school, some college, graduated from college, or advance degree.
This is a simple random sample, since the individuals were picked randomly.

Penn State wants to determine the salaries of their graduates in the majors of agricultural sciences, business, engineering, and education. They randomly ask 50 graduates of agricultural sciences, 100 graduates of business, 200 graduates of engineering, and 75 graduates of education what their salaries are.
This is a stratified sample, since all groups were used, and then random samples were taken inside each group.

In order for the Ford Motor Company to ensure quality of their cars, they test every 130^{th} car coming off the assembly line of their Ohio Assembly Plant in Avon Lake, OH.
This is a systematic sample since they picked every 130^{th} car.

A town council wants to know the opinion of their residents on a new regional plan. The town is 45% Caucasian, 25% African American, 20% Asian, and 10% Native American. It also is 55% Christian, 25% Jewish, 12% Islamic, and 8% Atheist. In addition, 8% of the town did not graduate from high school, 12% have graduated from high school but never went to college, 16% have had some college, 45% have obtained bachelor’s degree, and 19% have obtained a postgraduate degree. So the town council decides that the sample of residents will be taken so that it mirrors these breakdowns.
This is a quota sample, since they tried to pick people who fit into these subcategories.
Section 1.3: Clinical Studies
Now you know how to collect a sample, next you need to learn how to conduct a study. We will discuss the basics of studies, both observational studies and experiments.
Observational Study: This is where data is collected from just observing what is happening. There is no treatment or activity being controlled in any way. Observational studies are commonly conducted using surveys, though you can also collect data by just watching what is happening such as observing the types of trees in a forest.
Survey: Surveys are used for gathering data to create a sample. There are many different kinds of surveys, but overall, a survey is a method used to ask people questions when interested in the responses. Examples of surveys are Internet and T.V. surveys, customer satisfaction surveys at stores or restaurants, new product surveys, phone surveys, and mail surveys. The majority of surveys are some type of public opinion poll.

Experiment: This is an activity where the researcher controls some aspect of the study and then records what happens. An example of this is giving a plant a new fertilizer, and then watching what happens to the plant. Another example is giving a cancer patient a new medication, and monitoring whether the medication stops the cancer from growing. There are many ways to do an experiment, but a clinical study is one of the more popular ways, so we will look at the aspects of this.
Clinical Study: This is a method of collecting data for a sample and then comparing that to data collected for another sample where one sample has been given some sort of treatment and the other sample has not been given that treatment (control). Note: There are occasions when you can have two treatments, and no control. In this case you are trying to determine which treatment is better.
Example 1.3.1: Clinical Study Examples
Here are examples of clinical studies.

A researcher may want to study whether or not smoking increases a person's chances of heart disease.

A researcher may want to study whether a new antidepressant drug will work better than an old antidepressant drug.

A researcher may want to study whether taking folic acid before pregnancy will decrease the risk of birth defects.
Clinical Study Terminology:
Treatment Group: This is the group of individuals who are given some sort of treatment. The word treatment here does not necessarily mean medical treatment. The treatment is the cause, which may produce an effect that the researcher is interested in.
Control Group: This is the group of individuals who are not given the treatment.
Sometimes, they may be given some old treatment, or sometimes they will not be given anything at all. Other times, they may be given a placebo (see below).
Example 1.3.2: Treatment/Control Group Examples
Determine the treatment group, control group, treatment, and control for each clinical study in Example 1.3.1.

A researcher may want to study whether or not smoking increases a person's chances of heart disease.
The treatment group is the people in the study who smoke and the treatment is smoking. The control group is the people in the study who do not smoke and the control is not smoking.

A researcher may want to study whether a new antidepressant drug will work better than an old antidepressant drug.
The treatment group is the people in the study who take the new antidepressant drug and the treatment is taking the new antidepressant drug. The control group is the people in the study who take the old antidepressant drug and the control is taking the old antidepressant drug. Note: In this case the control group is given some treatment since you should not give a person with depression a nontreatment.

A researcher may want to study whether taking folic acid before pregnancy will decrease the risk of birth defects.
The treatment group is the women who take folic acid before pregnancy and the treatment is taking folic acid. The control group is the women who do not take folic acid before pregnancy and the control is not taking the folic acid.
Note: In this case, you may choose to do an observational study of women who did or did not take folic acid during pregnancy so that you are not inducing women to avoid folic acid during pregnancy which could be harmful to their baby.
Confounding Variables: These are other possible causes that may produce the effect of interest rather than the treatment under study. Researchers minimize the effect of confounding variables by comparing the results from the treatment group versus the control group.
Controlled Study: Any clinical study where the researchers compare the results of a treatment group versus a control group.
Placebo: A placebo is sometimes used on the control group in a study to mimic the treatment that the treatment group is receiving. The idea is that if a placebo is used, then the people in the control group and in the treatment group will all think that they are receiving the treatment. However, the control group is merely receiving something that looks like the treatment, but should have no effect on the outcome. An example of a placebo could be a sugar pill if the treatment is a drug in pill form.
Example 1.3.3: Placebo Examples
For each situation in Example 1.3.1, identify if a placebo is necessary to use.

A researcher may want to study whether or not smoking increases a person's chances of heart disease.
In this example, it is impossible to use a placebo. The treatment group is comprised of people who smoke and the control group is comprised of people who do not smoke. There is no way to get the control group to think that they are smoking as well as the treatment group.

A researcher may want to study whether a new antidepressant drug will work better than an old antidepressant drug.
In this example, a placebo is not needed since we are comparing the results of two different antidepressant drugs.

A researcher may want to study whether taking folic acid before pregnancy will decrease the risk of birth defects.
In this example, the control group could be given a sugar pill instead of folic acid. However, they may think that they are taking folic acid and so the psychological effect on a person's health can be being measured. This way, when we compare the results of taking folic acid versus taking a sugar pill, we can see if there were any dramatic differences in the results.
Blind Study: Usually, when a placebo is used in a study, the people in the study will not know if they received the treatment or the placebo until the study is completed. In other words, the people in the study do not know if they are in the treatment group or in the control group. This type of study is called a blind study. Note: When researchers use a placebo in a blind study, the people in the study are told ahead of time that they may be getting the actual treatment, or they may be getting the placebo.
DoubleBlind Study: Sometimes when researchers are conducting a very extensive study using many healthcare workers, the researchers will not tell the people in the study or the healthcare workers which patients will receive the treatment and which patients will receive the placebo. In other words, the healthcare workers who are administering the treatment or placebo to the people in the study do not know which people are in the treatment group and which people are in the control group. This type of study is called a doubleblind study.
Randomized Controlled Study: Any clinical study in which the treatment group and the control group are selected randomly from the population.
Whether you are doing an observational study or an experiment, you need to figure out what to do with the data. You will have many data values that you collected, and it sometimes helps to calculate numbers from these data values. Whether you are talking about the population or the sample, determines what we call these numbers.
Parameter: A numerical value calculated from a population
Statistic: A numerical value calculated from a sample, and used to estimate the parameter
Some examples of parameters that can be estimated from statistics are the percentage of people who strongly agree to a question and mean net worth of all Americans. The statistic would be the percentage of people asked who strongly agree to a question, and the mean net worth of a certain number of Americans.
Notation for Parameter and Statistics:
Parameters are usually denoted with Greek letters. This is not to make you learn a new alphabet. It is because there just are not enough letters in our alphabet. Also, if you see a letter you do not know, then you know that the letter represents a parameter. Examples of letters that are used are (mu), (sigma), (rho), and p (yes this is our letter because there is not a good choice in the Greek alphabet).
Statistics are usually denoted with our alphabet, and in some cases we try to use a letter that would be equivalent to the Greek letter. Examples of letter that are used are (xbar), s, r, and (phat, since we already used p for the parameter).
In addition, N is used to denote the size of the population and n is used to denote the size of the sample.
Sampling Error: This is the difference between a parameter and a statistic. There will always be some error between the two since a statistic is an estimate of a parameter. Sampling error is attributed to chance error and sample bias.
Chance Error: The error inherent in taking information from a sample instead of from the whole population. This comes from the fact that two different samples from the same population will likely give two different statistics.

Sample Bias: The error from using a sample that does not represent the population. To avoid this, use some sort of random sample.
Sampling Rate: The fraction of the total population that is in the sample. This can be denoted by n/N.

Section 1.4: Should You Believe a Statistical Study?
Now we have looked at the basics of a statistical study, but how do you make sure that you conduct a good statistical study? You need to use the following guidelines.
Guidelines for Conducting a Statistical Study:

State the goal of your study precisely. Make sure you understand what you actually want to know before you collect any data. Determine exactly what you would like to learn about.

State the population you want to study and state the population parameter(s) of interest.

Choose a sampling method. A simple random sample is the best type of sample, though sometimes a stratified or cluster sample may be better depending on the question you are asking.

Collect the data for the sample and summarize these data by finding sample statistics.

Use the sample statistics to make inferences about the population parameters.

Draw conclusions: Determine what you learned and whether you achieved your goal.
The mistake that most people make when doing a statistical study is to collect the data, and then look at the data to see what questions can be answered. This is actually backwards. So, make sure you know what question you want to answer before you collect any data.
Even if you do not conduct your own study, you will be looking at studies that other people have conducted. Every day you hear and see statistics on the news, in newspapers and magazines, on the Internet and other places. Some of these statistics may be legitimate and beneficial, but some may be inaccurate and misleading. Here are some steps to follow when evaluating whether or not a statistical study is believable.
Steps for Determining whether a Statistical Study is Believable:
1. Are the population, goal of the study, and type of study clearly stated?
You should be able to answer the following questions when reading about a statistical study:

Does the study have a clear goal? What is it?

Is the population defined clearly? What is it?

Is the type of study used clear and appropriate?
2. Is the source of the study identified? Are there any concerns with the source?
A study may not have been conducted fairly if those who funded the study are biased.
Example 1.4.1: Source of Study 1
Suppose a study is conducted to find out the percentage of United States college professors that belong to the Libertarian party. If this study was paid for by the Libertarian party, or another political party, then there may have been bias involved with conducting the study. Usually an independent group is a good source for conducting political studies.
Example 1.4.2: Source of Study 2
There was once a fullpage ad in many of the newspapers around the U.S. that said that global warming was not happening. The ad gave some reasons why it was not happening based on studies conducted. At the bottom of the page, in small print, were the words that the study and ad were paid for by the oil and gas industry. So, the study may have been a good study, but since it was funded by the industry that would benefit from the results, then you should question the validity of the results.
3. Are there any confounding variables that could skew the results of the study?
Confounding variables are other possible causes that may produce the effect of interest besides the variable under study. In a scientific experiment researchers may be able to minimize the effect of confounding variables by comparing the results from a treatment group versus a control group.
Example 1.4.3: Confounding Variable
A study was done to show that microwave ovens were dangerous. The study involved plants, where one plant was given tap water and one plant was given water that was boiled in the microwave oven. The plant given the water that was boiled died. So the conclusion was that microwaving water caused damage to the water and thus caused the plant to die. However, it could easily have been the fact that boiling water was poured onto the plant that caused the plant to die.
4. Could there be any bias from the sampling method that was used?
Sometimes researchers will take a sample from the population and the results may be biased.
Selection Bias: This occurs when the sample chosen from the population is not representative of the population.
