Lecture 6: Probability Theory I. Basic Notions of Probability Classical, empirical, and subjective probability: All of these in some way relate to the frequency with which you expect an event to occur. Frequency simply means the fraction of the time that a particular event happens (in a defined type of situation).
Classical probability applies to situations where you know all the possible outcomes, such as when throwing dice or flipping a coin. Our assumption in such cases is that we know the underlying distribution (e.g., we know that every side of a die is equally likely, we know we have a “fair” coin).
Empirical probability applies when we have to estimate a frequency based on actual observations. This is what we do when we don’t actually know the underlying distribution and have to estimate it. (If you weren’t sure if a coin was fair or a die was weighted, you’d have to use empirical probability to try to find out.)
Subjective probability applies in situations that cannot really be repeated, so we have to imagine repeating them to make sense of the notion of “frequency.” For example, “What is the probability Jack and Diane will eventually get married?” This is a unique situation; Jack and Diane are unique individuals. But we might imagine repeating the Jack-and-Diane relationship (or Jack-and-Diane-like-relationships) many, many times to see how often marriage resulted.
Basics of probability. We let P(A) = the probability (that is, the true frequency) of some event defined by A.
P(A) = 1 means the event occurs with certainty; P(A) = 0 means that it will certainly not occur. P(A) is always in the interval [0, 1].
The probability of all events put together must add up to 1, so long as we don’t double-count by including events that overlap. Another way of putting this: something must happen, even if it’s the absence of anything else. E.g., say we’re interested in the kinds of pet a randomly chosen household will have. Either they will have a cat, or they’ll have a dog, or they’ll have a fish, or they’ll have some other kind of pet, or they’ll have no pet at all. If you add all these together (being sure not to double-count the chance of having more than one kind of pet), the probabilities add up to one. Either they have a pet of some kind or they don’t.
The complement of event A is denoted ~A (or in the book A’). This is the probability that A will not occur. P(~A) = 1 – P(A), and P(A) + P(~A) = 1.
II. Intersections and Unions of Events The intersection of two events is when two events both happen. What is the probability that a randomly chosen voter is both African-American and a Democrat? If we let A = voter is African-American and B = voter is Democrat, the intersection is denoted A ∩ B = voter is both African-American and Democrat. P(A ∩ B) = the probability that a randomly chosen voter is both African-American and Democrat.
The union of two events is when one or the other happens, or both. What is the probability that a randomly chosen voter is either African-American or a Democrat? (Note that, by convention, we interpret this “or” not to exclude voters who are both African-American and Democrat.) Using the same A and B as above, the union is denoted A U B = voter is either African-American or Democrat. P(A U B) = probability that a randomly chosen voter is both African-American and Democrat.
Think of intersection as “and,” union as “or.”
Probability of the union can be found using this formula:
P(A U B) = P(A) + P(B) – P(A ∩ B)
Why do we need to subtract P(A ∩ B)? To keep us from double-counting. P(A), the probability of African-American, includes African-American Democrats. P(B), the probability of Democrat, also include African-American Democrats. But we should only be including African-American Democrats once, not twice.
If the two events are mutually exclusive, meaning they cannot happen at the same time, then the formula simplifies to P(A U B) = P(A) + P(B).
Consider this breakdown (I made up these numbers, so don’t take them seriously):
Given these numbers,
P(A) = 16/100 = 0.16
P(B) = 52/100 = 0.52
P(A ∩ B) = 12/100 = 0.12
P(A U B) = P(A) + P(B) – P(A ∩ B) = 0.16 + 0.52 – 0.12 = 0.56
(in other words, all the Democrats plus the African-American Republicans)
III. Conditional Probabilities A conditional probability tells you the probability of some event given that you already know another event has occurred.
For instance, what is the probability a voter is a Democrat given that he is African-American? This is designated P(B|A). Or what is the probability a voter is African-American given that he is a Democrat? This is designated P(A|B).
The formula for conditional probability is:
So if we’re interested in P(A|B) for our definitions above, that is, the probability a voter is African-American given that he is a Democrat, the answer is
P(A|B) = 0.12 / 0.52 = 0.23
If you think about it, all we’ve really done here is treat the “given” event as the whole population. In this case, Democrat is the given event, so our relevant population is the 52 Democrats. Of those Democrats, how many are African-American? 12. And 12 as a fraction of 52 is 0.23.
What about P(B|A)?
P(B|A) = P(A ∩ B)/P(A) = 0.12 / 0.16 = 0.75
Again, we’ve essentially treated the “given” event – this time that the voter is African-American – as the relevant population. There are 16 African-Americans, and 4 of them are Democrats, for a fraction of 0.75.
IV. Multiplication Rule Events are called independent if the occurrence of one does not affect the probability of the other. That is, P(A|B) = P(A).
A good example of independent events is rolls of dice or flips of a coin. Even if you’ve just thrown heads 10 times in a row, the probability of heads is still the same as it ever was (that is, one-half if this is a fair coin). The gambler’s fallacy is the mistaken belief that events are dependent when they’re really not; gamblers often speak of “hot streaks” as though they are likely to continue.
The multiplication rules says that if A and B are independent events, then you can multiply the probabilities together to get the joint probability (the probability of the intersection). That is,
P(A ∩ B) = P(A) ∙ P(B)
Example: What is the probability of rolling snake eyes? This the probability of die #1 coming up “1” and die #2 also coming up “1.” P(snake eyes) = (1/6)(1/6) = (1/36).
This rule actually follows directly from the rule for condition probabilities, once we include independence:
And now just multiply through by P(B) to get the multiplication rule.
V. Bayes’ Rule Bayes’ Rule is a formula for finding a conditional probability P(A|B) given information about P(B|A). It’s really not useful in the example above, where you have all the information you need to calculate P(B|A) directly. But what if you didn’t have all that information?
Let’s say you want to know the likelihood that a voter is a Democrat, given that he’s African-American. Instead of the information given in the table above, suppose you only know the following: 23% of Democrats are African-American, 8.3% of Republicans are African-American, and Democrats constitute 52% of the population.
Bayes’ Rule says:
(This is actually two equivalent formulas; use the one that’s more convenient.)
The idea is this: The numerator is the likelihood of having both events (in this case, African-American and Democrat) occur. The bottom is the likelihood of just event A occur, and it can happen in two ways: either with B or without B. So we’re saying: of all the times that A occurs, how many of those times involve both A and B occurring?
In this case,
Notice that this is just what we found from the initial data, but we were able to find it using more limited information.
Medical Testing. Probably the most important applications of Bayes’ Rule involve medical testing, such as for disease or drug use. Suppose a school implements are new random drug testing program. Students’ names are picked randomly from the registration records, and the selected students have to take a drug test. It is known that 5% of all students use drugs. It is also known that the drug test’s false positive rate is 1% (a false positive means indicating drug use when it did not in fact occur). Its false negative rate is 2% (a false negative means indicating no drug use even though it did occur). If a student tests positive for drugs, what is the probability that he actually did use drugs? Most people would say, “99 percent.” But that answer is wrong!
Let B = student used drugs
A = test was positive
We want to know P(B|A).
We know P(B) = 0.05, P(~B) = 0.95, P(A|B) = 0.98, and P(A|~B) = 0.01.
That is, the chance a randomly selected student who tested positive for drugs actually used drugs is only about 85%, not the 99% that some people naively assume.
The result is even more dramatic with tests for diseases, such as HIV, where the fraction of the public that has the disease (that is, P(B)) is very small. The probability that a randomly tested person (that is, a person getting tested with no particular reason to think he’s been exposed) actually has the disease can be as low as 50%, or even less.
To see the logic, suppose we have a population of 10,000 people. It is known that 1 in 200 people (that is, 50 people total) have the disease. And suppose the test for the disease has 98% accuracy (false positive and false negative rates of 2%). Of the 50 people with the disease, (0.98)(50) = 49 will test positive. Of the 9,550 people without the disease, (0.02)(9550) = 199 will test positive. That’s a total of 248 positive test results, but only 49 of those people actually have the disease. That’s only 19.8%. The remaining 80.2% of those who tested positive are disease-free.
The Monty Hall Problem. You’re on a game show. Monty Hall, the host presents you with three doors. Behind one there’s a bag of gold; behind the other two there are goats. You pick door A. Before you open it, the host opens door B to reveal there’s a goat behind it. Then he offers you a chance to switch to door C. Should you?
To answer properly, you should understand that the host knows where the gold is, and he’ll never open the door on the gold. Does that change your answer?
We want to find P(A has gold | host shows B has goat).
We know P(A has gold) = 1/3, P(A has goat) = 2/3
P(host shows B has goat | A has gold) = 1/2 (because he randomly opens B or C)
P(host shows B has goat | A has goat) = 1/2 (because he always opens whichever remaining door has the other goat, and there’s a 1/2 chance B is that door)
Numerator = P(host shows B has goat | A has gold)∙P(A has gold) = (1/2)(1/3)
Denominator = P(host shows B has goat | A has gold)∙P(A has gold)
+ P(host shows B has goat | A has goat)∙P(A does not)
= (1/2)(1/3) + (1/2)(2/3) = 1/6 + 1/3 = 1/2
So by Bayes’ Rule, P(A has gold | host shows B does not) = (1/2)(1/3)/(1/2) = 1/3.
And if there’s only a 1/3 chance the gold is behind A, and the host has already revealed that it’s not behind B, then there’s a 2/3 chance it’s behind C. You’re better off switching!
Here’s the logic. One-third of the time, you will have guessed correctly. Suppose you adopt a policy of always sticking with your original choice. Then obviously, you will win 1/3 of the time. Can you improve on that? Suppose you adopt the alternative policy of switching. That means 1/3 of the time you’ll lose because you guessed correctly to begin with. The other 2/3 of the time, you’ll switch to one of the other two doors, one of which must have the gold. And since Monty Hall has already eliminated the other door that has the goat, you’ll be switching to the door with gold. Thus, you win 2/3 of the time.
The key here is realizing that Monty Hall’s action reveals information. The result would be different if Monty Hall chose randomly between doors B and C (and therefore sometimes revealed the gold, causing you to lose immediately).
VI. Probability Distributions A probability distribution is a specification of all different possible values of a random variable along with a measure of the frequency for each of those values.
There are two kinds of random variables and thus two kinds of probability distributions: discrete and continuous.
A discrete variable that can only take a countable number of different values. (Countable is not the same as finite. Countable means that you could use the numbers 1, 2, 3, etc. to designate the possible outcomes). Examples of discrete variables would be the value of a die roll (there are exactly 6 possible outcomes), the outcome of a coin flip (2 possible outcomes), number of marriages (always a whole number 0, 1, 2, etc.; it’s impossible to have a fraction of a marriage), etc.
A continuous variable is one that is measured on a continuous (unbroken) number scale, and therefore can take on an uncountable and infinite number of possible values. Examples include height, weight, time, and so on. If these variables are treated as discrete sometimes (e.g., we don’t report heights down to infinitely small units, but in discrete numbers of inches), it is because we can’t measure with infinite precision and because it’s often convenient to round off the results.
With discrete variables, we can reasonably assign a probability value to each possible outcome. We can represent it with a function such as the following, which is for the roll of a die:
To show that countable does not mean infinite, consider the probability distribution for the following process: flip a coin until it comes up heads, and let x = the flip on which the first heads came up. The probability of x = 1 is ½; the probability of x = 2 is ¼; etc.; and there is no maximum possible value of x. We would write this like so:
For discrete probability distributions, the value of the function P(x) can be interpreted as the probability of the value x.
But for continuous probability distributions, we cannot assign a probability to each possible value. Why not? Because there is an infinite number of possible values. The probability of any given (and specifically defined) value is approximately zero. What we really want to know is the probability of the value falling within a certain interval. For example, what’s the probability of an American man being 6’2”? The probability of anyone being exactly 6’2”, and not the tiniest fraction taller or shorter, is zero. But we can talk about the probability of a man being between 6’1.5” and 6’2.5”, and that probability is greater than zero.
We represent a continuous probability distribution with a probability density function (pdf) such as the following:
This function defines a uniform distribution over the interval [0,10]. Every value in the range from 0 to 10 can occur (and not just 0, 1, 2, etc., but all the fractional values in between). We cannot interpret f(x) as the probability of the value x, because there are more than 10 possible values of x, so the probabilities would add up to more than 1. And that would clearly be wrong anyway, because the chance of (say) x = 2 is not 1/10.
What f(x) does do for us is allow us to find the probability of intervals. We do this by looking at the area underneath the curve defined by f(x). [Draw graph of this function: a horizontal line at f(x) = 1/10, going from x = 0 to x = 10.] Note that the total area underneath this function is 1. This makes sense, because all probabilities must add up to 1, and no value can fall outside the interval [0,10]. Note also that the area under the curve for any interval with a length of one, such as [0,1] or [1,2] or [3.5,4.5] is equal to 1/10.
There are many different continuous distributions; the uniform distribution is a very simple one. (Also, we could have defined our uniform distribution over any interval we wanted, such as [0,1] or [-1,1] or [50,100] or whatever.) We will be concerned, for the most part, with just one: the normal distribution. The normal distribution has a pdf that looks like this:
This is the first and last time we’ll be looking at this formula. The important thing to note is that, except for x, everything else in there is a parameter – that is, a fixed number. Pi and e are just irrational numbers that happen to turn up a lot in the world. Mu (μ) and sigma (σ), as you know, are the mean and standard deviation of a population. Note that these can take on many different values, depending on the population you’re talking about.
If you graphed this function, you’d get the famous bell curve. [Draw it, with μ marked as the center of the distribution.] Just as with the uniform distribution, the value of f(x) doesn’t have any important meaning. What does matter is the area underneath the curve for any given interval. The area under the whole curve (that is, in the interval [-∞, ∞]) is equal to 1, just as with the uniform distribution. The area under the left half is ½; the area under the right half is ½.
Knowing the standard deviation lets us get even more information. It turns out that the area in the interval [μ – σ, μ + σ] is equal to approximately 0.68, or just over 2/3. That is, the probability of x falling within one standard deviation of the mean is about 2/3. It also turns out the area in the interval [μ – 2σ, μ + 2σ] is equal to approximately 0.95, meaning the probability of x falling within two standard deviations of the mean is about 1/20.
The standard normal distribution is the normal distribution with mean of zero and standard deviation of 1. If you plugged those into the pdf above, you’d get:
(We changed x to z because, for historical reasons, we happen to call the standard normal variable z instead of x.) Again, this is the first and last time we’ll see this formula. The important part is that in any statistics book, you’ll find a table that summarizes lots of information about the area underneath the standard normal bell curve. It’s Table 3, on p. 348-9, in our text.
And it turns out we can use the information in that table for any normal distribution by making a simple conversion. If you have a variable x that is normally distributed with mean μ and standard deviation σ, you can convert any value of that variable into an equivalent value of a standard normal variable using the following:
This is called a z-score, and it can be interpreted as the number of standard deviations the value x is from the mean.
For each value of z, Table 3 gives the area under the bell curve and to the left of z. In other words, it gives the area under the curve in the interval [-∞, z]. [Draw picture for z = 1.] The table tells us the area under the curve to the left of z = 1.00 is 0.8413. We can easily find the area to the right of z by taking one minus the area given in the table. Thus, the area to the right of z = 1 is 1 – 0.8413 = 0.1587.
Because the standard normal distribution is symmetrical, the area to the left of any value z is equal to the area to the right of –z, and the area to the right of any value z is equal to the area to the left of –z. Thus, since the area to right of z = 1 is 0.1587, the area to the left of z = –1 is 0.1587 as well.
And we can easily find the area between any two values of z by subtracting the area to the left of the lower value from the area to the left of the higher value. If we wanted the area between z = –1 and z = 1, we’d note that the area to the left of z = -1 is 0.1587 (as shown above), and the area to the left of z = 1 is 0.8413. Subtract the former from the latter to get 0.8413 – 0.1587 = 0.6826.
We can do the same thing for any interval, not just symmetrical ones.
Example: The mean IQ is 100, and the standard deviation is 15. How many people have an IQ between 120 and 145? First, convert both of these into z-scores: 1.33 and 3.00. The area to the left of 1.33, as given in Table 3, is 0.9082. The area to the left of 3.00 is 0.9987. Subtract the 0.9082 from 0.9987 to get 0.0905, or about 9%.
Example: Prof. Nerdberger’s class grades are normally distributed, with a mean of 65 and a standard deviation of 17. (Yes, he sometimes gives scores above 100.) How many students get B’s, if the B range is 80 to 90? Convert these to z-scores: (90 – 65)/17 = 1.47; (80 – 65)/17 = 0.88. The area to the left of z = 0.88 is 0.8106; the area to the left of z = 1.47 is 0.9292. Subtracting, we get 0.9292 – 0.8106 = 0.1186, or almost 12%.
You can use this kind of information to solve some economic problems.
Example: You run a sandwich shop that also sells bowls of soup. Your price for a bowl of soup is $5 (assume you’ve already set this optimally). You’ve discovered that the number of bowls of soup requested by customers per day is normally distributed with mean of 20 and standard deviation of 5. You fix all of your soup at the beginning of the day, and you currently put in enough ingredients for 20 bowls’ worth. It costs you $1 to add another bowl’s worth. Should you increase your number of bowls prepared, and if so, by how much? We will use marginal analysis: compare the marginal cost (MC) of preparing another bowl with the expected marginal revenue (MR) from selling it. If MR > MC, prepare the bowl.
So, if you prepare a 21st bowl’s worth, what is the chance it will be sold? 21 means a z-score of z = 0.20. Table 3 gives a value of 0.58, meaning there’s a 58% chance you’ll sell less than or equal to 21, or a 42% chance you’ll sell 21 or more. Because of the fact that 21 is included in both (less than or equal to 21, 21 or more), we need to think a little more carefully.
The problem is that we’re using a normal distribution, which is for continuous variables, to approximate a discrete variable. We can do this by thinking of the number of actual bowls (the discrete variable) as the rounded-off form of a continuous variable. If you think about the non-existent bowl 20.5, you get a z-score of 0.10, and the table tells us there’s 54% chance of less than 20.5 – that is, 20 or fewer bowls; and thus there’s a 46% chance of getting more than 20.5 – that is, 21 or more bowls. Your expected MR from preparing bowl #21 is 0.46($5) = $2.30, which exceeds the MC of $1. What about the 22nd bowl? Using z = 0.30 (bowl 21.5), we find a 1 – 0.62 or 38% chance of selling at least 22 bowls; MR = 0.38($5) = $1.90 > $1, so you prepare it. The table below summarizes the rest of the calculations.
P(x > Bowl)
So you’d want to prepare 24 bowls’ worth and no more. (Unless, perhaps, you were worried that you’d lose some angry customers forever by running out of soup. How often would you turn away a customer? 16% of the time.)
What if your number of soup-requesters per day were not normally distributed? Then this approach wouldn’t work quite as well. What could you do? Maybe transform your number of bowls using a natural log…