- The Rain in Seattle problem
Suppose you are interviewing for a data science job and you are asked this question (from glassdoor.com):
You're about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it's raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that "Yes" it is raining. What is the probability that it's actually raining in Seattle?
My first thought is that my friends are biased toward truth, so getting three "yes" answers suggests there is a greater than 50% chance that it is raining in Seattle. If they were more likely to lie than tell the truth, the three "yes" answers would make me think it probably isn't raining. But even though they tend toward truth, it is possible that they are all lying. So I can't be 100% sure it is raining in Seattle. My answer should be a number between 50% and 100%, so I find 2/3 an attractive option. But this seems to assume that there is a 50% chance of rain in Seattle before I call any friends. Does that matter? Suppose instead I want to know if it is raining frogs in Seattle. I saw something on TV saying it is raining frogs in Seattle, but didn't notice if it was an ad or a news item. Normally I would rate it very unlikely that it is raining frogs in Seattle, say one in a million chance (since I have heard stories of frogs falling out of the sky). I want to know if I should pack a helmet, so I call my three friends to ask about the frogs. They all say "yes" it is raining frogs. This increases my estimate of the chances that it is in fact raining frogs in Seattle. But I still think it's very unlikely. It's more likely that all three friends have lied to me. The odds are one in three that one friend would lie, so the odds are 1 in 27 that all three friends would lie. Perhaps my updated estimate of raining frogs should be about 27 in a million, still very unlikely. This seems reasonable. If I had enough 2/3 reliable friends to call and they all reported raining frogs, eventually I would be convinced that it's probably true. This shows that I do need a prior estimate of rain in Seattle before I can update based on my friends' report. I believe it rains a lot in Seattle, but a 50% chance of rain on a random day seems a little high outside a rain forest. Suppose I put my estimate at 30% before I call my friends. When the first friend says it is raining, there is a 2/3 chance the information is reliable. If the information were perfectly reliable, it would banish all doubt. If the information were 50% reliable (i.e. completely unreliable) I would stick with my prior estimate of 30%. If I could remember Bayes' Theorem I could probably come up with a good updated estimate. I have time before my trip, so I look up the formula, planning to apply it to update my expectations after calling the first friend, Albert, and hearing him say "yes." Bayes' formula will give me the probability of one event given observation of another event. So I define two events: A: It is raining in Seattle. B: Albert says "yes." The formula says the probability that it is raining, given Albert's "yes" is P(B|A) × P(A) / P(B). P(B|A) is the probability that Albert says "yes" given that it is raining. The problem states that this is 2/3. P(A) is the probability that it is raining. My estimate is 30%. P(B) is the probability that Albert says "yes." Now I am confused. What Albert says depends on whether it is raining. How can I compute the probability of what he says if I don't know whether it is raining? So I change my definition of event B to "Albert reports honestly." This is the 2/3 chance given in the problem that my friend will be honest. The formula then gives me (2/3) × 30% / (2/3), resulting in the original 30%. It doesn't matter how Albert responds, because the 2/3 probability of Albert responding honestly cancels out in the numerator and denominator. This can't be right. At this point I read the rest of the blog entry, which gives a definite solution, but left me unclear as to where I steered wrong. Perhaps I should be using odds rather than probabilities. It also seems like I should incorporate the "yes" response in Event B somehow. b_b probably knows.
I can actually recommend a solid book by the author of the article that shows how to apply it to data science using Python: Think Bayes. It's even linked on the side bar next to the text, but I figured that some extra attention to it could be of use to some people. Edit: By the way, most problems that deal with Bayes' theorem are great exercises to train Monte Carlo methods, one of the more important tools in modern physics and modelling in general. Just bare in mind that it is not without some pitfalls so it should be applied with care.
The four-part Probability is Hard series convinced me of the utility of using a simulator to map out the probability space. Sometimes it seems like overkill to actually write and execute the code, but with these tricky problems it's good to be thorough. Let's see if I can talk my way through how a simulator would work. I simulate a million days in Seattle. They break down as follows: 300,000 rainy days 700,000 sunny days Each day, I call Albert. On 200,000 rainy days, Albert is honest and says "yes it is raining." On 100,000 rainy days, Albert lies and says "no it is not raining." On 466,667 sunny days, Albert is honest and says "no it is not raining." On 233,333 sunny days Albert lies and says "yes it is raining." In this problem Albert says yes. So it is actually raining in 200,000 out of 433,333 cases in which I hear yes, about 46% of the time. Next I call Betty. Ignoring the cases in which Albert said "no" I find four cases: Of the 200,000 rainy days on which Albert said "yes," Betty will be honest and say "yes" on 133,333 days. Of the 200,000 rainy days on which Albert said "yes," Betty will lie and say "no" on 66,667 days. Of the 233,333 sunny days on which Albert said "yes," Betty will be honest and say "no" on 155,556 days. Of the 233,333 sunny days on which Albert said "yes," Betty will lie and say "yes" on 77,778 days. In this problem Betty says yes. So it is actually raining in 133,333 days out of 211,111 days in which I hear a second yes, about 63% of the time. Next I call Charlie. Ignoring the cases in which Betty said "no" I find four cases: Of the 133,333 rainy days on which Albert and Betty said "yes," Charlie will be honest and say "yes" on 88,889 days. Of the 133,333 rainy days on which Albert and Betty said "yes," Charlie will lie and say "no" on 44,444 days. Of the 77,778 sunny days on which Albert and Betty said "yes," Charlie will be honest and say "no" on 51,852 days. Of the 77,778 sunny days on which Albert and Betty said "yes," Charlie will lie and say "yes" on 25,926 days. In this problem Charlie says yes. So it is actually raining in 88,889 days out of 114,815 in which I hear yes three times, about 77% of the time. This does not match the given answer of 47%, but my prior estimate of rain (30%) is higher than the 10% used in the article. If my arithmetic is correct, this tedious approach seems like a reliable way to get a result that is comprehensible. I am still not sure how to use the formula to get an answer; the article throws in a "Bayes factor" which makes sense but seems like a shortcut.
Perhaps it's shoddy coding on my part, but I got different result as well. It was actually raining for 111801 days out of a 1000000 Albert and Charlie said yes 81833 Betty and Charlie said yes 82640 Albert and Charlie said yes and it was true 49250 Betty and Charlie said yes and it was true 49631 Albert and Charlie said yes and it was a lie 16300 Betty and Charlie said yes and it was a lie 16681 It wasn't raining but everyone lied 32658 Times said 'yes' truthfully: 74377 Times said 'no' truthfully: 592897 Times said 'yes' and lied: 295302 Times said 'no' and lied: 37424 Betty Times said 'yes' truthfully: 74422 Times said 'no' truthfully: 592248 Times said 'yes' and lied: 295951 Times said 'no' and lied: 37379 Charlie Times said 'yes' truthfully: 74326 Times said 'no' truthfully: 592276 Times said 'yes' and lied: 295923 Times said 'no' and lied: 37475 I have used probabilities from the article. One out of three chance for a lie, one out of nine chance of rain being in Seattle. I can share my code if you would like to go for "maybe by spotting a problem with someone else's code I'll get some extra insight" type of exercise. Beware of sleepy Python though ;). Sample size for simulation: 1000000
In following the third person was saying the opposite:
Albert and Betty said yes 82200
Albert and Betty said yes and it was true 49596
Albert and Betty said yes and it was a lie 16646
It was raining and everyone said true 32950
Albert
They say it rains about 10% of the time, and "A base rate of 10% corresponds to prior odds of 1:9." I found this notation a bit confusing, but I think it corresponds to a one-in-ten chance of rain. For each rainy day, you get nine sunny days. The difference between probabilities expressed as a value between 0 and 1 and these "odds" if that's what 1:9 is called is likely part of my confusion in following the article. It was actually Professor Brian who demonstrated how useful a simulation can be, while analyzing this bizarre problem: But a simulator, based on a random number generator, seems like a good way to check our work. I still trust my numbers as long as I feel like I know what I am doing. Say we start with a convenient number of days: 243 sunny 27 rainy Each friend will be expected to give the same ratio of answers. We call Albert first and he responds: False "yes" on 1/3 of 243 sunny days = 81 days True "yes" on 2/3 of 27 rainy days = 18 days False "no" on 1/3 of 27 rainy days = 9 days So Albert says "yes" on 99 days, and on 81 days it is sunny and on 18 days it is raining. We increase our expectation of rain from 10% to 18/99, about 18%. We expect the same ratio of responses from Betty: False "yes" on 1/3 of 81 sunny days = 27 days True "yes" on 2/3 of 18 rainy days = 12 days False "no" on 1/3 of 18 rainy days = 6 days So Betty says "yes" on 39 days, and on 27 days it is sunny and on 12 days it is raining. We increase our expectation of rain from 18/99 to 12/39, about 31%. We expect the same ratio of responses from Charlie: False "yes" on 1/3 of 27 sunny days = 9 days True "yes" on 2/3 of 12 rainy days = 8 days False "no" on 1/3 of 12 rainy days = 4 days So Charlie says "yes" on 17 days, and on 9 days it is sunny and on 8 days it is raining. We increase our expectation of rain from 12/39 to 8/17, about 47%. This seems very clear and agrees with the given answer. But I still haven't used Bayes' theorem.one out of nine chance of rain being in Seattle
Say you know a family has two children, and further that at least one of them is a girl named Florida. What is the probability that they have two girls?
270 days
True "no" on 2/3 of 243 sunny days = 162 days
True "no" on 2/3 of 81 sunny days = 54 days
True "no" on 2/3 of 27 sunny days = 18 days
Using odds instead of percentage makes this problem simple. Our estimate of rain in Seattle is 1:9 (one rainy day for every nine sunny days). Our friends tell us the truth with odds of 2:1 (two truthful reports for each false report). So after calling one friend and hearing "yes it's raining" we multiply 1:9 by 2:1 and get odds of 2:9 as our new expectation of rain. That's 2/11, about 18%, so we are more confident of rain. The second friend says "yes" and we multiply 2:9 by another 2:1 to get 4:9, or 4/13 which is about 31%. The third friend says "yes" and we multiply 4:9 by 2:1 to get 8:9 which is 8/17, the final answer of about 47%. I have found a way to force Bayes in, but don't know if it's correct. Replacing the A's and B's with more descriptive terms, Bayes' Theorem is odds of rain, given a "yes" = odds of "yes," given rain × odds of rain / odds of "yes" So, after hearing the first "yes" we have odds of rain, given a "yes" = (2:1 the odds we will hear a "yes" on a rainy day) × (1:9 our prior estimate for rain) / (1, our certainty of hearing "yes" because we just talked with Albert and he definitely said "yes") This gives us 2:9 divided by 1, which is the desired 2/11 result. That denominator seems forced and flaky, though, and I am not sure what to do if Albert says "no." Perhaps the odds of rain given a "no" are 1:2 × 1:9 / 1 again because we definitely heard "no" for a total of 1:18 or about 5% chance of rain. Odds are good b_b will be able to straighten this out.