Suppose you are interviewing for a data science job and you are asked this question (from glassdoor.com):

*You're about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it's raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that "Yes" it is raining. What is the probability that it's actually raining in Seattle?*

wasoxygen:

My first thought is that my friends are biased toward truth, so getting three "yes" answers suggests there is a greater than 50% chance that it is raining in Seattle. If they were more likely to lie than tell the truth, the three "yes" answers would make me think it probably isn't raining.

But even though they tend toward truth, it is possible that they are all lying. So I can't be 100% sure it is raining in Seattle. My answer should be a number between 50% and 100%, so I find 2/3 an attractive option.

But this seems to assume that there is a 50% chance of rain in Seattle before I call any friends. Does that matter?

Suppose instead I want to know if it is raining frogs in Seattle. I saw something on TV saying it is raining frogs in Seattle, but didn't notice if it was an ad or a news item. Normally I would rate it very unlikely that it is raining frogs in Seattle, say one in a million chance (since I have heard stories of frogs falling out of the sky). I want to know if I should pack a helmet, so I call my three friends to ask about the frogs. They all say "yes" it is raining frogs.

This increases my estimate of the chances that it is in fact raining frogs in Seattle. But I still think it's very unlikely. It's more likely that all three friends have lied to me. The odds are one in three that one friend would lie, so the odds are 1 in 27 that all three friends would lie. Perhaps my updated estimate of raining frogs should be about 27 in a million, still very unlikely. This seems reasonable. If I had enough 2/3 reliable friends to call and they all reported raining frogs, eventually I would be convinced that it's probably true.

This shows that I do need a prior estimate of rain in Seattle before I can update based on my friends' report. I believe it rains a lot in Seattle, but a 50% chance of rain on a random day seems a little high outside a rain forest. Suppose I put my estimate at 30% before I call my friends.

When the first friend says it is raining, there is a 2/3 chance the information is reliable. If the information were perfectly reliable, it would banish all doubt. If the information were 50% reliable (i.e. completely unreliable) I would stick with my prior estimate of 30%. If I could remember Bayes' Theorem I could probably come up with a good updated estimate.

I have time before my trip, so I look up the formula, planning to apply it to update my expectations after calling the first friend, Albert, and hearing him say "yes."

Bayes' formula will give me the probability of one event given observation of another event. So I define two events:

**A**: It is raining in Seattle.

**B**: Albert says "yes."

The formula says the probability that it is raining, given Albert's "yes" is P(B|A) × P(A) / P(B).

P(B|A) is the probability that Albert says "yes" given that it is raining. The problem states that this is 2/3.

P(A) is the probability that it is raining. My estimate is 30%.

P(B) is the probability that Albert says "yes." Now I am confused. What Albert says depends on whether it is raining. How can I compute the probability of what he says if I don't know whether it is raining?

So I change my definition of event **B** to "Albert reports honestly." This is the 2/3 chance given in the problem that my friend will be honest.

The formula then gives me (2/3) × 30% / (2/3), resulting in the original 30%. It doesn't matter how Albert responds, because the 2/3 probability of Albert responding honestly cancels out in the numerator and denominator. This can't be right.

At this point I read the rest of the blog entry, which gives a definite solution, but left me unclear as to where I steered wrong. Perhaps I should be using odds rather than probabilities. It also seems like I should incorporate the "yes" response in Event B somehow. b_b probably knows.

posted by wasoxygen: 300 days ago