Wow, (4) sounds really cool! Putting it on my zotero reading list. I don't understand your walk-through just yet. The generative part of the model learns from the actual occurrence of "bread and butter" vs "butter and bread". Is the binomial now drawn per usage instance of that phrase, or per user, or per context in some other way? How would you even encounter a situation where the polarity of the beta distribution is flipped? If each binomial is drawn iid from the beta, how would they all gravitate towards opposite polarity? And isn't the beta learned from the actual text, so it should learn to place a high weight on the final correct polarity? Maybe I just need to sit down with the statistics for a while to understand. (2) is a really neat example of how we develop a consensus on language through recursive modeling! I don't know that much about pragmatics or this sort of maxim, coming from a computational background, but I did get really excited by a NIPS paper a couple years ago on a model of consensus-building in language learning, also from Goodman's lab.Then when you look at the parameters of this model after training, you find that very high-frequency expressions' concentration parameter is quite low, so that most expressions become very polarized toward their generative preferred order and a very few become polarized, but in the opposite direction, and nothing in between.
So the way this works is actually a bit more complicated than that. The background on this is that there are a bunch of people in the past who have tried to account for binomial ordering preferences through a bunch of constraints like "short words come before long words", "words with more general meanings come first", etc. This model formalizes that. They selected a bunch of the most agreed-upon constraints and coded a bunch of binomials (in their preferred order) for whether they follow a particular constraint or not (or whether the constraint doesn't apply). Then they use a logistic regression model to learn weights for each of the constraints. So for a new binomials, you can just spit out a number between 0 and 1 based on the weights of the various constraints. Yeah, this was hard for me to understand at first too. (I actually know the author of the paper, and I had to have her explain this to me in person :)) So, the way this works is that you have some estimate of preference from the first part of the model. Let's say that the estimate is 0.7. This will be the mean of the beta distribution you're drawing from. However, the concentration parameter varies by overall frequency of the binomial in the corpus. Look at the red line in this graph: https://upload.wikimedia.org/wikipedia/commons/f/f3/Beta_distribution_pdf.svg See how it's U-shaped so that you mostly end up drawing from the extremes and not much in between? That's how they found the very frequent binomials to behave -- they get polarized one way or the other, but nowhere in between (although it's not exactly like the function presented here). And the binomials that are really infrequent are more like the orange line -- you mostly end up drawing from the mean, and they don't get polarized one way or the other. I hope this helped somewhat -- I'm not totally clear on implementation details but I think I got the conceptual bits. Yes they do amazing work!!! There's been a trend in some fields of linguistics lately to do more computational work and I love it. I think Noah Goodman has a PhD in Math actually...we need more interdisciplinary people like him. That reminds me, there was another paper at this year's CogSci on joint inference of word & concept that is somewhat similar to Goodman's work. Mollica & Piantadosi. Towards semantically rich and recursive word learning models. (Side note, it's really awesome to have someone on this site with similar interests :D)The generative part of the model learns from the actual occurrence of "bread and butter" vs "butter and bread". Is the binomial now drawn per usage instance of that phrase, or per user, or per context in some other way?
How would you even encounter a situation where the polarity of the beta distribution is flipped? If each binomial is drawn iid from the beta, how would they all gravitate towards opposite polarity? And isn't the beta learned from the actual text, so it should learn to place a high weight on the final correct polarity? Maybe I just need to sit down with the statistics for a while to understand.
(2) is a really neat example of how we develop a consensus on language through recursive modeling! I don't know that much about pragmatics or this sort of maxim, coming from a computational background, but I did get really excited by a NIPS paper a couple years ago on a model of consensus-building in language learning, also from Goodman's lab.