Run it down this thread: 1) I'm a toxic mutherfucker 2) Because I cuss a lot 3) and whatever you say in an indoor voice, it isn't toxic. I'm not a machine learning guy, but I don't know I'd describe the problem as "strictness" so much as "lack of context." 18%, by the way, 13% without point 1 and 11% without point 2.
Paper. They're just looking at small windows of the text, building (very sparse) vectors along the lines of "1.0 if this sequence of n words/characters appeared in the text, 0.0 if not" and doing some voodoo with it. This is the sort of classifier marketing firms use to guess whether Twitter feels positively or negatively about something. You're not going to be able to do fine-grained classification of short texts that way and, unsurprisingly, "toxicity" looks a lot like vehemence.