Couple things they should have gotten into but didn't:
1) Hearing and vision are terribly lossy codecs in humans. Roughly 80% of the information you think you "see" and "hear" are synthesized by your brain based on, as they say, prior information. This is the source of nearly all optical illusions and why we see faces in the moon, etc. Every misinterpreted lyric you've ever heard ("there's a bathroom on the right" in CCR's bad moon rising, etc) is due to this as well. When exposed to new information we pattern-match with what we've seen or heard before. Thus, ghosts, phantom ringtones, etc.
2) They're kinda cheating on the "scrambling." They left most of the consonants in place (I'd be curious to see their algorithm - it sounds like a pitch/amplitude transform which preserves sibilance, plosives and frickatives to a greater extent than vowels) which aids natural born speakers of Germanic tongues. there's a chance that people who grew up speaking any of the Asiatic languages would not attach to the clip the same way because intelligibility in asiatic languages requires the development of different structures in Broca's Area than the structures necessary for Germanic languages.
3) The first time through you don't attach to the consonants because you have no context for them. If, however, you listen to the clip over and over again you will get close eventually, at least if you have training. I heard "next stop" before I knew what I was listening to, but I do this sort of thing for a living.