There are 44100 samples per second for the audio 25 frames per second for the video. so the image is 42 x 42 pixels (the fuzziness you see is caused by your computer expanding it larger than the postage stamp size of the original). The visual and the audio are literal mappings of each other blue=left channel red=right channel.
Tldr; You are seeing the literal data that you are hearing.
It is also nice with red blue 3d glasses.