I am writing a Smooth Streaming client application. On the server side (IIS 7 with Media Services extensions), I have a bunch of ISMV and ISMA files encoded using Expression Encoder pro 4 with the "H.264 IIS Smooth Streaming iPhone WiFi" preset. In a nutshell, it uses the "H.264 baseline" video codec, and the AAC-LC audio codec.
On the client side however is where I'm having problems, specifically with the audio chunks. While I have been able to make sense of the H.264 video stream (it is essentially a sequence of raw NAL units prefixed by their length, without the NAL unit "start code" 0, 0, 0, 1), I still haven't been able to crack what is inside the AAC LC audio stream, i.e. what comes in the "mdat" (Media Data Box) atom. It is most definitely not an MP4 container, but what is it then?
I am pasting below the first 128 (number chosen arbitrarily) bytes of one AAC-LC fragment (MDAT portion only) obtained from the server, in case anyone can figure it out from there.
unsigned char data[128] = {
0x21, 0x09, 0x0A, 0xBF, 0xBF, 0xFF, 0xFF, 0xD5, 0xB1, 0x8D, 0xC4, 0xA1,
0x18, 0x0D, 0x25, 0xC9, 0x2E, 0x49, 0x2E, 0x10, 0x88, 0x91, 0x10, 0x01,
0x13, 0x23, 0x2C, 0x36, 0x25, 0x60, 0x6B, 0x94, 0x8C, 0x74, 0xD7, 0x4A,
0x95, 0xD3, 0x03, 0x91, 0x5B, 0x76, 0xDE, 0x27, 0xC5, 0xB2, 0x4C, 0xCF,
0xEB, 0x3E, 0xDD, 0xFF, 0x22, 0xAF, 0xC3, 0xF8, 0x60, 0x36, 0x49, 0xBC,
0xAE, 0x4D, 0x10, 0x31, 0xC6, 0x28, 0x2A, 0xEB, 0xCA, 0x94, 0x51, 0xD8,
0x61, 0x1B, 0xC6, 0x2A, 0x91, 0x71, 0xE4, 0x8C, 0xF8, 0x19, 0x2C, 0xDE,
0x71, 0xBB, 0xE3, 0xBD, 0x36, 0xB4, 0x45, 0x37, 0x02, 0x61, 0x48, 0x8E,
0x19, 0x80, 0xD5, 0x24, 0x97, 0x24, 0x92, 0x44, 0x08, 0x89, 0x12, 0x00,
0xB3, 0xF8, 0x1E, 0xE2, 0xBD, 0xCD, 0x4E, 0xF7, 0xA9, 0xE2, 0x0E, 0xD8,
0xEA, 0xFA, 0xCF, 0xDB, 0x4E, 0x69, 0x6F, 0xEE
};
After a long research and this tip I received on the IIS forums, I was able to figure it out. Basically this is a raw AAC stream, which needs to be wrapped with headers before it can be played back. The simplest and most common header format appears to be ADTS, which consists in adding a 7-byte header in front of each sample.