I use AES128 crypto in CTR mode for encryption, implemented for different clients (Android/Java and iOS/ObjC). The 16 byte IV used when encrypting a packet is formated like this:
<11 byte nonce> | <4 byte packet counter> | 0
The packet counter (included in a sent packet) is increased by one for every packet sent. The last byte is used as block counter, so that packets with fewer than 256 blocks always get a unique counter value. I was under the assumption that the CTR mode specified that the counter should be increased by 1 for each block, using the 8 last bytes as counter in a big endian way, or that this at least was a de facto standard. This also seems to be the case in the Sun crypto implementation.
I was a bit surprised when the corresponding iOS implementation (using CommonCryptor, iOS 5.1) failed to decode every block except the first when decoding a packet. It seems that CommonCryptor defines the counter in some other way. The CommonCryptor can be created in both big endian and little endian mode, but some vague comments in the CommonCryptor code indicates that this is not (or at least has not been) fully supported:
http://www.opensource.apple.com/source/CommonCrypto/CommonCrypto-60026/Source/API/CommonCryptor.c
/* corecrypto only implements CTR_BE. No use of CTR_LE was found so we're marking
this as unimplemented for now. Also in Lion this was defined in reverse order.
See <rdar://problem/10306112> */
By decoding block by block, each time setting the IV as specified above, it works nicely.
My question: is there a "right" way of implementing the CTR/IV mode when decoding multiple blocks in a single go, or can I expect it to be interoperability problems when using different crypto libs? Is CommonCrypto bugged in this regard, or is it just a question of implementing the CTR mode differently?
The definition of the counter is (loosely) specified in NIST recommendation sp800-38a Appendix B. Note that NIST only specifies how to use CTR mode with regards to security; it does not define one standard algorithm for the counter.
To answer your question directly, whatever you do you should expect the counter to be incremented by one each time. The counter should represent a 128 bit big endian integer according to the NIST specifications. It may be that only the least significant (rightmost) bits are incremented, but that will usually not make a difference unless you pass the 2^32 - 1 or 2^64 - 1 value.
For the sake of compatibility you could decide to use the first (leftmost) 12 bytes as random nonce, and leave the latter ones to zero, then let the implementation of the CTR do the increments. In that case you simply use a 96 bit / 12 byte random at the start, in that case there is no need for a packet counter.
You are however limited to 2^32 * 16 bytes of plaintext until the counter uses up all the available bits. It is implementation specific if the counter returns to zero or if the nonce itself is included in the counter, so you may want to limit yourself to messages of 68,719,476,736 = ~68 GB (yes that's base 10, Giga means 1,000,000,000).
In case this is still incompatible (test!) then use the initial 8 bytes as nonce. Unfortunately that does mean that you need to limit the number of messages because of the birthday problem.