I am in the process learning HLA Assembly from the book, Art of Assembly Language, 2nd Edition. I just started learning about the shr and shl instructions and i would like to know if shifting by a larger amount would take more time than shifting by a smaller amount. shr(1,dest) vs shr(7,dest).
I'm sorry if the syntax for the instructions are wrong.
http://agner.org/optimize/ has instruction timings for x86 CPUs, and microarch guides.
Shift and rotate with an immediate (compile-time-constant) count are single cycle latency on recent AMD and Intel.
Rotate-through-carry by any count other than 1 is slow, but probably constant-time. (data-dependent timing makes out-of-order execution dependency tracking even trickier, so I think they just take the maximum.
Another strange thing: apparently IvyBridge / Haswell take an extra uop for the short-form ROL / ROR
rotate-by-1 opcode, so throughput is half compared to the normal opcode with an imm8
count of 1.
re: HLA: C and C++ compilers have good support for intrinsics now (functions that turn into inline instructions). There's not as much of a use-case for HLA anymore, I think I remember reading. According to some source I can't remember (sorry >.<), these days you might as well just learn normal asm. A lot of the time, you can get speedups from using vector instructions (or bit-manipulation, like popcount) through intrinsics in C/C++.
If you're having fun learning HLA, and think it's useful, then best of luck to you, though.