Which arithmetic instruction set operation is the slowest and the fastest on IA-32, IA-64? Are there any ranking? Benchmarks?
Generally speaking these are the square-root and division instructions especially for the scalar floating point pipeline.
For IA-32 and IA-64 specifically you might want to look at the Intel(R) IA-64 and IA-32 Architectures Optimization Reference Manual which has cycle counts for each instruction on different processors in Appendix C. You'll see that the SIMD equivalent instructions perform much better at a cost of less precision and they operate on 4 elements at a time. If you need more precision for the square-root and reciprocal-square-root you'll have to manually do that with an extra Newton-Raphson step.