Huh, ideone shows it slower. In g++ 4.9.3 in my VM (ubuntu werewolf) I get A: 830 B: 460. In either case why is one faster than the other? I assumed A is being push by reference and B is by value but I don't see why it would do that. I tried using operator<(A a, ...
and that didn't help.
struct A {
u32 first, second;
A(int f, int s):first(f),second(s) {}
bool operator<(u32 b) const { return first < b; }
};
//bool operator<(const A&a, u32 b) { return a.first < b; }
struct B {
u64 v;
B(int f, int s) { v = ((long)f << 32) | s; }
bool operator<(u32 b) const {
return (v >> 32) < b;
}
};
u32 get_value(deque<A>& d, u32 v) {
auto p = lower_bound(d.begin(), d.end(), v);
if (p != d.end())
return p->second;
else
return UINT_MAX;
}
u32 get_value(deque<B>& d, u32 v) {
auto p = lower_bound(d.begin(), d.end(), v);
if (p != d.end())
return p->v & 0xFFFFFFFF;
else
return UINT_MAX;
}
int main(int argc, char *argv[]) {
{
deque<A> d;
struct timeval s, e;
gettimeofday(&s, 0);
for (int i = 0; i < 1024LL * 1024 * 1024 * 3 / 32; ++i)
d.emplace_back(A(i, i ^ 92142));
long v = 0;
for (int i = 0; i < 10000; ++i)
v += get_value(d, i * 3 + 1);
gettimeofday(&e, 0);
auto sec = e.tv_sec - s.tv_sec;
auto usc = e.tv_usec - s.tv_usec;
printf("A %ld\n", v);
printf("Time: %lu\n", sec * 1000 + usc / 1000);
}
{
deque<B> d;
struct timeval s, e;
gettimeofday(&s, 0);
for (int i = 0; i < 1024LL * 1024 * 1024 * 3 / 32; ++i)
d.emplace_back(B(i, i ^ 92142));
long v = 0;
for (int i = 0; i < 10000; ++i)
v += get_value(d, i * 3 + 1);
gettimeofday(&e, 0);
auto sec = e.tv_sec - s.tv_sec;
auto usc = e.tv_usec - s.tv_usec;
printf("A %ld\n", v);
printf("Time: %lu\n", sec * 1000 + usc / 1000);
}
}
I think you're getting a speedup with struct B { u64 v; }
because you're using it with compile-time-constant args. The compiler can combine both values into a 64-bit store.
I made a simple non-member function to get asm output produced for the non-inline case of operater<
for A and B. As godbolt shows, the asm for struct A::operator<
is significantly more efficient. B::operator< really does do a 64bit load, then shift. I commented the asm for the benefit of people that don't know x86 very well.
bool comp_A(const struct A &lhs, u32 rhs) { return lhs < rhs; }
bool comp_B(const struct B &lhs, u32 rhs) { return lhs < rhs; }
comp_A(A const&, unsigned int):
cmpl %esi, (%rdi) # compare with lhs->first as a memory operand
setb %al
ret
comp_B(B const&, unsigned int):
movq (%rdi), %rax # load lhs->v
movl %esi, %esi # zero-extend the low 32b of rhs
shrq $32, %rax # shift the high32 of v down to the low 32
cmpq %rsi, %rax
setb %al # al=1 if rax "bigger than" rsi, else 0.
ret
x86 has fast hardware support for loading any size integer from RAM into a 32 or 64-bit temporary (in a register). movzx
and movsx
zero or sign extend, and are as cheap as regular loads with mov
. 32-bit ops always zero the high 32 of the dest register, so false dependencies on old values aren't a problem. (They are for 16 and 8-bit ops, which is why movzx
to a 32-bit temporary is a good plan. Compilers will hopefully use movzx
loads even when you use uint8_t
local variables, but sometimes that ends up requiring extra instructions to truncate or redo zero-extension after working with an 8-bit local.)
I haven't looked at the asm of your actual functions, as it's quite large. I'd suggest using version A in general, though. It may be that GCC isn't copying it around with 64-bit moves, maybe because it might not be aligned? x86 doesn't require alignment (except for non-AVX (legacy-SSE) 16-byte memory operands). However, IIRC it's only since about 2008 or 2009 CPUs (Intel's Nehalem) that unaligned loads/stores had no penalty as long as they didn't cross a cache line. GCC may still be reluctant to use them, since the potential gain is smaller than the potential downside on old CPUs with slow unaligned access. alignof(struct of 2 u32 members)
is only 4, smaller than the size, so it's not guaranteed to be naturally aligned.
You might get GCC to give you the best of both worlds with a union.
union A { u64 v; u32 first_last[2]; };
This might induce the compiler to use 64bit moves when copying it around, but still do 32bit loads of A.first_last[0]
and not need to shift when you're accessing individual fields. It also gives the whole union an alignof
== 8, like if you'd make the first 32-bit member of a struct alignas(8)
.