Hi, suppose you have two different independent 64-bit binary matrices A
and T
(T
is another matrix that is stored in transposed form, using the transposed version of matrix allows during multiplication to operate on T
's rows rather than columns which is super cool for binary arithmetic) and you want to multiply these matrices the only thing is that matrix multiplication result is truncated to 64-bits and if you yield to a value greater that 1
in some specific matrix cell the resulting matrix cell will contain 1
otherwise 0
A T
00000001 01111101
01010100 01100101
10010111 00010100
10110000 00011000 <-- This matrix is transposed
11000100 00111110
10000011 10101111
11110101 11000100
10100000 01100010
Binary and traditional multiplication results:
Binary Traditional
11000100 11000100
11111111 32212121
11111111 32213421
11111111 21112211
11101111 22101231
11001111 11001311
11111111 54213432
11001111 11001211
How do you multiply these matrices in a way described above in most efficient matter?
I was trying to take advantage of binary and
(i.e. &
operator) instead of performing multiplication on separate bits, in that case I had to prepare data for multiplication:
ulong u;
u = T & 0xFF;
u = (u << 00) + (u << 08) + (u << 16) + (u << 24)
+ (u << 32) + (u << 40) + (u << 48) + (u << 56);
now by performing binary and
over two integers A
and u
it would yield to the following:
A u R C
00000001 01111101 00000001 1
01010100 01111101 01010100 3
10010111 01111101 00010101 3
10110000 01111101 00110000 2
11000100 01111101 01000100 2
10000011 01111101 00000001 1
11110101 01111101 01110101 5
10100000 01111101 00100000 1
In the example above R
contains result of multiplication of A
bits to u
bits and to obtain the final value we must sum
all bits in a row. Notice that column C
contains values equal to ones found in first column of resulting Traditional
matrix multiplication above. The problem is that during this step I have to operate on a separate bits which I think is sub-optimal approach, I've read through http://graphics.stanford.edu/~seander/bithacks.html looking for a way to do that on parallel but no luck, if anyone has any idea on how to "flatten" and "merge" the values located in R
column into resulting 64-bit matrix, I would appreciate if you drop me several lines,
Thank you,
With big thank you to David Eisenstat, the final algorithm would then look like:
var A = ...;
var T = ...; // T == transpose(t), t is original matrix, algorithm works with transposed matrix
var D = 0x8040201008040201UL;
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D);
The following piece of code:
public static void Main (string[] args){
ulong U;
var Random = new Xor128 ();
var timer = DateTime.Now;
var A = Random.As<IUniformRandom<UInt64>>().Evaluate();
var T = Random.As<IUniformRandom<UInt64>>().Evaluate();
var steps = 10000000;
for (var i = 0; i < steps; i++) {
ulong r = 0;
var d = 0x8040201008040201UL;
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d);
}
Console.WriteLine (DateTime.Now - timer);
var m1 = new Int32[8,8];
var m2 = new Int32[8,8];
var m3 = new Int32[8,8];
for (int row = 0; row < 8; row++) {
for (int col = 0; col < 8; col++) {
m1 [row, col] = Random.As<IUniformRandom<Int32>> ().Evaluate(0, 1);
m2 [row, col] = Random.As<IUniformRandom<Int32>> ().Evaluate(0, 1);
m3 [row, col] = Random.As<IUniformRandom<Int32>> ().Evaluate(0, 1);
}
}
timer = DateTime.Now;
for (int i = 0; i < steps; i++) {
for (int row = 0; row < 8; row++) {
for (int col = 0; col < 8; col++) {
var sum = 0;
for (int temp = 0; temp < 8; temp++) {
sum += m1 [row, temp] * m2 [temp, row];
}
m3 [row, col] = sum;
}
}
}
Console.WriteLine (DateTime.Now - timer);
}
Shows me the following results:
00:00:02.4035870
00:00:57.5147150
And that's a 23x performance improvement under Mac OS X / Mono, thanks everyone
I'm not sure about most efficient, but here's something to try. The following sequence of instructions computes the main diagonal of the product A * T'. Rotate both T and D by 8 bits and repeat for 7 more iterations.
// uint64_t A, T;
uint64_t D = UINT64_C(0x8040201008040201);
uint64_t P = A & T;
// test whether each byte is nonzero
P |= P >> 1;
P |= P >> 2;
P |= P >> 4;
P &= UINT64_C(0x0101010101010101);
// fill each nonzero byte with ones
P *= 255; // or P = (P << 8) - P;
// leave only the current diagonal
P &= D;