Say I have 3 2D matrices:
I want to map every sub-block of Data into shiftedData according to the linear indices defined in the matrix indices. In Matlab, I might do the following:
shiftedData = zeros(tSize*TXSize,xSize);
for nT = 0:TXSize-1
shiftedData( (nT*tSize+1):((nT+1)*tSize),:) = ...
extractData(Data((nT*tSize+1):((nT+1)*tSize),:), ...
indices(:,(nR*xSize+1):((nR+1)*xSize))+1,xSize,tSize);
end
function [shiftedData] = extractData(Data,indices,xSize,tSize)
shiftedData(indices(:)) = Data(:);
end
I am trying to implement something similar in Eigen. I've written out 4 different approaches for the analogous extractData function and have shown them in pseudocode below along with the line calling them from a for loop:
Eigen::MatrixXcd shiftedData = Eigen::MatrixXcd::Zero(tSize*TXSize,xSize);
for (int nT = 0; nT < TXSize; nT++){
extractData(shiftedData(Eigen::seq(nT*tSize,(nT+1)*tSize-1),Eigen::all),
Data(Eigen::seq(nT*tSize,(nT+1)*tSize-1),Eigen::all), indices, xSize, tSize);
}
void extractData(Eigen::Ref<Eigen::MatrixXcd> shiftedData,
const Eigen::Ref<const Eigen::MatrixXcd>& Data,
const Eigen::Ref<const Eigen::MatrixXi>& indices,
const long int& xSize, const long int& tSize) {
// Approach 1
for (int nt = 0; nt < tSize; nt++){
for (int nx = 0; nx < xSize; nx++){
int row = indices(nt,nx) % tSize;
int col = indices(nt,nx) / tSize;
shiftedData(row,col) = Data(nt,nx);
}
}
// Approach 2
for (int i = 0; i < tSize*xSize; i++) {
int nt = i % tSize;
int nx = i / tSize;
int row = indices(nt,nx) % tSize;
int col = indices(nt,nx) / tSize;
shiftedData(row,col) = Data(nt,nx);
}
// Approach 3
Eigen::Map<const Eigen::VectorXcd> linData(Data.data(), xSize*tSize);
Eigen::Map<const Eigen::VectorXi> linIDX(indices.data(), xSize*tSize);
for (int i = 0; i < xSize*tSize; i++){
int col = linIDX(i) / tSize;
int row = linIDX(i) % tSize;
shiftedData(row,col) = linData(i);
}
// Approach 4
Eigen::Map<const Eigen::VectorXcd> linData(Data.data(), xSize*tSize);
Eigen::Map<const Eigen::VectorXi> linIDX(indices.data(), xSize*tSize);
Eigen::Map<Eigen::VectorXcd> linView(shiftedData.data(), xSize*tSize);
linView(linIDX) = linData;
}
Approaches 1 and 2 generate the correct output. However, approaches 3 and 4 do not. I believe the problem has something to do with the fact that I am passing non-contiguous blocks of shiftedData
and Data
to extractData
. How can I properly account for this and perform something more analogous to the Matlab example in Eigen?
I'll focus on improving your first approach as it seems the most straightforward. As you suspected, Map
cannot work. You can use Reshape
but that's just going to hide the division and modulo inside the Eigen method. Let's try some different methods.
First, some minor cleanup. I'm going to ignore the outer loop over TXSize
entirely and focus on the extractData
function. There are minor issues with the interface:
const long int&
, which is pointless, creating a reference to a simple integer just risks losing performance (potential risk of redundant reloads because of aliasing). Just pass them by valuelong int
itself is not the best type for this. From your linked CodeReview I know that you operate on Windows, which means long int
is basically the same as int
. Prefer using a type that is the correct size on all platforms. When you use Eigen, use Eigen::Index
, which is std::ptrdiff_t
by default and is probably what you expect to get when you type long int
.rows()
and cols()
methods on indices
or Data
. Dropping those explicit arguments means two less potential sources of errorBack to the main course. For testing, I use the dimensions from your CodeReview page, meaning tSize = 2688; xSize = 192
For approach 1, your loop is
for (int nt = 0; nt < tSize; nt++){
for (int nx = 0; nx < xSize; nx++){
...
... = indices(nt,nx) ...;
... = Data(nt,nx);
}
}
Notice how you iterate over columns in the inner loop and rows in the outer loop. Eigen uses a column-major format by default which makes this order sub-optimal with poor locality of data.
Assuming that all entries in indices
are unique, we can swap the order of the loop headers. On my system this gives me a 50% improvement.
void extractData(Eigen::Ref<Eigen::MatrixXcd> shiftedData,
const Eigen::Ref<const Eigen::MatrixXcd>& Data,
const Eigen::Ref<const Eigen::MatrixXi>& indices,
Eigen::Index xSize, Eigen::Index tSize) {
for (Eigen::Index nx = 0; nx < xSize; nx++){
for (Eigen::Index nt = 0; nt < tSize; nt++){
Eigen::Index row = indices(nt,nx) % tSize;
Eigen::Index col = indices(nt,nx) / tSize;
shiftedData(row,col) = Data(nt,nx);
}
}
}
You reuse the same indices for multiple matrices. This means you do repeated integer divisions. Integer division is very slow with a throughput of 6-14 cycles per instruction on many CPUs. It may be worth caching that.
void extractData(Eigen::Ref<Eigen::MatrixXcd> shiftedData,
const Eigen::Ref<const Eigen::MatrixXcd>& Data,
const Eigen::Ref<const Eigen::Array2Xi>& indices,
Eigen::Index xSize, Eigen::Index tSize) {
for (Eigen::Index nx = 0; nx < xSize; nx++){
for (Eigen::Index nt = 0; nt < tSize; nt++){
Eigen::Index flatIdx = nx * tSize + nt;
const auto outIndex = indices.col(flatIdx);
shiftedData(outIndex.x(), outIndex.y()) = Data(nt, nx);
}
}
}
...
Eigen::Array2Xi splitIndices(2, xSize * tSize);
for(Eigen::Index i = 0; i < xSize; ++i) {
for(Eigen::Index j = 0; j < tSize; ++j) {
auto out = splitIndices.col(i * tSize + j);
const int flatIdx = indices(j, i);
out.x() = flatIdx % tSize;
out.y() = flatIdx / tSize;
}
}
for(int i = 0; i < TXSize; ++i)
extractData(ShiftedData, Data, splitIndices, xSize, tSize);
However, on my system I don't see a benefit. I assume the cost of the random memory access on shiftedData
dominates.
In general, CPUs are better at handling random memory loads over stores. We can invert the indices and combine this with the cached index computation above.
void extractData(Eigen::Ref<Eigen::MatrixXcd> shiftedData,
const Eigen::Ref<const Eigen::MatrixXcd>& Data,
const Eigen::Ref<const Eigen::Array2Xi>& indices,
Eigen::Index xSize, Eigen::Index tSize) {
for (Eigen::Index nx = 0; nx < xSize; nx++){
for (Eigen::Index nt = 0; nt < tSize; nt++){
Eigen::Index flatIdx = nx * tSize + nt;
const auto inIndex = indices.col(flatIdx);
shiftedData(nt, nx) = Data(inIndex.x(), inIndex.y());
}
}
}
...
Eigen::Array2Xi inverseIndices(2, xSize * tSize);
for(Eigen::Index i = 0; i < xSize; ++i) {
for(Eigen::Index j = 0; j < tSize; ++j) {
int flatIdx = indices(j, i);
Eigen::Index row = flatIdx % rows;
Eigen::Index col = flatIdx / rows;
auto out = inverseIndices.col(col * tSize + row);
out.x() = j;
out.y() = i;
}
}
for(int i = 0; i < TXSize; ++i)
extractData(ShiftedData, Data, inverseIndices, cols, rows);
On my particular platform (TigerLake) I see only a minimal, positive effect. However, when I reduce the size of the matrix to something that fits comfortably into L2 cache (128 x 192), the performance improves by 18% over the cached indices above and the cached indices have a 7% advantage over not caching the division.
If the index pattern works out for it, Eigen implements multiplications with a PermutationMatrix
or Transpositions
but this can only shuffle full rows or columns at once.