[SOLVED] ComputeLibrary CLTensor data transfer

ComputeLibrary CLTensor data transfer

I am working with integrating ARM ComputeLibrary into a project.

It's not an API whose semantics I am familiar with, but I'm working my way through the docs and examples.

At the moment, I am trying to copy the contents of an std::vector to a CLTensor. Then use the ARMCL GEMM operation.

I've been building an MWE, shown below, with the aim of getting matrix multiplication working.

To get the input data from a standard C++ std::vector, or std::ifstream, I am trying an iterator based approach, based on this example shown in the docs.

However, I keep getting a segfault.

There is an example of sgemm using CLTensor in the source, which is also where I'm drawing inspiration from. However it gets its input data from Numpy arrays, so isn't relevant up to this point.

I'm not sure in ARMCL if CLTensor and Tensor have disjoint methods. But I feel like they are of a common interface ITensor. Still, I haven't been able to find an equivalent example that uses CLTensor instead of Tensor for this iterator based method.

You can see my code I'm working with below, which fails on line 64 (*reinterpret_cast..). I'm not entirely sure what the operations are that it performs, but my guess is that we have our ARMCL iterator input_it which is incremented n * m times, each iteration setting the value of the CLTensor at that address to the corresponding input value. reinterpret_cast is just to make the types play nicely together?

I reckon my Iterator and Window objects are okay, but can't be sure.

#include "arm_compute/core/Types.h"
#include "arm_compute/runtime/CL/CLFunctions.h"
#include "arm_compute/runtime/CL/CLScheduler.h"
#include "arm_compute/runtime/CL/CLTuner.h"
#include "utils/Utils.h"

namespace armcl = arm_compute;
namespace armcl_utils = arm_compute::utils;

int main(int argc, char *argv[])
{
  int n = 3;
  int m = 2;
  int p = 4;

  std::vector<float> src_a = {2, 1,
                          6, 4,
                          2, 3};
  std::vector<float> src_b = {5, 2, 1, 6,
                          3, 7, 4, 1};
  std::vector<float> c_targets = {13, 11, 6, 13,
                                  42, 40, 22, 40,
                                  19, 25, 14, 15};

  // Provides global access to a CL context and command queue.
  armcl::CLTuner tuner{};
  armcl::CLScheduler::get().default_init(&tuner);

  armcl::CLTensor a{}, b{}, c{};
  float alpha = 1;
  float beta = 0;
  // Initialize the tensors dimensions and type:
  const armcl::TensorShape shape_a(m, n);
  const armcl::TensorShape shape_b(p, m);
  const armcl::TensorShape shape_c(p, n);
  a.allocator()->init(armcl::TensorInfo(shape_a, 1, armcl::DataType::F32));
  b.allocator()->init(armcl::TensorInfo(shape_b, 1, armcl::DataType::F32));
  c.allocator()->init(armcl::TensorInfo(shape_c, 1, armcl::DataType::F32));

  // configure sgemm
  armcl::CLGEMM sgemm{};
  sgemm.configure(&a, &b, nullptr, &c, alpha, beta);

  // // Allocate the input / output tensors:
  a.allocator()->allocate();
  b.allocator()->allocate();
  c.allocator()->allocate();

  // // Fill the input tensor:
  // // Simplest way: create an iterator to iterate through each element of the input tensor:
  armcl::Window input_window;
  armcl::Iterator input_it(&a, input_window);
  input_window.use_tensor_dimensions(shape_a);

  std::cout << " Dimensions of the input's iterator:\n";
  std::cout << " X = [start=" << input_window.x().start() << ", end=" << input_window.x().end() << ", step=" << input_window.x().step() << "]\n";
  std::cout << " Y = [start=" << input_window.y().start() << ", end=" << input_window.y().end() << ", step=" << input_window.y().step() << "]\n";


  // // Iterate through the elements of src_data and copy them one by one to the input tensor:
  execute_window_loop(input_window, [&](const armcl::Coordinates & id)
                      {
                        std::cout << "Setting item [" << id.x() << "," << id.y() << "]\n";
                        *reinterpret_cast<float *>(input_it.ptr()) = src_a[id.y() * m + id.x()]; //
                      },
                      input_it);

  //  armcl_utils::init_sgemm_output(dst, src0, src1, armcl::DataType::F32);

  // Configure function

  // Allocate all the images
  //  src0.allocator()->import_memory(armcl::Memory(&a));
  //src0.allocator()->allocate();
  //src1.allocator()->allocate();

  // dst.allocator()->allocate();

  // armcl_utils::fill_random_tensor(src0, -1.f, 1.f);
  // armcl_utils::fill_random_tensor(src1, -1.f, 1.f);

  // Dummy run for CLTuner
  //sgemm.run();

  std::vector<float> lin_c(n * p);

  return 0;
}

Solution

The part you've missed (Which admittedly could be better explained in the documentation!) is that you need to map / unmap OpenCL buffers in order to make them accessible to the CPU.

If you look inside the fill_random_tensor (which is what's used in the cl_sgemm example you've got a call to tensor.map();

So if you map() your buffer before creating your iterator then I believe it should work:

a.map();
input_it(&a, input_window);
execute_window_loop(...)
{
}
a.unmap(); //Don't forget to unmap the buffer before using it on the GPU

Hope this helps