c++cblasintel-mkllinkage

MKL and openBLAS interactions - a question about linking


I'm using a binary (R) that dynamically links to a generic version of BLAS, for instance (and in a lot of cases) this is openBLAS.

Now, inside R, I'm dynamically loading another shared library (libtorch.so) essentially using dlopen(). Turns out libtorch statically links to MKL BLAS.

My understanding about static and dynamic linking is that this shouldn't be a problem. Ie, since libtorch is statically linked to MKL. When calling libtorch's code it will always prefer it's own symbols instead of other similarly named symbols that might be dynamically loaded.

Indeed, this seems to be the usual behavior. For instance if I take out BLAS and LibTorch from the game, I can compile an executable that links to a shared library libA, implementing eg print() and to another shared library libB that is statically linked to libA. When calling code from libB it will correcly call the definitions from it's own version of libA.

But that doesn't happen with libtorch/MKL and openBLAS. If I compile an executable that dynamically links to both libTorch and openBlas, then libtorch will start using openBLAS routines instead of the statically linked MKL ones.

For instance:

#0  0x00007ffff5537da0 in sgemm_ () from /lib/x86_64-linux-gnu/libopenblas.so.0
#1  0x00007fffde5385d6 in at::native::cpublas::gemm(at::native::TransposeType, at::native::TransposeType, long, long, long, float, float const*, long, float const*, long, float, float*, long) () from /home/rstudio/data/torch/build-lantern/libtorch/lib/libtorch_cpu.so
#2  0x00007fffde67c139 in at::native::addmm_impl_cpu_(at::Tensor&, at::Tensor const&, at::Tensor, at::Tensor, c10::Scalar const&, c10::Scalar const&) () from /home/rstudio/data/torch/build-lantern/libtorch/lib/libtorch_cpu.so
#3  0x00007fffde67d475 in at::native::structured_mm_out_cpu::impl(at::Tensor const&, at::Tensor const&, at::Tensor const&) ()
   from /home/rstudio/data/torch/build-lantern/libtorch/lib/libtorch_cpu.so
#4  0x00007fffdf42309b in at::(anonymous namespace)::wrapper_CPU_mm(at::Tensor const&, at::Tensor const&) ()
   from /home/rstudio/data/torch/build-lantern/libtorch/lib/libtorch_cpu.so
#5  0x00007fffdf423123 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&), &at::(anonymous namespace)::wrapper_CPU_mm>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) () from /home/rstudio/data/torch/build-lantern/libtorch/lib/libtorch_cpu.so
#6  0x00007fffdf1eaa70 in at::_ops::mm::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) ()
   from /home/rstudio/data/torch/build-lantern/libtorch/lib/libtorch_cpu.so

This happens, even though libtorch_cpu.so includes it's own version of sgemm_, eg:

nm libtorch/lib/libtorch_cpu.so | grep "T sgemm_"
0000000006c531b0 T sgemm_
0000000006c53870 T sgemm_64
0000000006c53870 T sgemm_64_

My question is, in what circunstances symbols from a dynamically loaded library can get in front of the statically loaded library? I'm surely missing something important here and any advice will be extremely helpful.

Reproducible example:
#include <torch/torch.h>
#include <iostream>
#include <cblas.h>

extern "C" void execute () {
  for (auto i = 1; i < 10; i++) {
    torch::Tensor tensor = torch::randn({2000, 2000});
    auto k = tensor.mm(tensor);  
  }
}

int main() {
  
  int m = 3; // rows of A
  int n = 3; // cols of A
  
  // Matrix A (m x n) in row-major order
  double A[] = {1.0, 2.0, 3.0,
                4.0, 5.0, 6.0,
                7.0, 8.0, 9.0};
  
  // Vector x (size n)
  double x[] = {1.0, 1.0, 1.0};
  
  // Result vector y (size m), initially zero
  double y[] = {0.0, 0.0, 0.0};
  
  // Scalar multipliers
  double alpha = 1.0, beta = 0.0;
  
  // Perform y = alpha * A * x + beta * y
  cblas_dgemv(CblasRowMajor, CblasNoTrans, m, n, alpha, A, n, x, 1, beta, y, 1);

  execute();
  
  return 0;
}

With a CMakeLists.txt

set(CMAKE_POSITION_INDEPENDENT_CODE ON)
cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
project(example)

find_package(Torch REQUIRED)
find_package(BLAS)

add_executable(example example.cpp)
target_link_libraries(example "${TORCH_LIBRARIES}" "${BLAS_LIBRARIES}")
set_property(TARGET example PROPERTY CXX_STANDARD 17)

LibTorch can be obtained from pytorch website with direct download link

To run

mkdir build && cd build
cmake .. -DCMAKE_PREFIX_PATH=<path to libtorch>
cmake --build .

Solution

  • My question is, in what circunstances symbols from a dynamically loaded library can get in front of the statically loaded library? I'm surely missing something important here and any advice will be extremely helpful.

    I don't think that's what is happening in your case. Indeed, I don't think it can happen -- static linking resolves symbols at link time, not run time.

    Instead, I think there is a confusion about the nature of the LibTorch shared library you're working with. My examination shows that it embeds many MKL routines, and provides both normal and dynamic symbols for them. That does not imply that references to those symbols from elsewhere in the library are pre-resolved. In fact, it suggests that the LibTorch DSO was built against MKL DSOs, which rules out static linking in the sense you seem to have in mind. LibTorch also provides sgemm and sgemm_ as (relocatable) dynamic symbols, so you can expect that references to these, even from within that same DSO, will be resolved by the dynamic linker.

    Dynamic linking is complicated and sometimes unintuitive. My usual reference for how that works with ELF is Ulrich Drepper's paper How to Write Shared Libraries, which is somewhat more approachable than the ELF specifications themselves. As it pertains to your problem, the details to consider are the lookup scope for dynamic symbol resolution, which Drepper discusses in his section 1.5.4.

    For any given symbol lookup, the lookup scope can be regarded as an ordered sequence of loaded shared objects to be searched. The order of SOs within is affected by several factors, but the main one is the sequence of the DT_NEEDED entries in each shared object. To a first approximation, the executable is first, then its own directly DT_NEEDED objects, then their DT_NEEDED objects, etc, in breadth-first order.

    If you want to observe a shared library providing a symbol itself but having its own references to the symbol resolve to one provided by a different object, then the easiest way to achieve it is for the executable to provide the same symbol. Example:

    main.c

    #include <stdio.h>
    
    int share(void);
    
    int do_something(void) {
        printf("in main\n");
    }
    
    int main(void) {
        share();
        return 0;
    }
    

    shared.c

    #include <stdio.h>
    
    int do_something(void) {
        printf("in shared\n");
    }
    
    int share(void) {
        return do_something();
    }
    

    Makefile

    CFLAGS= -fpic
    LIBS = -L. -ldemo
    MAIN_OBJS = main.o
    
    all: prog libdemo.so
    
    prog: $(MAIN_OBJS) libdemo.so
        $(CC) -o $@ $(MAIN_OBJS) $(LIBS)
    
    libdemo.so: shared.o
        $(CC) -o $@ -shared $^
    

    Commands:

    $ make
    cc -fpic   -c -o main.o main.c
    cc -fpic   -c -o shared.o shared.c
    cc -o libdemo.so -shared shared.o
    cc -o prog main.o -L. -ldemo
    $ LD_LIBRARY_PATH=$(pwd) ./prog 
    in main
    $
    

    Note that in that case, both the main program, prog, and the shared library, libdemo.so, provide function do_something(). When function shared() in the shared library is called, it is the do_something() from the main program that it calls.

    Similar can be achieved when the symbol in question is defined by different shared libraries, but not by the main executable. Whichever shared library occurs first in the breadth-first traversal of the program's direct and indirect dependencies is the one that will (normally) provide the definition to all DSOs involved. And usually, that's what you want -- the same function definition is used by all calls anywhere within the overall dynamically-linked program.

    There are additional considerations that I haven't covered, and dlopen() in particular brings several of them, but they just add possible modulations and exceptions to the same general picture I've presented. They don't fundamentally change dynamic name resolution behavior.