multithreadingvalgrinddata-racethread-sanitizer

Why do Valgrind-based Helgrind thread error detector and Google ThreadSanitizer (TSan) report different data race detection results?


1.Background

① The C++ code in Listing 1 below aims at handling data race for the s variable that the Produce function (running on the main thread) and Consume function (running on the t thread created at the beginning of the main function) may run into.

Listing 1:

// a.cpp file
#include <chrono>                                               //01
#include <condition_variable>                                   //02
#include <iostream>                                             //03
#include <mutex>                                                //04
#include <thread>                                               //05
                                                                //06
std::string s = "";                                             //07
std::mutex m;                                                   //08
std::condition_variable cv;                                     //09
bool stop = false;                                              //10
                                                                //11
void Consume() {                                                //12
  while (true) {                                                //13
    std::unique_lock<std::mutex> ul{m};                         //14
    cv.wait(ul, []() { return stop || s != ""; });              //15
    if (stop) break;                                            //16
    std::cout << "Consume \"" << s << "\"\n";                   //17
    s = "";                                                     //18
    m.unlock();                                                 //19
    cv.notify_all();                                            //20
    m.lock();                                                   //21
  }                                                             //22
}                                                               //23
                                                                //24
void Produce(std::string s1) {                                  //25
  std::unique_lock<std::mutex> ul{m};                           //26
  cv.wait(ul, []() { return s == ""; });                        //27 
  s = s1;                                                       //28
  m.unlock();                                                   //29
  cv.notify_all();                                              //30
  m.lock();                                                     //31
}                                                               //32
                                                                //33
void Stop() {                                                   //34
  std::unique_lock<std::mutex> ul{m};                           //35
  cv.wait(ul, []() { return s == ""; });                        //36   
  stop = true;                                                  //37
  m.unlock();                                                   //38
  cv.notify_all();                                              //39
  m.lock();                                                     //40
}                                                               //41
                                                                //42
int main() {                                                    //43
  auto t = std::thread(Consume);                                //44
  Produce("A");                                                 //45
  Produce("B");                                                 //46
  Produce("C");                                                 //47
  Stop();                                                       //48
  t.join();                                                     //49
  return 0;                                                     //50
}                                                               //51 

No data race is detected by Google ThreadSanitizer (aka TSan) when compiling above C++ code with Ubuntu clang++ version 14.0.0-1ubuntu1.1, as shown in Listing 2 below.

Listing 2:

$ clang++ -fsanitize=thread -g -O1 -Wall a.cpp          #01
$ ./a.out                                               #02
Consume "A"                                             #03
Consume "B"                                             #04
Consume "C"                                             #05 
$                                                       #06

2.Concern

Possible data race is detected by Valgrind-based Helgrind thread error detector when compiling above C++ code with Ubuntu clang++ version 14.0.0-1ubuntu1.1, as shown in Listing 3 below.

Listing 3:

$ clang++ -g -O1 -Wall a.cpp                                                           //01
$ valgrind --tool=helgrind ./a.out                                                     //02
==<PID>== Helgrind, a thread error detector                                            //03
==<PID>== Copyright (C) 2007-2017, and GNU GPL'd, by OpenWorks LLP et al.              //04
==<PID>== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info           //05
==<PID>== Command: ./a.out                                                             //06
==<PID>==                                                                              //07
==<PID>== ---Thread-Announcement------------------------------------------             //08
==<PID>==                                                                              //09
==<PID>== Thread #1 is the program's root thread                                       //10
==<PID>==                                                                              //11
==<PID>== ---Thread-Announcement------------------------------------------             //12
==<PID>==                                                                              //13
==<PID>== Thread #2 was created                                                        //14
==<PID>==  at 0x53729F3: clone                      (clone.S         : 76)             //15
==<PID>==  by 0x53738EE: __clone_internal           (clone-internal.c: 83)             //16
==<PID>==  by 0x52E16D8: create_thread              (pthread_create.c:295)             //17
==<PID>==  by 0x52E21FF: pthread_create@@GLIBC_2.34 (pthread_create.c:828)             //18
==<PID>==  by 0x4E13585: pthread_create_WRK         (hg_intercepts.c :445)             //19
==<PID>==  by 0x4E14A8C: pthread_create@*           (hg_intercepts.c :478)             //20
==<PID>==  by 0x50FD328: std::thread::_M_start_thread(                                 //21
                           std::unique_ptr<                                            //22
                             std::thread::_State,                                      //23
                             std::default_delete<std::thread::_State>                  //24
                           >,                                                          //25
                           void (*)()                                                  //26
                         )                                                             //27
                         (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30)            //28
==<PID>==  by 0x10A691: thread<void (&)(), void>    (std_thread.h:143)                 //22
==<PID>==  by 0x10A691: main (???:43)                                                  //23
==<PID>==                                                                              //24
==<PID>== ----------------------------------------------------------------             //25
==<PID>==                                                                              //26
==<PID>== Possible data race during read of size 4 at 0x10D110 by thread #1            //27
==<PID>== Locks held: none                                                             //28
==<PID>==  at 0x52E4F4A: pthread_mutex_lock@@GLIBC_2.2.5 (pthread_mutex_lock.c: 94)    //29
==<PID>==  by 0x4E1009A: mutex_lock_WRK                  (hg_intercepts.c     :937)    //30
==<PID>==  by 0x4E14E73: pthread_mutex_lock              (hg_intercepts.c     :960)    //31
==<PID>==  by 0x10A4D4 : __gthread_mutex_lock            (gthr-default.h      :749)    //32
==<PID>==  by 0x10A4D4 : lock                            (std_mutex.h         :100)    //33
==<PID>==  by 0x10A4D4 : lock                            (unique_lock.h       :139)    //34
==<PID>==  by 0x10A4D4 : unique_lock                     (unique_lock.h       : 69)    //35
==<PID>==  by 0x10A4D4 : Produce(                                                      //36
                           std::__cxx11::basic_string<                                 //37
                             char,                                                     //38
                             std::char_traits<char>,                                   //39
                             std::allocator<char>                                      //40
                           >                                                           //41
                         )                                (???                  :26)   //42
==<PID>==  by 0x10A6C5 : main (a.cpp:44)                                               //43
==<PID>==                                                                              //44
==<PID>== This conflicts with a previous write of size 4 by thread #2                  //45
==<PID>== Locks held: none                                                             //46
==<PID>==  at 0x52E696C: __pthread_mutex_unlock_usercnt  (pthread_mutex_unlock.c:  62) //47
==<PID>==  by 0x52E08AD: __pthread_cond_wait_common      (pthread_cond_wait.c   : 419) //48
==<PID>==  by 0x52E08AD: pthread_cond_wait@@GLIBC_2.3.2  (pthread_cond_wait.c   : 627) //49
==<PID>==  by 0x4E1390B: pthread_cond_wait_WRK           (hg_intercepts.c       :1291) //50
==<PID>==  by 0x4E14EAA: pthread_cond_wait@*             (hg_intercepts.c       :1318) //51
==<PID>==  by 0x10A3A9 : wait<(lambda at a.cpp:15:17)>   (condition_variable    : 103) //52
==<PID>==  by 0x10A3A9 : Consume()                       (???                   :  15) //53
==<PID>==  by 0x10A833 : __invoke_impl<void, void (*)()> (invoke.h              :  61) //54
==<PID>==  by 0x10A833 : __invoke<void (*)()>            (invoke.h              :  96) //55
==<PID>==  by 0x10A833 : _M_invoke<0UL>                  (std_thread.h          : 259) //56
==<PID>==  by 0x10A833 : operator()                      (std_thread.h          : 266) //57
==<PID>==  by 0x10A833 : std::thread::_State_impl<                                     //58
                           std::thread::_Invoker<                                      //59
                             std::tuple<void (*)()>                                    //60
                           >                                                           //61
                         >::_M_run()                     (std_thread.h          : 211) //62
==<PID>==  by 0x50FD252: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30)        //63
==<PID>==  by 0x4E13779: mythread_wrapper                (hg_intercepts.c       : 406) //64
==<PID>==                                                                              //65
==<PID>==  Address 0x10d110 is 8 bytes inside data symbol "m"                          //66
==<PID>==                                                                              //67
==<PID>== ----------------------------------------------------------------             //68

<-- Text Truncated -->

==<PID>== Use --history-level=approx or =none to gain increased speed, at
==<PID>== the cost of reduced accuracy of conflicting-access information
==<PID>== For lists of detected and suppressed errors, rerun with: -s
==<PID>== ERROR SUMMARY: 71 errors from 36 contexts (suppressed: 0 from 0)
$

3.Question

Which of Google TSan or Valgrind Helgrind reports is correct, and why?


Solution

  • These are cases where the C++ standard library is using pthreads in a way that looks like an error (at least in the eyes of the authors of DRD and Helgrind at the time that they were written).

    See the answer to this question for an explanation and example of what DRD and Helgrind are expecting: understanding of pthread_cond_wait() and pthread_cond_signal().

    As is often the case it's tricky to "fix" this kind of thing without throwing out the baby with the bathwater.

    What I could do is add an option like

       --ignore-dubious=yes|no  Ignore dubious usage of pthread_cond_{signal|broadcast}
                                where the signaller is not holding the mutex [no]
    

    In the meantime you can suppress this kind of error.

    {
       dubious DRD condition signal/broadcast
       drd:CondRaceErr
       fun:pthread_cond_*
    }
    {
       dubious Helgrind condition signal/broadcast
       Helgrind:Dubious
       fun:pthread_cond*
    }