mariadbdoparallelrmariadb

Failed Connections to MariaDB When Running a Process in Parallel


I am trying to process about 30k codes/records from a database in parallel in R using the doParallel and foreach packages. I have done this in other applications using an SQLite database and a SQL Server database. The database I am trying to use now is MariaDB and it is running locally on my Windows 11 computer. However, when looping through about 3,500 of the attempts to connect to the database fail. The failure is happening when running the "dbConnect" function. This is the error message:

Failed to connect: Can't connect to server on 'localhost' (10048)

library(data.table)
library(foreach)
library(RMariaDB)
library(doParallel)

cl<-makeCluster(6)
registerDoParallel(cl)
results <- foreach(i=1:length(cds),.packages=c('data.table','RMariaDB'),.errorhandling="pass") %dopar% {

  dbCon <- dbConnect(RMariaDB::MariaDB(), user=dbuser, password=dbpwd, dbname=databasename)

  #run queries, do calculations, etc.
  
  dbDisconnect(dbCon)
  return(dtset)
}
stopCluster(cl)

I can add a TryCatch that pauses for a few seconds after a failure before attempting again and it will complete without errors.

library(data.table)
library(foreach)
library(RMariaDB)
library(doParallel)

cl<-makeCluster(6)
registerDoParallel(cl)
results <- foreach(i=1:length(cds),.packages=c('data.table','RMariaDB'),.errorhandling="pass") %dopar% {

      j <- 0
      cont <- TRUE
      while(cont==TRUE) {
        fRet <- tryCatch({
          dbCon <- dbConnect(RMariaDB::MariaDB(), user=dbuser, password=dbpwd, dbname=databasename)
          iRet <- 0
        },
        error = function(err) {
          return(err)
        })
        if("error" %in% class(fRet) & j<5) {
          Sys.sleep(1 + 3*j)
          j <- j + 1
        } else {
          cont <- FALSE
        }
      }
  #run queries, do calculations, etc.
  
  dbDisconnect(dbCon)
  return(dtset)
}
stopCluster(cl)

The TryCatch solution is fine temporarily (slows down execution), but there has to be a better way. Is there a setting that needs to be adjusted with the database?

  1. max_connections is set to 150 so I don't think that's the issue. (There is a different error message that is returned when I have exceeded the maximum connections.)
  2. I have also tried adjusting thread_pool_size from 32 to 64 and that didn't help.
  3. I have a 16 core CPU (32 threads) and 128gb of RAM so I doubt that is the issue. I get the same number of errors whether I run it using 24 cores or something as low as 2 or 4. If I run the code using "%do%" instead of "%dopar%" so that it only uses 1 core and is not run in parallel, then it completes without errors.

Any help would be appreciated!


Solution

  • The error 10048 (WSAEADDRINUSE, "Address already in use") is strong indication that you exhausted TCP ephemeral ports.

    netstat -ano
    

    will give you an overview, and my bet there is a huge number of connections in TIME_WAIT state. This happens if you rapidly create and shut down TCP connections.

    Try using connection pool, or just reuse your connections instead. if you can't, try using named pipe connections, and if you can't, decrease TcpTimedWaitDelay as described here