rparallel-processingpsych

How to run the psych package in parallel?


I'm using the psych package to compute tetrachoric correlations for a very large dataset, comprising 1000 variables and 288,059 cases.

The data can be downloaded here:

https://www.dropbox.com/s/iqwgdywqfjvlkku/data.csv.zip?dl=0 (4MB)

My code looks like the following:

library(psych)
library(tidyverse)

temp = read.csv("~/Temp/data.csv", sep=",")

tetravalues = tetrachoric(temp, delete=FALSE)

tetraframe = tetravalues$rho

write.csv(tetraframe, file="~/Temp/output.csv")

Currently, this bit of the code has been running for 8 hours and hasn't ended yet:

tetravalues = tetrachoric(temp, delete=FALSE)

According to the psych package manual (tetrachoric):

This is a computationally intensive function which can be speeded up considerably by using mul- tiple cores and using the parallel package. The number of cores to use when doing polychoric or tetrachoric may be specified using the options command. The greatest step up in speed is going from 1 cores to 2. This is about a 50% savings. Going to 4 cores seems to have about at 66% savings, and 8 a 75% savings. The number of parallel processes defaults to 2 but can be modified by using the options command: options("mc.cores"=4) will set the number of cores to 4.

My laptop has 10 cores.

I'm new to R, and I haven't been able to figure out how to run my code in parallel.

Any ideas are appreciated.

library(psych) library(tidyverse)

temp = read.csv("~/Temp/data.csv", sep=",")

tetravalues = tetrachoric(temp, delete=FALSE)

tetraframe = tetravalues$rho

write.csv(tetraframe, file="~/Temp/output.csv")


Solution

  • You basically provided the answer to your question yourself.

    You can adjust the number of cores in the code below. Note that when you want to use your laptop for other things while the computation is running, I would not set the number of cores to the maximum.

    Here is a quick intro about parallel computing in R.

    library(psych)
    library(tidyverse)
    
    # Here you can pick the number of cores. 
    options("mc.cores"=4)
    
    temp = read.csv("~/Temp/data.csv", sep=",")
    
    tetravalues = tetrachoric(temp, delete=FALSE)
    
    tetraframe = tetravalues$rho
    
    write.csv(tetraframe, file="~/Temp/output.csv")
    
    tetravalues = tetrachoric(temp, delete=FALSE)