rconcatenationlarge-data

Splitting concatenated data in large dataset. Looking for improvements in computation time


I am working with a large dataframe that contains a concatenated vector with several pieces of information contained within that variable. I need to extract these values for my analysis, which could be done with the substr() function, but the observations are of varying length, making this unideal. While I have some working code, the size of my dataset means it takes quite a while to process and was wondering if there is maybe a better way.

For example, an 8-digit data point of "97110770" contains the information

But some values are 9-digits like "114123411" which would have:

In order to remedy this, I thought to add a "0" to the front of smaller variables so they would all be the same length (the digits after time value is consistent, so it works properly) and came up with this code:

(Dataframe: data, Vector of interest: speakerid)

congv <- vector(mode = "numeric")
memberidv <- vector(mode = "numeric")
chamberv <- vector(mode = "numeric")

for (i in 1:length(data$speakerid)) {
  if(nchar(data$speakerid[i]) == 8) {
    data$speakerid[i] = paste0("0", data$speakerid[i])
  }
  
  congv <- append(congv, substr(data$speakerid[i], 1, 3) )
  memberidv <- append(memberidv, substr(data$speakerid[i], 4, 8))
  chamberv <- append(chamberv, substr(data$speakerid[i], 9, 9))
}

data <- cbind(data, cong = congv, memberid = memberidv, chamber = chamberv)
data <- select(data, "memberid", "count", "chamber", "cong")

rm(list = c("congv", "memberidv", "chamberv"))

This works, but due to the size of the data (over 100k observations) it takes a lot longer than I would like (as I might need to include more data later) and I'm wondering if there's a more efficient method of doing this (maybe not using a for-if loop?)

If not, I'm fairly new to R and would appreciate any other tips you may have about my code, thanks!

Example dataframe:

speakerid = c(97110770, 114123411, 93123770, 112123410, 93123770)
count = c(12, 3, 4, 0, 4)
data = data.frame(speakerid, count)

Solution

  • Here's an alternative that doesn't use any strings at all

    library(data.table)
    setDT(data)[, {congv=speakerid %/% 1e6; .(memberid=(speakerid-congv*1e6)%/%10, count, chamberv=speakerid %% 10, congv)}]
    

    Output

       memberid count chamberv congv
          <num> <num>    <num> <num>
    1:    11077    12        0    97
    2:    12341     3        1   114
    3:    12377     4        0    93
    4:    12341     0        0   112
    5:    12377     4        0    93