arraysrlapplyn-dimensional

Creating a customisable n dimension array


This is two questions in one; if I should be splitting them, please let me know.

I have a spreadsheet of HR data and I'm going to be cutting it into various cross sections. Each row currently represents an employee, the year of that particular report (so for example over a three year period, an employee would appear three times and a column includes which year that row's referring to) and a series of other characteristics. Furthermore, I've added a field which shows how many FTEs that employee represents for that period which represents that employees exposure to risk.

What I'm trying to do, for the sake of marrying it up with other data, is create an n dimensional array where each point represents the total exposure to risk that matches the dimensions. In the example I'm using, the dimensions are Year, Company [there are a couple], Age Band, Gender, Division, Tenure band.

To do so, among other code, I've written the following:

FactorNames <- c("FY","HR Business", "Age Band", "Gender", "Classification Level 1", "Tenure Band")
FactorDim <- lapply(length,mapply(unique,HR[FactorNames]))
Names <- lapply(HR[FactorNames], function(x)sort(unique(x)))

 Index <- 1
 for (Ten in 1:FactorDim[6]){
   for (Job in 1:FactorDim[5])  {
     for (Sex in 1:FactorDim[4]) {
       for (Age in 1:FactorDim[3]) {
         for (Co in 1:FactorDim[2]) {
           for (Year in 1:FactorDim[1]) {
             ExpList[Index] = sum(subset(HR,
                                         HR$FY == Names[1,Year],
                                         HR$`HR Business` == Names[2, Co],
                                         HR$`Age Band` == Names[3, Age],
                                         HR$Gender == Names[4, Sex],
                                         HR$`Classification Level 1` ==  Names[5,Job],
                                         HR$`Tenure Band` == Names[6,Ten],
                                         select=Exposure),
                                  na.rm=TRUE)
             Index <- Index + 1
           }
         }
       }
     }
   }
 }

There are two main issues.

  1. Names <- lapply(HR[FactorNames], function(x)sort(unique(x))) is incorrect as lapply(HR[FactorNames], function(x)sort(unique(x))) returns the unique values as a single combined element rather than as a vector. This means that the contents for my for loops throw the error Error in Names[1, Year] : incorrect number of dimensions.
  2. There's no way that my concentric for loops are even close to being the optimal way to fill my array and I was wondering if anyone knew what was.

What would you recommend?


Solution

  • I made up some data

    # make fake data
    FactorNames <- c("FY","HR Business", "Age Band", "Gender", "Classification Level 1", "Tenure Band")
    d <- as.data.frame(lapply(FactorNames,function(x){paste(x,sample(1:3,6,replace=T))}))
    names(d) <- FactorNames
    d$Name <- c('z','y','x','w','v','z')
    d$Exposure <- randu[1:6,1]
    

    From what I understand, your for loops intend to generate something like below in the d$sum_val column. A sum of all Exposure values for each combination of name and all factors.

    # get sum
    library(dplyr) # %>% pipe, group_by, and summarize
    d %>% 
     group_by(Name, FY, `HR Business`, `Age Band`, Gender, `Classification Level 1`, `Tenure Band`) %>% 
     summarize(sum_val = sum(Exposure))
    

    To make an n-dimensional array instead, look to acast with a formula like factor1 ~ factor2 ~ factor3 with ~ for each dim.

    # lazy way to write out each of the factors
    quoteFN <- lapply(c('Name',FactorNames),sprintf,fmt='`%s`')
    concatFN <- paste(collapse=" ~ ", quoteFN )
    
    # collapse into array
    out <- reshape2::acast(d, as.formula(concatFN),value.var='Exposure',sum)
    
    # what does it look like
    dimnames(out)
    dim(out)