rtidyrreshape2spreaddcast

Spread or dcast and fill in counts


Probably a basic question.

I have a key - value data.frame(df below):

features <- paste0("f",1:5)
set.seed(1)
ids <- paste0("id",1:10)

df <- do.call(rbind,lapply(ids,function(i){
  data.frame(id = i, feature = sample(features,3,replace = F))
}))

And I want to tidyr::spread or reshape2::dcast it, so that the rows are id' the columns are feature, but the values are the sum of featuresfor eachid`.

A simple:

reshape2::dcast(df, id ~ feature)

Doesn't achieve that. It just fills in with features and NAs

adding fun.aggregate = sum to the command above results with an error:

> reshape2::dcast(df, id ~ feature, fun.aggregate = sum)
Using feature as value column: use value.var to override.
Error in .fun(.value[0], ...) : invalid 'type' (character) of argument

And tidyr::spread also results with an error:

tidyr::spread(df, key = id, value = feature)

Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 30 rows:

Any idea?


Solution

  • I think you want to count the features and not sum them. Try with the function length.

    tidyr::pivot_wider(df, names_from = feature, 
                values_from = feature, values_fn = length, values_fill = 0)
    

    Or with dcast.

    library(data.table)
    dcast(setDT(df), id~feature, value.var = 'feature', fun.aggregate = length)
    

    In base R, using table(df) would give the same output.

    table(df)
    
    #     feature
    #id     f1 f2 f3 f4 f5
    #  id1   1  0  1  1  0
    #  id10  1  0  1  1  0
    #  id2   1  1  0  0  1
    #  id3   0  1  1  1  0
    #  id4   1  0  1  0  1
    #  id5   1  1  0  0  1
    #  id6   1  1  1  0  0
    #  id7   1  0  0  1  1
    #  id8   1  1  0  0  1
    #  id9   0  1  0  1  1