rdata.tabledcast

How do I use dcast and keep data from multiple years?


I have a .csv table with world bank development data from the years 2016-2020. In the beginning I only used data from 2020, however I had to add years as for some of the variables the data was not available. The .csv table contained countrycodes and series names and the values for each year. So for example

Country.Code Series.Name
Arg Gini index
Arg Population

I used the dcast function like this:

ControlM <- dcast(Control, Country.Code ~ Series.Name)

This worked and I had a table with only one row with the Country code and then columns with the values, but it only kept the values for 2020. Then I tried to specify, which columns it should keep by adding:

ControlM <- dcast(
  Control, 
  Country.Code ~ Series.Name, 
  value.var = c("2016", "2017", "2018", "2019", "2020")
)

However, I get the following Error:

Error in if (!(value.var %in% names(data))) { : 
  the condition has length > 1

I tried different solutions I found here. For example: Error using dcast with multiple value.var and converted the data frame to a data table first and setting value.var = NULL. The error message stays the same.

I would be really glad, if someone had a suggestion for me.

Edit: Thank you for your quick response! Here is the ouput for dput(head(Contro)):

structure(list(Country.Name = c("Argentina", "Argentina", "Argentina", 
"Argentina", "Argentina", "Armenia"), Country.Code = c("ARG", 
"ARG", "ARG", "ARG", "ARG", "ARM"), Series.Name = c("Gini index", 
"Trade (% of GDP)", "Population density (people per sq. km of land area)", 
"Population, total", "Educational attainment, at least completed post-secondary, population 25+, total (%) (cumulative)", 
"Gini index"), Series.Code = c("SI.POV.GINI", "NE.TRD.GNFS.ZS", 
"EN.POP.DNST", "SP.POP.TOTL", "SE.SEC.CUAT.PO.ZS", "SI.POV.GINI"
), `2016` = c("42", "26.0938878488799", "15.9281350828921", "43590368", 
"..", "32.5"), `2017` = c("41.1", "25.2896011376779", "16.0941907925267", 
"44044811", "..", "33.6"), `2018` = c("41.3", "30.7625359549926", 
"16.2585100979651", "44494502", "..", "34.4"), `2019` = c("42.9", 
"32.6306150458499", "16.4208266190179", "44938712", "..", "29.9"
), `2020` = c("42.3", "30.2197998857878", "16.580892611147", 
"45376763", "..", "25.2")), row.names = c(NA, -6L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x00000193c0bd5930>)

The problem is that I want the series names as column names. For example:

Country.Code Gini Index
ARG 50
ARM 50

as opposed to the example above. The problem is that the dcast function by default only keeps the data for 2020, but I need the data for 2016-2020. I hope that helps to clarify my problem. Oh and as to which packages I used I installed both weeks ago (data.table and reshape2) and am not sure, which one I used here. I am sorry, still really struggling with R :D


Solution

  • Your code was basically correct - the main problem was the function being masked by another package. If I run your code with reshape2::dcast() I get the error you posted. However, with data.table::dcast() it runs.

    If you can avoid it, it is advisable not to use reshape2 (i.e. not to call library(reshape2)) if you are using data.table. The former is a tidyverse package and the latter is not. Sometimes they play well together, sometimes they do not - as you have found out.

    In your case, the other thing I would do before casting to wide is clean the series names. "Educational attainment, at least completed post-secondary, population 25+, total (%) (cumulative)" is going to be awkward as a column name.

    Control[, series_clean := gsub(
        "(.*?)\\s.+", 
        "\\L\\1",
        Series.Name,
        perl = TRUE)
    ][, series_clean := sub("population,", "pop", series_clean)
    ][, series_clean := sub("population", "pop_density", series_clean)]
    
    Control[, unique(series_clean)]
    # [1] "gini"        "trade"       "pop_density" "pop"         "educational"
    

    If you just want to make the Gini Coefficient wide, you can do:

    ControlM <- data.table::dcast(
        Control[series_clean == "gini"],
        Country.Code ~ series_clean,
        value.var = c("2016", "2017", "2018", "2019", "2020")
    )
    

    If you want to do it for the whole data frame you can just remove the subset ([series_clean=="gini"]) from the dcast() call.

    Your data is in character rather than numeric format. You will want to change that as well:

    
    cols_to_make_numeric <- grep(
        "^\\d{4}",
        names(ControlM),
        value = TRUE
    )
    ControlM[, (cols_to_make_numeric) :=
        lapply(.SD, as.numeric),
    .SDcols = cols_to_make_numeric
    ]
    # Key: <Country.Code>
    #    Country.Code 2016_gini 2017_gini 2018_gini 2019_gini 2020_gini
    #          <char>     <num>     <num>     <num>     <num>     <num>
    # 1:          ARG      42.0      41.1      41.3      42.9      42.3
    # 2:          ARM      32.5      33.6      34.4      29.9      25.2