rdataframe

Finding a value of a dataset in different ones


I need some help with this problem, which I am unable to resolve.

I have a dataframe called Base, which contains columns such as ID_1, organization_1, and year_beginning1. In the organization_1 column, I will have names like School X, School Y, and School Z, which correspond to dataframes of these organizations. In these dataframes, such as the one for School X, I also have columns ID and Start Year of the Program.

I want R to look up these dataframes based on the organization_1 column and, using the ID column, retrieve the number in the Start Year of the Program column and insert it into the Year_beginning1 column.

The code I tred:

    map <- list(
  "School X" = SchoolX,
  "School Y" = SchoolY,
  "School Z" = SchoolZ
)

    for (i in 1:nrow(Base)) {
  nome_org <- Base$Organization1[i] 
  df_org <- map[[name_org]] 
  
  if (!is.null(df_org)) {  
    id_base <- Base$ID_1[i]  

    Year_beginning <- df_org$`Start year of the program`[df_org$ID == id_base]

    if (length(Year_beginning) > 0) {
      Base$Year_beginning1[i] <- Year_beginning
    }
  }
}

Solution

  • From reading your description, this is what I suspect your data may look like:

    Base <- data.frame(ID_1=1:3, org_1=c("School X", "School Y", "School Z"), yearbeg_1=2000+1:3)
    Base
    #   ID_1    org_1 yearbeg_1
    # 1    1 School X      2001
    # 2    2 School Y      2002
    # 3    3 School Z      2003
    School_X <- data.frame(ID=c(1L, 5L), startyear=1998+c(1,5))
    School_Y <- data.frame(ID=c(2L, 9L), startyear=1998+c(2,9))
    School_Z <- data.frame(ID=c(3L, 10L), startyear=1998+c(3,10))
    

    I think this can easily be a "JOIN" operation, but to do it more easily we need to have the schools combined into one frame, with its org as a column.

    I'll use dplyr here, it is easy to read/understand, is well documented, and very heavily used. However, this can mostly be done in data.table (about as easily, though it has a steeper learning curve), or in base R with a little more work and care.

    Also, I'm going to assume for this example that your Base references "School X" (space) and the other frame is named something like School_X (underscore); in the call to bind_rows, the name of the argument "School X"= is intended to match what is found in Base.

    library(dplyr) # bind_rows, mutate, left_join
    Schools <- bind_rows(
      "School X" = School_X,
      "School Y" = School_Y,
      "School Z" = School_Z,
      .id="org")
    Schools
    #        org ID startyear
    # 1 School X  1      1999
    # 2 School X  5      2003
    # 3 School Y  2      2000
    # 4 School Y  9      2007
    # 5 School Z  3      2001
    # 6 School Z 10      2008
    

    From here, it's a simple "JOIN" operation. The "left" join ensures that all rows on the LHS (Base here) are preserved, and any non-matching rows in Schools will not be included in the output (notice that each of my schools have IDs that are not desired, and those IDs are not included in this output).

    left_join(Base, Schools, join_by(org_1 == org, ID_1 == ID))
    #   ID_1    org_1 yearbeg_1 startyear
    # 1    1 School X      2001      1999
    # 2    2 School Y      2002      2000
    # 3    3 School Z      2003      2001
    

    For more information on JOINs in general, see