I have five data frames in R, each with annual crime data for the years 2018-2022. There are 40 columns and thousands of observations. Each observation is one county in the U.S. 39 columns have crime counts for their respective county (different crime types), and 1 column has a 5-digit "FIPS" code for the county. Each data frame is of a different length, indicating that the same set of counties did not submit crime data for each year.
I need to determine which observations/counties submitted data for all five years, and filter each of the data frames to include only those observations. I would, in short, like each data frame to have only the observations with FIPS codes that are contained in all of the other data frames. I do not want to merge these data frames.
Thanks for any advice.
I tried setdiff(df1$FIPS , df2$FIPS), and then subset(df1, FIPS!= . . .) but with five data frames and dozens of relevant FIPS codes, this brute force method proved unworkable.
I tried filter(ifelse()), but this didn't seem to be a solution.
I tried intersect(), but that requires that values in every column be identical, not just one column. Is there an intersect() type of function but keyed to just one value in one column, not every value in every column? Or some other solution?
I also used this code for the 2022 data frame, which is not efficient at all:
CrimeAll <- merge(Crime2022 , Crime2021 , by = "FIPS")
CrimeAll2 <- merge(CrimeAll , Crime2020 , by = "FIPS")
CrimeAll3 <- merge(CrimeAll2 , Crime2019 , by ="FIPS")
CrimeAll4 <- merge(CrimeAll3 , Crime2018 , by = "FIPS")
setdiff(Crime2022$FIPS , CrimeAll4$FIPS)
Crime2022_1 <- subset(Crime2022 , FIPS!="01003" & FIPS!="01009" & FIPS!="01015" & FIPS!="01021"& FIPS!= "01027"& FIPS!= "01039"& FIPS!= "01045" & FIPS!="01049"& FIPS!= "01051" & FIPS!="01053"& FIPS!= "01055" & FIPS!="01059" & FIPS!="01069" & FIPS!="01071" & FIPS!="01081" & FIPS!="01083"& FIPS!= "01089" & FIPS!="01095" & FIPS!="01109" & FIPS!="01115" & FIPS!="01117" & FIPS!="01119" & FIPS!="01121" & FIPS!="01123" & FIPS!="01125" & FIPS!="01127" & FIPS!="01133" & FIPS!="04003"& FIPS!= "05059"& FIPS!= "05099" & FIPS!="08061"& FIPS!= "16073"& FIPS!= "18039" & FIPS!="18057" & FIPS!="19003" & FIPS!="19005" & FIPS!="19011" & FIPS!="19043" & FIPS!="19051" & FIPS!="19069"& FIPS!= "19091" & FIPS!="19107" & FIPS!="19115" & FIPS!="19119" & FIPS!="19133" & FIPS!="19151" & FIPS!="19177" & FIPS!="20043" & FIPS!="20063" & FIPS!="20101" & FIPS!="20107" & FIPS!="20147" & FIPS!="20183" & FIPS!="20197" & FIPS!="21063" & FIPS!="22001" & FIPS!="22035" & FIPS!="22047" & FIPS!="22055" & FIPS!="22097" & FIPS!="22111" & FIPS!="22117" & FIPS!="28073" & FIPS!="29015" & FIPS!="30001" & FIPS!="30045" & FIPS!="30069" & FIPS!="30075" & FIPS!="31073" & FIPS!="31161" & FIPS!="38087" & FIPS!="39059" & FIPS!="39067" & FIPS!="39125" & FIPS!="46009" & FIPS!="46015"& FIPS!= "46037" & FIPS!="46067" & FIPS!="46085" & FIPS!="46089" & FIPS!="46109" & FIPS!="48197" & FIPS!="48299" & FIPS!="49029" & FIPS!="53069" & FIPS!="54073" & FIPS!="54091" & FIPS!="54105" & FIPS!="55059")
Is there a better way?
Consider stacking all data frames with a year indicator, then assign with ave
a total count by county to later subset
to counties with 5 observations.
If needed, call split
to break out dataframe into a list of data frames by year. You lose no functionality with a data frame stored within a list instead of its own object. Heed @GregorThomas's good advice:
Don't ever create
d1
d2
d3
, ...,dn
in the first place. Create a listd
withn
elements.
# COMPILE ALL YEAR DATA FRAMES
CrimeAll <- do.call(
rbind,
lapply(
2018:2022, \(yr)
within(get(paste0("Crime", yr)), { year <- yr })
)
)
# ASSIGN AND SUBSET COUNTIES IN ALL 5 YEARS
CrimeAll <- within(
CrimeAll, { county_n <- ave(year, FIPS, FUN=length) }
) |> subset(
subset = county_n == 5, select = -county_n
)
# BUILD LIST OF YEAR SPLIT DATA FRAMES
CrimeList <- split(CrimeAll, CrimeAll$year)
CrimeList$`2018`
CrimeList$`2019`
...
CrimeList$`2022`