I am looking for a way to compare two different data frame column names (only certain ranges of columns) and keeping only the names that have the same shared similarity of column names between the two data frames. (I only want to compare each data frame from a certain point forward only though.
I would have a significantly larger number of columns, so i want to see what would work best in larger data sets.
Say i have DF1
ID | Location | Value2 | Value3 | Value4 |
---|---|---|---|---|
First | Park | 4 | 3 | 3 |
Second | House | 2 | 5 | 2 |
Second | Barn | 4 | 5 | 6 |
Third | Lake | 1 | 8 | 8 |
Third | Airport | 7 | 5 | 4 |
Fourth | Car | 4 | 5 | 1 |
And DF2
Name | Identifier | City | Value2 | Value3 |
---|---|---|---|---|
Jeff | ISE2 | Seattle | 1 | 6 |
George | GSY7 | Houston | 2 | 2 |
Carl | BHU1 | Miami | 3 | 7 |
Mike | POI0 | Los Angeles | 8 | 8 |
Linel | ANN1 | DC | 5 | 4 |
Pork | CNU4 | Portland | 5 | 1 |
In DF1, we will keep ID and Location column regardless, and for DF2, we keep Name, Identifier, and City regardless. My goal is to compare the other numerical columns to see if there is a match in the name. So in this case, the match is "Value2" and "Value3."
So the processed data frames would be:
New DF1
ID | Location | Value2 | Value3 |
---|---|---|---|
First | Park | 4 | 3 |
Second | House | 2 | 5 |
Second | Barn | 4 | 5 |
Third | Lake | 1 | 8 |
Third | Airport | 7 | 5 |
Fourth | Car | 4 | 5 |
And new DF2
Name | Identifier | City | Value2 | Value3 |
---|---|---|---|---|
Jeff | ISE2 | Seattle | 1 | 6 |
George | GSY7 | Houston | 2 | 2 |
Carl | BHU1 | Miami | 3 | 7 |
Mike | POI0 | Los Angeles | 8 | 8 |
Linel | ANN1 | DC | 5 | 4 |
Pork | CNU4 | Portland | 5 | 1 |
Is one way comparing the column names of each DF and compare it to the other? I would appreciate any help on this. Thank you!
ID <- c("First", "Second", "Second", "Third", "Third", "Fourth")
Location <- c("Park","House","Barn","Lake","Airport","Car")
Value2 <- c(4,2,4,1,7,4)
Value3 <- c(3,5,5,8,5,5)
Value4 <- c(3,2,6,8,4,1)
DF1 <- data.frame(ID, Location, Value2, Value3, Value4)
Name <- c("Jeff", "George", "Carl", "Mike", "Linel", "Pork")
Identifier <- c("ISE2","GSY2","BHU1","POI0","ANN1","CNU4")
City <- c("Seattle","Houston","Miami","Los Angeles","DC","Portland")
Value2 <- c(1,2,3,8,5,5)
Value3 <- c(6,2,7,8,4,1)
DF2 <- data.frame(Name, Identifier, City, Value2, Value3)
Here is a base R option
# Common column-name pattern across `data.frame`
pattern <- "Value"
common_nms <- intersect(
names(DF1)[grep(pattern, names(DF1))],
names(DF2)[grep(pattern, names(DF2))])
# Unique and and "pattern-shared" columns in `DF1`
DF1[c(names(DF1)[-grep(pattern, names(DF1))], common_nms)]
# ID Location Value2 Value3
#1 First Park 4 3
#2 Second House 2 5
#3 Second Barn 4 5
#4 Third Lake 1 8
#5 Third Airport 7 5
#6 Fourth Car 4 5
# Unique and and "pattern-shared" columns in `DF2`
DF2[c(names(DF2)[-grep(pattern, names(DF2))], common_nms)]
# Name Identifier City Value2 Value3
#1 Jeff ISE2 Seattle 1 6
#2 George GSY2 Houston 2 2
#3 Carl BHU1 Miami 3 7
#4 Mike POI0 Los Angeles 8 8
#5 Linel ANN1 DC 5 4
#6 Pork CNU4 Portland 5 1
The idea is to use a pattern
to filter column names and then determine common names through intersect
. Final column names are then given by column names that don't match the pattern
(the "unique" names) and the common column names from the intersect.