I use R and I am merging mutiple data sets horizontically (by column), where each data set has one same column. For example, consider
`df1=data.frame('name'=c('9504','07123','40355','59003'),
'age 16'=c(21,22,23,25),
'income 43_'=c(3,4,7,2))`
and
`df2=data.frame('name'=c('9504','40355','59003','07123'),
'new 21'=c(21,22,23,25),
'GDP 98'=c(3,4,7,2) )`
Here, the two data sets have the same 'name' column. I merge the two data sets horizontally and I have
`df=data.frame('name'=c('9504','07123','40355','59003'),
'age 16'=c(21,22,23,25),
'income 43'=c(3,4,7,2),
'name'=c('9504','40355','59003','07123'),
'new 21'=c(21,22,23,25),
'GDP 98'=c(3,4,7,2))`
Note that merging is not my problem. It is taken as given. Given that data frame, df, I want to arrange the values of the "name" columns (in an ascending or descrnding order) so as to have the same order in both 'name' columns and ofcourse to rearrange accordingly the other columns
For example, I want to have
`dfN1=data.frame('name'=c('07123','40355','59003','9504'),
'age 16'=c(22,23,25,21),
'income 43'=c(4,7,2,3),
'name'=c('07123','40355','59003','9504'),
'new 21'=c(25,22,23,21),
'GDP 98'=c(2,4,7,3))`
name age.16 income.43 name.1 new.21 GDP.98
1 07123 22 4 07123 25 2
2 40355 23 7 40355 22 4
3 59003 25 2 59003 23 7
4 9504 21 3 9504 21 3
I do not want to process df1 and df2 but directly df to get 'dfN1'
So I tried
`dfN<-df[with(df, order(name)),]`
name age.16 income.43 name.1 new.21 GDP.98
2 07123 22 4 40355 22 4
3 40355 23 7 59003 23 7
4 59003 25 2 07123 25 2
1 9504 21 3 9504 21 3
but as you can see only the first two columns are corrected properly but not the last two. So dfN is not the same as dfN1, which is what I want to achieve.
What can I do?
Note that in reality I merge many data frames before getting df. So my real df contains many times the word "name" and also I have many more variables between the "name" columns. So a more general code would be helpful.
Since your data sets share a "key" column, this is a natural place to use a join with df1 & df2:
library(dplyr)
df1 |>
left_join(df2) # this will join by the columns in common, i.e. "name"
#left_join(df2, join_by(name)) # good practice to be explicit if possible
Result
name age.16 income.43_ new.21 GDP.98
1 9504 21 3 21 3
2 07123 22 4 25 2
3 40355 23 7 22 4
4 59003 25 2 23 7
If you need to start from df
for some reason (I'd avoid it if possible), you could get to the same place with manual subsets of df
:
left_join(df[,1:3], df[,4:6], join_by(name == "name.1"))
Note that in reality I merge many data frames before getting df. So my real df contains many times the word "name" and also I have many more variables between the "name" columns. So a more general code would be helpful.
Again, this is more straightforward to do each time as a join, such that each added table will add variables, unified by key columns like "name." Keeping the data "tidy," where each row represents one observation, will make your analysis much simpler, robust, and reliable.
https://vita.had.co.nz/papers/tidy-data.pdf
The main pitfall to watch out for is to make sure your keys are unique; if they're not, a row could get matched to multiple rows, inadvertently growing the output each time. The multiple
parameter of left_join
gives options for avoiding this, or you could perform a summarization step to make sure the key columns are unique for each row in the tables to be joined.