[SOLVED] Subset of Columns based on the same column names

Subset of Columns based on the same column names

I have 2 data frames. In my dataset, a part of column name with leading zeros or trailing zeros should be ignored when matching columns, such as 009 = 9 or 22.0 == 22. And I want to subset the intersection of column names of two data frames.

As an example:

df1

No  009  237 038.1  22.0  010 
1     2    3     6     3    3
3     1    7     6     5    5
7     5   NA     9     0    6

df2

No    9  237  38.1   010  070  33.5
1     2    3     6     3    2     1
3     1    7     6     5    1     2
7     5   NA     9     0    9     6

The result should be like this:

result.df1

No  009  237 038.1  010
1     2    3     6    3
3     1    7     6    5
7     5   NA     9    6

result.df2

No    9  237  38.1   010
1     2    3     6     3
3     1    7     6     5
7     5   NA     9     0

How can I do it?

Solution

We can use trimws() to remove leading/trailing zero and dot (with the argument whitespace = "[\\.0]"), then use %in% to compare the two colnames and use that to index ([]) the original dataframe.

Save the trimws() results to new variables (clean_col_df1 and clean_col_df2) to improve readability of codes.

clean_col_df1 <- trimws(colnames(df1), whitespace = "[\\.0]")
clean_col_df2 <- trimws(colnames(df2), whitespace = "[\\.0]")

df1[, clean_col_df1 %in% clean_col_df2]
  No 009 237 038.1 010
1  1   2   3     6   3
2  3   1   7     6   5
3  7   5  NA     9   6

df2[, clean_col_df2 %in% clean_col_df1]
  No 9 237 38.1 010
1  1 2   3    6   3
2  3 1   7    6   5
3  7 5  NA    9   0