rdataframe

Efficient manner to compare and match structures between two data frames in R?


I'm working on a function for comparing the structures of two data frames in R, in order to build a validation filter for uploading data into a table. I realize that when you have a data frame in R where a column has both numbers and text strings, the entire column defaults to the "character" class.

Suppose we have this mixed data frame:

> df1
   col1 col2
1     1    4
2     2 five
3 three    6

Whereby df1 is built via:

df1 <- data.frame(
  col1 = c("1", "2", "three"),
  col2 = c("4", "five", "6")
)

And we have another mixed data frame df2:

> df2
  col1 col2
1   11   14
2   12 fill
3 tree   16

df2 <- data.frame(
  col1 = c("11", "12", "tree"),
  col2 = c("14", "fill", "16")
)

I'd like to run a structure comparison between the two, AS IF data frame elements that could be converted to numerics were actually converted to numerics. Ignoring the actual values. In the comparison of df1 and df2, the structures match. Is there a way to run this type of comparison in R?

And continuing the example, supposing we have another data frame df3 that we want to compare with df1, there would be no structure match since the df3[2,1] is a text string and df1[2,1] contains an element (of 2) that may be converted to a numeric:

> df3
  col1 col2
1   11   14
2 kats fill
3 tree   16

df3 <- data.frame(
  col1 = c("11", "kats", "tree"),
  col2 = c("14", "fill", "16")
)

Solution

  • You can detect every cell if there is any non-numeric character.

    compare <- function(x, y) {
      identical(sapply(x, grepl, pattern = "\\D"),
                sapply(y, grepl, pattern = "\\D"))
    }
    
    compare(df1, df2)
    # [1] TRUE
    
    compare(df1, df3)
    # [1] FALSE