rdataframeintersect

How to automatize comparison of values in two columns of two different data.frames in r?


I'm trying to automatize the following function for all columns of two data frames.

I have the data.frames position_sc and position_hit. Both have 18 columns (each column represents a subject ("PD.."). The rows represent the sentence numbers (sentence comprehension test with four conditions: sc, sf, lc, lf, total number of sentences 200) In position_sc the numbers in the rows represent the sentence numbers which had the condition sc (e.g. 3, 7, 19 --> these sentences were presented under condition sc) In position_hit the rows represent for which sentence of the 200 the subjects had a hit (e.g. 3, 22, 78 --> the subject gave the correct answer for these sentences). Now I want to get the sentences the subjects had correct (hit) in condition sc. So the sentence number are the same when they had a sc-sentence correct.

I want to have the number of equal values in position_hit$x and position_sc$x. I was able to do that for each column (=subject) seperately, but I can't figure out, how to automatize it over all columns (compare position_sc$PD02 with position_hit$PD02, position_sc$PD03 with position_hit$PD03, position_sc$PD04 with position_hit$PD04, position_sc$PD05 with position_hit$PD03, aso.

That works well:

test_single = length(intersect(position_sc$PD03, position_hit$PD03))

OUTPUT
[1] 40 

--> (40 numbers are equal --> 40 sentences in condition sc were a hit)

I tried something like that:

#Create a list containing both data.frames.
list_position_sc_hit = list(position_hit, position_sc)

#Try to automatize intersect() first:
test_all = sapply(list_positions_sc_hit, function(x) intersect(position_sc$x, position_hit$x))

But OUTPUT:
$hit
NULL

$sc
NULL

Here is an example how my script looks like: REPRODUCIBLE EXAMPLE

#CREATE 2 DATAFRAMES
#Dataframe 1
df.sc = data.frame (PD02 = c(1,3,7,9,12),
                  PD03 = c(8,9,10,11,13),
                  PD04 = c(1,2,3,24,36),
                  PD05 = c(1,9,20,23,26))
df.sc

#Dataframe 2
#Create list of 4
list.hit = list(PD02 = c(1,3,7,8,11,13,14,15,16,17),
                  PD03 = c(2,4,6,10,11,13,18,19,21),
                  PD04 = c(1,2,3,4,6),
                  PD05 = c(18,19,20,23,25,26))
#Convert list with different number of rows into data.frame
df.hit = as.data.frame(t(stri_list2matrix(list.hit)))
#Rows and columns are switched
df.hit = data.frame(t(df.hit))
#Name columns like in df.sc
names(df.hit)[1:4] = names(df.sc)[1:4]
#Convert all values in data.frame from chr to int
df.hit = sapply(df.hit, as.integer)
# After converting from chr to int the data.frame became an array. In order to reshape it I had to switch row and column twice (not sure why)
df.hit = data.frame(t(df.hit))
df.hit = data.frame(t(df.hit))
df.hit
#Do intersection and length counting for one subject (column)
matches.hit_sc_PD02 = length(intersect(df.sc$PD02, df.hit$PD02))
matches.hit_sc_PD02

OUTPUT: [3] --> three integers are the same in df.sc$PD02 and df.hit$PD02

#Combine both dataframes into list and rename dataframes
df.hit_sc = list(df.hit, df.sc)
names(df.hit_sc) = c("hit", "sc")
df.hit_sc
#Just try intersect() function for all variables (columns) with sapply(). I left length() out just to minimize complexity. 
matches.hit_sc_all = sapply(df.hit_sc, function(x) intersect(df.sc$x, df.hit$x))
matches.hit_sc_all

OUTPUT: $hit NULL

$sc NULL

I would be happy if anyone could help me.


Solution

  • If I'm understanding the question correctly, you want to test how many values are shared between two columns of the same name from two different data frames. Here is a simple solution:

    # assert that variable names are identical and in the same order
    stopifnot(names(df.sc) == names(df.hit))
    
    map2(df.sc, df.hit, ~ length(intersect(.x, .y)))
    

    This will return a list with one element per variable name:

    $PD02
    [1] 3
    
    $PD03
    [1] 3
    
    $PD04
    [1] 3
    
    $PD05
    [1] 3
    

    If you want to return a tibble (data.frame), which may be easier to work with, you can use map2_dfc() instead:

    map2_dfc(df.sc, df.hit, ~ length(intersect(.x, .y)))
    

    which returns:

    # A tibble: 1 x 4
       PD02  PD03  PD04  PD05
      <int> <int> <int> <int>
    1     3     3     3     3
    

    If this solves your problem I'd appreciate an upvote and accepting the solution. Thanks and hope this helps!

    Edit: the map* functions are in the purrr package from the tidyverse.