rdataframetextduplicatesidentify

How to identify duplicated data in a data.frame in R?


I have a data.frame - e.g.: data1.csv - (100 000 rows x 5 cols).

N - ID - DATE - TEXT - LANG

Next, I did a sample of 3000 without set.seed:

num <- c(1:100000)
aleat <- sort(sample(num, 3000, replace = F))
data2 <- data1[aleat,c(1,4)]

Notice that col. 4 is TEXT.

data2.csv have been processed by other programs and add variables to the file.
Now, data2 is a data.frame (3000 rows x 3 cols)

N - TEXT - CODE

data2$N = c(1:3000) So data1$Nis different to data2$N

Now, I need to identify those 3000 TEXT (data2) in data1 in order to associate them with all the original variables which I didn't need at first. I need to associate ID with TEXT and CODE. Keep the order is essential.

Note that the text language is Spanish. Different accents are included. When I read both files I use freadfunction. For data1 I use UTF-8 encodingand Latin-1for data2. If I read data2 with UTF-8 encoding R doesn't read it right. I suppose it is because another program have processed and saved it.

I've tried two ways:

1) for loops:

try1 <- matrix(0, nrow=3000, ncol= 5)
for (i in (1:3000)){
  for (j in (1:100000)){
    if ((data2[i,2] == data1[j,4]) == T){
      try1[j,] <- data1[j,]
    }
  }
}

#OR  

gg <- NULL
a <- NULL
for (j in 1:100000) {
  for (i in 1:3000) {
    if((data2[i,2]==data1[j,4]==T))
      a <- data1[j,]
    gg <- c(gg,a)
  }
}

Both loops failed. There is no error when I run them, but Try1 or gg are still empty after run the loops.

2) duplicated function.

num <- c(1:103000)
text1 <- as.data.frame(data1[,4]); colnames(text1) <- "TEXT"
text2 <- as.data.frame(data2[,2]); colnames(text2) <- "TEXT"
text <- rbind(text1,text2)
data3 <- as.data.frame(cbind(num,text))
dup <- as.data.frame(data3[duplicated(data3$TEXT),])

I create the variable num in order to identify the row number of data1. This method doesn't work. It identify 2400 of 3000 and the order is not correct. I think it is because those 600 left are interleaved.


Solution

  • I think what you are looking for is a join. Try this:

    library(dplyr)
    data2 %>%
      left_join(data1 %>% select(-N), by = "TEXT")
    

    However, joining by a text field that contains special characters which have been processed and read in using different encodings can lead to problems. If possible I would suggest you keep a unique ID when processing those sample data with other programs and join by this column instead.