I have a data.frame - e.g.: data1.csv - (100 000 rows x 5 cols).
N - ID - DATE - TEXT - LANG
Next, I did a sample of 3000 without set.seed
:
num <- c(1:100000)
aleat <- sort(sample(num, 3000, replace = F))
data2 <- data1[aleat,c(1,4)]
Notice that col. 4 is TEXT.
data2.csv have been processed by other programs and add variables to the file.
Now, data2 is a data.frame (3000 rows x 3 cols)
N - TEXT - CODE
data2$N = c(1:3000)
So data1$N
is different to data2$N
Now, I need to identify those 3000 TEXT (data2) in data1 in order to associate them with all the original variables which I didn't need at first. I need to associate ID with TEXT and CODE. Keep the order is essential.
Note that the text language is Spanish. Different accents are included. When I read both files I use fread
function. For data1 I use UTF-8 encoding
and Latin-1
for data2. If I read data2 with UTF-8 encoding
R doesn't read it right. I suppose it is because another program have processed and saved it.
I've tried two ways:
1) for loops:
try1 <- matrix(0, nrow=3000, ncol= 5)
for (i in (1:3000)){
for (j in (1:100000)){
if ((data2[i,2] == data1[j,4]) == T){
try1[j,] <- data1[j,]
}
}
}
#OR
gg <- NULL
a <- NULL
for (j in 1:100000) {
for (i in 1:3000) {
if((data2[i,2]==data1[j,4]==T))
a <- data1[j,]
gg <- c(gg,a)
}
}
Both loops failed. There is no error when I run them, but Try1 or gg are still empty after run the loops.
2) duplicated
function.
num <- c(1:103000)
text1 <- as.data.frame(data1[,4]); colnames(text1) <- "TEXT"
text2 <- as.data.frame(data2[,2]); colnames(text2) <- "TEXT"
text <- rbind(text1,text2)
data3 <- as.data.frame(cbind(num,text))
dup <- as.data.frame(data3[duplicated(data3$TEXT),])
I create the variable num
in order to identify the row number of data1. This method doesn't work. It identify 2400 of 3000 and the order is not correct. I think it is because those 600 left are interleaved.
I think what you are looking for is a join. Try this:
library(dplyr)
data2 %>%
left_join(data1 %>% select(-N), by = "TEXT")
However, joining by a text field that contains special characters which have been processed and read in using different encodings can lead to problems. If possible I would suggest you keep a unique ID when processing those sample data with other programs and join by this column instead.