I want to remove the rows which have the same two or more words after each other, like a sequence. This is to do a sequential pattern mining analysis.
I already tried the distinct()
and duplicated()
function, but this only removes the whole row.
r_seq_5 <- r_seq_5[!duplicated(r_seq_5),] # remove duplicates
# Su Score result ROI next_roi third_roi four_roi five_roi
# 1 1 90 high Elsewhere Elsewhere Teacher Teacher Teacher
# 2 1 90 high Elsewhere Teacher Teacher Teacher Teacher
# 3 1 90 high Teacher Pen Teacher Elsewhere Smartboard
This is the table. If Teacher is two or three times in the sentence it doesn't matter, as long as it is not after each other.
The desired result is:
# 1 1 90 high Teacher Pen Teacher Elsewhere Smartboard
To do this, I have found it convenient to turn the factors into numbers. And this was my first step, because to compare macth of columns this path seems to be less arduous.
For this I used a for
, the qdap
package, because in macth I replaced the values with NA
.
library(dplyr)
library(qdap)
df <- data.frame(Su = rep(1,3),
Score = rep(90,3),
ROI = c("A", "A", "B"),
NETX_ROI = c("A", "B", "C"),
third_roi = rep("B", 3),
four_roi = c("B", "B", "A"),
five_roi = c("B", "B", "D"))
df
> df
Su Score ROI NETX_ROI third_roi four_roi five_roi
1 1 90 A A B B B
2 1 90 A B B B B
3 1 90 B C B A D
df2 <- df
roi <- c("A", "B", "C", "D")
# A = Elsewhere
# B = Teacher
# C = Pen
# D = Smartboard
n <- seq(1, length.out = length(roi))
for (i in 1:length(n)) {
df2[df2 == roi[i]] <- NA
df2 <- qdap::NAer(df2, i)
}
> df2
Su Score ROI NETX_ROI third_roi four_roi five_roi
1 1 90 1 1 2 2 2
2 1 90 1 2 2 2 2
3 1 90 2 3 2 1 4
df2 <- df2 %>%
dplyr::select(-c(Su, Score)) %>%
as.matrix()
nn <- ncol(df2)
x <- matrix(nrow = nrow(df2), ncol = ncol(df2)-1)
for (i in 1:(nn-1)) {
xx <- ifelse(df2[,i] == df2[,i+1], NA, 0)
x[,i] <- as.matrix(xx)
}
> x
[,1] [,2] [,3] [,4]
[1,] NA 0 NA NA
[2,] 0 NA NA NA
[3,] 0 0 0 0
Finally, I just removed the lines with NA
.
dfx <- x %>%
as.data.frame()
df_test <- df %>%
dplyr::bind_cols(dfx) %>%
na.omit() %>%
dplyr::select(1:ncol(df))
df_test
> df_test
Su Score ROI NETX_ROI third_roi four_roi five_roi
3 1 90 B C B A D