In my dataset, i have product descriptions that appear as:
and in other rows as
and in some rows, as just
Initially, my dataset had strings in the format:
and
Since I wanted just one instance of each product, I resolved this issue by using the following code:
df$lengths <- str_length(df$items)
df$new_items <- str_sub(df$items, 1, df$lengths/3)
Is there a way to solve the above problem by modifying this code?
df <-
structure(list(Product_name = c("Samsung Galaxy A03s (4+64), Samsung Galaxy A03s (4+64)",
"Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (3+32)",
"Samsung A32 (6+128), Samsung A32 (6+128), Samsung A32 (6+128)",
"samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s(3+32)",
"Xiaomi Redmi 10 (6+128), Xiaomi Redmi 10 (6+128)", "Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128)"
)), class = "data.frame", row.names = c(NA, -6L))
EDIT:
If the comma-separated strings do not always contain identical elements, more complex solutions are in order:
Data:
Product_name = c("Samsung Galaxy A03s (4+64), Samsung Galaxy A03s (3+32)", "Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (4+64)", "Samsung A32 (6+128), Samsung A32 (6+128), Samsung A32 (6+128)", "samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32)", "Xiaomi Redmi 10 (6+128), Xiaomi Redmi 10 (6+128)", "Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128)")
Solution 1: A regex solution based on negative character class, negative lookahead, and backreference -- basically, a one-liner:
library(dplyr)
library(stringr)
data.frame(Product_name) %>%
mutate(Product_name = str_extract_all(Product_name, "((?!\\s)[^,]+)(?!.*\\1)"))
Product_name
1 Samsung Galaxy A03s (4+64), Samsung Galaxy A03s (3+32)
2 Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (4+64)
3 Samsung A32 (6+128)
4 samsung A02s (3+32)
5 Xiaomi Redmi 10 (6+128)
6 Redmi Note 10 Pro (6+128)
Solution 2: Based on tidyr
functionality
library(tidyr)
library(dplyr)
data.frame(Product_name) %>%
# create identifier:
mutate(row = row_number()) %>%
# separate rows into individual elements:
separate_rows(Product_name, sep = ", ") %>%
group_by(row) %>%
# remove duplicated elements:
filter(!duplicated(Product_name)) %>%
# put distinct elements back into the same row:
summarise(Product_name = toString(Product_name))
# A tibble: 6 x 2
row Product_name
<int> <chr>
1 1 Samsung Galaxy A03s (4+64), Samsung Galaxy A03s (3+32)
2 2 Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (4+64)
3 3 Samsung A32 (6+128)
4 4 samsung A02s (3+32)
5 5 Xiaomi Redmi 10 (6+128)
6 6 Redmi Note 10 Pro (6+128)
Before edit:
This solution is based on the assumption that the comma-separated elements in the strings are always identical:
library(stringr)
str_extract(Product_name, "[^,]+")
[1] "Samsung Galaxy A03s (4+64)" "Samsung Galaxy A03s (3+32)"
[3] "Samsung A32 (6+128)" "samsung A02s (3+32)"
[5] "Xiaomi Redmi 10 (6+128)" "Redmi Note 10 Pro (6+128)"
Data:
Product_name = c("Samsung Galaxy A03s (4+64), Samsung Galaxy A03s (4+64)", "Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (3+32)", "Samsung A32 (6+128), Samsung A32 (6+128), Samsung A32 (6+128)", "samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32)", "Xiaomi Redmi 10 (6+128), Xiaomi Redmi 10 (6+128)", "Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128)")