r replace data-cleaning stringr market-basket-analysis

Stringr pattern matching: Is there a way to identify multiple product descriptions?

In my dataset, i have product descriptions that appear as:

Product A, Product A, Product A

and in other rows as

Product A, Product B, Product A, Product B

and in some rows, as just

Product A

Initially, my dataset had strings in the format:

Product A, Product B, Product A, Product B, Product A, Product B

and

Product A, Product A, Product A

Since I wanted just one instance of each product, I resolved this issue by using the following code:

df$lengths <- str_length(df$items)

df$new_items <- str_sub(df$items, 1, df$lengths/3)

Is there a way to solve the above problem by modifying this code?

df <-
structure(list(Product_name = c("Samsung Galaxy A03s (4+64), Samsung Galaxy A03s (4+64)", 
"Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (3+32)", 
"Samsung A32 (6+128), Samsung A32 (6+128), Samsung A32 (6+128)", 
"samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s(3+32)", 
"Xiaomi Redmi 10 (6+128), Xiaomi Redmi 10 (6+128)", "Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128)"
)), class = "data.frame", row.names = c(NA, -6L))

Solution

EDIT:

If the comma-separated strings do not always contain identical elements, more complex solutions are in order:

Data:

Product_name = c("Samsung Galaxy A03s (4+64), Samsung Galaxy A03s (3+32)", "Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (4+64)", "Samsung A32 (6+128), Samsung A32 (6+128), Samsung A32 (6+128)", "samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32)", "Xiaomi Redmi 10 (6+128), Xiaomi Redmi 10 (6+128)", "Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128)")

Solution 1: A regex solution based on negative character class, negative lookahead, and backreference -- basically, a one-liner:

library(dplyr)
library(stringr)
data.frame(Product_name) %>%
  mutate(Product_name = str_extract_all(Product_name, "((?!\\s)[^,]+)(?!.*\\1)"))
                                             Product_name
1  Samsung Galaxy A03s (4+64), Samsung Galaxy A03s (3+32)
2  Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (4+64)
3                                     Samsung A32 (6+128)
4                                     samsung A02s (3+32)
5                                 Xiaomi Redmi 10 (6+128)
6                               Redmi Note 10 Pro (6+128)

Solution 2: Based on tidyr functionality

library(tidyr)
library(dplyr)
data.frame(Product_name) %>%
  # create identifier:
  mutate(row = row_number()) %>%
  # separate rows into individual elements:
  separate_rows(Product_name, sep = ", ") %>%
  group_by(row) %>%
  # remove duplicated elements:
  filter(!duplicated(Product_name)) %>%
  # put distinct elements back into the same row:
  summarise(Product_name = toString(Product_name))
# A tibble: 6 x 2
    row Product_name                                          
  <int> <chr>                                                 
1     1 Samsung Galaxy A03s (4+64), Samsung Galaxy A03s (3+32)
2     2 Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (4+64)
3     3 Samsung A32 (6+128)                                   
4     4 samsung A02s (3+32)                                   
5     5 Xiaomi Redmi 10 (6+128)                               
6     6 Redmi Note 10 Pro (6+128)

Before edit:

This solution is based on the assumption that the comma-separated elements in the strings are always identical:

library(stringr)
str_extract(Product_name, "[^,]+")
[1] "Samsung Galaxy A03s (4+64)" "Samsung Galaxy A03s (3+32)"
[3] "Samsung A32 (6+128)"        "samsung A02s (3+32)"       
[5] "Xiaomi Redmi 10 (6+128)"    "Redmi Note 10 Pro (6+128)"

Data:

Product_name = c("Samsung Galaxy A03s (4+64), Samsung Galaxy A03s (4+64)", "Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (3+32)", "Samsung A32 (6+128), Samsung A32 (6+128), Samsung A32 (6+128)", "samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32)", "Xiaomi Redmi 10 (6+128), Xiaomi Redmi 10 (6+128)", "Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128)")