[SOLVED] Using separate_wider_regex for to create a new row for each match in R

Using separate_wider_regex for to create a new row for each match in R

I have a dataframe with a long 'comment'-column that includes semi-structered information about the presence of species in several zones for different study sites. The pattern is typically something like the following:

Site	Comment
1	1: species A, species B 2a: species A, species C 4: species A
2	2a: species B 2b: species A 4: species C

Below is a small subset of my data. Please not that it's possible not not all 5 zones are listed. This is when a species was not found for that particular zone. This is not uncommon for Zone 1, but Zone 4 is almost always listed. In other words, I believe I'm dealing with unequal lengths for the matching groups.

I want to use seperate_wider_regex to split the column into two columns and multiple rows. With the end result something like this:

Site	Zone	Species
1	1	species A, species B
1	2a	species A, species C
1	4	species A
2	2a	species B
2	2b	species A
2	4	species C

I can then pivot the dataframe so that each zone gets it's own colomn with pivot_wider. I can match all the zones per row with the following regex: "[1-4][ab]?[:]?" The colon (:) is usually present for the species listings, but not always. I can also match everything between the two patterns, except for zone 4 because there's no pattern to end the match. I thought I could do something like this, but that also doesn't seem to work: ".+?(?=[1-4][ab]?:)|(.*)"

My ideal solution would be using packages from the Tidyverse. Here's a little subset of my data:

library(tidyr)

sample_df <- structure(list(site = 1:2, comment = c("2a: species A, species B 2b: species A, species C 3: species C, species D 4: species A, species B, species C", 
"2a: species A, species B, species D 2b: species B 3: species C 4: species C, species D, species E"
)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))

Solution

You can try

sample_df %>%
  mutate(comment = str_split(comment, "\\s(?=\\S+:)")) %>%
  unnest(comment) %>%
  separate_wider_regex(comment, c(zone = "[^:]+", ":\\s+", species = ".*"))

which gives

# A tibble: 8 × 3
   site zone  species
  <int> <chr> <chr>
1     1 2a    species A, species B
2     1 2b    species A, species C
3     1 3     species C, species D
4     1 4     species A, species B, species C
5     2 2a    species A, species B, species D
6     2 2b    species B
7     2 3     species C
8     2 4     species C, species D, species E