rregextidyversedata-manipulation

Using separate_wider_regex for to create a new row for each match in R


I have a dataframe with a long 'comment'-column that includes semi-structered information about the presence of species in several zones for different study sites. The pattern is typically something like the following:

Site Comment
1 1: species A, species B 2a: species A, species C 4: species A
2 2a: species B 2b: species A 4: species C

Below is a small subset of my data. Please not that it's possible not not all 5 zones are listed. This is when a species was not found for that particular zone. This is not uncommon for Zone 1, but Zone 4 is almost always listed. In other words, I believe I'm dealing with unequal lengths for the matching groups.

I want to use seperate_wider_regex to split the column into two columns and multiple rows. With the end result something like this:

Site Zone Species
1 1 species A, species B
1 2a species A, species C
1 4 species A
2 2a species B
2 2b species A
2 4 species C

I can then pivot the dataframe so that each zone gets it's own colomn with pivot_wider. I can match all the zones per row with the following regex: "[1-4][ab]?[:]?" The colon (:) is usually present for the species listings, but not always. I can also match everything between the two patterns, except for zone 4 because there's no pattern to end the match. I thought I could do something like this, but that also doesn't seem to work: ".+?(?=[1-4][ab]?:)|(.*)"

My ideal solution would be using packages from the Tidyverse. Here's a little subset of my data:

library(tidyr)

sample_df <- structure(list(site = 1:2, comment = c("2a: species A, species B 2b: species A, species C 3: species C, species D 4: species A, species B, species C", 
"2a: species A, species B, species D 2b: species B 3: species C 4: species C, species D, species E"
)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))

Solution

  • You can try

    sample_df %>%
      mutate(comment = str_split(comment, "\\s(?=\\S+:)")) %>%
      unnest(comment) %>%
      separate_wider_regex(comment, c(zone = "[^:]+", ":\\s+", species = ".*"))
    

    which gives

    # A tibble: 8 × 3
       site zone  species
      <int> <chr> <chr>
    1     1 2a    species A, species B
    2     1 2b    species A, species C
    3     1 3     species C, species D
    4     1 4     species A, species B, species C
    5     2 2a    species A, species B, species D
    6     2 2b    species B
    7     2 3     species C
    8     2 4     species C, species D, species E