I have a dataframe with a long 'comment'-column that includes semi-structered information about the presence of species in several zones for different study sites. The pattern is typically something like the following:
Site | Comment |
---|---|
1 | 1: species A, species B 2a: species A, species C 4: species A |
2 | 2a: species B 2b: species A 4: species C |
Below is a small subset of my data. Please not that it's possible not not all 5 zones are listed. This is when a species was not found for that particular zone. This is not uncommon for Zone 1, but Zone 4 is almost always listed. In other words, I believe I'm dealing with unequal lengths for the matching groups.
I want to use seperate_wider_regex to split the column into two columns and multiple rows. With the end result something like this:
Site | Zone | Species |
---|---|---|
1 | 1 | species A, species B |
1 | 2a | species A, species C |
1 | 4 | species A |
2 | 2a | species B |
2 | 2b | species A |
2 | 4 | species C |
I can then pivot the dataframe so that each zone gets it's own colomn with pivot_wider
. I can match all the zones per row with the following regex: "[1-4][ab]?[:]?"
The colon (:) is usually present for the species listings, but not always. I can also match everything between the two patterns, except for zone 4 because there's no pattern to end the match. I thought I could do something like this, but that also doesn't seem to work: ".+?(?=[1-4][ab]?:)|(.*)"
My ideal solution would be using packages from the Tidyverse. Here's a little subset of my data:
library(tidyr)
sample_df <- structure(list(site = 1:2, comment = c("2a: species A, species B 2b: species A, species C 3: species C, species D 4: species A, species B, species C",
"2a: species A, species B, species D 2b: species B 3: species C 4: species C, species D, species E"
)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))
You can try
sample_df %>%
mutate(comment = str_split(comment, "\\s(?=\\S+:)")) %>%
unnest(comment) %>%
separate_wider_regex(comment, c(zone = "[^:]+", ":\\s+", species = ".*"))
which gives
# A tibble: 8 × 3
site zone species
<int> <chr> <chr>
1 1 2a species A, species B
2 1 2b species A, species C
3 1 3 species C, species D
4 1 4 species A, species B, species C
5 2 2a species A, species B, species D
6 2 2b species B
7 2 3 species C
8 2 4 species C, species D, species E