rsubstringstrposdesctools

How to create new columns in data.frame based on letter AND number character objects in a column in R


I have a data frame with a column filled with data like so, on chromosome and then base position, all in one column. I filled in the remaining columns V2 through V5 with integers just to simulate a similar data.frame.

> test
             V1 V2 V3 V4 V5
1     I.1286480  9 17 25 33
2     I.1898932 10 18 26 34
3    I.11871397 11 19 27 35
4    II.1252994 12 20 28 36
5   II.18175911 13 21 29 37
6  III.10298347 14 22 30 38
7  IV.123478912 15 23 31 39
8 V.12837471234 16 24 32 40

with other data in the following columns. This is a huge data set, with 115,000 rows. I want to make two new columns, one containing the roman numerals (I, II, III, IV, V) and another column containing the number following the roman numerals. The issues I'm having trouble with are that this is a vector of character objects, so I'm not sure how to parse out the letters from the numbers. I tried using StrPos from DescTools package, but

> StrPos(test$V1, "I")
[1]  1  1  1  1  1  1  1 NA
> StrPos(test$V1, "I.")
[1]  1  1  1  1  1  1  1 NA

it returns positions of all "I"s, not just the objects with one instance of "I". I'm wondering whether substring would work? But then I have the problem of all the roman numerals being of different lengths, as well as the numbers following the roman numerals being of different lengths as well. I know there must be a simple solution to this problem, but the only things I can think up are very long for and if loops. Help me, stackoverflow, you're my only hope!


Solution

  • Using separate from tidyr:

    library(tidyr)
    separate(test, V1, into = c("chr", "pos"))
      chr         pos V2 V3 V4 V5
    1   I     1286480  9 17 25 33
    2   I     1898932 10 18 26 34
    3   I    11871397 11 19 27 35
    4  II     1252994 12 20 28 36
    5  II    18175911 13 21 29 37
    6 III    10298347 14 22 30 38
    7  IV   123478912 15 23 31 39
    8   V 12837471234 16 24 32 40