I have a data.table with many rows that look like this in R:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
NCBINCC GenBank gene 331 1008 . - . gene_id=UL1 protein_id=ABV71500.1
NCBINCC GenBank gene 1009 1120 . - . gene_id=UL4 protein_id=ABV71520
NCBINCC GenBank gene 1135 1200 . - . gene_id=UL6 protein_id=ABV71525
Is there a simple way to add quotes in between strings (after the strings gene_id= and protein_id=) so that they only encompass the different gene and proteins like the following output:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
NCBINCC GenBank gene 331 1008 . - . gene_id="UL1" protein_id="ABV71500.1"
NCBINCC GenBank gene 1009 1120 . - . gene_id="UL4" protein_id="ABV71520"
NCBINCC GenBank gene 1135 1200 . - . gene_id="UL6" protein_id="ABV71525"
I have seen this answer for shell, but wanted to know if there was a way to also do it in R. Thank you kindly.
If you are bored from packages, you may want to try sub
in an lapply
.
v <- c('V9', 'V10')
d[v] <- lapply(d[v], sub, pa='\\=(.*)', re='="\\1"')
d
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 NCBINCC GenBank gene 331 1008 . - . gene_id="UL1" protein_id="ABV71500.1"
# 2 NCBINCC GenBank gene 1009 1120 . - . gene_id="UL4" protein_id="ABV71520"
# 3 NCBINCC GenBank gene 1135 1200 . - . gene_id="UL6" protein_id="ABV71525"
Data
d <- read.table(header=T, text='V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
NCBINCC GenBank gene 331 1008 . - . gene_id=UL1 protein_id=ABV71500.1
NCBINCC GenBank gene 1009 1120 . - . gene_id=UL4 protein_id=ABV71520
NCBINCC GenBank gene 1135 1200 . - . gene_id=UL6 protein_id=ABV71525')