To tidy part of a dataset I need to separate one column into several. These data takes the form similar to this:
set.seed(2133)
df <- data.frame(a = paste(sample(1:9, 5, replace=T), sample(c("", "%2", "%3"), 5, replace=T), sample(c("", "%3", "%4"), 5, replace=T), sep=""))
df
a
1 6
2 2%3%4
3 6%2
4 3%2
5 5%2%4
Tidyr's separate function wont do the job and the best idea I have is a series of ifelse statements, like this:
df$One <- ifelse(grepl("1", df$a) == T, 1, 0)
df$Two <- ifelse(grepl("2", df$a) == T, 1, 0)
a One Two
1 6 0 0
2 2%3%4 0 1
3 6%2 0 1
4 3%2 0 1
5 5%2%4 0 1
What's the best way to go about such tidying. I'm sure many people working with Open Data Kit (ODK) for data collection will come across this.
We can loop the patterns (i.e. 1, 2) with sapply
, get a logical matrix, coerce to binary by wrapping with +
and assign it as new columns in 'df'
df[c("One", "Two")] <- +(sapply(1:2, grepl, df$a))
df
# a One Two
#1 6 0 0
#2 2%3%4 0 1
#3 6%2 0 1
#4 3%2 0 1
#5 5%2%4 0 1
If this is to get the binary output for each unique value in 'a' column, we can split the strings in 'a' by the delimiter (%
), the output list
can be converted to data.frame
using stack
, transform
the 'values' column to factor
class by specifying the levels
and do the table
table(transform(stack(setNames(strsplit(as.character(df$a), "[%]"),
1:nrow(df))), values = factor(values, levels= 1:6))[2:1])
Or we can use a convenient function mtabulate
from qdapTools
after splitting.
library(qdapTools)
mtabulate(strsplit(as.character(df$a), "[%]"))