rtidyr

Tidy up datasets - converting subheaders into column


I received a txt data file that looks like this:

# test A response c
1   x1     1
2   x2     0
..       ..
324 x324   5
# test P response 8
1   x1     2
2   x2     1
..       ..
501 x501   4
# test 7 response t
1   x1     2
2   x2     1
..       ..
936 x936   4

It has two long columns separated by tab many subheaders (e.g. "test A response c"). Notice under each subheader, the number of rows is variable. The total number of rows is about 10,000. I want to tidy it up and drop all the subheaders, to as following:

x      y Test   Response  
x1     1 test A response c
x2     0 test A response c
..       ..
x324   5 test A response c
x1     2 test P response 8
x2     1 test P response 8
..       ..
x501   4 test P response 8
x1     2 test 7 response t
x2     1 test 7 response t
..       ..
x936   4 test 7 response t

What's the best way to do this?


Solution

  • One way to solve your problem

    library(tidyr)
    library(dplyr)
    
    df |> 
      mutate(Test = replace(x, !grepl("test", x), NA),
             Response = replace(y, !grepl("response", y), NA)) |> 
      fill(Test, Response) |> 
      filter(!grepl("test", x))
    
         x y   Test   Response
    1   x1 1 test A response c
    2   x2 0 test A response c
    3 x324 5 test A response c
    4   x1 2 test P response 8
    5   x2 1 test P response 8
    6 x501 4 test P response 8
    7   x1 2 test 7 response t
    8   x2 1 test 7 response t
    9 x936 4 test 7 response t
    
    data
    # read file
    df = read.delim("file.txt", header=FALSE, , col.names=c("x", "y"))
    # toy example:
    df = read.delim(text="test A    response c
    x1  1
    x2  0
    x324    5
    test P  response 8
    x1  2
    x2  1
    x501    4
    test 7  response t
    x1  2
    x2  1
    x936    4", header=FALSE, , col.names=c("x", "y"))
    
    df
            x          y
    1  test A response c
    2      x1          1
    3      x2          0
    4    x324          5
    5  test P response 8
    6      x1          2
    7      x2          1
    8    x501          4
    9  test 7 response t
    10     x1          2
    11     x2          1
    12   x936          4