rregexread.csvragged

Importing first three and last three fields from CSV with variable number of fields


I've got a data set in csv. Unfortunately each line has different amount of "," commas. I am interested in importing only first 3 and last 3 variables from the file in R.

in example:

> line: "A","B","C","D",...,"X",Y","Z"

I want to achieve the following `

> line: "A","B","C","X","Y","Z"

I tried to use grep, to find - by using of regural expressions - first 3 variables:

new_data <- grep("([^,]+)(,[^,]+){2}", dataset, values=TRUE)

After that operation it shows me all lines in which that expression exists.

How can I remove the following variables in the line using grep, if it is possible, how can I remove the whole interval (each variable from <3;n-3>).

Do you now other method to solve that problem?


Solution

  • Using a combination of apply and head and tail:

    d2 <- data.frame(t(apply(d1, 1, function(x) c(head(x[x != ''],3), tail(x[x != ''],3)))))
    

    resulting in:

    > d2
      X1 X2 X3 X4 X5 X6
    1  a  b  c  x  y  z
    2  a  b  c  g  h  i
    3  a  b  c  t  u  v
    

    Using the data of @VarunM:

    d1 <- read.csv(text='a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z
    a, b, c, d, e, f, g, h, i
    a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v', header = FALSE, fill = TRUE)