rzipcompressiongzipvowpalwabbit

write a gzip file from data frame


I'm trying to write a data frame to a gzip file but having problems.

Here's my code example:

df1 <- data.frame(id = seq(1,10,1), var1 = runif(10), var2 = runif(10))

gz1 <- gzfile("df1.gz","w" )
writeLines(df1)

Error in writeLines(df1) : invalid 'text' argument

Any suggestions?

EDIT: an example line of the character vector I'm trying to write is:

0 | var1:1.5 var2:.55 var7:1250

The class label / y-variable is separated from the x-vars by a " | " and variable names are separated from values by " : " and spaces between variables.

EDIT2: I apologize for the wording / format of the question but here are the results: Old method:

system.time(write(out1, file="out1.txt"))
#    user  system elapsed 
#   9.772  17.205  86.860 

New Method:

writeGzFile <- function(){
  gz1 = gzfile("df1.gz","w");
  write(out1, gz1);
  close(gz1) 
}

system.time( writeGzFile())
#    user  system elapsed 
#   2.312   0.000   2.478 

Thank you all very much for helping me figure this out.


Solution

  • writeLines expects a list of strings. The simplest way to write this to a gzip file would be

    df1 <- data.frame(id = seq(1,10,1), var1 = runif(10), var2 = runif(10))
    gz1 <- gzfile("df1.gz", "w")
    write.csv(df1, gz1)
    close(gz1)
    

    This will write it as a gzipped csv. Also see write.table and write.csv2 for alternate ways of writing the file out.

    EDIT:Based on the updates to the post about desired format, I made the following helper (quickly thrown together, probably admits tons of simplification):

    function(df) {
        rowCount <- nrow(df)
        dfNames <- names(df)
        dfNamesIndex <- length(dfNames)
        sapply(1:rowCount, function(rowIndex) {
            paste(rowIndex, '|', 
                paste(sapply(1:dfNamesIndex, function(element) {
                    c(dfNames[element], ':', df[rowIndex, element])
                }), collapse=' ')
            )
        })
    }
    

    So the output looks like

    a <- data.frame(x=1:10,y=rnorm(10))
    writeLines(myser(a))
    # 1 | x : 1 y : -0.231340933021948
    # 2 | x : 2 y : 0.896777389870928
    # 3 | x : 3 y : -0.434875004781075
    # 4 | x : 4 y : -0.0269824962632977
    # 5 | x : 5 y : 0.67654540494899
    # 6 | x : 6 y : -1.96965253674725
    # 7 | x : 7 y : 0.0863177759402661
    # 8 | x : 8 y : -0.130116466571162
    # 9 | x : 9 y : 0.418337557610229
    # 10 | x : 10 y : -1.22890714891874
    

    And all that is necessary is to pass the gzfile in to writeLines to get the desired output.