rapache-sparksparklyr

reading a subset of columns with spark_read_parquet


I tried to read a subset of columns from a 'table' using spark_read_parquet,

temp <- spark_read_parquet(sc, name='mytable',columns=c("Col1","Col2"),
                                 path="/my/path/to/the/parquet/folder")

But I got the error:

Error: java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.
Old column names (54): .....

Is my syntax right? I tried googling for a (real) code example using the columns argument but couldn't find any.

(And my apologies in advance... I don't really know how to give you a reproducible example involving a spark and cloud.)


Solution

  • TL;DR This is not how columns work. When applied like this there are used to rename the columns, hence its lengths, should be equal to the length of the input.

    The way to use it is (please note memory = FALSE, it is crucial for this to work correctly):

    spark_read_parquet(
      sc, name = "mytable", path = "/tmp/foo", 
      memory = FALSE
    ) %>% select(Col1, Col2) 
    

    optionally followed by

    ... %>% 
      sdf_persist()
    

    If you have a character vector, you can use rlang:

    library(rlang)
    
    cols <- c("Col1", "Col2")
    
    spark_read_parquet(sc, name="mytable", path="/tmp/foo", memory=FALSE) %>% 
      select(!!! lapply(cols, parse_quosure))