rapache-sparksparkr

Convert a list of fields to structtype object which is a SparkR schema


We have to get the schema of dataframe in SparkR as StructType and list as list of fields, e.g:

str(schema)
#List of 2
# $ jobj  :Class 'jobj' <environment: 0x563114ff5900> 
# $ fields:function ()  
# - attr(*, "class")= chr "structType"

schema <- schema(output_count)
 
fields <- schema$fields()

fields
#[[1]]
#StructField(name = "word", type = "StringType", nullable = TRUE)
#[[2]]
#StructField(name = "count", type = "StringType", nullable = TRUE)

I found that SparkR API exposes a method: https://spark.apache.org/docs/2.0.0/api/R/

but not sure how to use it as a beginner in SparkR

My attempt:

schema <- schema(output_count)
str(schema)
#List of 2
# $ jobj  :Class 'jobj' <environment: 0x563114ff5900> 
# $ fields:function ()  
# - attr(*, "class")= chr "structType"

I try to get it as a structtype


Solution

  • If I understood correctly, then the below codes at least produces the type of output you explained in the question.

    df <- SparkR::createDataFrame(iris)
    lapply(SparkR::dtypes(df), function(x) SparkR::structField(x[1], x[2]))
    

    The output is:

    [[1]] 
    StructField(name = "Sepal_Length", type = "DoubleType", nullable = TRUE)
    [[2]] 
    StructField(name = "Sepal_Width", type = "DoubleType", nullable = TRUE)
    [[3]] 
    StructField(name = "Petal_Length", type = "DoubleType", nullable = TRUE)
    [[4]] 
    StructField(name = "Petal_Width", type = "DoubleType", nullable = TRUE)
    [[5]] 
    StructField(name = "Species", type = "StringType", nullable = TRUE)
    

    If you further use do.apply with SparkR::structType,

    do.call(SparkR::structType, lapply(SparkR::dtypes(dd), function(x) SparkR::structField(x[1], x[2])))
    

    then the output is like below:

    StructType
    |-name = "Sepal_Length", type = "DoubleType", nullable = TRUE
    |-name = "Sepal_Width", type = "DoubleType", nullable = TRUE
    |-name = "Petal_Length", type = "DoubleType", nullable = TRUE
    |-name = "Petal_Width", type = "DoubleType", nullable = TRUE
    |-name = "Species", type = "StringType", nullable = TRUE