apache-sparkhbaseparquetgoogle-cloud-bigtablespark-avro

Spark - Wide/sparse dataframe persistence


I want to persist a very wide Spark Dataframe (>100'000 columns) that is sparsely populated (>99% of values are null) while keeping only non-null values (to avoid storage cost):

Note that I've tried already Parquet and Avro with a simple df.write statement - for a df of size ca. 100x130k Parquet is performing the worst (ca. 55MB) vs. Avro (ca. 15MB). To me this suggests that ALL null values are stored.

Thanks !


Solution

  • Spark to JSON / SparseVector (from thebluephantom)

    In pyspark and using ml. Convert to Scala otherwise.

    %python
    from pyspark.sql.types import StructType, StructField, DoubleType
    from pyspark.ml.linalg import SparseVector, VectorUDT
    
    temp_rdd = sc.parallelize([
        (0.0, SparseVector(4, {1: 1.0, 3: 5.5})),
        (1.0, SparseVector(4, {0: -1.0, 2: 0.5}))])
    
    schema = StructType([
        StructField("label", DoubleType(), False),
        StructField("features", VectorUDT(), False)
    ])
    
    df = temp_rdd.toDF(schema)
    df.printSchema()
    df.write.json("/FileStore/V.json")
    
    
    df2 = spark.read.schema(schema).json("/FileStore/V.json")
    df2.show()
    

    returns upon read:

    +-----+--------------------+
    |label|            features|
    +-----+--------------------+
    |  1.0|(4,[0,2],[-1.0,0.5])|
    |  0.0| (4,[1,3],[1.0,5.5])|
    +-----+--------------------+
    

    Spark to Avro / Avro2TF (from py-r)

    The Avro2TF library presented in this tutorial seems to be an interesting alternative that directly leverages Avro. As a result, a sparse vector would be encoded as follows:

    +---------------------+--------------------+
    |genreFeatures_indices|genreFeatures_values|
    +---------------------+--------------------+
    |     [2, 4, 1, 8, 11]|[1.0, 1.0, 1.0, 1...|
    |          [11, 10, 3]|     [1.0, 1.0, 1.0]|
    |            [2, 4, 8]|     [1.0, 1.0, 1.0]|
    |             [11, 10]|          [1.0, 1.0]|
    |               [4, 8]|          [1.0, 1.0]|
    |         [2, 4, 7, 3]|[1.0, 1.0, 1.0, 1.0]|