pythonapache-sparkpyspark

PySpark sampleBy using multiple columns


I want to carry out a stratified sampling from a data frame on PySpark. There is a sampleBy(col, fractions, seed=None) function, but it seems to only use one column as a strata. Is there any way to use multiple columns as a strata?


Solution

  • based on the answer here

    after converting it to python, I think an answer might look like:

    #create a dataframe to use
    df = sc.parallelize([ (1,1234,282),(1,1396,179),(2,8620,178),(3,1620,191),(3,8820,828) ] ).toDF(["ID","X","Y"])
    
    #we are going to use the first two columns as our key (strata)
    #assign sampling percentages to each key # you could do something cooler here
    fractions = df.rdd.map(lambda x: (x[0],x[1])).distinct().map(lambda x: (x,0.3)).collectAsMap()
    
    #setup how we want to key the dataframe
    kb = df.rdd.keyBy(lambda x: (x[0],x[1]))
    
    #create a dataframe after sampling from our newly keyed rdd
    #note, if the sample did not return any values you'll get a `ValueError: RDD is empty` error
    
    sampleddf = kb.sampleByKey(False,fractions).map(lambda x: x[1]).toDF(df.columns)
    sampleddf.show()
    +---+----+---+
    | ID|   X|  Y|
    +---+----+---+
    |  1|1234|282|
    |  1|1396|179|
    |  3|1620|191|
    +---+----+---+
    #other examples
    kb.sampleByKey(False,fractions).map(lambda x: x[1]).toDF(df.columns).show()
    +---+----+---+
    | ID|   X|  Y|
    +---+----+---+
    |  2|8620|178|
    +---+----+---+
    
    
    kb.sampleByKey(False,fractions).map(lambda x: x[1]).toDF(df.columns).show()
    +---+----+---+
    | ID|   X|  Y|
    +---+----+---+
    |  1|1234|282|
    |  1|1396|179|
    +---+----+---+
    

    Is this the kind of thing you were looking for?