pysparksql-types

create a PySpark Schema for a tuple composed of a list and an array


Can someone help please and tell me what should be the correct PySpark Schema for the following tuple:

([['__label__positif', '__label__négatif', '__label__neutre']], array([[0.60312474, 0.24436191, 0.15254335]]))

Thank you in advance


Solution

  • Have a look at the narrated code below:

    import numpy as np
    
    #this is the object you got from the fasttext model
    pred = ([['__label__positif', '__label__négatif', '__label__neutre']], np.array([[0.60312474, 0.24436191, 0.15254335]]))
    print(pred)
    
    #At first we flatten this object to create a list with 6 elements
    pred = [item for sublist in pred for subsubiter in sublist for item in subsubiter]
    print(pred)
    
    #pyspark doesn't work that well with numpy and therefore we cast the numpy floats to python floats
    pred = [x.item() if type(x) == np.float64 else x for x in pred]
    print(pred)
    
    l = [tuple(pred)]
    
    columns = ['one', 'two', 'three', 'four', 'five', 'six']
    
    df=spark.createDataFrame(l, columns)
    df.show()
    

    Output:

    ([['__label__positif', '__label__négatif', '__label__neutre']], array([[0.60312474, 0.24436191, 0.15254335]])) 
    ['__label__positif', '__label__négatif', '__label__neutre', 0.60312474, 0.24436191, 0.15254335] 
    ['__label__positif', '__label__négatif', '__label__neutre', 0.60312474, 0.24436191, 0.15254335] 
    +----------------+----------------+---------------+----------+----------+----------+ 
    |             one|             two|          three|      four|      five|       six| 
    +----------------+----------------+---------------+----------+----------+----------+ 
    |__label__positif|__label__négatif|__label__neutre|0.60312474|0.24436191|0.15254335| 
    +----------------+----------------+---------------+----------+----------+----------+