Can someone help please and tell me what should be the correct PySpark Schema for the following tuple:
([['__label__positif', '__label__négatif', '__label__neutre']], array([[0.60312474, 0.24436191, 0.15254335]]))
Thank you in advance
Have a look at the narrated code below:
import numpy as np
#this is the object you got from the fasttext model
pred = ([['__label__positif', '__label__négatif', '__label__neutre']], np.array([[0.60312474, 0.24436191, 0.15254335]]))
#At first we flatten this object to create a list with 6 elements
pred = [item for sublist in pred for subsubiter in sublist for item in subsubiter]
#pyspark doesn't work that well with numpy and therefore we cast the numpy floats to python floats
pred = [x.item() if type(x) == np.float64 else x for x in pred]
l = [tuple(pred)]
columns = ['one', 'two', 'three', 'four', 'five', 'six']
df=spark.createDataFrame(l, columns)
([['__label__positif', '__label__négatif', '__label__neutre']], array([[0.60312474, 0.24436191, 0.15254335]]))
['__label__positif', '__label__négatif', '__label__neutre', 0.60312474, 0.24436191, 0.15254335]
['__label__positif', '__label__négatif', '__label__neutre', 0.60312474, 0.24436191, 0.15254335]
| one| two| three| four| five| six|