csvpysparknumber-formatting

PySpark: How to specify column with comma as decimal


I am working with PySpark and loading a csv file. I have a column with numbers in European format, which means that comma replaces the dot and vice versa.

For example: I have 2.416,67 instead of 2,416.67.

My data in .csv file looks like this -    
ID;    Revenue
21;    2.645,45
23;   31.147,05
.
.
55;    1.009,11

In pandas, such a file can easily be read by specifying decimal=',' and thousands='.' options inside pd.read_csv() to read European formats.

Pandas code:

import pandas as pd
df=pd.read_csv("filepath/revenues.csv",sep=';',decimal=',',thousands='.')

I don't know how can this be done in PySpark.

PySpark code:

from pyspark.sql.types import StructType, StructField, FloatType, StringType
schema = StructType([
            StructField("ID", StringType(), True),
            StructField("Revenue", FloatType(), True)
                    ])
df=spark.read.csv("filepath/revenues.csv",sep=';',encoding='UTF-8', schema=schema, header=True)

Can anyone suggest as to how we can load such a file in PySpark using the above mentioned .csv() function?


Solution

  • You won't be able to read it as a float because the format of the data. You need to read it as a string, clean it up and then cast to float:

    from pyspark.sql.functions import regexp_replace
    from pyspark.sql.types import FloatType
    
    df = spark.read.option("headers", "true").option("inferSchema", "true").csv("my_csv.csv", sep=";")
    df = df.withColumn('revenue', regexp_replace('revenue', '\\.', ''))
    df = df.withColumn('revenue', regexp_replace('revenue', ',', '.'))
    df = df.withColumn('revenue', df['revenue'].cast("float"))
    

    You can probably just chain these all together too:

    df = spark.read.option("headers", "true").option("inferSchema", "true").csv("my_csv.csv", sep=";")
    df = (
             df
             .withColumn('revenue', regexp_replace('revenue', '\\.', ''))
             .withColumn('revenue', regexp_replace('revenue', ',', '.'))
             .withColumn('revenue', df['revenue'].cast("float"))
         )
    

    Please note this I haven't tested this so there may be a typo or two in there.