When I am trying to import a local CSV with spark, every column is by default read in as a string. However, my columns only include integers and a timestamp type. To be more specific, the CSV looks like this:
"Customer","TransDate","Quantity","PurchAmount","Cost","TransID","TransKey"
149332,"15.11.2005",1,199.95,107,127998739,100000
I have found code that should work in this question, but when I execute it all the entries are returned as NULL
.
I use the following to create a custom schema:
from pyspark.sql.types import LongType, StringType, StructField, StructType, BooleanType, ArrayType, IntegerType, TimestampType
customSchema = StructType(Array(
StructField("Customer", IntegerType, true),
StructField("TransDate", TimestampType, true),
StructField("Quantity", IntegerType, true),
StructField("Cost", IntegerType, true),
StructField("TransKey", IntegerType, true)))
and then read in the CSV with:
myData = spark.read.load('myData.csv', format="csv", header="true", sep=',', schema=customSchema)
Which returns:
+--------+---------+--------+----+--------+
|Customer|TransDate|Quantity|Cost|Transkey|
+--------+---------+--------+----+--------+
| null| null| null|null| null|
+--------+---------+--------+----+--------+
Am I missing a crucial step? I suspect that the Date column is the root of the problem. Note: I am running this in GoogleCollab.
Here you go!
"Customer","TransDate","Quantity","PurchAmount","Cost","TransID","TransKey"
149332,"15.11.2005",1,199.95,107,127998739,100000
PATH_TO_FILE="file:///u/vikrant/LocalTestDateFile"
Loading above file to dataframe:
df = spark.read.format("com.databricks.spark.csv") \
.option("mode", "DROPMALFORMED") \
.option("header", "true") \
.option("inferschema", "true") \
.option("delimiter", ",").load(PATH_TO_FILE)
your date will get loaded as string column type, but the moment you change it to date type it will treat this date format as NULL.
df = (df.withColumn('TransDate',col('TransDate').cast('date'))
+--------+---------+--------+-----------+----+---------+--------+
|Customer|TransDate|Quantity|PurchAmount|Cost| TransID|TransKey|
+--------+---------+--------+-----------+----+---------+--------+
| 149332| null| 1| 199.95| 107|127998739| 100000|
+--------+---------+--------+-----------+----+---------+--------+
So we need to change the date format from dd.mm.yy to yy-mm-dd.
from datetime import datetime
from pyspark.sql.functions import col, udf
from pyspark.sql.types import DateType
from pyspark.sql.functions import col
Python function to change the date format:
change_dateformat_func = udf (lambda x: datetime.strptime(x, '%d.%m.%Y').strftime('%Y-%m-%d'))
call this function for your dataframe column now:
newdf = df.withColumn('TransDate', change_dateformat_func(col('TransDate')).cast(DateType()))
+--------+----------+--------+-----------+----+---------+--------+
|Customer| TransDate|Quantity|PurchAmount|Cost| TransID|TransKey|
+--------+----------+--------+-----------+----+---------+--------+
| 149332|2005-11-15| 1| 199.95| 107|127998739| 100000|
+--------+----------+--------+-----------+----+---------+--------+
and below is the Schema:
|-- Customer: integer (nullable = true)
|-- TransDate: date (nullable = true)
|-- Quantity: integer (nullable = true)
|-- PurchAmount: double (nullable = true)
|-- Cost: integer (nullable = true)
|-- TransID: integer (nullable = true)
|-- TransKey: integer (nullable = true)
Let me know if it works for you.