I have a CSV that looks like this:
+-----------------+-----------------+-----------------+
| Column One | Column Two | Column Three |
+-----------------+-----------------+-----------------+
| This is a value | This is a value | This is a value |
+-----------------+-----------------+-----------------+
| This is a value | This is a value | This is a value |
+-----------------+-----------------+-----------------+
| This is a value | This is a value | This is a value |
+-----------------+-----------------+-----------------+
In plain text, it actually looks like this:
Column One,Column Two,Column Three
This is a value,This is a value,This is a value
This is a value,This is a value,This is a value
This is a value,This is a value,This is a value
My spark.read
method looks like this:
val df = spark.read
.format("csv")
.schema(schema)
.option("quote", "\"")
.option("escape", "\"")
.option("header", "true")
.option("multiLine", "true")
.option("mode", "DROPMALFORMED")
.load(inputFilePath)
When multiLine
is set to true
, the df
loads as empty. It loads fine when multiLine
is set to false
, but I need multiLine
set to true
.
If I change the name of Column Three
to ColumnThree
, and also update that in the schema
object, then it works fine. It seems like multiLine
is being applied to the header row! I was hoping that wouldn't be the case when header
is also set to true
.
Any ideas how to get around this? Should I be using the univocity
parser instead of the default commons
?
UPDATE:
I don't know why that mocked data was working fine. Here's a closer representation of the data:
CSV (Just 1 header and 1 line of data...):
Digital ISBN,Print ISBN,Title,Price,File Name,Description,Book Cover File Name
97803453308,test,This is English,29.99,qwe_1.txt,test,test
Schema & the spark.read
method:
val df = spark.read
.format("csv")
.schema(StructType(Array(
StructField("Digital ISBN", StringType, true),
StructField("Print ISBN", StringType, true),
StructField("Title", StringType, true),
StructField("File Name", StringType, true),
StructField("Price", StringType, true),
StructField("Description", StringType, true),
StructField("Book Cover File Name", StringType, true)
)))
.option("quote", "\"")
.option("escape", "\"")
.option("header", "true")
.option("multiLine", "true")
.option("mode", "DROPMALFORMED")
.load(inputFilePath)
df.show()
result in spark-shell
:
+------------+----------+-----+---------+-----+-----------+--------------------+
|Digital ISBN|Print ISBN|Title|File Name|Price|Description|Book Cover File Name|
+------------+----------+-----+---------+-----+-----------+--------------------+
+------------+----------+-----+---------+-----+-----------+--------------------+
UDPATE 2:
I think I found "what's different". When I copy the data in the CSV and save it to another CSV, it works fine. But that original CSV (which was saved by Excel), fails... The CSV saved by Excel is 1290 bytes, while the CSV I created myself (which works fine) is 1292 bytes....
UPDATE 3:
I opened the two files mentioned in Update2 in vim
and noticed that the CSV saved by Excel had ^M
instead of new lines. All of my testing prior to this was flawed because it was always comparing a CSV originally saved by Excel vs a CSV created from Sublime... Sublime wasn't showing the difference. I'm sure there's a setting or package I can install to see that, because I use Sublime as my go-to one-off file editor...
Not sure if I should close this question since the title is misleading. Then again, there's gotta be some value to someone out there lol...
Since the question has a few up-votes, here's the resolution to the original problem as an answer...
Newlines in files saved in the Windows world contain both carriage return
and line feed
. Spark (running on Linux) sees this as a malformed row and drops it, because in its world, newlines are just line feed
.
Lessons: