apache-sparkapache-spark-sqlspark-csv

Spark SQLContext Query with header


I'm using SQLContext to read in a CSV file like this:

val csvContents = sqlContext.read.sql("SELECT * FROM 
                  csv.`src/test/resources/afile.csv` WHERE firstcolumn=21")

But it's printing out the first column as _c0 and including the header under it. How do I set the header and use the SQL query? I've seen this solution:

 val df = spark.read
         .option("header", "true") //reading the headers
         .csv("file.csv")

But this doesn't allow me to do the SELECT query with the WHERE clause. Is there a way to specify a CSV header and do a SQL SELECT query?


Solution

  • It turns out the header wasn't being parsed correctly. The CSV file was tab-delimited so I had to explicitly specify that:

    val csvContents = sqlContext.read
            .option("delimiter", "\t")
            .option("header", "true")
            .csv(csvPath)
            .select("*")
            .where(s"col_id=22")