azure-databricksucanaccess

UCanAccess keeps repeating column names as data


I was using UcanAccess 5.0.1 in databricks 9.1LTS (Spark 3.1.2, Scala 2.1.2), and for whatever reasons when I use the following code to read in a single record Access db table it keeps treating the column names as the record itself (I've tried adding more records and got the same results.)

The Access db table looks like this (2 records):

ID  Field1  Field2
2   key1    column2
3   key2    column2-1

The python code looks like this:

connectionProperties = {
  "driver" : "net.ucanaccess.jdbc.UcanaccessDriver"
}
 
url = "jdbc:ucanaccess:///dbfs/mnt/internal/Temporal/Database1.accdb"
df = spark.read.jdbc(url=url, table="Table1", properties=connectionProperties)

And the result looks like this:

df.printSchema()
df.count()

root
 |-- Field1: string (nullable = true)
 |-- Field2: string (nullable = true)
 |-- ID: string (nullable = true)
Out[21]: 2

df.show()
+------+------+---+
|Field1|Field2| ID|
+------+------+---+
|Field1|Field2| ID|
|Field1|Field2| ID|
+------+------+---+

Any idea/suggestion?


Solution

  • turns out that there was a bug in the jdbc code ([https://stackoverflow.com/questions/63177736/spark-read-as-jdbc-return-all-rows-as-columns-name])

    I added the following code and now the ucanaccess driver works fine:

    %scala
    import org.apache.spark.sql.jdbc.JdbcDialect
    import org.apache.spark.sql.jdbc.JdbcDialects
    private case object HiveDialect extends JdbcDialect {
      override def canHandle(url : String): Boolean = url.startsWith("jdbc:ucanaccess")
      override def quoteIdentifier(colName: String): String = {
        colName.split('.').map(part => s"`$part`").mkString(".")
      }
    }
    
    JdbcDialects.registerDialect(HiveDialect)
    

    Then display(df) would show

    |Field1 |Field2 |ID     |
    |:------|:------|:----- |
    |key1   |column2 |  2|
    |key2   |column2-1| 3|