databricks

How to upload local parquet files to a remote Databricks table in a language agnostic way?


I have a dotnet project and I am generating a bunch of parquet files. I would like to upload these parquet files to Databricks. I would prefer to avoid introducing python to this project. Is there a way to do this that does not require python?

Databricks documentation seems to cover UI/manually cases only.

I am hoping there is some sort of HTTP api I can invoke or something similar. Could ODBC be useful?


Solution

  • Depends on how big the file(s) is. Is it one time to ongoing, what's the storage behind your tables, what kind of tables are they etc.

    As you mentioned .NET and ODBC: Here is the Databricks ODBC driver that you can use.

    Holistically you have two options:

    1. Pull.

    You'll have to make your "local" parquet file available to Databricks compute somehow. So you'll have to copy it to some storage/FS that is readable from your Databricks compute. E.g.

    1. Push.

    Lets say the "remote Databricks table" is a Delta table backed by S3 storage (say s3://bucket/path/to/delta/table1).


    It is also possible to run a local Spark cluster and write to Delta table using Delta lib, without using Databricks at all. E.g.

    $ pyspark --packages io.delta:delta-core_2.12:2.4.0,org.apache.hadoop:hadoop-aws:3.3.4 \
              --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
              --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
    ......snip......
    io.delta#delta-core_2.12 added as a dependency
    org.apache.hadoop#hadoop-aws added as a dependency
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/   _/
       /__ / .__/\_,_/_/ /_/\_\   version 3.5.0
          /_/
    
    Using Python version 3.10.12 (main, Jun  8 2023 17:32:40)
    SparkSession available as 'spark'.
    >>> df = spark.read.parquet('/tmp/path/to/local/parquet/file1.parquet')
    >>> df.write.format('delta').save('s3a://bucket/path/to/delta/table1')
    >>>