[SOLVED] How to export data from Delta table in Databricks and write to txt file

How to export data from Delta table in Databricks and write to txt file

I have a delta table with 20 columns and around 5.23 million rows. Some of the columns are complex datatypes. I want to export data from the table and write to .txt files using python with a header row using tab (\t) field delimiter and 50,000 rows of data in each file. I am pretty new to Databricks and python and need to have a solution. Please write the full code and not just logic.

Thanks in advance.

Tried searching no result

Solution

Does the 50,000 record count per file have to be exact? If not 5.32e6/50,000 is roughly 106, so if we repartition data into 106 partitions, it's gonna give us files with roughly 50,000 records:

import pyspark.sql.functions as f
from pyspark.sql.types import *

df = spark.read.format('delta').load('<path to table>')
record_count = df.count()
(
    df
    .select(*[f.col(element).cast(StringType()).alias(element) for element in df.columns]) #To take care of complex data types
    .repartition(int(record_count/50000))
    .write.option('delimiter', '\t').option('header', True).csv('<write destination>')
)

#since you're using Databricks and if you absolutely care about the file extension being .txt
files = dbutils.fs.ls('<write destination>')
for element in flies:
    dbutils.fs.mv(element.path, element.path.replace('.csv', '.txt'))