apache-sparkpysparkapache-spark-sqldatabricksparquet

Preserve parquet file names in PySpark


I am reading a parquet file with 2 partitions using spark in order to apply some processing, let's take this example

├── Users_data
│ ├── region=eu
      ├── country=france
          ├─- fr_default_players_results.parquet
│ ├── region=na
      ├── country=us
          ├── us_default_players_results.parquet

Is there a way to preserve the same file names (in this case fr_default_players_results.parquet, us_default_players_results.parquet) when writing the parquet back with df.write() ?


Solution

  • No unfortunately you cannot decide file names with spark because they are automatically generated, however what you can do is to create a column that contains the files names, and then partition by that column, this will create a directory with the filename and inside it the generated files by spark:

    df.withColumn("file_name", regexp_extract(input_file_name(), "[^/]*$", 0)).write.partitionBy("region", "country", "file_name").parquet("path/Users_data")
    

    This will create this tree:

    ├── Users_data
    │ ├── region=eu
          ├── country=france
              ├─- file_name=fr_default_players_results.parquet
                  ├──part-00...c000.snappy.parquet
    │ ├── region=na
          ├── country=us
              ├── file_name=us_default_players_results.parquet
                  ├──part-00...c000.snappy.parquet
    

    If you want to go further and really change the names, then you can use hadoop library to loop over your files, copy them to the parent path and rename them using the name of folders generates by spark file_name=....parquet, then delete the folders