pythondataframescalapysparkpyspark-schema

Reading multiple files using pyspark with same columns but different ordering


Suppose I have two files.

file0.txt

field1 field2
1 2
1 2

file1.txt

field2 field1
2 1
2 1

Now, if I write:

spark.read.csv(["./file0.txt""./file1.txt"], sep=',', header=True, inferSchema=True).show()

the following dataframe is read by spark.

field1 field2
1 2
1 2
2 1
2 1

but it should have been,

field1 field2
1 2
1 2
1 2
1 2

I tried using inferSchema. As I have a lot of files in the folder I cannot hardcode the ordering of the columns in the csvs.


Solution

  • You can read them one at a time and then union them, so something like this,

    import glob
    
    path = 'test_data/'
    
    files=glob.glob(path +'*.txt')
    
    for idx,f in enumerate(files):
        if idx == 0:
            df = spark.read.csv(f, sep=',', header=True, inferSchema=True)
            final_df = df
        else:
            df = spark.read.csv(f, sep=',', header=True, inferSchema=True)
            final_df=final_df.unionByName(df)
    

    output:

    +------+------+
    |field1|field2|
    +------+------+
    |     1|     2|
    |     1|     2|
    |     1|     2|
    |     1|     2|
    +------+------+