Suppose I have two files.
file0.txt
field1 | field2 |
---|---|
1 | 2 |
1 | 2 |
file1.txt
field2 | field1 |
---|---|
2 | 1 |
2 | 1 |
Now, if I write:
spark.read.csv(["./file0.txt""./file1.txt"], sep=',', header=True, inferSchema=True).show()
the following dataframe is read by spark.
field1 | field2 |
---|---|
1 | 2 |
1 | 2 |
2 | 1 |
2 | 1 |
but it should have been,
field1 | field2 |
---|---|
1 | 2 |
1 | 2 |
1 | 2 |
1 | 2 |
I tried using inferSchema. As I have a lot of files in the folder I cannot hardcode the ordering of the columns in the csvs.
You can read them one at a time and then union them, so something like this,
import glob
path = 'test_data/'
files=glob.glob(path +'*.txt')
for idx,f in enumerate(files):
if idx == 0:
df = spark.read.csv(f, sep=',', header=True, inferSchema=True)
final_df = df
else:
df = spark.read.csv(f, sep=',', header=True, inferSchema=True)
final_df=final_df.unionByName(df)
output:
+------+------+
|field1|field2|
+------+------+
| 1| 2|
| 1| 2|
| 1| 2|
| 1| 2|
+------+------+