pythoncsvfilemerge

How to merge two files in python


I have two tab delimited csv files (with headers) that I need to merge in python.

Also, in the merged file I want to add a column in the end to identify the files because though they have same format, they have different data that I need to separate later on. So, I want to add a column called 'source' on each line of output which is 0 for file1 and 1 for file2.

I have gone far as using the csv module but the writerow adds an additioal newline character between each line it writes and this code doesn't write anything from file2. What am I doing wrong here? Also, how do I add the extra column 'source' in the line object?

import os, csv

path1 = os.path.abspath("../data/file1.txt")
path2 = os.path.abspath("../data/file2.txt")
merged_path = os.path.abspath('../data/output.txt')

# merge the two files for further processing
merged_file = csv.writer(open(merged_path, 'a'), delimiter = '\t')

#file1
fg = csv.reader(open(path1, 'r'), delimiter = '\t')

for line in fg:
    if line[7] != '\N':
        merged_file.writerow(line) 

#file2
bg = csv.reader(open(path2, 'r'), delimiter = '\t')

for line in bg:
    if line[16] != '\N':
        merged_file.writerow(line) 

Solution

  • I prefer to use the dictWriter for this. Also, your code doesn't work because the csv library requires opening files in binary mode.

    import os, csv
    
    path1 = os.path.abspath("../data/file1.txt")
    path2 = os.path.abspath("../data/file2.txt")
    merged_path = os.path.abspath('../data/output.txt')
    
    #file1
    fg = csv.DictReader(open(path1, 'rb'), delimiter = '\t')
    
    fieldnames = fg.fieldnames
    fieldnames.append('source')
    # merge the two files for further processing
    merged_file = csv.DictWriter(open(merged_path, 'ab'), delimiter = '\t', fieldnames=fieldnames)
    merged_file.writeheader()
    
    for row in fg:
        row['source'] = os.path.basename(path1)
        merged_file.writerow(row)
    
    #file2
    bg = csv.DictReader(open(path2, 'rb'), delimiter = '\t')
    
    for row in bg:
        row['source'] = os.path.basename(path1)
        merged_file.writerow(row)