pythonpython-3.xcsv

replace CRLF characters in middle of row for unquoted csv file using python


I am receiving below sample data in a txt file from source. The file contains a header row which signifies how many columns are expected per row

ID~First name~last name~description~birth date<CR/LF>
1~Smith~John~Loves to skateboard~12/3/98<CR/LF>
2~Johnson~Mike~Eats worms~2/14/75<CR/LF>
3~Howard~Sally~Walks along the beach<CR/LF>
Likes bike rides<CR/LF>
Loves to cook~9/7/67<CR/LF>
4~Doe~Jane~Restores classic cars~7/4/74<CR/LF>

For line number 3 there are extra CR/LF characters inside the data itself causing it to span across multiple lines in the file. How to remove the additional CR/LF characters inside the data but not at the line ending using python.


Solution

  • I approached the solution using below points

    1.) read the first line of file (header) to know the total number of columns per each row
    2.) process the file row wise, count number of columns in each row and if it matches with the column count of header row ,then write the row to output file
    3.) if the column count of row does not match the header column count then store the values of current row to a deque after replacing the end of row characters.
    4.) once the deque reaches the length as same as the expected column count then write the row to output file and clear the deque

    PFB the working solution

    def file_processing():
       
        ANSI = "cp1252"
    
        with open(f'/tmp/sample.txt', 'r',encoding=ANSI, newline='',errors='ignore') as infile, open(f'/tmp/sample_new.txt', 'w', newline='',errors='ignore') as outfile:
            reader = csv.reader(infile,delimiter='~',lineterminator='\r\n')
            writer = csv.writer(outfile,delimiter='~',lineterminator='\r\n')
            ncol = len(next(reader))  #taking count from header
            print(f'number of columns in file are {ncol}')
            infile.seek(0)
    
           
            row_q = deque() #using deque to store the fields
    
            for row in reader:
            
                current_row_column_count = len(row)
    
                if current_row_column_count == ncol: #if no.of columns are same as header then write the row
                    print(f'number of columns in file are {ncol} and no.of columns in current row is {current_row_column_count} so writing the same')
                    writer.writerow(row)
                elif current_row_column_count != ncol: #if no.of columns are not equal to columns in header
                   
                    current_replaced_row = [field.replace('\n', ' ').replace('\r', ' ') for field in row]  #replacing <CR/LF>
                    print(f'current row being replaced is {current_replaced_row}')
    
                    if len(row_q) == 0:  #If deque is empty it means simply append the values to the queue
                        for v in current_replaced_row:
                            row_q.append(str(v))
                        continue
                    if len(row_q) != ncol: #if deque is not empty take the last field in deque and concat the first field from next row
                        last_val = row_q.pop()
                        first_val = current_replaced_row[0]
                        new_val = str(last_val) + str(first_val)
                        row_q.append(str(new_val))
                       
                        for val in current_replaced_row[1:]: #append the remaining fields from the row to the deque
                            row_q.append(str(val))
                       
                    print(row_q)
                   
                    if len(row_q) == ncol: #check if length of deque and no.of columns in header are same then write the values as a row in file
                        print(f'number of columns in file are {ncol} and no.of values in queue is {len(row_q)} so writing the same')
                        crr_string = '~'.join(row_q)
                        print(f'crr_string is {crr_string} ')
                        new_str_list = str(crr_string).split("~")
                        print(f'new_str_list is {new_str_list} ')
                        writer.writerow(new_str_list)
                        row_q.clear()