I am receiving below sample data in a txt file from source. The file contains a header row which signifies how many columns are expected per row
ID~First name~last name~description~birth date<CR/LF>
1~Smith~John~Loves to skateboard~12/3/98<CR/LF>
2~Johnson~Mike~Eats worms~2/14/75<CR/LF>
3~Howard~Sally~Walks along the beach<CR/LF>
Likes bike rides<CR/LF>
Loves to cook~9/7/67<CR/LF>
4~Doe~Jane~Restores classic cars~7/4/74<CR/LF>
For line number 3 there are extra CR/LF characters inside the data itself causing it to span across multiple lines in the file. How to remove the additional CR/LF characters inside the data but not at the line ending using python.
I approached the solution using below points
1.) read the first line of file (header) to know the total number of columns per each row
2.) process the file row wise, count number of columns in each row and if it matches with the column count of header row ,then write the row to output file
3.) if the column count of row does not match the header column count then store the values of current row to a deque after replacing the end of row characters.
4.) once the deque reaches the length as same as the expected column count then write the row to output file and clear the deque
PFB the working solution
def file_processing():
ANSI = "cp1252"
with open(f'/tmp/sample.txt', 'r',encoding=ANSI, newline='',errors='ignore') as infile, open(f'/tmp/sample_new.txt', 'w', newline='',errors='ignore') as outfile:
reader = csv.reader(infile,delimiter='~',lineterminator='\r\n')
writer = csv.writer(outfile,delimiter='~',lineterminator='\r\n')
ncol = len(next(reader)) #taking count from header
print(f'number of columns in file are {ncol}')
infile.seek(0)
row_q = deque() #using deque to store the fields
for row in reader:
current_row_column_count = len(row)
if current_row_column_count == ncol: #if no.of columns are same as header then write the row
print(f'number of columns in file are {ncol} and no.of columns in current row is {current_row_column_count} so writing the same')
writer.writerow(row)
elif current_row_column_count != ncol: #if no.of columns are not equal to columns in header
current_replaced_row = [field.replace('\n', ' ').replace('\r', ' ') for field in row] #replacing <CR/LF>
print(f'current row being replaced is {current_replaced_row}')
if len(row_q) == 0: #If deque is empty it means simply append the values to the queue
for v in current_replaced_row:
row_q.append(str(v))
continue
if len(row_q) != ncol: #if deque is not empty take the last field in deque and concat the first field from next row
last_val = row_q.pop()
first_val = current_replaced_row[0]
new_val = str(last_val) + str(first_val)
row_q.append(str(new_val))
for val in current_replaced_row[1:]: #append the remaining fields from the row to the deque
row_q.append(str(val))
print(row_q)
if len(row_q) == ncol: #check if length of deque and no.of columns in header are same then write the values as a row in file
print(f'number of columns in file are {ncol} and no.of values in queue is {len(row_q)} so writing the same')
crr_string = '~'.join(row_q)
print(f'crr_string is {crr_string} ')
new_str_list = str(crr_string).split("~")
print(f'new_str_list is {new_str_list} ')
writer.writerow(new_str_list)
row_q.clear()