I have a lot of .tsv
files that I would like to read them 1 by 1 and write the last column into other file.
Here is my code:
for filename in os.listdir(path):
with open(path+'/'+filename,'r',encoding="utf8") as tsvin, open('temptweets.csv','a',encoding='utf-8') as csvout:
tsvin = csv.reader(tsvin, delimiter='\t')
csvout = csv.writer(csvout)
count = 0
for row in tsvin:
try:
count = str(row[-1])
except ValueError:
pass # w.e.
if len(count) >= 0:
csvout.writerow([count])
Most of it, work perfect. But the Problem is that some of lines interpreted together.
i.e. row variable getting more few lines connected together, so it ends up that not only the last column is written into the file, but also ALL the columns of the next line. It is stopping after few rows - can't tell why either.
I have tried to read the files in few other method (such as pandas) but got the same result.
I have also tried to open the input file and view all chars (notepad++) but all the lines (including the problematic ones) DO HAVE CR:LF.
I know there is something wrong with the input file (the input file is given), but I would like to know if there is any way to solve it.
It looks like your file might have multiline fields embedded in double quotes (but it's hard to tell without looking at the data).
Try to add newline=''
in your open()
call (and maybe add quotechar='"'
to reader()
, but that's probably the default).
From the doc:
If
newline=''
is not specified, newlines embedded inside quoted fields will not be interpreted correctly
Or it could be the opposite, and maybe you need to turn off quoting to parse those files correctly..