I have a tsv that has some fields with some escaped double quotation marks (DQM). Those are fine, but the problem comes when the DQM delimiter comes after one of those and then when the tsv is opened it doesn't recognize the last DQM and so the following field gets included with the previous.
For example, the line:
"9812" "tt0167609" "tvSeries" "L'homme du \"Picardie\"" "L'homme du \"Picardie\"" 0 "1968" "\N" "13" "Drama"
when opened in a SS, places the first 3 fields fine. But the fourth field it includes all the way up to 1968
when what it should do is only put in the first "L'homme du \"Picardie\"
, the second of the same in the next field, and so on. And so the problem appears to be that it's not recognizing the "
after the \"
. I tried clicking different options when opening it in SS programs, but nothing fixes it.
Now I've found out that I can fix this before opening it in a SS program by replacing \""
with \"""
in a text editor, but I'd like to be able to do it in R when the file is generated.
I've tried several combinations of strings, but I just can't figure it out and I'm hoping someone can point me in the right direction. The following are just some of what I tried.
tv.Subset <- str_replace(tv.Subset, "\\\"\"", "\\\"\"\"")
- one of my first attempts, simply escape each character in the string
tv.Subset <- str_replace(tv.Subset, '\\\"\"', '\\\"\"\"')
- I wondered if single quotation marks for delimiters might be the trick
tv.Subset <- str_replace(tv.Subset, "\\\\\\"\\"", "\\\\\\"\\"\\"")
- I read that you need to do double backslashes to respect both R and regex
Thanks.
If I am understanding your question correctly, R is generating .tsv output files that you cannot open in spreadsheet programs due to the escaped double quotes in R's output.
If so, here is a reproducible example where the third row is your original data:
df <- data.frame(f1=c(1,27,9812,9999),
f2=c('tt12345','tt54321','tt0167609','tt98765'),
f3=rep("tvSeries",4),
f4=c("Some string","Another string","L'homme du \"Picardie\"","Yet another string"),
f5=c("Here is another string","Another string again","L'homme du \"Picardie\"","Also a string"),
f6=c(0,0,0,0),
f7=c(1966,1967,1968,1969),
f8=rep("\\N",4),
f9=c(11,12,13,14),
f10=rep('Drama',4))
write.table(df,'outfile.tsv')
Viewed in R, your data would look like this:
> print(df,row.names=F)
f1 f2 f3 f4 f5 f6 f7 f8 f9 f10
1 tt12345 tvSeries Some string Here is another string 0 1966 \\N 11 Drama
27 tt54321 tvSeries Another string Another string again 0 1967 \\N 12 Drama
9812 tt0167609 tvSeries L'homme du "Picardie" L'homme du "Picardie" 0 1968 \\N 13 Drama
9999 tt98765 tvSeries Yet another string Also a string 0 1969 \\N 14 Drama
And the command write.table(df,'outfile.tsv')
produces this output file, which some spreadsheet programs would fail to read:
"f1" "f2" "f3" "f4" "f5" "f6" "f7" "f8" "f9" "f10"
"1" 1 "tt12345" "tvSeries" "Some string" "Here is another string" 0 1966 "\N" 11 "Drama"
"2" 27 "tt54321" "tvSeries" "Another string" "Another string again" 0 1967 "\N" 12 "Drama"
"3" 9812 "tt0167609" "tvSeries" "L'homme du \"Picardie\"" "L'homme du \"Picardie\"" 0 1968 "\N" 13 "Drama"
"4" 9999 "tt98765" "tvSeries" "Yet another string" "Also a string" 0 1969 "\N" 14 "Drama"
The solution is not to change \" in R like you've been trying to, but instead to just not have R surround strings with quotes in your output. That way the only quotes in the output are the quotes that are part of a title, and spreadsheet programs will be able to treat them accordingly.
Here is the command you can use to save your data without the escaped quotes: write.table(df,'outfile.tsv',quote=F,sep='\t')
. This will do two things:
Your new output file will look like this:
f1 f2 f3 f4 f5 f6 f7 f8 f9 f10
1 1 tt12345 tvSeries Some string Here is another string 0 1966 \N 11 Drama
2 27 tt54321 tvSeries Another string Another string again 0 1967 \N 12 Drama
3 9812 tt0167609 tvSeries L'homme du "Picardie" L'homme du "Picardie" 0 1968 \N 13 Drama
4 9999 tt98765 tvSeries Yet another string Also a string 0 1969 \N 14 Drama
As long as the new output file uses tab characters, most spreadsheet programs will have no problem reading it like this.