I'm trying to read the following CSV file from partyfacts with readr
.
The import results in problems, but in reality there are no problems.
download.file("https://partyfacts.herokuapp.com/download/external-parties-csv/", "partyfacts-external-parties.csv")
df <- readr::read_csv("partyfacts-external-parties.csv", show_col_types = FALSE)
Warning: One or more parsing issues, call
problems()
on your data frame for details,e.g.:
dat <- vroom(...)
problems(dat)
Let's see what we have:
nrow(problems(df))
86
problems(df)[1,]
# A tibble: 1 × 5 row col expected actual file
<int> <int> <chr> <chr> <chr>
35519 15 17 columns 15 columns /home/raffaele/Downloads/external-parties.csv
But in reality there are no problems.
Row 35519 is:
BIH,elecglob,292,SNSD,Alliance of Independent Social Democrats,Alliance of Independent Social Democrats,1998,2014,19.1,2006,,,2019-02-08 19:26:26.193233+00:00,2021-03-12 10:15:38.362019+00:00,30450,292,2019-02-08 19:26:26.296626+00:00
Which correctly contains 17 columns, not 15.
The other 84 problems are of the same nature (read less columns than expected) and a similar reasoning applies (the number of columns in the source file is correct).
EDIT: The text I reported for the line is from getting it from a text editor. Apparently the line numbers are not the same I get from R.
The file is huge, so it's hard to examine. A way to diagnose problems like this is to make the file smaller by deleting lines that are fine. I did that, and obtained this file, keeping only the first two lines, the first line that showed an error, and one line after that (which also shows an error):
country,dataset_key,dataset_party_id,name_short,name,name_english,year_first,year_last,share,share_year,description,comment,created,modified,external_id,partyfacts_id,linked
ALB,manifesto,75721,DBSH,E Djatha e Bashkuar e Shqipërisë,United Albanian Right,1996,1997,5.0,1996,,,2013-01-01 18:18:05.413000+00:00,2023-06-05 10:39:57.075788+00:00,1914,674,2013-01-01 18:33:17.889000+00:00
BEN,gps,60,ABT,,Alliance pour un Benin triomphant,2011,2019,2.9,2015,,,2020-07-16 17:39:48.143406+00:00,2021-03-12 10:16:03.729055+00:00,47733
BEN,gps,64,AE,,Eclaireur,2011,2019,3.7,2015,,,2020-07-16 17:39:57.563352+00:00,2021-03-12 10:16:03.731436+00:00,48035
The third and fourth lines shown above were somewhere around line 35440 in the original file, and as you can see, they don't follow the same format as the previous line: the final two fields are missing.
read.csv()
doesn't complain about this file, because it is documented to fill in missing fields with blanks unless you call it with fill = FALSE
. When I do that I get an error.