rcsvapache-arrow

read_csv_arrow unable to deal with extended latin character set in Windows


I have a csv, on windows:

name, age
Siân, 34
Brónagh, 45
François, 87
Alan, 23
,

I try to read this into R using:

library(arrow)
df <- read_csv_arrow("people.csv")

It loads the table but converts the name column to arrow_binary

dput output:

structure(list(name = structure(list(as.raw(c(0x53, 0x69, 0xe2, 0x6e)), 
                                            as.raw(c(0x42, 0x72, 0xf3, 0x6e, 0x61, 0x67,0x68)), 
                                            as.raw(c(0x46, 0x72, 0x61, 0x6e, 0xe7, 0x6f, 0x69, 0x73)), 
                                            as.raw(c(0x41, 0x6c, 0x61, 0x6e)), NULL), 
                                       class = c("arrow_binary", "vctrs_vctr", "list"))), 
                 row.names = c(NA, -5L), class = c("tbl_df","tbl", "data.frame"))

I've tried to do an explicity conversion of this column:

as.character(df$name)
> Can't convert `x` <arrow_binary> to <character>.

I've also tried to use arrows cast command following this

df %>% mutate(name = arrow::cast(name, string())) 

But it can't find cast

> ! 'cast' is not an exported object from 'namespace:arrow'

Additionally, I've tried defining the datatype in the read_csv_arrow

read_csv_arrow("people.csv",
               col_types = schema(name = arrow::string()))

but this gives:

> ! Invalid: In CSV column #1: Row #1: CSV conversion error to string: invalid UTF8 data

I would like to use uft16, but it doesn't appear to be a datatype that arrow accepts


Solution

  • From the comments:

    read_csv_arrow("people.csv", read_options = list(encoding = "latin1"))