I'm working with zip codes, which of course have leading zeros. I am correctly loading my dataframe to preserve the leading zeros in R, but the upload step seems to fail. Here's what I mean:
Here's my minimal.csv file:
zip,val
07030,10
10001,100
90210,1000
60602,10000
Here's the R code
require("bigrquery")
filename <- "minimal.csv"
tablename <- "as_STRING"
ds <- bq_dataset(project='myproject', dataset="zips")
I am also correctly setting the type in my schema to expect them as strings.
# first pass
df <- read.csv(filename, stringsAsFactors=F)
# > df
# zip val
# 1 7030 10
# 2 10001 100
# 3 90210 1000
# 4 60602 10000
# uh oh! Let's fix it!
cols <- unlist(lapply(df, class))
cols[[1]] <- "character" # make zipcode a character
# then reload
df2 <- read.csv(filename, stringsAsFactors=F, colClasses=cols)
# > df2
# zip val
# 1 07030 10
# 2 10001 100
# 3 90210 1000
# 4 60602 10000
# much better! You can see my zips are now strings.
However, when I try to upload strings, the bigrquery interface complains that I am uploading integers, which they are not. Here's the schema, expecting strings:
# create schema
bq_table_create(bq_table(ds, tablename), fields=df2) # using df2, which has strings
# now prove it got the strings right:
> bq_table_meta(bq_table(ds, tablename))$schema$fields
[[1]]
[[1]]$name
[1] "zip"
[[1]]$type
[1] "STRING" # GOOD, ZIP IS A STRING!
[[1]]$mode
[1] "NULLABLE"
[[2]]
[[2]]$name
[1] "val"
[[2]]$type
[1] "INTEGER"
[[2]]$mode
[1] "NULLABLE"
Now it's time to upload....
bq_table_upload(bq_table(ds, tablename), df2) # using df2, with STRINGS
Error: Invalid schema update. Field zip has changed type from STRING to INTEGER [invalid]
Huh? What is this invalid schema update, and how can I stop it from trying to change my strings, which the data contains, and the schema is, to integers, which my data does not contain, and which the schema is not?
Is there a javascript serialization that's happening and turning my strings back to integers?
That is because BigQuery will auto-detect the schema when it is not specified. This could be solved by specifying fields
argument, like this (see this similar question for more details):
bq_table_upload(bq_table(ds, tablename), df2,fields = list(bq_field("zip", "string"),bq_field("val", "integer")))
UPDATE:
Looking into the code,bq_table_upload
is calling bq_perform_upload
, which take the argument fields
as schema. At the end, it parses the data frame
as JSON
file to upload it to the BigQuery.