rexcelgoogle-sheetsimportformatting

Why is my exported Google sheets adding trailing zero to character when read by read_excel in R?


I have a data sheet that I am entering in Google Sheets and importing into R as an .xslx file. One of the columns in the datasheet is a character variable representing the specimen numbers of each specimen (i.e., a specific institution-based code). However, when I open the spreadsheet using read_excel, a trailing zero is added to the end of each entry. Notably, this occurs despite the column being recognized as a character.

library(openxlsx)
openxlsx::read.xlsx("https://raw.githubusercontent.com/megaraptor1/mydata/main/tempnumber.xlsx")[1:10,1]

## If code does not work download original and try this code
read.xlsx("tempnumber.xlsx", col_names=TRUE)[1:10,1]
# A tibble: 10 × 1
   Number  
   <chr>   
 1 112322.0
 2 63906.0 
 3 77315.0 
 4 77316.0 
 5 109692.0
 6 64046.0 
 7 166222.0
 8 115399.0
 9 103553.0
10 112349.0

Additionally, only specimens with a totally numeric specimen number have this trailing zero added. Specimen numbers that have a letter or other formatting in their specimen ID have their specimen number displayed correctly.

What is even stranger is if I open the file in Microsoft Excel and resave without editing it (beyond selecting 'Enable Editing' due to being downloaded), the variable reads correctly. So it must be something only present in the file when being exported from Google Sheets that gets overwritten when the file is saved in Excel.

openxlsx::read.xlsx("https://raw.githubusercontent.com/megaraptor1/mydata/main/tempnumber2.xlsx")[1:10,1]

## If code does not work try downloading file and running the following.
read.xlsx("tempnumber2.xlsx",col_names=T)[1:10,1]
# A tibble: 10 × 1
   Number
   <chr> 
 1 112322
 2 63906 
 3 77315 
 4 77316 
 5 109692
 6 64046 
 7 166222
 8 115399
 9 103553
10 112349

I have no idea what could be causing this. This causes issues further downstream in the workflow when it is necessary to match specimen numbers across data frames. Google Sheets also does not appear to allow columns to be explicitly set as text versus numeric like Microsoft Excel. Because the dataset is constantly changing, I cannot resave the file each time in Excel every time I download the data. Does anyone know why this trailing zero is being added?


Solution

  • I'm not aware of a way to specify the imported column type for openxlsx::read.xlsx the way you can with readxl::read_excel. It seems like in this instance, there is a bug where read.xlsx converts the integer numeric values to a character value with one decimal. It would be best to find a way to fix this upstream, but as a workaround, we could re-do the conversion by converting values to numeric and back to character. This will produce NA for the values which were not actually numeric. dplyr::coalesce will return the first non-NA value, so we'll either get the re-converted numeric values or the original non-numeric values.

    library(dplyr)
    data.frame(tempnumber = openxlsx::read.xlsx(
      "https://raw.githubusercontent.com/megaraptor1/mydata/main/tempnumber.xlsx")[568:592,1]) |>
      mutate(tempnumber2 = coalesce(as.character(as.numeric(tempnumber)), tempnumber))
    

    I picked a range of the data that includes a few types of values. We can see how this converts the values like 120999.0 to 120999 while preserving the non-numeric values.

    
        tempnumber tempnumber2
    1     120999.0      120999
    2      17811.0       17811
    3       M19081      M19081
    4     437888.0      437888
    5      12886.0       12886
    6      12730.0       12730
    7      83732.0       83732
    8      95920.0       95920
    9      17685.0       17685
    10     40242.0       40242
    11    V-003668    V-003668
    12       306.0         306
    13       292.0         292
    14       443.0         443
    15     64199.0       64199
    16     85298.0       85298
    17     14453.0       14453
    18     56442.0       56442
    19     85300.0       85300
    20     17627.0       17627
    21      8267.0        8267
    22     37960.0       37960
    23 57-X-10-153 57-X-10-153
    24  85-VII-3-1  85-VII-3-1