Why is my exported Google sheets adding trailing zero to character when read by read_excel in R?

I have a data sheet that I am entering in Google Sheets and importing into R as an .xslx file. One of the columns in the datasheet is a character variable representing the specimen numbers of each specimen (i.e., a specific institution-based code). However, when I open the spreadsheet using read_excel, a trailing zero is added to the end of each entry. Notably, this occurs despite the column being recognized as a character.

library(openxlsx)
openxlsx::read.xlsx("https://raw.githubusercontent.com/megaraptor1/mydata/main/tempnumber.xlsx")[1:10,1]

## If code does not work download original and try this code
read.xlsx("tempnumber.xlsx", col_names=TRUE)[1:10,1]

# A tibble: 10 × 1
   Number  
   <chr>   
 1 112322.0
 2 63906.0 
 3 77315.0 
 4 77316.0 
 5 109692.0
 6 64046.0 
 7 166222.0
 8 115399.0
 9 103553.0
10 112349.0

Additionally, only specimens with a totally numeric specimen number have this trailing zero added. Specimen numbers that have a letter or other formatting in their specimen ID have their specimen number displayed correctly.

What is even stranger is if I open the file in Microsoft Excel and resave without editing it (beyond selecting 'Enable Editing' due to being downloaded), the variable reads correctly. So it must be something only present in the file when being exported from Google Sheets that gets overwritten when the file is saved in Excel.

openxlsx::read.xlsx("https://raw.githubusercontent.com/megaraptor1/mydata/main/tempnumber2.xlsx")[1:10,1]

## If code does not work try downloading file and running the following.
read.xlsx("tempnumber2.xlsx",col_names=T)[1:10,1]

# A tibble: 10 × 1
   Number
   <chr> 
 1 112322
 2 63906 
 3 77315 
 4 77316 
 5 109692
 6 64046 
 7 166222
 8 115399
 9 103553
10 112349

I have no idea what could be causing this. This causes issues further downstream in the workflow when it is necessary to match specimen numbers across data frames. Google Sheets also does not appear to allow columns to be explicitly set as text versus numeric like Microsoft Excel. Because the dataset is constantly changing, I cannot resave the file each time in Excel every time I download the data. Does anyone know why this trailing zero is being added?

Solution

I'm not aware of a way to specify the imported column type for openxlsx::read.xlsx the way you can with readxl::read_excel. It seems like in this instance, there is a bug where read.xlsx converts the integer numeric values to a character value with one decimal. It would be best to find a way to fix this upstream, but as a workaround, we could re-do the conversion by converting values to numeric and back to character. This will produce NA for the values which were not actually numeric. dplyr::coalesce will return the first non-NA value, so we'll either get the re-converted numeric values or the original non-numeric values.

library(dplyr)
data.frame(tempnumber = openxlsx::read.xlsx(
  "https://raw.githubusercontent.com/megaraptor1/mydata/main/tempnumber.xlsx")[568:592,1]) |>
  mutate(tempnumber2 = coalesce(as.character(as.numeric(tempnumber)), tempnumber))

I picked a range of the data that includes a few types of values. We can see how this converts the values like 120999.0 to 120999 while preserving the non-numeric values.


    tempnumber tempnumber2
1     120999.0      120999
2      17811.0       17811
3       M19081      M19081
4     437888.0      437888
5      12886.0       12886
6      12730.0       12730
7      83732.0       83732
8      95920.0       95920
9      17685.0       17685
10     40242.0       40242
11    V-003668    V-003668
12       306.0         306
13       292.0         292
14       443.0         443
15     64199.0       64199
16     85298.0       85298
17     14453.0       14453
18     56442.0       56442
19     85300.0       85300
20     17627.0       17627
21      8267.0        8267
22     37960.0       37960
23 57-X-10-153 57-X-10-153
24  85-VII-3-1  85-VII-3-1