I have a data sheet that I am entering in Google Sheets and importing into R as an .xslx file. One of the columns in the datasheet is a character variable representing the specimen numbers of each specimen (i.e., a specific institution-based code). However, when I open the spreadsheet using read_excel, a trailing zero is added to the end of each entry. Notably, this occurs despite the column being recognized as a character.
library(openxlsx)
openxlsx::read.xlsx("https://raw.githubusercontent.com/megaraptor1/mydata/main/tempnumber.xlsx")[1:10,1]
## If code does not work download original and try this code
read.xlsx("tempnumber.xlsx", col_names=TRUE)[1:10,1]
# A tibble: 10 × 1
Number
<chr>
1 112322.0
2 63906.0
3 77315.0
4 77316.0
5 109692.0
6 64046.0
7 166222.0
8 115399.0
9 103553.0
10 112349.0
Additionally, only specimens with a totally numeric specimen number have this trailing zero added. Specimen numbers that have a letter or other formatting in their specimen ID have their specimen number displayed correctly.
What is even stranger is if I open the file in Microsoft Excel and resave without editing it (beyond selecting 'Enable Editing' due to being downloaded), the variable reads correctly. So it must be something only present in the file when being exported from Google Sheets that gets overwritten when the file is saved in Excel.
openxlsx::read.xlsx("https://raw.githubusercontent.com/megaraptor1/mydata/main/tempnumber2.xlsx")[1:10,1]
## If code does not work try downloading file and running the following.
read.xlsx("tempnumber2.xlsx",col_names=T)[1:10,1]
# A tibble: 10 × 1
Number
<chr>
1 112322
2 63906
3 77315
4 77316
5 109692
6 64046
7 166222
8 115399
9 103553
10 112349
I have no idea what could be causing this. This causes issues further downstream in the workflow when it is necessary to match specimen numbers across data frames. Google Sheets also does not appear to allow columns to be explicitly set as text versus numeric like Microsoft Excel. Because the dataset is constantly changing, I cannot resave the file each time in Excel every time I download the data. Does anyone know why this trailing zero is being added?
I'm not aware of a way to specify the imported column type for openxlsx::read.xlsx the way you can with readxl::read_excel. It seems like in this instance, there is a bug where read.xlsx converts the integer numeric values to a character value with one decimal. It would be best to find a way to fix this upstream, but as a workaround, we could re-do the conversion by converting values to numeric and back to character. This will produce NA for the values which were not actually numeric. dplyr::coalesce will return the first non-NA value, so we'll either get the re-converted numeric values or the original non-numeric values.
library(dplyr)
data.frame(tempnumber = openxlsx::read.xlsx(
"https://raw.githubusercontent.com/megaraptor1/mydata/main/tempnumber.xlsx")[568:592,1]) |>
mutate(tempnumber2 = coalesce(as.character(as.numeric(tempnumber)), tempnumber))
I picked a range of the data that includes a few types of values. We can see how this converts the values like 120999.0 to 120999 while preserving the non-numeric values.
tempnumber tempnumber2
1 120999.0 120999
2 17811.0 17811
3 M19081 M19081
4 437888.0 437888
5 12886.0 12886
6 12730.0 12730
7 83732.0 83732
8 95920.0 95920
9 17685.0 17685
10 40242.0 40242
11 V-003668 V-003668
12 306.0 306
13 292.0 292
14 443.0 443
15 64199.0 64199
16 85298.0 85298
17 14453.0 14453
18 56442.0 56442
19 85300.0 85300
20 17627.0 17627
21 8267.0 8267
22 37960.0 37960
23 57-X-10-153 57-X-10-153
24 85-VII-3-1 85-VII-3-1