rreadxl

Wrong date format when using read_excel() to read an .xlsx file from Windows PC on macOS in R


I encountered an issue when using the read_excel() function from the readxl package in R to read an .xlsx file that was originally created on a Windows PC. When I open this file on my Mac, the date format in the resulting data frame is incorrect.

I can manually convert these numeric values to dates using something like as.Date(, origin = "1899-12-30"), but this only works for the rows where the dates are incorrect. Unfortunately, for the rows where the dates are already correct (like the first two rows), this approach leads to incorrect conversion.

Here’s a simplified example of the code I’m using:

> date_data
   `Company`                                    Date 
 **1 CHINA PETROLEUM & CHEMICAL CORPORATION       1998                 
 2 INDUSTRIAL AND COMMERCIAL BANK OF CHINA LIM… 2018     **            
 3 CHINA MERCHANTS BANK CO., LTD                39911                 
 4 LEGEND HOLDINGS CORPORATION                  42096                 
 5 CHINA MERCHANTS BANK CO., LTD.               42354                 
 6 CRRC CORPORATION LIMITED                     39559                 
 7 CHINA PACIFIC INSURANCE (GROUP) CO., LTD.    40154                 
 8 SHANGHAI PHARMACEUTICALS HOLDING CO., LTD.   40666                 
 9 CHINA CITIC BANK CORPORATION LIMITED         39176                 
10 CHINA NATIONAL BUILDING MATERIAL COMPANY LI… 38755                 

correct read the data even in a character format like the picture attachedenter image description here


Solution

  • It looks like you have two types of data there -- some rows like 1+2 where you have a plain number which refers to the year, and other rows which are dates.

    Excel saves dates as days since X (1904-01-01 on pre-2011 mac excel, or 1899-12-30 on windows excel), but formats them to show the date format you want, like "08/04/2009." So under the hood, those rows hold numbers like 39911, which is 109 years of days, so implies a 2009 date on windows excel.

    To convert all the rows to plain years, you might do something like:

    df <- data.frame(Company = letters[1:4],
                     Date = c(1998, 2018, 39911, 42096))
    
    df$Date2 = ifelse(df$Date < 2050, df$Date,
                      lubridate::year(as.Date(df$Date, 
                                      origin = "1899-12-30")))
    

    Then we get:

    > df
      Company  Date Date2
    1       a  1998  1998
    2       b  2018  2018
    3       c 39911  2009
    4       d 42096  2015