rmergedata.tablefwritefread

The data is changed after fwrite and fread


I am trying to store several columns of the data, and then merge them with the other data when necessary. However, the data could only be merged with other data when I created it at the start, and it fails after fwrite and fread the data.

region <- climatebound_rast |> as.data.table(xy = TRUE)
fwrite(region, "region.csv")
region2 <- fread("region.csv", stringsAsFactors = TRUE)

all.equal(region,region2)
[1] TRUE


TEST <- raster_to_table[Variable=="npp"&Model=="DLEM", 1:4]

# 1. Merge raster data with region data
merged_region <- merge(TEST, region, all.x = TRUE) |> 
  drop_na(region) 

merged_region

Key: <x, y>
              x       y  Model Variable region
          <num>   <num> <char>   <char> <fctr>
   1:  87.99277 49.0312   DLEM      npp    MNG
   2:  88.49277 48.5312   DLEM      npp    MNG
   3:  88.49277 49.0312   DLEM      npp    MNG
   4:  88.99277 48.0312   DLEM      npp    MNG
   5:  88.99277 48.5312   DLEM      npp    MNG
  ---                                         
5846: 161.49277 68.5312   DLEM      npp    RUS
5847: 161.99277 68.5312   DLEM      npp    RUS
5848: 161.99277 69.0312   DLEM      npp    RUS
5849: 162.49277 68.5312   DLEM      npp    RUS
5850: 162.49277 69.0312   DLEM      npp    RUS

merged_region2 <- merge(TEST, region2, all.x = TRUE) |> 
  drop_na(region) 

merged_region2

Key: <x, y>
Empty data.table (0 rows and 5 cols): x,y,Model,Variable,region



Solution

  • all.equal() is not the same as identical()

    What is happening here is that you are trying to join on a float which cannot be precisely represented and is altered by saving to csv. Here's a simple example:

    library(data.table)
    
    dat1 <- data.table(x = 1/3, label = "one third")
    dat2 <- data.table(x = 1/3)
    

    It's possible (though inadvisable - see below) to merge these:

    merge(dat1, dat2)
    # Key: <x>
    #            x     label
    #        <num>    <char>
    # 1: 0.3333333 one third
    

    However, let's write dat1 to a csv and then read it back in. Floating-point numbers are stored in binary format with finite precision, meaning decimal fractions which cannot be expressed as a power of 2 (like 1/3) cannot be represented exactly. When written to a csv, values which cannot be expressed precisely as decimals (e.g. 1/3) are rounded to a limited number of decimal places, and reading them back introduces tiny errors.

    fwrite(dat1, "./dat.csv")
    dat1_fread <- fread("./dat.csv")
    all.equal(dat1, dat1_fread) # TRUE
    identical(dat1, dat1_fread) # FALSE
    

    You can see that dat1 and dat1_fread are all.equal(), as this is a utility to compare R objects x and y testing ‘near equality’. However, they are not identical(). Joins require exact matches. And so we don't return any results if we join on x:

    merge(dat1_fread, dat2)
    # Empty data.table (0 rows and 2 cols): x,label
    

    You can see the issue if you print the values with high enough precision:

    sprintf("%.20f", dat1$x)
    # [1] "0.33333333333333331483"
    sprintf("%.20f", dat2$x)
    # [1] "0.33333333333333331483"
    sprintf("%.20f", dat1_fread$x)
    # [1] "0.33333333333333298176"
    

    Interestingly, the text value of x stored in dat.csv is 0.333333333333333, which is not the same as the value when it is read back in. This is because when fwrite() saves the csv, it rounds 1/3 to 15 significant digits. When fread() reads the csv, it sees a number which cannot be exactly represented in binary. It then represents it as 0.33333333333333298176.... This is a fundamental limitation of floating-point arithmetic, not an R issue.

    How to resolve this

    The best plan is not to join on a float. If you have to join on a float, you could round, if you can afford to lose some precision (which essentially you can because you can't rely on floats being stored with very high accuracy). If you don't want to round, don't save your file to csv but serialize to some binary format instead. For example, rds:

    saveRDS(dat1, "./dat.rds")
    dat1_rds <- readRDS("./dat.rds")
    merge(dat1_rds, dat2)
    # Key: <x>
    #            x     label
    #        <num>    <char>
    # 1: 0.3333333 one third