rdataframejoin

Table functions to match and merge based on different criteria


There are two tables, table_a and table_b. One table has monthly data while the other table has yearly data (although both tables have a 'year' column / both tables may have a different year range). Example:

           table_a                         table_b
     year   month   infltn               year   h_index
[1]  2010   May     4.0%            [1]  2011   102.0
[2]  2010   Jun     4.0%            [2]  2012   102.5
[3]  2010   Jul     4.0%            [3]  2013   103.0
[4]  2010   Aug     4.0%            [4]  2014   103.6
[5]  2010   Sep     4.0%            [5]  2015   104.1
[6]  2010   Oct     3.0%            [6]  2016   104.6
[7]  2010   Nov     3.0%            [7]  2017   105.1
[8]  2010   Dec     3.0%            [8]  2018   105.6
[9]  2011   Jan     4.0%            [9]  2019   106.2
[10] 2011   Feb     4.0%            [10] 2020   106.7
[11] 2011   Mar     4.0%            [11] 2021   107.2
[12] 2011   Apr     5.0%            [12] 2018   107.8
[13] 2011   May     5.0%            [13] 2019   108.3
[14] 2011   Jun     5.0%            [14] 2020   108.8
[15] 2011   Jul     5.0%            [15] 2021   109.4

The first part of the question/request: if creating a new_table_a (based on the yearly data of table_a) the data from table_b needs to be interpolated to populate the new table. As an example the column h_index in table_a has been interpolated into new_table_a as h_index2 (see below, new_table_a). Also, please note that in the example the data started in 2011 so prior to 2011 that h_index2 shows the character '-'.

               new_table_a              
     year   month   infltn  h_index2
[1]  2010   May       4%        -
[2]  2010   Jun       4%        -
[3]  2010   Jul       4%        -
[4]  2010   Aug       4%        -
[5]  2010   Sep       4%        -
[6]  2010   Oct       3%        -
[7]  2010   Nov       3%        -
[8]  2010   Dec       3%        -
[9]  2011   Jan       4%        102.0%
[10] 2011   Feb       4%        102.1%
[11] 2011   Mar       4%        102.1%
[12] 2011   Apr       5%        102.2%
[13] 2011   May       5%        102.2%
[14] 2011   Jun       5%        102.3%
[15] 2011   Jul       5%        102.3%

The second part of the question/request: if creating a new_table_b based on the monthly data of table_b, how do you average the calender months for the respective year so that you can populate new_table_b with an average (from infltn in table_a to infltn_avg in new_table_b). Example:

new_table_b             
     Year   h_index     infltn_avg
[1]  2011   102.0       4.0%
[2]  2012   102.5       4.0%
[3]  2013   103.0       4.0%
[4]  2014   103.6       4.0%
[5]  2015   104.1       4.0%
[6]  2016   104.6       3.0%
[7]  2017   105.1       3.0%
[8]  2018   105.6       3.0%
[9]  2019   106.2       4.0%
[10] 2020   106.7       4.0%
[11] 2021   107.2       4.0%
[12] 2018   107.8       5.0%
[13] 2019   108.3       5.0%
[14] 2020   108.8       5.0%
[15] 2021   109.4       5.0%

I have tried to code this using a range of loops and if's, which would seem rational in something like VBA, although it a bit clunky in R and I am sure that there is a more intelligent way of approaching it.


Solution

  • Here is a way to achieve this:


    table_b$month ="Jan"
    new_table_a = table_a %>% left_join(table_b, by =c("year", "month"))
    new_table_a$h_index2 = na.approx(new_table_a$h_index, na.rm = FALSE)
    new_table_a$infltn = as.numeric(sub("%", "", new_table_a$infltn))
    new_table_b = new_table_a %>% group_by(year) %>% summarise(h_index2= min(h_index2), infltn_avg = mean(infltn))
    

    new_table_a

    year month infltn h_index h_index2
    1  2010   May      4      NA       NA
    2  2010   Jun      4      NA       NA
    3  2010   Jul      4      NA       NA
    4  2010   Aug      4      NA       NA
    5  2010   Sep      4      NA       NA
    6  2010   Oct      3      NA       NA
    7  2010   Nov      3      NA       NA
    8  2010   Dec      3      NA       NA
    9  2011   Jan      4   102.0 102.0000
    10 2011   Feb      4      NA 102.0417
    11 2011   Mar      4      NA 102.0833
    12 2011   Apr      5      NA 102.1250
    13 2011   May      5      NA 102.1667
    14 2011   Jun      5      NA 102.2083
    15 2011   Jul      5      NA 102.2500
    16 2011   Aug      5      NA 102.2917
    17 2011   Sep      5      NA 102.3333
    18 2011   Oct      5      NA 102.3750
    19 2011   Nov      5      NA 102.4167
    20 2011   Dec      5      NA 102.4583
    21 2012   Jan      5   102.5 102.5000
    

    new_table_b

    # A tibble: 6 x 3
       year h_index2 infltn_avg
      <int>    <dbl>      <dbl>
    1  2010      NA        3.62
    2  2011     102        4.75
    3  2012     102.       5   
    4  2013     103        5   
    5  2014     104.       6   
    6  2015     104.       6
    

    Details


    First, to facilitate the extrapolation of h_index we add a month column in the table_b set to "Jan"

    table_b$month ="Jan"
    

    Then we join the two data frames by year and month

    new_table_a = table_a %>% left_join(table_b, by =c("year", "month"))
    

    For the extrapolation you can use na.approx from the zoo package. h_index2 is the extrapolation between 2 known h_index. Note: As this column is numeric it's best to keep the NA rather than adding a "-".

    library(zoo)
    new_table_a$h_index2 = na.approx(new_table_a$h_index, na.rm = FALSE)
    

    Now to generate new_table_b, we need the average percentage per year. For this we have to transform the infltn column into numeric value by removing the "%":

    new_table_a$infltn = as.numeric(sub("%", "", new_table_a$infltn))
    

    Finally, we just need to summarize the new_table_a to have the new_table_b. After grouping by year, we take the lowest value for min(h_index2) and do the average of monthly inflation mean(infltn)

    new_table_b = new_table_a %>% group_by(year) %>% summarize(h_index2= min(h_index2), infltn_avg = mean(infltn))
    

    Data


    table_a = read.table(text="year month infltn
    1  2010   May  4.00%
    2  2010   Jun  4.00%
    3  2010   Jul  4.00%
    4  2010   Aug  4.00%
    5  2010   Sep  4.00%
    6  2010   Oct  3.00%
    7  2010   Nov  3.00%
    8  2010   Dec  3.00%
    9  2011   Jan  4.00%
    10 2011   Feb  4.00%
    11 2011   Mar  4.00%
    12 2011   Apr  5.00%
    13 2011   May  5.00%
    14 2011   Jun  5.00%
    15 2011   Jul  5.00%
    16 2011   Aug  5.00%
    17 2011   Sep  5.00%
    18 2011   Oct  5.00%
    19 2011   Nov  5.00%
    20 2011   Dec  5.00%
    21 2012   Jan  5.00%
    22 2012   Feb  5.00%
    23 2012   Mar  5.00%
    24 2012   Apr  5.00%
    25 2012   May  5.00%
    26 2012   Jun  5.00%
    27 2012   Jul  5.00%
    28 2012   Aug  5.00%
    29 2012   Sep  5.00%
    30 2012   Oct  5.00%
    31 2012   Nov  5.00%
    32 2012   Dec  5.00%
    33 2013   Jan  5.00%
    34 2013   Feb  5.00%
    35 2013   Mar  5.00%
    36 2013   Apr  5.00%
    37 2013   May  5.00%
    38 2013   Jun  5.00%
    39 2013   Jul  5.00%
    40 2013   Aug  5.00%
    41 2013   Sep  5.00%
    42 2013   Oct  5.00%
    43 2013   Nov  5.00%
    44 2013   Dec  5.00%
    45 2014   Jan  6.00%
    46 2014   Feb  6.00%
    47 2014   Mar  6.00%
    48 2014   Apr  6.00%
    49 2014   May  6.00%
    50 2014   Jun  6.00%
    51 2014   Jul  6.00%
    52 2014   Aug  6.00%
    53 2014   Sep  6.00%
    54 2014   Oct  6.00%
    55 2014   Nov  6.00%
    56 2014   Dec  6.00%
    57 2015   Jan  6.00%", header= TRUE)
    table_b = read.table(text="year h_index
    1  2011   102.0
    2  2012   102.5
    3  2013   103.0
    4  2014   103.6
    5  2015   104.1
    6  2016   104.6
    7  2017   105.1
    8  2018   105.6
    9  2019   106.2
    10 2020   106.7
    11 2021   107.2
    12 2018   107.8
    13 2019   108.3
    14 2020   108.8
    15 2021   109.4", header = TRUE )