rdplyrdynamic-programmingmarkov-decision-process

Sequential value iteration in R


I am currently reading the Dynamic Programming & MDP of Ronald Howard. Particularly in page 29 he presents the toymaker example with two different policies 1 and 2.Each policy has a transition probability matrix and a reward matrix.


# Set up initial variables for Policy 1
P1 <- matrix(c(0.5, 0.5, 0.4, 0.6), nrow = 2, byrow = TRUE);rownames(P1)=c("1","2");P1
R1 <- matrix(c(9, 3, 3, -7), nrow = 2, byrow = TRUE);rownames(R1)=c("1","2");R1


# Set up initial variables for Policy 2
P2 <- matrix(c(0.8, 0.2, 0.7, 0.3), nrow = 2, byrow = TRUE);rownames(P2)=c("1","2");P2
R2 <- matrix(c(4, 4, 1, -19), nrow = 2, byrow = TRUE);rownames(R2)=c("1","2");R2


Now bind these 4 matrices according to their policies:

library(dplyr)

P = as_tibble(rbind(P1,P2))%>%
  mutate(policy = rep(c(1,2),2))%>%
  arrange(policy)%>%
  rename(p1 =V1,p2 = V2);P
  
R = as_tibble(rbind(R1,R2))%>%
  mutate(policy2 = rep(c(1,2),2))%>%
  arrange(policy2)%>%
  rename(r1 =V1,r2 = V2);R
cbind(P,R)%>%
  select(-policy2)%>%
  relocate(.,policy,.after = r2)%>%
  group_by(policy)

the result is :

# A tibble: 4 × 5
# Groups:   policy [2]
     p1    p2    r1    r2 policy
  <dbl> <dbl> <dbl> <dbl>  <dbl>
1   0.5   0.5     9     3      1
2   0.8   0.2     4     4      1
3   0.4   0.6     3    -7      2
4   0.7   0.3     1   -19      2

the q_{I}^k = \sum_{j=1}^{N}p_{ij}^k r_{ij}^k

for k =1,2, where k are the two policies.Sctually is the row sum of element by element P*R matrices for each policy.

I want to implement the sequential value iteration as described in his book.In general I want from this data frame to calculate the 3.3 equation as shown in the picture below and the ideal output to be the table as shown below.

enter image description here

How can I do it in R ?

Edit (additional note)

for one policy as described in page 19 is the following

 # Set up initial variables
> P = matrix(c(0.5, 0.5, 0.4, 0.6), nrow = 2, byrow = TRUE)
> R = matrix(c(9, 3, 3, -7), nrow = 2, byrow = TRUE)
> Q = rowSums(P*R)
> n = 5
> v = matrix(0, nrow = 2, ncol = n+1) # Initialize v(0) to be all zeros
> v[,1] = 0 # Initialize v(0) to be all zeros
> # total-value vector.
> # Calculate values for different n using matrix calculations
> for (i in 1:n) {
+   v[,i+1] = Q + P %*% v[,i]
+ }
> print(v)
     [,1] [,2] [,3]  [,4]   [,5]    [,6]
[1,]    0    6  7.5  8.55  9.555 10.5555
[2,]    0   -3 -2.4 -1.44 -0.444  0.5556

Solution

  • Using a bit of recursion. Though you could use for-loop or even reduce:

    q1 <- unname(rowSums(P1*R1))
    q2 <- unname(rowSums(P2*R2))
    
    
    v <- function(x, n=0){
      if(n == 0)  x
      else cbind(x, v(pmax(q1 + P1 %*% x, q2 + P2 %*% x), n-1))
    }
    
    v(c(0,0), 4)
      x              
    1 0  6  8.2 10.22
    2 0 -3 -1.7  0.23
    

    Using for-loop:

    n <- 5
    V <- matrix(0,2,n)
    
    for (i in seq.int(2, n)){
      V[,i] <- pmax(q1 + P1 %*% V[,i-1], q2 + P2 %*% V[,i-1])
    }
    V
    
        [,1] [,2] [,3]  [,4]   [,5]
    [1,]    0    6  8.2 10.22 12.222
    [2,]    0   -3 -1.7  0.23  2.223
    

    Edit:

    using tidyverse:

    library(tidyverse)
    seq_len(n) %>%
       set_names(str_c("v", .))%>%
       accumulate(~c(pmax(q1 + P1%*%.x, q2 + P2%*%.x)), .init = c(0,0))%>%
       bind_cols()
    # A tibble: 2 × 6
      .init    v1    v2     v3    v4    v5
      <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
    1     0     6   8.2 10.2   12.2  14.2 
    2     0    -3  -1.7  0.230  2.22  4.22