rdataframeuniqueepl

Create data frame for each unique row in another data frame


For an assignment for my graduate program, I have been asked to extract data from datasets of English Premier League results (located here). I am very close to being done but need help on the last two outputs.

We must create a function that can receive two arguments, a date and a season. The function must return a data frame with the table of the respective season on that date. It must include wins, losses, home record, away record, etc. The only ones I have not managed to figure out are W/L streak and the results of the last 10 matches.

Here is an example of what the initial dataset looks like:

   e.Date           e.HomeTeam    e.AwayTeam   e.FTHG e.FTAG  e.FTR
   1  2015-08-08    Bournemouth    Aston Villa      0      1     A
   2  2015-08-08        Chelsea        Swansea      2      2     D
   3  2015-08-08        Everton        Watford      2      2     D
   4  2015-08-08      Leicester     Sunderland      4      2     H
   5  2015-08-08     Man United      Tottenham      1      0     H

My plan was to get Home and Away data sorted out for each club then merge them together before doing the analysis to find streak and last 10 results.

I manipulated the data to look like this:

          HomeTeam FTR       Date freq
 1         Arsenal   L 2015-08-09    1
 2         Arsenal   D 2015-08-24    1
 3         Arsenal   W 2015-09-12    1
 4     Aston Villa   L 2015-08-14    1
 5     Aston Villa   L 2015-09-19    1
 6     Aston Villa   D 2015-08-29    1

And now I'm kinda lost. My idea was to run some kind of loop (for? ddply? data.table?) to create a data frame for each club with their results in it and then loop again to do whatever calculations to get the desired variables (streak and last 10) and somehow push those back into the main data frame where I am housing all of the other outputs.

I don't want to be told the answer outright since it's important I learn this on my own. However, if someone could point me in the right direction that would be great. Thanks so much.


Solution

  • I created some dummy data just to demonstrate a few commands and maybe give you some ideas.

    set.seed(321)
    dat <- data.frame(team = sample(letters[1:3], 20, replace=TRUE), 
                      season = rep("season1", 20), 
                      time = rnorm(20), 
                      win_loss = sample(c("win", "loss"), 20, replace=TRUE))
    

    Problem 1. Find win/loss streak

    Take a look at the rle function example below

    # 1. find wl streak of team 'a'
    tmp <- dat[dat$team == "a", ]
    tmp <- tmp[order(tmp$time), ]
    > tmp
       team  season        time win_loss
    19    a season1 -1.12032742     loss
    14    a season1 -1.07223880     loss
    16    a season1  0.09500072     loss
    3     a season1  0.18832552     loss
    8     a season1  0.42033257     loss
    4     a season1  2.44325982      win
    
    # shows runs of 5 consecutive losses, then 1 consecutive win
    rle(tmp$win_loss == "win")
    Run Length Encoding
      lengths: int [1:2] 5 1
      values : logi [1:2] FALSE TRUE
    

    Here's a very helpful post on rle How can I count runs in a sequence?

    Problem 2. Last 3 results

    I reversed the order of time and then picked the top 3 results.

    # 2. find last 3 matches for team 'b'
    tmp <- dat[dat$team == "b", ]
    tmp <- tmp[rev(order(tmp$time)), ]
    > tmp[1:3, ]
       team  season      time win_loss
    11    b season1 0.9172555     loss
    9     b season1 0.5775845      win
    7     b season1 0.4560691     loss