rstringstringrdata-preprocessing

Preprocessing strings using stringr function


I have a string which looks like :

clean_text
[1] "01/04/2018   Japan   -   Ghana   7:1    04/04/2018   Turkey   -   Estonia   3:2    06/04/2018   USA   -   Mexico   4:1        France   -   Nigeria   8:0     07/04/2018   Turkey   -   Estonia   3:0    08/04/2018   USA   -   Mexico   6:2     09/04/2018   France   -   Canada   1:0     10/04/2018   Cuba   -   Nicaragua   4:2    12/04/2018   Cuba   -   Nicaragua   1:2    18/04/2018   St. Vincent/Grenadines   -   St. Lucia   0:1       St. Kitts & Nevis   -   Dominica   1:0       Cuba   -   Barbados   7:0    19/04/2018   Haiti   -   Virgin Islands   7:0    20/04/2018   St. Lucia   -   Dominica   0:0       St. Kitts & Nevis   -   St. Vincent/Grenadines   2:0       Jamaica    -   Barbados   3:2    21/04/2018   Virgin Islands   -   Haiti   0:14    22/04/2018   Dominica   -   St. Vincent/Grenadines   3:0       St. Kitts & Nevis   -   St. Lucia   0:1       Jamaica   -   Cuba   0:1    25/04/2018   Guyana   -   Grenada   0:0       Trinidad & Tobago   -   Suriname   7:0    27/04/2018   Suriname   -   Guyana   2:2       Antigua & Barbuda   -   Curaçao   2:1       Trinidad & Tobago   -   Grenada   8:1    29/04/2018   Grenada   -   Suriname   5:6       Trinidad & Tobago   -   Guyana   3:1    "

I want to preprocess it such that I get a list like : Japan , Ghana , Turkey , Estonia , USA, and so on, but that is team names separated by ' - '.

I am trying the code:

pattern <- "[[:alpha:]][[:alpha:] -]*[[:alpha:]]"
matches <- str_extract_all(clean_text, pattern)[[1]]

which gives me the list as :

[1] "Japan   -   Ghana"          "Turkey   -   Estonia"
[3] "USA   -   Mexico"           "France   -   Nigeria"
[5] "Turkey   -   Estonia"       "USA   -   Mexico"
[7] "France   -   Canada"        "Cuba   -   Nicaragua"
[9] "Cuba   -   Nicaragua"       "St"
[11] "Vincent"                    "Grenadines   -   St"
[13] "Lucia"                      "St"
[15] "Kitts"                      "Nevis   -   Dominica"
[17] "Cuba   -   Barbados"        "Haiti   -   Virgin Islands"
[19] "St"                         "Lucia   -   Dominica"
[21] "St"                         "Kitts"
[23] "Nevis   -   St"             "Vincent"
[25] "Grenadines"                 "Jamaica    -   Barbados"
[27] "Virgin Islands   -   Haiti" "Dominica   -   St"
[29] "Vincent"                    "Grenadines"
[31] "St"                         "Kitts"
[33] "Nevis   -   St"             "Lucia"
[35] "Jamaica   -   Cuba"         "Guyana   -   Grenada"
[37] "Trinidad"                   "Tobago   -   Suriname"
[39] "Suriname   -   Guyana"      "Antigua"
[41] "Barbuda   -   Curaçao"      "Trinidad"
[43] "Tobago   -   Grenada"       "Grenada   -   Suriname"
[45] "Trinidad"                   "Tobago   -   Guyana

but which is wrong cause it splits the string where '.' or '&' or '-' are present. In fact I only want the string to split wherever there is ' - ' this is present what change should I make in my code?


Solution

  • Perhaps a more iterative approach is helpful here:

    library(stringr)
    
    s <- "01/04/2018   Japan   -   Ghana   7:1    04/04/2018   Turkey   -   Estonia   3:2    06/04/2018   USA   -   Mexico   4:1        France   -   Nigeria   8:0     07/04/2018   Turkey   -   Estonia   3:0    08/04/2018   USA   -   Mexico   6:2     09/04/2018   France   -   Canada   1:0     10/04/2018   Cuba   -   Nicaragua   4:2    12/04/2018   Cuba   -   Nicaragua   1:2    18/04/2018   St. Vincent/Grenadines   -   St. Lucia   0:1       St. Kitts & Nevis   -   Dominica   1:0       Cuba   -   Barbados   7:0    19/04/2018   Haiti   -   Virgin Islands   7:0    20/04/2018   St. Lucia   -   Dominica   0:0       St. Kitts & Nevis   -   St. Vincent/Grenadines   2:0       Jamaica    -   Barbados   3:2    21/04/2018   Virgin Islands   -   Haiti   0:14    22/04/2018   Dominica   -   St. Vincent/Grenadines   3:0       St. Kitts & Nevis   -   St. Lucia   0:1       Jamaica   -   Cuba   0:1    25/04/2018   Guyana   -   Grenada   0:0       Trinidad & Tobago   -   Suriname   7:0    27/04/2018   Suriname   -   Guyana   2:2       Antigua & Barbuda   -   Curaçao   2:1       Trinidad & Tobago   -   Grenada   8:1    29/04/2018   Grenada   -   Suriname   5:6       Trinidad & Tobago   -   Guyana   3:1"
    
    s |> 
      str_split_1("\\d+:\\d+") |> 
      str_remove("\\d{2}/\\d{2}/\\d{4}") |> 
      str_trim()
    #>  [1] "Japan   -   Ghana"                             
    #>  [2] "Turkey   -   Estonia"                          
    #>  [3] "USA   -   Mexico"                              
    #>  [4] "France   -   Nigeria"                          
    #>  [5] "Turkey   -   Estonia"                          
    #>  [6] "USA   -   Mexico"                              
    #>  [7] "France   -   Canada"                           
    #>  [8] "Cuba   -   Nicaragua"                          
    #>  [9] "Cuba   -   Nicaragua"                          
    #> [10] "St. Vincent/Grenadines   -   St. Lucia"        
    #> [11] "St. Kitts & Nevis   -   Dominica"              
    #> [12] "Cuba   -   Barbados"                           
    #> [13] "Haiti   -   Virgin Islands"                    
    #> [14] "St. Lucia   -   Dominica"                      
    #> [15] "St. Kitts & Nevis   -   St. Vincent/Grenadines"
    #> [16] "Jamaica    -   Barbados"                       
    #> [17] "Virgin Islands   -   Haiti"                    
    #> [18] "Dominica   -   St. Vincent/Grenadines"         
    #> [19] "St. Kitts & Nevis   -   St. Lucia"             
    #> [20] "Jamaica   -   Cuba"                            
    #> [21] "Guyana   -   Grenada"                          
    #> [22] "Trinidad & Tobago   -   Suriname"              
    #> [23] "Suriname   -   Guyana"                         
    #> [24] "Antigua & Barbuda   -   Curaçao"               
    #> [25] "Trinidad & Tobago   -   Grenada"               
    #> [26] "Grenada   -   Suriname"                        
    #> [27] "Trinidad & Tobago   -   Guyana"                
    #> [28] ""
    

    Created on 2023-03-17 with reprex v2.0.2