rweb-scrapingrowsrvest

scraping with rvest with different number of rows in r


https://www.transfermarkt.de/alexander-bade/profil/spieler/31

Is it possible to scrape the whole table in one scrape?


Solution

  • Here's a way to get the table in one go with the flag urls included (left blank when there is no flag in the cell)

    library(rvest)
    library(dplyr)
    
    pg <- read_html("https://www.transfermarkt.de/alexander-bade/profil/spieler/31")
    row_class    <- "tm-player-transfer-history-grid"
    header_class <- paste0(row_class, " ", row_class, "--heading")
    
    dplyr::bind_cols(
      pg %>% html_nodes(xpath = paste0("//div[@class='", row_class, "']")) %>%
             sapply(function(x) html_children(x) %>% 
                     sapply(function(x) html_text(x) %>% trimws())) %>%
             t() %>%
             as.data.frame() %>%
             setNames(pg %>% 
                html_nodes(xpath = paste0("//div[@class='", header_class, "']")) %>%
                html_children() %>%
                sapply(html_text)),
      pg %>% html_nodes(xpath =  paste0("//div[@class='", row_class, "']")) %>%
            lapply(function(x) {
              unlist(lapply(html_children(x), function(x) { 
                a <- html_node(x, xpath = "./img")
                if(length(a) == 0) ""
                else html_attr(a, "data-src")
                }))}) %>%
            do.call(rbind, .) %>%
            `[`(,3:4) %>%
            as.data.frame() %>%
            setNames(c("old_club_flag", "new_club_flag"))
    ) %>% as_tibble() %>% select(-7)
    #> New names:
    #> * `` -> ...7
    #> # A tibble: 11 x 8
    #>    Saison Datum     `Abgebender Ve~` `Aufnehmender ~` MW    Ablöse old_club_flag
    #>    <chr>  <chr>     <chr>            <chr>            <chr> <chr>  <chr>        
    #>  1 09/10  01.07.20~ Arm. Bielefeld   Karriereende     -     -      "https://tms~
    #>  2 08/09  01.09.20~ Vereinslos       Arm. Bielefeld   -     -      ""           
    #>  3 08/09  01.07.20~ Bor. Dortmund    Vereinslos       200 ~ -      "https://tms~
    #>  4 07/08  01.01.20~ SC Paderborn     Bor. Dortmund    400 ~ 100 T~ "https://tms~
    #>  5 07/08  01.07.20~ VfL Bochum       SC Paderborn     400 ~ ablös~ "https://tms~
    #>  6 06/07  01.07.20~ 1.FC Köln        VfL Bochum       400 ~ ablös~ "https://tms~
    #>  7 00/01  01.07.20~ Hamburger SV     1.FC Köln        -     125 T~ "https://tms~
    #>  8 98/99  01.07.19~ KFC Uerdingen    Hamburger SV     -     300 T~ "https://tms~
    #>  9 94/95  01.07.19~ 1.FC Köln        Bayer 05         -     500 T~ "https://tms~
    #> 10 91/92  01.07.19~ 1.FC Köln II     1.FC Köln        -     -      "https://tms~
    #> 11 88/89  01.07.19~ TeBe Berlin U19  1.FC Köln II     -     ablös~ "https://tms~
    #> # ... with 1 more variable: new_club_flag <chr>
    

    Created on 2022-05-13 by the reprex package (v2.0.1)