I'm trying to extract tables from this site: https://www.dnb.com/business-directory/company-information.beverage_manufacturing.br.html
As you can see, the complete table has 14,387 rows and each page shows 50 results. My goal is to extract all results and put them in a single data.frame. Using rvest read_html() function directly doesn't work:
library(rvest)
page <- read_html(url_completa) #url_completa is the url I first indicated
Error en open.connection(x, "rb"): no se puede abrir la conexión
That's why I'm trying to handle this problem using RSelenium. Here's the code I'm using to open a browser. It works fine:
library(RSelenium)
rd <- rsDriver(browser = "firefox", port = 4567L,
phantomver = NULL, chromever= NULL)
remdr <- rd[["client"]]
remdr$navigate(url_completa)
page_source <- remdr$getPageSource()[[1]]
Now, when I try to access the html content using html_nodes("table")
I get xml_nodeset(0)
page_source%>%
read_html()%>%
html_nodes("table") -> result
result
{xml_nodeset (0)}
Exploring some other posts, I found that sometimes html_nodes("li")
could work but the output is not what I want:
page_source%>%
read_html()%>%
html_nodes("li")%>%
html_text()-> result
head(result)
> head(result)
[1] "Enterprise"
[2] "Small Business"
[3] "Public Sector"
[4] "Search OptionsSite ContentBusiness DirectoryContact DirectoryMarketplace"
[5] "Site Content"
[6] "Business Directory"
So basically I'm thinking that the problem is the parameters I pass onto html_nodes()
but I'm not so proficient in webscrapping so if I can't get the results even from the first page, I wouldn't be able to extract the whole table.
Thanks in advance for any help and advice.
This is more of an outline how it can be done, please always repect the Terms of use of the website and make sure to use APIs where possible.
You can mimic the post requests the page makes under the hood, every time a page is loaded:
this will get you the company data for 50 companies / page:
In R you can use httr2
to make the post request sending the request body payload
Thanks to @margusl
for pointing me towards req_perform_iterative()
to perform iterative requests and resps_data()
to build a dataframe from the response. We can then send multiple post requests until we reach max_pages
to pull the available data from 40 pages.
library(httr2)
get_comp_data_from_all_pages <- \(max_pages = 40) {
request("https://www.dnb.com/business-directory/api/companyinformation") |>
req_method("POST") |>
req_headers(
"Accept" = "application/json, text/plain, */*",
"Accept-Language" = "en,de-DE;q=0.9,de;q=0.8,en-US;q=0.7,zh-CN;q=0.6,zh;q=0.5",
"Cache-Control" = "no-cache",
"User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Referer" = "https://www.dnb.com/"
) |>
req_body_json(list(
pageNumber = 1,
industryPath = "beverage_manufacturing",
countryIsoTwoCode = "br"
)) |>
req_throttle(rate = 2) |>
req_perform_iterative(
next_req = \(resp, req) {
page_num <- (resp$request$body$data$pageNumber %||% 1) + 1
if (page_num > max_pages) return(NULL)
companies <- resp_body_json(resp)[["companyInformationCompany"]]
if (length(companies) == 0) return(NULL)
req_body_json(req, list(
pageNumber = page_num,
industryPath = "beverage_manufacturing",
countryIsoTwoCode = "br"
))
},
max_reqs = max_pages
) |>
resps_data(\(resp) {
companies <- resp_body_json(resp)[["companyInformationCompany"]]
if (length(companies) == 0) return(NULL)
lapply(companies, rapply, f = c) |> # flatten list for binding
dplyr::bind_rows()
})
}
get_comp_data_from_all_pages()
giving
res
# A tibble: 2,000 × 18
duns primaryName primaryNameForUrl primaryAddress.addre…¹ primaryAddress.addre…² primaryAddress.addre…³ primaryAddress.addre…⁴
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 04210dcfb196… AMBEV S/A ambev_s-a BR Brazil Brazil SAO PAULO
2 a83c292319f8… CERVEJARIA… cervejaria_petro… BR Brazil Brazil RIO DE JANEIRO
3 e55d63d6827b… HNK BR IND… hnk_br_industria… BR Brazil Brazil ITU
4 14250804cf5e… SPAIPA IND… spaipa_industria… BR Brazil Brazil CURITIBA
5 3c6cadd49a54… NORSA REFR… norsa_refrigeran… BR Brazil Brazil FORTALEZA
6 da30ec9a8425… AROSUCO AR… arosuco_aromas_e… BR Brazil Brazil MANAUS
7 c88c496b7ac4… REFRESCOS … refrescos_guarar… BR Brazil Brazil JABOATAO DOS GUARARAP…
8 5d5fe1872a35… CERVEJARIA… cervejaria_petro… BR Brazil Brazil ALAGOINHAS
9 33218fc0ae0c… CERVEJARIA… cervejaria_petro… BR Brazil Brazil ITAPISSUMA
10 50e7284963b0… CERVEJARIA… cervejaria_cidad… BR Brazil Brazil FRUTAL
It seems like the website has duplicated data. You can see it, if you fetch the full data set and then use distinct
-> it only exposes 1000 distinct observations. You can see where these duplicates occur if we add the page to each row.
library(httr2)
library(purrr) # for map to unnest response list
library(dplyr) # for bind_rows
get_comp_data_from_all_pages <- \(max_pages = 40) {
get_page_data <- \(page_num) {
request_body <- list(
pageNumber = page_num,
industryPath = "beverage_manufacturing",
countryIsoTwoCode = "br"
)
cat("Fetching page", page_num, "...\n")
tryCatch({
res <- request("https://www.dnb.com/business-directory/api/companyinformation") |>
req_method("POST") |>
req_headers(
"Accept" = "application/json, text/plain, */*",
"Accept-Language" = "en,de-DE;q=0.9,de;q=0.8,en-US;q=0.7,zh-CN;q=0.6,zh;q=0.5",
"Cache-Control" = "no-cache",
"User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Referer" = "https://www.dnb.com/"
) |>
req_body_json(request_body) |>
req_throttle(rate = 2) |>
req_perform() |>
resp_body_json()
companies <- res[["companyInformationCompany"]]
if (length(companies) == 0) {
cat("No more data on page", page_num, "\n")
return(NULL)
}
companies |>
purrr::map(~ {
list(
company_name = .x$primaryName,
sales_revenue = .x$salesRevenue,
country = .x$primaryAddress$addressCountry$countryName,
city = .x$primaryAddress$addressLocality$name,
region = .x$primaryAddress$addressRegion$name,
postal_code = .x$primaryAddress$postalCode,
street = .x$primaryAddress$streetAddress$line1,
page = page_num
)
}) |>
dplyr::bind_rows()
}, error = function(e) {
cat("Error on page", page_num, ":", e$message, "\n")
return(NULL)
})
}
# iterate function over pages
all_data <- map(1:max_pages, get_page_data) |>
compact() |> # Remove NULL results
bind_rows()
}
res <- get_comp_data_from_all_pages()
res2 <- res %>% distinct() # only gives 1000 rows
It makes me question their data quality and if these 14,387 records really exist.