In order to search for keywords using text mining tools, I need to retrieve the abstracts available on each of the URLs from the dat0
dataframe (given for example, URLs provided from this website) and integrate them into a second column named "abstract". The desired output is:
The challenge is, using a loop or other mapping, to open each URL and then to retrieve the text between the first word "Abstract" and the last two words "Issue Section" (these beginning and ending words are not to be retrieved), and to implement them in the second column of the dataframe. Example, for the first URL:
The loop/mapping is necessary because there are more than 700 URLs in the real dataframe, and therefore more than 700 abstracts to retrieve.
Thanks for help
Initial data:
dat0 <- structure(list(url = c("https://doi.org/10.1093/clinchem/hvae106.001",
"https://doi.org/10.1093/clinchem/hvae106.002", "https://doi.org/10.1093/clinchem/hvae106.003"
)), class = "data.frame", row.names = c(NA, -3L))
Desired output:
dat1 <- structure(list(url = c("https://doi.org/10.1093/clinchem/hvae106.001",
"https://doi.org/10.1093/clinchem/hvae106.002", "https://doi.org/10.1093/clinchem/hvae106.003"
), abstract = c("Background\nCardiovascular disease (CVD) remains the leading cause of death in developed countries. Elevated levels of the gut-microbe-associated metabolite trimethylamine-N-oxide (TMAO) have been associated with increased risk for CVD mortality in many large independent studies. In fact, large laboratory corporations such as Labcorp and Quest Diagnostic now offer TMAO diagnostic tests for the assessment of CVD risk and as a marker for disease-associated dysbiosis, using samples obtained from fasting patients. Although the strong association between TMAO levels and CVD risk has been established in fasting blood samples, here we are investigating the potential for a single meal to dynamically alter TMAO-related metabolites based on sex, diverse dietary substrates, and microbial influences.\n\nMethods\n35 healthy participants were randomized to one of four study groups with approximately 8-9 subjects in each arm. Half of the participants were randomly assigned to receive a three-day broad spectrum antibiotic regimen (ciprofloxacin, metronidazole, and vancomycin) while the others received no antibiotics. These two groups were further subdivided into a group consuming a highly processed meal (originating from local fast food restaurants) or alternatively a whole food meal (containing a variety of fruits, vegetables, and healthy fiber). Blood samples were taken at baseline (after an overnight fast), and then postprandially at 15 minutes, 30 minutes, 1 hour, 2 hours, 4 hours, and 6 hours after meal ingestion. Targeted plasma metabolites were quantified using a stable isotope dilution, liquid chromatography - tandem mass spectrometry (LC-MS/MS) method\n\nResults\nWhile plasma TMAO levels between food groups did not change significantly within the total cohort, there were clear individualized responses to either highly processed or whole food meals. Separation based on sex showed that TMAO levels were significantly reduced in females at 6 hours but remained steady in males for both food groups. Particularly, in the processed food group, TMAO levels were significantly lower in females than males at the 6-hour time point. Neither of these observations held true for TMAO precursors choline, carnitine, or betaine. However, plasma levels of a dietary precursor for TMAO, γ-butyrobetaine, showed clear diet-microbe-host interactions. For males in the processed food group, plasma γ-butyrobetaine levels were significantly increased in subjects on broad spectrum antibiotics.\n\nConclusions\nOur results show the postprandial levels of TMAO, and its nutrient precursors dynamically change in a diet-, microbe-, and sex-dependent manner. These findings provide new insights into the postprandial levels of TMAO-related metabolites and may inform precision nutritional approaches in those who could benefit from TMAO-lowering strategies.",
"Background\nHigh-sensitivity cardiac troponin (hs-cTn) plays a pivotal role in the early diagnosis of Acute Coronary Syndrome (ACS). With technological advancements, the sensitivity of troponin assays has significantly improved. The International Federation of Clinical Chemistry (IFCC) criteria for hs-cTn assays include a total imprecision (Coefficient of Variation, CV) ≤10% at the gender-specific 99th percentile value and detectable levels in ≥50% of a healthy population. This study aims to evaluate the performance of Zybio's hs-cTnI assay against these high-sensitivity standards\n\nMethods\nThis study involved 1661 individuals undergoing routine health checks at Chongqing Medical University Third Hospital. Exclusion criteria included recent medication use, abnormal NT-proBNP levels, significant cardiac silhouette changes in chest X-rays, age <20, incomplete data, pregnancy, or history of severe chronic diseases. Participants were classified into apparent healthy and healthy groups based on blood pressure, lipid profiles, and electrocardiogram results. Laboratory examinations included hs-cTn and NT-ProBNP concentrations using Zybio's chemiluminescence instrument EXI2400 and associated reagent kits, with other parameters measured using standard biochemical and high-performance liquid chromatography methods.\n\nResults\nOf the 1247 participants (556 males, 691 females) included in the final analysis, 1157 showed detectable hs-cTnI levels above the limit of detection (LOD), yielding an overall detection rate of 92.78%. Detection rates were 93.71% in males and 92.04% in females. The total imprecision (CV) of hs-cTnI at gender-specific 99th percentile values over 20 days was below 10%, meeting the high-sensitivity criteria. This finding was consistent across lower and higher concentration ranges.\n\nConclusions\nThe Zybio hs-cTnI assay demonstrated a high detection rate in a healthy population, with 92.78% detectability overall, 93.71% in males, and 92.04% in females. The assay met the high-sensitivity criteria of IFCC, with a total imprecision (CV) of less than 10% at the gender-specific 99th percentile levels. These results validate the utility of Zybio's hs-cTnI assay for clinical application in the early diagnosis of ACS.",
"Background\nIn July 2024, the Clinical Laboratory Improvement Act (CLIA) proficiency testing (PT) criteria will directly regulate Troponin I performance for the very first time. The new CLIA goal is 0.9 ng/mL or 30%, whichever is greater. CAP previously set a goal of 30% or 3 times the group standard deviation (SD), whichever is greater, a more permissive setting. Estimates of current instrument group performance from an international proficiency testing (PT) survey have shown none of the 5 major diagnostic instruments can achieve the biological minimum goal at a 6-Sigma level, while 4 of the 5 instruments perform at 3 Sigma or below. Performance of these platforms was assessed using the methodology introduced in 2006 by Westgard JO and Westgard SA. The DxI 9000 high sensitivity Troponin I assay was assessed to determine if it could achieve the new CLIA 2024 goals.\n\nMethods\nThe DxI 9000 high sensitivity Troponin I assay was assessed with three reagent lots, on both serum and Lithium Heparin (LiHep) samples, following Clinical Laboratory Standards Institute (CLSI) protocols EP05 and EP09 to estimate imprecision and bias. The new CLIA 2024 PT criteria supplied the allowable total error for the standard Sigma-metric calculation: Sigma-metric = (TEa - |bias|) / SD The Sigma-metric predicts not only future problems with PT, but also potential optimization of QC procedures, including fewer Westgard Rules, control levels, even reduced QC frequency which can lead to less cost, time, and materials.\n\nResults\nThe majority (91.7%) of data points across the analytical measuring range for DxI 9000’s high sensitivity Troponin I assay from both serum and LiHep samples achieved 6-Sigma performance. For serum, only 8.3% of samples achieved 5-sigma performance. For LiHep, only 4.2% of the performance was 4 Sigma. None of the performance was 3 Sigma or lower.\n\nConclusions\nThe superior precision observed on DxI 9000 high sensitivity Troponin I delivers overwhelming 6-sigma performance when assessed by CLIA’s 2024 goal. This assay is highly unlikely to face PT difficulties and can be optimized for reduced Westgard Rules, reduced control levels leading to a reduction in time, materials, and cost."
)), class = "data.frame", row.names = c(NA, -3L))
As r2evans notes, scraping a bunch of CloudFlare pages is going to be difficult and possibly against their terms of service, and an API is preferable. Fortunately, such APIs exist. For example, the Crossref REST API.
The result for your first DOI can be obtained at https://api.crossref.org/works/10.1093/clinchem/hvae106.001. As you can see, the results are json which is slightly strangely formatted, e.g. they have a bunch of html-like tags in them such as <jats:title>Background</jats:title>
. Fortunately, these tags are quite helpful to show us where the abstract starts, and they can be easily removed with a function like this:
clean_abstracts <- function(abstracts) {
abstracts |>
# remove anything before the start
sub("^.*?<jats:title>", "", x = _) |>
# remove the <jats:title> etc. tags
gsub("<.*?>", "", x = _) |>
# replace multiple newline/space combos with one newline
gsub("\n+\\s*\n*", "\n", x = _) |>
trimws()
}
Using such an approach to clean the abstracts, we are in a position to write a function to obtain them by looping through the URLs (keeping in mind the 50 request/second rate limit):
get_abstracts <- function(dois, base_url = "https://api.crossref.org/works/") {
urls <- paste0(base_url, sub("https://doi.org/", "", dois))
results <- lapply(urls, \(url) {
Sys.sleep(0.02) # rate limit 50 request a second
httr::GET(url)
})
abstracts <- lapply(results, \(result) {
if (!result$status_code == 200) {
return(NA_character_)
}
# return NA rather than NULL if no abstract
httr::content(result)$message$abstract %||% NA_character_
})
clean_abstracts(abstracts)
}
You can then just apply this to your dataframe:
dat0 |>
transform(abstract = get_abstracts(url))
Output:
url abstract
<chr> <chr>
1 https://doi.org/10.1093/clinchem/hvae106.001 "Abstract\nBackground\nCardiovascular disease (CVD) remains the leading cause of death in developed countries. Elevated le…"
2 https://doi.org/10.1093/clinchem/hvae106.002 "Abstract\nBackground\nHigh-sensitivity cardiac troponin (hs-cTn) plays a pivotal role in the early diagnosis of Acute Cor…"
3 https://doi.org/10.1093/clinchem/hvae106.003 "Abstract\nBackground\nIn July 2024, the Clinical Laboratory Improvement Act (CLIA) proficiency testing (PT) criteria will…"
Or the first abstract in full:
"Abstract\nBackground\nCardiovascular disease (CVD) remains the leading cause of death in developed countries. Elevated levels of the gut-microbe-associated metabolite trimethylamine-N-oxide (TMAO) have been associated with increased risk for CVD mortality in many large independent studies. In fact, large laboratory corporations such as Labcorp and Quest Diagnostic now offer TMAO diagnostic tests for the assessment of CVD risk and as a marker for disease-associated dysbiosis, using samples obtained from fasting patients. Although the strong association between TMAO levels and CVD risk has been established in fasting blood samples, here we are investigating the potential for a single meal to dynamically alter TMAO-related metabolites based on sex, diverse dietary substrates, and microbial influences.\nMethods\n35 healthy participants were randomized to one of four study groups with approximately 8-9 subjects in each arm. Half of the participants were randomly assigned to receive a three-day broad spectrum antibiotic regimen (ciprofloxacin, metronidazole, and vancomycin) while the others received no antibiotics. These two groups were further subdivided into a group consuming a highly processed meal (originating from local fast food restaurants) or alternatively a whole food meal (containing a variety of fruits, vegetables, and healthy fiber). Blood samples were taken at baseline (after an overnight fast), and then postprandially at 15 minutes, 30 minutes, 1 hour, 2 hours, 4 hours, and 6 hours after meal ingestion. Targeted plasma metabolites were quantified using a stable isotope dilution, liquid chromatography - tandem mass spectrometry (LC-MS/MS) method\nResults\nWhile plasma TMAO levels between food groups did not change significantly within the total cohort, there were clear individualized responses to either highly processed or whole food meals. Separation based on sex showed that TMAO levels were significantly reduced in females at 6 hours but remained steady in males for both food groups. Particularly, in the processed food group, TMAO levels were significantly lower in females than males at the 6-hour time point. Neither of these observations held true for TMAO precursors choline, carnitine, or betaine. However, plasma levels of a dietary precursor for TMAO, γ-butyrobetaine, showed clear diet-microbe-host interactions. For males in the processed food group, plasma γ-butyrobetaine levels were significantly increased in subjects on broad spectrum antibiotics.\nConclusions\nOur results show the postprandial levels of TMAO, and its nutrient precursors dynamically change in a diet-, microbe-, and sex-dependent manner. These findings provide new insights into the postprandial levels of TMAO-related metabolites and may inform precision nutritional approaches in those who could benefit from TMAO-lowering strategies."
DOIs are quite complicated and there are a bunch of ways to register them. Various types of resources can have a DOI - e.g. GitHub repos, which I wouldn't expect to be on Crossref and even if they were wouldn't generally contain an abstract. Crossref say that they hold metadata for approximately 150 million scholarly artifacts. It covers the three papers in your sample data and a selection of others I just tried including some published in the last few weeks. However, I found a paper that it couldn't resolve (although technically it's at the "accepted" rather than "published" stage, there's a preprint and a DOI). I'm sure you'll find that there are some papers that this misses. Some will be dead/incorrect DOIs but others will just be missing from Crossref. The answer from r2evans would be a good approach to working out which is which, and mopping up.