rtext-miningtext-manipulation

Organizing text blocks as rows in R


I have a text file containing details of about 1000 articles, each containing the same items (PMC, PMID..... AID, SO). I need to have every article details as a single row such that the item names are read as column names, and the respective content as row values:

PMC              PMID        ........      SO

PMC8882595       35237547    ........      Front ....69001
PMC8967719       35296856    ........      Child Yo... 040

The author names and affiliations columns (FAU, AU, AD) in between are repeated, and I just need to extract the country name from that content, if possible. Or just merge them otherwise.

I am really sorry about the poor phrasing of the question. Here is a glimpse of first two articles to describe it better.

PMC - PMC8882595
PMID- 35237547
IS  - 2296-2565 (Electronic)
VI  - 10
DP  - 2022
TI  - Poverty Vulnerability and Health Risk Action Path of Families of Rural Elderly
      With Chronic Diseases: Empirical Analysis of 1,852 Families in Central and
      Western China.
LID - 776901
AB  - Health poverty has become the most important cause of poverty and return to
      poverty. Understanding the health risk factors and action paths of poverty in
      families of rural elderly with chronic diseases is important to alleviate return 
      to poverty because of illness. This study selected families with at least one
      elderly member (over 60 years old) with chronic diseases (sample size was 1,852
      people and localities, the causes of poverty and returning to poverty, and the
      types of poverty vulnerabilities. The use efficiency of medical insurance should 
      be further improved, and the responsibility of medical insurance targeted poverty
      alleviation must be clarified.
FAU - Ma, Ying
AU  - Ma Y
AD  - Department of Health Management, School of Medicine and Health Management, Tongji
      Medical College, Huazhong University of Science and Technology, Wuhan, China
FAU - Xiang, Qin
AU  - Xiang Q
AD  - Department of Health Management, School of Medicine and Health Management, Tongji
      Medical College, Huazhong University of Science and Technology, Wuhan, China
FAU - Yan, Chaoyang
AU  - Yan C
AD  - Department of Health Management, School of Medicine and Health Management, Tongji
      Medical College, Huazhong University of Science and Technology, Wuhan, China
FAU - Liao, Hui
AU  - Liao H
AD  - Department of Health Management, School of Medicine and Health Management, Tongji
      Medical College, Huazhong University of Science and Technology, Wuhan, China
FAU - Wang, Jing
AU  - Wang J
AD  - Department of Health Management, School of Medicine and Health Management, Tongji
      Medical College, Huazhong University of Science and Technology, Wuhan, China
LA  - eng
PT  - Journal Article
DEP - 20220214
PHST- 2021/09/14 [received]
PHST- 2022/01/20 [accepted]
TA  - Front Public Health
JT  - Frontiers in Public Health
AID - 10.3389/fpubh.2022.776901 [doi]
SO  - Front Public Health. 2022 Feb 14;10:. doi:10.3389/fpubh.2022.776901.


    
    
PMC - PMC6107082
PMID- 30147212
IS  - 0190-7409 (Print)
VI  - 79
DP  - 2017 Aug
TI  - Family poverty and neighborhood poverty: Links with children’s school readiness
      before and after the Great Recession.
PG  - 368-84
AB  - This paper examines how neighborhood and family poverty predict children’s
      academic skills and classroom behavior at school entry, and whether associations 
      additional support.
FAU - Wolf, Sharon
AU  - Wolf S
AD  - Graduate School of Education, University of Pennsylvania, 3700 Walnut Street,
      Room 340, Philadelphia, PA 19104, United States
FAU - Magnuson, Katherine A.
AU  - Magnuson KA
AD  - Institute for Research on Poverty, University of Wisconsin – Madison, 1180
      Observatory Drive, Madison, WI 53706, United States
FAU - Kimbro, Rachel T.
AU  - Kimbro RT
AD  - Rice University, 6100 Main St., MS-28, Houston, TX 77005, United States
LA  - eng
PT  - Journal Article
DEP - 20170623
TA  - Child Youth Serv Rev
JT  - Children and youth services review
AID - 10.1016/j.childyouth.2017.06.040 [doi]
MID - NIHMS984695
SO  - Child Youth Serv Rev. 2017 Aug;79:368-84. Epub 2017 Jun 23
      doi:10.1016/j.childyouth.2017.06.040.

Solution

  • Here's an admittedly brute-force start for you.

    Up front, though, you will need to put some thought onto how you want to aggregate your fields. In general, all of the fields (e.g., "PMC", "PHST", "FAU") must recycling safely, with a length of exactly 1 or the same non-1 length from each group. In this data, in the first batch of data, you have PHST with length 2 and FAU with length 5. I know this is known based on the variable data, and that's fine, but that means you need to come up with a clean way to aggregate them. Some options might include list-columns or text-reduction using tokenization, pattern-matching, or whatever. You said you want to extract just the country from the multiple AU fields, that's fine, but ... that's not always easy given the free-text flow of those fields.

    I'm addressing everything except how to find country names and reduce those fields. I strongly encourage looking into list-columns (as opposed to my naive string-concatenation) and some other functions to do the aggregation. (Perhaps another question?)

    For now, my naive aggregator:

    agg_fun <- function(x, fld) {
      # do something different based on fld[1] such as extract a country
      paste(x, collapse = " ")
    }
    

    First, to read in that data, I'll use some bespoke code to turn it into a frame with which we can work a little more easily:

    txt <- readLines("~/StackOverflow/12121693/72706285.txt")
    tmp1 <- tmp2 <- character(0)
    leadsp <- grepl("^\\s", txt)
    tmp1[!leadsp] <- sub("^(\\S+)\\s*-.*", "\\1", txt[!leadsp])
    tmp2[!leadsp] <- sub("^\\S+\\s*-\\s*(.*)", "\\1", txt[!leadsp])
    tmp2[leadsp] <- sub("^ *", "", txt[leadsp])
    
    tmpdat <- tibble(field = tmp1, value = tmp2)
    tmpdat
    # # A tibble: 82 x 2
    #    field value                                                                         
    #    <chr> <chr>                                                                         
    #  1 PMC   PMC8882595                                                                    
    #  2 PMID  35237547                                                                      
    #  3 IS    2296-2565 (Electronic)                                                        
    #  4 VI    10                                                                            
    #  5 DP    2022                                                                          
    #  6 TI    Poverty Vulnerability and Health Risk Action Path of Families of Rural Elderly
    #  7 NA    With Chronic Diseases: Empirical Analysis of 1,852 Families in Central and    
    #  8 NA    Western China.                                                                
    #  9 LID   776901                                                                        
    # 10 AB    Health poverty has become the most important cause of poverty and return to   
    # # ... with 72 more rows
    

    With that, here's a dplyr/tidyr flow using that data:

    library(dplyr)
    library(tidyr) # pivot_wider
    tibble(field = tmp1, value = tmp2) %>%
      filter(nzchar(trimws(field)) | nzchar(trimws(value))) %>%
      group_by(grp = cumsum(!is.na(field))) %>%
      summarize(field = field[1], value = paste(value, collapse = " ")) %>%
      ungroup() %>%
      group_by(grp = cumsum(field == "PMC"), field) %>%
      summarize(value = agg_fun(value, field[1])) %>%
      pivot_wider(grp, names_from = "field", values_from = "value") %>%
      ungroup()
    # # A tibble: 2 x 22
    #     grp AB                                      AD                                      AID        AU        DEP    DP     FAU          IS       JT        LA    LID    PHST      PMC   PMID  PT     SO                TA     TI                         VI    MID   PG   
    #   <int> <chr>                                   <chr>                                   <chr>      <chr>     <chr>  <chr>  <chr>        <chr>    <chr>     <chr> <chr>  <chr>     <chr> <chr> <chr>  <chr>             <chr>  <chr>                      <chr> <chr> <chr>
    # 1     1 Health poverty has become the most imp~ Department of Health Management, Schoo~ 10.3389/f~ Ma Y Xia~ 20220~ 2022   Ma, Ying Xi~ 2296-25~ Frontier~ eng   776901 2021/09/~ PMC8~ 3523~ Journ~ "Front Public He~ Front~ Poverty Vulnerability and~ 10    NA    NA   
    # 2     2 This paper examines how neighborhood a~ Graduate School of Education, Universi~ 10.1016/j~ Wolf S M~ 20170~ 2017 ~ Wolf, Sharo~ 0190-74~ Children~ eng   NA     NA        PMC6~ 3014~ Journ~ "Child Youth Ser~ Child~ Family poverty and neighb~ 79    NIHM~ 368-~
    

    You can likely now discard grp (it was artifical), and if you look deeper into many of these strings, you'll realize some of them are quite long.