rtext-miningstringrtm

is package tm suitable for extracting scores from text data?


I have many cognitive assessment data stored as txt files. Each file looks like this:

patient number xxxxxx
score A        (98) (95)ile%
score B        (100) (97)ile%
test C
   score D     (76)
   score E     (80)
(the real report is longer and more orderless than this)

As the example data showed, the format of each score is not well ordered. It's easy to read, but hard to analyze. I want to extract scores of each test for each patient and create a table for further analysis. Because I've never use text mining function or package in R before. I'm wondering if it's more appropriate to do it with text mining package in R, or is it ok if I just treat the whole report as a very long string? What are the difference? Thanks!

the actual report I'm dealing with looks like this (I've converted all actual number to "X")

 ORIENTATION                                   R.S.      %ile      N/D/B
     Temporal orientation-Error score          ( xx)  ( xx  )%ile ( x )
     Orientation to Personal Information       ( x )/8   ( xx  )%ile ( x )
     Orientation to Place                      ( x )/4   ( xx  )%ile ( x )

   WMS-III - Verbal Memory                       R.S.     %ile       N/D/B
     Verbal Paired
       Associates-I       Scale score ( -  )   ( -  )/32 ( -   )%ile ( - )
     Verbal Paired
       Associates-II      Scale score ( -  )   ( -  )/8  ( -   )%ile ( - )
     Word List Memory-I   Scale score ( x  )   ( xx )/48 ( x   )%ile ( x )
     Word List Memory-II  Scale score ( x  )   ( xx  )/12 ( x   )%ile ( x )
     Logical Memory-I     Scale score ( x  )   ( xx )/75 ( x  )%ile ( x )
     Logical Memory-II    Scale score ( x  )   ( xx )/50 ( x   )%ile ( x )
     Faces Memory-I      Scale score ( -  )  ( -  )/48   ( -   )%ile ( - )
     Faces Memory -II    Scale score ( -  )  ( -  )/48   ( -   )%ile ( - )
     Visual
       Reproduction-I    Scale score ( x  )  ( xx  )/104 ( x  )%ile ( x )
     Visual
       Reproduction-II   Scale score ( x  )  ( xx  )/104 ( x   )%ile ( x )
     Spatial Memory
      F:( x ) B:( x )    Scale score ( xx )  ( xx )/32   ( xx  )%ile ( N )
   LANGUAGE                                      R.S.      %ile      N/D/B
     Visual Naming                             ( xx )/60 ( xx  )%ile ( x )
     Object Naming  A+B                        ( -  )/16 ( -   )%ile ( - )
     Aural Comprehension                       ( xx )/18 ( xx  )%ile ( x )
     Semantic Association of Verbal Fluency    ( xx  )   ( xx  )%ile ( x )
     (                                         (       ) (     )%ile (   )
     (                                         (       ) (     )%ile (   )
    WCST-S    Number cards used               ( xx  )               (   )
              Number complete categories      ( x/x )   ( x   )%ile ( x )
              Number perseverative errors     ( xx  )   ( x  )%ile ( x )
              Number non-perseverative errors ( xx  )   ( xx  )%ile ( x )
    Trails Making Test-Part A           Time  ( xx  )   ( xx  )%ile ( x )
    Trails Making Test-Part B           Time  ( N/A )   (     )%ile (   )
    (                                         (       ) (     )%ile (   )
    (                                         (       ) (     )%ile (   )

  SPATIAL PERCEPTUAL FUNCTION                   R.S.      %ile      N/D/B
    Judgment of Line Orientation  Form( x )   ( x )/30 ( xx  )%ile ( x )
    3-D Block Construction-Model  Form(   )
                                       score (     )/29 (     )%ile (   )
                                       Time  (     )s   (     )%ile (   )
    MANUAL DEXTERITY                              R.S.      %ile      N/D/B
      Purdue Pegboard                 RH         (    )   (     )%ile (   )
                                      LH         (    )   (     )%ile (   )
                                      Both Hands (    )   (     )%ile (   )
IMPRESSION:
< xxxxxxxxxxxxxxxxxxxxxxxxxxx  >
    Age : ( xx  ) y/o
    Edu : ( xx  ) yrs
    Handedness : ( xx  )

  MINI-MENTAL EXAMINATION                      R.S.      %ile      N/D/B
    MMSE                                     ( xx )/30 (     )%ile ( x )
    
    
PERSONALITY ASSESSMENT                        R.S.      %ile      N/D/B  SCL-90R                                  ( x.xx )   (     )%ile ( N )  Frontal Behavioral Inventory                 Negative behavior score   ( xx )/36  (     )%ile ( N )                     Disinhibition score   ( x  )/36  (     )%ile ( x )                             Total score   ( xx )/72  (     )%ile ( x )  BDI-II                                   (    )/63  (     )%ile (   )  BAI                                      (    )/6   (     )%ile (   )



Solution

  • It's hard to say without seeing the actual file (or a similar example file), but my guess is that you could use regular expressions to pull out what you need. If you do convert that whole thing into one long string, you'll probably get something like this:

    library(stringr)
    
    string <- "patient number xxxxxx\nscore A        (98) (95)ile%\nscore B        (100) (97)ile%\ntest C\n   score D     (76)\n   score E     (80)"
    
    

    You could then pull out patient numbers with something like this:

    patient_number <- str_extract(string, "(?<=patient number).*")
    
    

    And then score names like this

    score_name <- str_extract_all(string, "score [A-Z]") %>% unlist()
    
    
    

    Then the actual scores

    score <- str_extract_all(string, " \\([0-9]{1,3}\\)( |(?=\\\n)|)") %>% 
    unlist() %>% 
    str_squish() %>% 
    str_replace("\\(","") %>% 
    str_replace("\\)","")
    
    

    Then put it all together into a dataframe

    scores_df <- data.frame(patient_number,score_name, score)
    
    
    scores_df
    
    > scores_df
      patient_number score_name score
    1         xxxxxx    score A    98
    2         xxxxxx    score B   100
    3         xxxxxx    score D    76
    4         xxxxxx    score E    80
    

    Please consider editing your question to include a sample of your actual data and a more specific description of what you want to pull out. If you do that we'll be able to give you much better help than this random example :)