I have many cognitive assessment data stored as txt files. Each file looks like this:
patient number xxxxxx
score A (98) (95)ile%
score B (100) (97)ile%
test C
score D (76)
score E (80)
(the real report is longer and more orderless than this)
As the example data showed, the format of each score is not well ordered. It's easy to read, but hard to analyze. I want to extract scores of each test for each patient and create a table for further analysis. Because I've never use text mining function or package in R before. I'm wondering if it's more appropriate to do it with text mining package in R, or is it ok if I just treat the whole report as a very long string? What are the difference? Thanks!
the actual report I'm dealing with looks like this (I've converted all actual number to "X")
ORIENTATION R.S. %ile N/D/B
Temporal orientation-Error score ( xx) ( xx )%ile ( x )
Orientation to Personal Information ( x )/8 ( xx )%ile ( x )
Orientation to Place ( x )/4 ( xx )%ile ( x )
WMS-III - Verbal Memory R.S. %ile N/D/B
Verbal Paired
Associates-I Scale score ( - ) ( - )/32 ( - )%ile ( - )
Verbal Paired
Associates-II Scale score ( - ) ( - )/8 ( - )%ile ( - )
Word List Memory-I Scale score ( x ) ( xx )/48 ( x )%ile ( x )
Word List Memory-II Scale score ( x ) ( xx )/12 ( x )%ile ( x )
Logical Memory-I Scale score ( x ) ( xx )/75 ( x )%ile ( x )
Logical Memory-II Scale score ( x ) ( xx )/50 ( x )%ile ( x )
Faces Memory-I Scale score ( - ) ( - )/48 ( - )%ile ( - )
Faces Memory -II Scale score ( - ) ( - )/48 ( - )%ile ( - )
Visual
Reproduction-I Scale score ( x ) ( xx )/104 ( x )%ile ( x )
Visual
Reproduction-II Scale score ( x ) ( xx )/104 ( x )%ile ( x )
Spatial Memory
F:( x ) B:( x ) Scale score ( xx ) ( xx )/32 ( xx )%ile ( N )
LANGUAGE R.S. %ile N/D/B
Visual Naming ( xx )/60 ( xx )%ile ( x )
Object Naming A+B ( - )/16 ( - )%ile ( - )
Aural Comprehension ( xx )/18 ( xx )%ile ( x )
Semantic Association of Verbal Fluency ( xx ) ( xx )%ile ( x )
( ( ) ( )%ile ( )
( ( ) ( )%ile ( )
WCST-S Number cards used ( xx ) ( )
Number complete categories ( x/x ) ( x )%ile ( x )
Number perseverative errors ( xx ) ( x )%ile ( x )
Number non-perseverative errors ( xx ) ( xx )%ile ( x )
Trails Making Test-Part A Time ( xx ) ( xx )%ile ( x )
Trails Making Test-Part B Time ( N/A ) ( )%ile ( )
( ( ) ( )%ile ( )
( ( ) ( )%ile ( )
SPATIAL PERCEPTUAL FUNCTION R.S. %ile N/D/B
Judgment of Line Orientation Form( x ) ( x )/30 ( xx )%ile ( x )
3-D Block Construction-Model Form( )
score ( )/29 ( )%ile ( )
Time ( )s ( )%ile ( )
MANUAL DEXTERITY R.S. %ile N/D/B
Purdue Pegboard RH ( ) ( )%ile ( )
LH ( ) ( )%ile ( )
Both Hands ( ) ( )%ile ( )
IMPRESSION:
< xxxxxxxxxxxxxxxxxxxxxxxxxxx >
Age : ( xx ) y/o
Edu : ( xx ) yrs
Handedness : ( xx )
MINI-MENTAL EXAMINATION R.S. %ile N/D/B
MMSE ( xx )/30 ( )%ile ( x )
PERSONALITY ASSESSMENT R.S. %ile N/D/B SCL-90R ( x.xx ) ( )%ile ( N ) Frontal Behavioral Inventory Negative behavior score ( xx )/36 ( )%ile ( N ) Disinhibition score ( x )/36 ( )%ile ( x ) Total score ( xx )/72 ( )%ile ( x ) BDI-II ( )/63 ( )%ile ( ) BAI ( )/6 ( )%ile ( )
It's hard to say without seeing the actual file (or a similar example file), but my guess is that you could use regular expressions to pull out what you need. If you do convert that whole thing into one long string, you'll probably get something like this:
library(stringr)
string <- "patient number xxxxxx\nscore A (98) (95)ile%\nscore B (100) (97)ile%\ntest C\n score D (76)\n score E (80)"
You could then pull out patient numbers with something like this:
patient_number <- str_extract(string, "(?<=patient number).*")
And then score names like this
score_name <- str_extract_all(string, "score [A-Z]") %>% unlist()
Then the actual scores
score <- str_extract_all(string, " \\([0-9]{1,3}\\)( |(?=\\\n)|)") %>%
unlist() %>%
str_squish() %>%
str_replace("\\(","") %>%
str_replace("\\)","")
Then put it all together into a dataframe
scores_df <- data.frame(patient_number,score_name, score)
scores_df
> scores_df
patient_number score_name score
1 xxxxxx score A 98
2 xxxxxx score B 100
3 xxxxxx score D 76
4 xxxxxx score E 80
Please consider editing your question to include a sample of your actual data and a more specific description of what you want to pull out. If you do that we'll be able to give you much better help than this random example :)