rtmreproducible-researchdata-scrubbinganonymize

Anonymize names in paragraph variable by matching and replacement


I am analyzing a school's student report card database. My dataset consists of around 3000 records structured similarly to the example below. Each observation is one teacher's assessment of one student. Each observation contains a three-sentence narrative comment.

To share the results of my analysis, I would like to scrub mentions of student names from the comments and replace them with other names. In an ideal world I would also like to share an anonymized version of the database for the sake of reproducibility.

The inconsistent use of student names (first vs nickname vs full name) and unstructured use of the student's name makes this quite tricky for an amateur like me. My attempt at solving this problem was to approach the comments as documents in a corpus and use write a function that uses tm::removeWords but it didn't work for me. Thanks in advance!

Example Data (dput of table here)

  Teacher Subject      Student.Name                                                         Comment
1   Black    Math    Richard (Dick) Dick is a terrible student-- why hasn't he been kicked out yet?
2   Black    Math Elizabeth (Betty)                       Betty procrastinates, but does good work.
3   Black    Math   Mary Grace (MG)                      As her teacher, I think MG is my favorite.
4   Brown English    Richard (Dick)                      Richard is terrible at turning in homework.
5   Brown English Elizabeth (Betty)                Elizabeth's work is interfering with her studies.
6   Brown English   Mary Grace (MG)                         Mary Grace should be a teacher someday.
7    Blue    P.E.    Richard (Dick)  Richard (Dick) kicked more field goals than any other student.
8    Blue    P.E. Elizabeth (Betty)    Elizabeth (Betty) needs to work to communicate on the field.
9    Blue    P.E.   Mary Grace (MG)             Mary Grace (MG) needs to stop insulting the teacher

Desired Data

Teacher Subject Student Name    Comment
Black   Math    A   A is a terrible student-- why hasn't he been kicked out yet?
Black   Math    B   B procrastinates, but does good work.
Black   Math    C   As her teacher, I think C is my favorite.
Brown   English A   A is terrible at turning in homework
Brown   English B   B's work is interfering with her studies.
Brown   English C   C should be a teacher someday.
Blue    P.E.    A   A kicked more field goals than any other student.
Blue    P.E.    B   B needs to work to communicate on the field.
Blue    P.E.    C   C needs to stop insulting the teacher

N.B.

Four months ago, I asked a version of this question to no reply. I thought it would help to show my solution but perhaps the tm package is not widely used. So here's another shot.


Solution

  • I would use mgsub here from the qdap package. You could do something like this (though take care to make sure the students are attributed to the same ids which here might be too specific to your example which contains nicknames for each student):

    names <- unique(as.character(reports$Student.Name))
    ids <- sample(100000, length(names))
    
    tocheck <- c(
      names, 
      unlist(regmatches(names, gregexpr("(?<=\\().*?(?=\\))", names, perl = T))),
      gsub("\\s*\\([^\\)]+\\)","",as.character(names))
    )
    reports$Student.Name <- rep(ids, 3)
    reports$Comment <- qdap::mgsub(tocheck, rep(ids, 3), reports$Comment)
    
      Student.Name                                                          Comment
    1        61034 61034 is a terrible student-- why hasn't he been kicked out yet?
    2        45005                        45005 procrastinates, but does good work.
    3        13699                    As her teacher, I think 13699 is my favorite.
    4        61034                         61034 is terrible at turning in homework
    5        45005                    45005's work is interfering with her studies.
    6        13699                               13699 should be a teacher someday.
    7        61034            61034 kicked more field goals than any other student.
    8        45005                 45005 needs to work to communicate on the field.
    9        13699                        13699 needs to stop insulting the teacher