I am analyzing a school's student report card database. My dataset consists of around 3000 records structured similarly to the example below. Each observation is one teacher's assessment of one student. Each observation contains a three-sentence narrative comment.
To share the results of my analysis, I would like to scrub mentions of student names from the comments and replace them with other names. In an ideal world I would also like to share an anonymized version of the database for the sake of reproducibility.
The inconsistent use of student names (first vs nickname vs full name) and unstructured use of the student's name makes this quite tricky for an amateur like me. My attempt at solving this problem was to approach the comments as documents in a corpus and use write a function that uses tm::removeWords
but it didn't work for me. Thanks in advance!
Teacher Subject Student.Name Comment
1 Black Math Richard (Dick) Dick is a terrible student-- why hasn't he been kicked out yet?
2 Black Math Elizabeth (Betty) Betty procrastinates, but does good work.
3 Black Math Mary Grace (MG) As her teacher, I think MG is my favorite.
4 Brown English Richard (Dick) Richard is terrible at turning in homework.
5 Brown English Elizabeth (Betty) Elizabeth's work is interfering with her studies.
6 Brown English Mary Grace (MG) Mary Grace should be a teacher someday.
7 Blue P.E. Richard (Dick) Richard (Dick) kicked more field goals than any other student.
8 Blue P.E. Elizabeth (Betty) Elizabeth (Betty) needs to work to communicate on the field.
9 Blue P.E. Mary Grace (MG) Mary Grace (MG) needs to stop insulting the teacher
Teacher Subject Student Name Comment
Black Math A A is a terrible student-- why hasn't he been kicked out yet?
Black Math B B procrastinates, but does good work.
Black Math C As her teacher, I think C is my favorite.
Brown English A A is terrible at turning in homework
Brown English B B's work is interfering with her studies.
Brown English C C should be a teacher someday.
Blue P.E. A A kicked more field goals than any other student.
Blue P.E. B B needs to work to communicate on the field.
Blue P.E. C C needs to stop insulting the teacher
Four months ago, I asked a version of this question to no reply. I thought it would help to show my solution but perhaps the tm
package is not widely used. So here's another shot.
I would use mgsub
here from the qdap
package. You could do something like this (though take care to make sure the students are attributed to the same ids which here might be too specific to your example which contains nicknames for each student):
names <- unique(as.character(reports$Student.Name))
ids <- sample(100000, length(names))
tocheck <- c(
names,
unlist(regmatches(names, gregexpr("(?<=\\().*?(?=\\))", names, perl = T))),
gsub("\\s*\\([^\\)]+\\)","",as.character(names))
)
reports$Student.Name <- rep(ids, 3)
reports$Comment <- qdap::mgsub(tocheck, rep(ids, 3), reports$Comment)
Student.Name Comment
1 61034 61034 is a terrible student-- why hasn't he been kicked out yet?
2 45005 45005 procrastinates, but does good work.
3 13699 As her teacher, I think 13699 is my favorite.
4 61034 61034 is terrible at turning in homework
5 45005 45005's work is interfering with her studies.
6 13699 13699 should be a teacher someday.
7 61034 61034 kicked more field goals than any other student.
8 45005 45005 needs to work to communicate on the field.
9 13699 13699 needs to stop insulting the teacher