pythonspacyredaction

Using spacy to redact names from a column in a data frame


I have a data frame named "df1". This data frame has 12 columns. The last column in this data frame is called notes. I need to replace common names like "john, sally and richard" from this column and replace the values with xxxx or something similar. I have a working script that is creating this data frame from MS SQL. I have spent several hours and used various resources to try and get some code that works to do this but I have not been successful. I do not have to use Spacy, but I was told this is a good package to work with. Any help would be appreciated.


Solution

  • You need to use a solution like

    import spacy
    import pandas as pd
    
    # Test dataframe
    df = pd.DataFrame({'notes':["Peter A. Smith came to see Bart in Washington on Tuesday."]})
    
    print(df['notes'])
    # => 0    Peter A. Smith came to see   Bart     in   Washington on  Tuesday.
    ##        <<PERSON>>     came to see <<PERSON>> in     <<GPE>>  on <<DATE>>.
    
    nlp = spacy.load('en_core_web_trf')
    
    def redact_with_spacy(text: str) -> str:
        doc = nlp(text)
        newString = text
        for e in reversed(doc.ents):
            if e.label_ == "PERSON": # Only redact PERSON entities
                start = e.start_char
                end = start + len(e.text)
                newString = newString[:start] + "xxxx" + newString[end:]
        return newString
    
    df['notes'] = df['notes'].apply(redact_with_spacy)
    print(df['notes'])
    

    Output:

    0    xxxx came to see xxxx in Washington on Tuesday.
    

    Note you may adjust the "xxxx" in the redact_with_spacy function. E.g., you may replace the found entity with the same amount of xs if you use newString = newString[:start] + ("x" * len(e.text)) + newString[end:]. Or, to keep spaces, newString = newString[:start] + "".join(["x" if not x.isspace() else " " for x in e.text]) + newString[end:].