pythonpandasstringpunctuation

Remove punctuations from pandas column but keep original list of lists structure


I know how to do it for a single list in a cell but I need to keep the structure of multiple list of lists as in [["I","need","to","remove","punctuations","."],[...],[...]] -> [["I","need","to","remove","punctuations"],[...],[...]]

All methods I know turn into this -> ["I","need","to","remove","punctuations",...]

data["clean_text"] = data["clean_text"].apply(lambda x: [', '.join([c for c in s if c not in string.punctuation]) for s in x])
data["clean_text"] = data["clean_text"].str.replace(r'[^\w\s]+', '')
...

What's the best way to do that?


Solution

  • Following your approach, I would just add a listcomp with a helper function :

    import string
    
    def clean_up(lst):
        return [[w for w in sublist if w not in string.punctuation] for sublist in lst]
    
    data["clean_text"] = [clean_up(x) for x in data["text"]]
    

    ā€‹ Output :

    print(data) # -- with two different columns so we can see the difference
    
                                                                                                        text  \
    0  [[I, need, to, remove, punctuations, .], [This, is, another, list, with, commas, ,, and, periods, .]]   
    
                                                                                         clean_text  
    0  [[I, need, to, remove, punctuations], [This, is, another, list, with, commas, and, periods]]