pythonhtmlpandasgoogle-cloud-translate

Convert HTML Characters To Strings in Pandas Dataframe


I want to replace html character to string in dataframe.

I tried below code but can't change to stirng.

import html
html.unescape(data)

Here is my dataframe and How can I this?

For your reference, This result from Translation API by Google Cloud.

ID A1 A2 A3 1 I don't know if it doesn't meet Actually it was hard for me to understand that... I don't know if it doesn't meet my exp... 2 NaN NaN NaN 3 I think it's a correct web design, at leas... NaN This item costs ¥400 or £4.

enter image description here


Solution

  • If you didn't have any NaN's, then you could simply use applymap() to have all cells processed by html.escape.

    So if you find acceptable to convert NaN's to empty strings, you can use:

    df.fillna("").applymap(html.unescape)
    

    If you want to preserve NaN's, then a good solution is to use stack() to turn columns into another level of the index, which will suppress NaN entries. Then you can use apply() (since it's a Series now, not a DataFrame) and later unstack() to get it back to its original format:

    df.stack().apply(html.unescape).unstack()
    

    But note that this last method will get rid of rows or columns entirely made of NaN's, not sure if that's acceptable to you.

    One more alternative is to use applymap() but use a lambda and only apply html.unescape to the terms that are not NaN:

    df.applymap(lambda x: html.unescape(x) if pd.notnull(x) else x)