stringdataframeapache-sparkpysparkdiacritics

How to replace accented characters in PySpark?


I have a string column in a dataframe with values with accents, like

'México', 'Albânia', 'Japão'

How to replace letters with accents to get this:

'Mexico', 'Albania', 'Japao'

I tried many solutions available in Stack OverFlow, like this:

def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

But disappointed returns

strip_accents('México')
>>> 'M?xico'

Solution

  • You can use translate:

    df = spark.createDataFrame(
        [
        ('1','Japão'),
        ('2','Irã'),
        ('3','São Paulo'),
        ('5','Canadá'),
        ('6','Tókio'),
        ('7','México'),
        ('8','Albânia')
        ],
        ["id", "Local"]
    )
    
    df.show(truncate = False)
    
    +---+---------+
    |id |Local    |
    +---+---------+
    |1  |Japão    |
    |2  |Irã      |
    |3  |São Paulo|
    |5  |Canadá   |
    |6  |Tókio    |
    |7  |México   |
    |8  |Albânia  |
    +---+---------+
    
    from pyspark.sql import functions as F
    
    df\
        .withColumn('Loc_norm', F.translate('Local',
                                           'ãäöüẞáäčďéěíĺľňóôŕšťúůýžÄÖÜẞÁÄČĎÉĚÍĹĽŇÓÔŔŠŤÚŮÝŽ',
                                           'aaousaacdeeillnoorstuuyzAOUSAACDEEILLNOORSTUUYZ'))\
        .show(truncate=False)
    
    +---+---------+---------+
    |id |Local    |Loc_norm |
    +---+---------+---------+
    |1  |Japão    |Japao    |
    |2  |Irã      |Ira      |
    |3  |São Paulo|Sao Paulo|
    |5  |Canadá   |Canada   |
    |6  |Tókio    |Tokio    |
    |7  |México   |Mexico   |
    |8  |Albânia  |Albânia  |
    +---+---------+---------+