google-cloud-platformgoogle-cloud-dataprep

Dataprep - accents and special characters


How do I solve this problem with accents / special characters in the dataprep? I need this information to appear.

enter image description here

Thank you very much for your attention.


Solution

  • DataPrep has builtin recipes which allow you to remove or change special characters. For example, you can change accented letters to unaccented ones with Remove accents in text or you can also replace non recognised characters for another character with Replace text or patterns.

    Below are the steps to change a special character or accented letter.

    1. Create your flow.
    2. Add/import your data
    3. Click Add a recipe, as per documentation. In your case you can do one or both of the following:

    First, in case you have an accented word, go to Search Transformations > Select Remove accents in text. Then, select the column, which there are accented words. It will replace the accented words for non-accented ones. Your data your be shown to you so you can check the transformation.

    Second, in case you have an non recognised character, go to Search Transformations > Replace text or patterns > Select the column you want to transform the data > Within Find write the letter/symbol between single quotes > In Replace with write the letter which will be placed instead. Finally, preview your data to see the transformation.

    UPDATE: I was able to load a .csv file with the mentioned characters to DataPrep. Below are my steps and sample data:

    The .csv file I used had the following content:

    Test
    Non rec. char É
    Non rec. char ç
    Accented word não
    

    In the DataPrep UI home page, click on Import Data (top right corner) Google Cloud Storage (left part of the screen). Then, find and select you file (test just importing one file instead of parametrizing) and click in the add(+) symbol. In this step, you can already see the characters, in my case I could see them normally. Finally, click in Import&Wrangle and visualise your data. Using the data above, I was able to see the characters properly without any issues.