pythonstringdataframedata-manipulationlowercase

How to convert Latin characters to lowercase in data frame in Python?


I am trying to convert Latin uppercase characters by modifying a part of a pandas data frame to lowercase using Python. The CSV file will be populated with strings.

I have tried using .lower() and .casefold()

Input:

' LetÂ’s ALL look after the less capable in our village and ensure they stay healthy.'

Expected Output:

letâ’s all look after the less capable in our village and ensure they stay healthy.

Current Output:

letâ’s all look after the less capable in our village and ensure they stay healthy.

Quote is a field in the CSV file. I want the content of the 'Quote' to be in lower case.

df = pd.read_csv(data_file, encoding='latin-1')
df['Quote'] = df['Quote'].str.lower()

but the output is still showing uppercase Latin characters.

Output:

output


Solution

  • First of all: I think your input text already includes errors due to problems with encoding. "LetÂ’s" should probably be "Let’s".

    The encoding "latin-1" supports the uppercase letter "Â" and the lowercase letter "â" and "Â".lower() does yield "â". However, I am not sure that your input text actually includes the letter "Â". It is more likely it is some character, the encoding does not support (see my first point) and only gets shown as "Â".

    Note that parts of your text that use the ' sign (Unicode U+0027) do not have this problem. Whatever symbol was used for the part that became "LetÂ’s" may simply not be supported by the encoding of your input text. There exists a variety of different symbols that are (incorrectly) used for contractions.

    What happens if you use encoding "utf-8"? Unicode supports way more symbols than latin-1. Keeping the incorrect word "letâ’s" in your text should not be your goal.