pythonpandas

DtypeWarning: Columns have mixed types error in Pandas when loading csv


When loading a csv file in pandas I've encountered the bellow error message:

DtypeWarning: Columns have mixed types. Specify dtype option on import  
or set low_memory=False

Reading online I found few solutions.

One, to set low_memory=False, but I understand that this is not a good practice and it doesn't really resolve the problem.

Second solution is to set a data type for each column (or each column with mixed data types):

pd.read_csv(csv_path_name, dtype={'first_column': 'str', 'second_column': 'str'})

Again, from what I read, not the ideal solution if we have a big dataset.

Third solution - create a converter function. To my understanding this might be the most appropriate solution. I found code which works for me, but I am trying to better understand what is this function exactly doing:

def convert_dtype(x):
    if not x:
        return ''
    try:
        return str(x)
    except:
        return ''

df = pd.read_csv(csv_path_name, converters={'first_col':convert_dtype, 'second_col':convert_dtype, etc.... } )

Can someone please explain the function code to me?

Thanks


Solution

  • if not x checks if x is an empty string. if it is empty it returns '', which is an empty string without any content.

    def convert_dtype(x):
        if not x:
            return ''
    
    

    try: return str(x) tries to convert and return x as a string.

        try:
            return str(x)
    

    if converting and returning x as a string doesn't work, it returns ''.

        except:
            return ''
    

    Basically, if the content of the column is empty from the start or can't be converted to string it's discarded and replaced with a string not having any content. I can't judge however if this is a good approach, it depends on what you are trying to accomplish with your application. Your column will only contain strings afterwards nonetheless.