When loading a csv file in pandas I've encountered the bellow error message:
DtypeWarning: Columns have mixed types. Specify dtype option on import
or set low_memory=False
Reading online I found few solutions.
One, to set low_memory=False
, but I understand that this is not a good practice and it doesn't really resolve the problem.
Second solution is to set a data type for each column (or each column with mixed data types):
pd.read_csv(csv_path_name, dtype={'first_column': 'str', 'second_column': 'str'})
Again, from what I read, not the ideal solution if we have a big dataset.
Third solution - create a converter function. To my understanding this might be the most appropriate solution. I found code which works for me, but I am trying to better understand what is this function exactly doing:
def convert_dtype(x):
if not x:
return ''
try:
return str(x)
except:
return ''
df = pd.read_csv(csv_path_name, converters={'first_col':convert_dtype, 'second_col':convert_dtype, etc.... } )
Can someone please explain the function code to me?
Thanks
if not x
checks if x
is an empty string. if it is empty it returns ''
, which is an empty string without any content.
def convert_dtype(x):
if not x:
return ''
try: return str(x)
tries to convert and return x
as a string.
try:
return str(x)
if converting and returning x
as a string doesn't work, it returns ''
.
except:
return ''
Basically, if the content of the column is empty from the start or can't be converted to string it's discarded and replaced with a string not having any content. I can't judge however if this is a good approach, it depends on what you are trying to accomplish with your application. Your column will only contain strings afterwards nonetheless.