pythonpandascsvencodingqpython

Trying to convert kdb to csv using python, everything converts correctly except one column


I have converted a kdb query into a dataframe and then uploaded that dataframe to a csv file. This caused an encoding error which I easily fixed by decoding to utf-8. However, there is one column which this did not work for.

"nameFid" is the column which isn't working correctly, it outputs on the CSV file as " b'STRING' "

I am running Python 3.7, any other information needed I will be happy to provide.

Here is my code which decodes the data in the dataframe I get from kdb

  for ba in df.dtypes.keys():
        if df.dtypes[ba] == 'O':
            try:
                df[ba] = df[ba].apply(lambda x: x.decode('UTF-8'))
            except Exception as e:
                print(e)
return df

This worked for every column except "nameFid"

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 6: invalid continuation byte -

This is one error I get but I thought this suggests that the data isn't encoded using UTF-8, which would surely mean all the columns wouldn't work?

When using the try except, it instead prints "'Series' object has no attribute 'decode'".

My goal is to remove the "b''" from the column values, which currently show

" b'STRING' "

I'm not sure what else i need to add. Let me know if you need anything.

Also sorry I am quite new to all of this.


Solution

  • Many encodings are partially compatible from one other. This is mostly due to the prevalence of ASCII so a ton of them will be backward compatible with ASCII but extend it differently. Hence if your other columns only contain stuff like numbers etc they are likely ASCII-only and will work with a lot of different encodings.

    The column that raises an error however contains some character outside the normal ASCII range and thus the encoding starts to matter. If you don't know the encoding of the file you can use chardet to try to guess it. Keep in mind that this is just guessing. Decoding using a different encoding may not raise any error however it could result in the wrong characters appearing in the final text so you should always know which encoding to use.

    This said, if you are on Linux the standard file utility is often able to give you a rough guess of the encoding used, however for more advanced use cases something like chardet is necessary.

    Once you have found the correct encoding, say you found it is latin-1 simply replace the decode('utf-8') with decode('latin-1').