pythonpandasfor-looputf-8gbk

how to start a for loop from a chosen row of pandas.df?


when processing a pandas.df with for loop.I usually meet up with errors. When the error has been removed, I will have to restart the for loop form the beginning of the dataframe. How can I start the for loop from the error position, getting rid of run it repeatedly. For example:

senti = []
for i in dfs['ssentence']:
   senti.append(get_baidu_senti(i))

in the code above, I'm trying to do the sentiment analysis through api and store them into a list.However, the api only input GBK format whereas my data are encoded in utf-8. So it usually meet up with errors like this:

UnicodeEncodeError: 'gbk' codec can't encode character '\u30fb' in position 14: illegal multibyte sequence

So I have to delete the specific items like'\u30fb' manually and restart the for loop. At this time, the list"senti" contains so many data already so I don't want to abandon them. What can I do to improve the for loop?


Solution

  • If you API requires encoding to GBK, then just encode to that codec using an error handler other than 'strict' (the default).

    'ignore' will drop any codepoints that can't be encoded to GBK:

    dfs['ssentence_encoded'] = dfs['ssentence'].str.encode('gbk', 'ignore')
    

    See the Error Handlers section of Python's codecs documentation.

    If you need to pass in strings, but only strings that can safely be encoded to GBK, then I'd create a translation map suitable for the str.translate() method:

    class InvalidForEncodingMap(dict):
        def __init__(self, encoding):
            self._encoding = encoding
            self._negative = set()
        def __missing__(self, codepoint):
            if codepoint in self._negative:
                raise LookupError(codepoint)
            if chr(codepoint).encode(self._encoding, 'ignore'):
                # can be mapped, record as a negative and raise
                self._negative.add(codepoint)
                raise LookupError(codepoint)
            # map to None to remove
            self[codepoint] = None
            return None
    
    only_gbk = InvalidForEncodingMap('gbk')
    dfs['ssentence_gbk_safe'] = dfs['sentence'].str.translate(only_gbk)
    

    The InvalidForEncodingMap class lazily creates entries as codepoints are looked up, so only codepoints that are actually present in your data are processed. I'd still keep the map instance around for re-use if you need to use it more than once, the cache it builds up can be reused that way.