pythonarrayslistdivide-and-conquerspss-files

Divide and Conquer Lists in Python (to read sav files using pyreadstat)


I am trying to read sav files using pyreadstat in python but for some rare scenarios I am getting error of UnicodeDecodeError since the string variable has special characters.

To handle this I think instead of loading the entire variable set I will load only variables which do not have this error.

Below is the pseudo-code that I have with me. This is not a very efficient code since I check for error in each item of list using try and except.

# Reads only the medata to get information about the variables
df, meta = pyreadstat.read_sav('Test.sav', metadataonly=True)
list = meta.column_names # All variables are stored in list
result = []
for var in list:
    print(var)
    try:
        df, meta = pyreadstat.read_sav('Test.sav', usecols=[str(var)]) 
        # If no error that means we can store this variable in result
        result.append(var)
    except:
        pass
# This will finally load the sav for non error variables
df, meta = pyreadstat.read_sav('Test.sav', usecols=result) 

For a sav file with 1000+ variables it takes a long amount of time to process this. I was thinking if there is a way to use divide and conquer approach and do it faster. Below is my suggested approach but I am not very good in implementing recursion algorithm. Can someone please help me with pseudo code it would be very helpful.

  1. Take the list and try to read sav file
  2. In case of no error then output can be stored in result and then we read the sav file
  3. In case of error then split the list into 2 parts and run these again ....
  4. Step 3 needs to run again until we have a list where it does not give any error

Using the second approach 90% of my sav files will get loaded on the first pass itself hence I think recursion is a good method

You can try to reproduce the issue for sav file here


Solution

  • For this specific case I would suggest a different approach: you can give an argument "encoding" to pyreadstat.read_sav to manually set the encoding. If you don't know which one it is, what you can do is iterate over the list of encodings here: https://gist.github.com/hakre/4188459 to find out which one makes sense. For example:

    # here codes is a list with all the encodings in the link mentioned before
    for c in codes:
        try:
            df, meta = p.read_sav("Test.sav", encoding=c)
            print(encoding)
            print(df.head())
        except:
            pass
    

    I did and there were a few that may potentially make sense, assuming that the string is in a non-latin alphabet. However the most promising one is not in the list: encoding="UTF8" (the list contains UTF-8, with dash and that fails). Using UTF8 (no dash) I get this:

    నేను గతంలో వాడిన బ
    

    which according to google translate means "I used to come b" in Telugu. Not sure if that fully makes sense, but it's a way.

    The advantage of this approach is that if you find the right encoding, you will not be loosing data, and reading the data will be fast. The disadvantage is that you may not find the right encoding.

    In case you would not find the right encoding, you anyway would be reading the problematic columns very fast, and you can discard them later in pandas by inspecting which character columns do not contain latin characters. This will be much faster than the algorithm you were suggesting.