pythonjsonstringlistijson

Parsing with ijson, lists become strings - make them nested lists of floats


I have large .GEOJSON files that I parse using ijson. One data I load is coordinates listed as for example: "coordinates": [[[47335.8499999996, 6571361.68], [47336.2599999998, 6571360.54], [47336, 6571335.4]]]

I'm able to load this data, having changed its type from Decimal.decimal() to float in the ijson object class. I use the following to parse the JOSN file.

class ReadJSON:    

def __init__(self, filename, name=""):
        self.name = name
        self.f = open(datafolder+filename)
        self.objects = ijson.items(self.f, 'features')

def load_file(self):
    for obj in self.objects:
        final_list = list()
        for entry in obj:
            temp_list = list()
            col_names = list()
            for key in entry.keys():
                for col in entry[key]:
                    temp_list.append(entry[key][col])
                    col_names.append(self.name+'.'+col)
                final_list.append(temp_list)
            df = pd.DataFrame(final_list, columns=col_names)
    return df

Everything ends up where it should, but the list of coordinates is string type. I need to be able to work with the individual points and xy-coordinates. I will have for example: df_rivers, where df_rivers["coordinates"] will contain such lists.

I have tried

temp_list = "[[[47335.8499999996, 6571361.68], [47336.2599999998, 6571360.54], [47336, 6571335.4]]]"
t_list = temp_list.split('],')

print(temp_list[0])
out: [[[47335.8499999996, 6571361.68
type(temp_list[0]) is 'str'

point = temp_list[0].split(',')
print(point[0]):
[[[47335.8499999996
type(point[0]) is 'str

So I am able to access each point and coordinate, however it is quite cumbersome. In addition, point[1] suddenly became out of bounds in the middle of a temp_list. I have many of these lists, which are in reality much longer, and I need to be able to work easily with them.

I don't care whether the fix lies in the loading of the data, or if I can apply it afterwards on the whole column as the script will rarely be run once finished. However, I will have 153 files with up to 60000 rows it will have to run through, so efficiency would be nice.

I'm using Python 3.6.3


Solution

  • You can use ast.literal_eval to obtain the list object from the string: Here is a demo :

    >>> temp_list = "[[[47335.8499999996, 6571361.68], [47336.2599999998, 6571360.54], [47336, 6571335.4]]]"
    >>> import ast
    >>> li = ast.literal_eval(temp_list)
    >>> li
    [[[47335.8499999996, 6571361.68], [47336.2599999998, 6571360.54], [47336, 6571335.4]]]
    >>> type(li)
    <class 'list'>
    

    Here is the Python documentation : Doc