pythonlistloopsordereddictionaryread-text

Get x and y coordinates from a specifically formatted text file into an Ordered dictionary in python


I am trying to read a text file in a specific format and extract coordinates from them and store them in an ordered dict. One set in the text file consists of a title line followed by x and y coordinates. The x, y coordinates always start with . followed by \t (tab). One text file contains multiple such sets. My idea is to extract each of the sets' x and y into a list and append this to an ordered dict. Basically, in the end, it will be a list of lists with the number of lists being equal to the number of sets which will be appended to the ordered dict.

An illustration of how the text file looks like:

Freehand    green   2   2   0,0 289618  .   
.   104326.2,38323.8    104309.6,38307.2    104286.3,38287.3    104269.6,38270.6    104256.3,38254.0
.   104239.7,38237.4    104223.0,38220.7    104209.7,38204.1    104193.1,38194.1    104176.4,38187.5

Freehand    green   2   3   0,0 63980   .   
.   99803.4,37296.2 99826.7,37306.2 99843.3,37312.8 99860.0,37316.2 99876.6,37322.8

My code:

from collections import OrderedDict
import re

dict_roi = OrderedDict([
                ("title", []),
                ("X", []),
                ("Y", []) ])

with open(elements_file,"r") as f:

    try:
        # pattern to match to get coordinates
        pattern = re.compile(".\t\d+.*")

        # loop through lines and find title line and line with coordinates

        for i, line in enumerate(f):
            # get title line
            if line.startswith('Freehand'):
                dict_roi['title'].append(line) 

                # initiate empty list per set
                XX = []  
                YY = []

            # line with coordinates starts with .\t
            # if pattern matches and line starts with .\t, get the coordinates
            for match in re.finditer(pattern, line):
                if line.startswith('.\t'):
                    nln = "{}".format(line[2:].strip())
                    val = nln.split('{:6.1f}')

                    # data-massaging to get to the coordinates
                    for v in val:
                        coordinates_list = v.split("\t") 
                        for c in coordinates_list:
                            x, y = c.split(',')
                            print(x, y)
                            XX.append(float(x))
                            YY.append(float(y))

                        # this should append one list per set
                        dict_roi['X'].append(XX)
                        dict_roi['Y'].append(YY)


    except ValueError:
        print("Exiting")

    print(dict_roi)

Ideally, I would like to have an ordered dict which would give me something like:

('X', [[104326.2, 104309.6, 104286.3, 104269.6, 104256.3, 104239.7, 104223.0, 104209.7, 104193.1, 104176.4], 
[99803.4, 99826.7, 99843.3, 99860.0, 99876.6]])

('Y', [[38323.8, 38307.2, 38287.3, 38270.6, 38254.0, 38237.4, 38220.7, 38204.1, 38194.1, 38187.5], 
[37296.2, 37306.2, 37312.8, 37316.2, 37322.8]])])

But my output looks like this:

('X', [[104326.2, 104309.6, 104286.3, 104269.6, 104256.3, 104239.7, 104223.0, 104209.7, 104193.1, 104176.4], 
[104326.2, 104309.6, 104286.3, 104269.6, 104256.3, 104239.7, 104223.0, 104209.7, 104193.1, 104176.4], 
[99803.4, 99826.7, 99843.3, 99860.0, 99876.6]])

('Y', [[38323.8, 38307.2, 38287.3, 38270.6, 38254.0, 38237.4, 38220.7, 38204.1, 38194.1, 38187.5], 
[38323.8, 38307.2, 38287.3, 38270.6, 38254.0, 38237.4, 38220.7, 38204.1, 38194.1, 38187.5], 
[37296.2, 37306.2, 37312.8, 37316.2, 37322.8]])])

I get multiple copies of the list from the each of the set. For example, here the X and Y lists are duplicated from the first set. Probably it is something to do with clearing the lists after appending, or placement of the empty lists XX and YY. But I have tried multiple times with multiple variations and seem to get the output as above or a list per line instead of list per set in the ordered dict.

Does anyone have any idea how to format this code in a way that I get the output as mentioned in the ideal case?


Solution

  • I simplified it slightly by not using a regular expression.

    Instead, for each line the coordinates are stored in a list named coords.
    Each x will have an even index, and y will be odd. Thus, slicing this list will give you your XX and YY.

    from collections import OrderedDict
    
    input_text = '''Freehand    green   2   2   0,0 289618  .   
    .   104326.2,38323.8    104309.6,38307.2    104286.3,38287.3    104269.6,38270.6    104256.3,38254.0
    .   104239.7,38237.4    104223.0,38220.7    104209.7,38204.1    104193.1,38194.1    104176.4,38187.5
    
    Freehand    green   2   3   0,0 63980   .   
    .   99803.4,37296.2 99826.7,37306.2 99843.3,37312.8 99860.0,37316.2 99876.6,37322.8'''
    
    
    dict_roi = OrderedDict([('title', []),
                            ('X', []),
                            ('Y', [])])
    
    lines = input_text.split('\n')
    
    Xs = []
    Ys = []
    
    for i, line in enumerate(lines):
    
        # When a line contains a tile
        if line.startswith('Freehand'):
            dict_roi['title'].append(line)
    
            if Xs and Ys:
                dict_roi['X'].append(Xs)
                dict_roi['Y'].append(Ys)
                Xs = []
                Ys = []
    
        # When a line is empty
        elif not line:
            continue
    
        # When a line contains coordinates
        else:
            line = line.replace('\n', '')
            line = line.replace('\t', ',')
            line = line.replace(' ', ',')
            coords = line.split(',')
            coords = [e for e in coords if e != '.' and e]
            coords = [float(c) for c in coords]
    
            # Xs are even, Ys are odd
            Xs += coords[0:: 2]
            Ys += coords[1:: 2]
    
    dict_roi['X'].append(Xs)
    dict_roi['Y'].append(Ys)
    
    print(dict_roi)
    

    Output:

    [('title', ['Freehand    green   2   2   0,0 289618  .   ', 'Freehand    green   2   3   0,0 63980   .   ']),
    
     ('X', [[104326.2, 104309.6, 104286.3, 104269.6, 104256.3, 104239.7, 104223.0, 104209.7, 104193.1, 104176.4], [99803.4, 99826.7, 99843.3, 99860.0, 99876.6]]), 
    
    ('Y', [[38323.8, 38307.2, 38287.3, 38270.6, 38254.0, 38237.4, 38220.7, 38204.1, 38194.1, 38187.5], [37296.2, 37306.2, 37312.8, 37316.2, 37322.8]])])