pythondatabasefor-loopmultiple-entries

Python: Each chunk of data consists of a variable amount of lines, how to determine the number of chunks total?


I have a very large file of data and each entry looks something like this:

    5 (this can be any number, call this line n)
Line 1
Line 2
Line 3
n lines, in this case 5, i.e. lines 4 - 8
Line 9
n lines, in this case again 5, i.e. lines 10-14
Line 15

Essentially, each entry starts with one line, followed by 3 lines + n lines + 1 line + n lines + 1 line.

This number n, is an integer (but can vary depending on the entry). Is there a way to figure out how many data entries I have in this file?

I have some code in place for if I know how many entries there are, then I can loop over each entry... but is there a way to figure out the number of entries in the first place?

Thanks!

edit: Here are two examples of a sample entry -

    5
10.0 0.0 0.0
0.0 10.0 0.0
0.0 0.0 10.0
A       -0.005364798      -0.022912843       0.017346957
B        0.527031905       0.603310150       0.560736787
B       -0.629466850      -0.628385741       0.628048126
B       -0.649090857       0.603667874      -0.726135880
B        0.683741908      -0.584386774      -0.700569743
    -17.862057
  -2.022841336      -1.477407454      -5.606136767
   2.521789668       2.889251770       2.572440406
  -0.401914888      -0.722582908       0.244151982
   0.806040926      -0.990697574       1.474733506
  -0.903074369       0.301436166       1.314862295
      0.016462

     7
 10.0 0.0 0.0
 0.0 10.0 0.0
 0.0 0.0 10.0
 A       -0.591644968      -0.645755982      -0.014245979
 B        1.198655655      -0.588872080      -0.025169784
 B       -1.460774580      -1.255848596       0.025804796
 B        0.321839745       2.199107994       0.050450166
 C        0.617684720      -1.389588077      -0.075897238
 C        0.493712792       1.349385956      -0.004249822
 D       -0.808145644       0.577304796       0.014326943
    -26.435922
   1.649465696      -2.945456091      -0.152209323
   0.531241391      -1.113956273      -0.135548573
  -0.529287352      -0.556746737      -0.061346528
  -2.152476371       6.326868481       0.441458459
  -1.633473432       3.325310912       0.291306019
   0.726490986      -8.268565793      -0.512575180
   1.408090505       3.232545501       0.128915126
      0.155658

The first number, an integer (5 or 7 in these examples), determines the number of lines that follows this entry:

 10.0 0.0 0.0
 0.0 10.0 0.0
 0.0 0.0 10.0

As well as the number of lines that follow the line after, which in the first case is: -17.862057

Each entry looks something like this. Basically, the goal would be to figure out how many entries there are total, utilizing the fact that the first integer gives an idea of how many total lines follow for the rest of the entry.


Solution

  • I've written this code to work with your given example. It doesn't know at the start how many entries there are, but it just keeps reading from the file until the file is exhausted, in order to pull each entry. I've saved your sample input in input.txt. I've now also modified the code to read the data in as floats.

    import pprint
    import functools
    
    #helper function for reading multiple lines
    def read_n(in_file, n):
        return [in_file.readline() for _ in range(n)]
    
    #read one line of floats
    def read_floats(line):
        return list(map(float, line.split()))
    
    #reads several lines of floats
    def float_lines(lines):
        return [read_floats(line) for line in lines]
    
    def parse_entry(in_file):
        #get n
        n = in_file.readline().strip()
    
        if n:
            n = int(n)
    
            #read 3, n, 1, n, 1 lines
            head = float_lines(read_n(in_file, 3))
            head_data = [(line[0], read_floats(line[1:])) for line in map(str.strip, read_n(in_file, n))]
            mid = float(in_file.readline().strip())
            tail_data = float_lines(read_n(in_file, n))
            tail = float(in_file.readline().strip())
    
            #readline to eat the empty line between entries
            in_file.readline()
    
            return n, head, head_data, mid, tail_data, tail
    
    with open("input.txt", "r") as input_file:
        #apply parse_entry until it stops returning
        entries = list(iter(functools.partial(parse_entry, input_file), None))
    
    print(len(entries))
    pprint.pprint(entries)
    

    Which outputs:

    2
    [(5,
      [[10.0, 0.0, 0.0], [0.0, 10.0, 0.0], [0.0, 0.0, 10.0]],
      [('A', [-0.005364798, -0.022912843, 0.017346957]),
       ('B', [0.527031905, 0.60331015, 0.560736787]),
       ('B', [-0.62946685, -0.628385741, 0.628048126]),
       ('B', [-0.649090857, 0.603667874, -0.72613588]),
       ('B', [0.683741908, -0.584386774, -0.700569743])],
      -17.862057,
      [[-2.022841336, -1.477407454, -5.606136767],
       [2.521789668, 2.88925177, 2.572440406],
       [-0.401914888, -0.722582908, 0.244151982],
       [0.806040926, -0.990697574, 1.474733506],
       [-0.903074369, 0.301436166, 1.314862295]],
      0.016462),
     (7,
      [[10.0, 0.0, 0.0], [0.0, 10.0, 0.0], [0.0, 0.0, 10.0]],
      [('A', [-0.591644968, -0.645755982, -0.014245979]),
       ('B', [1.198655655, -0.58887208, -0.025169784]),
       ('B', [-1.46077458, -1.255848596, 0.025804796]),
       ('B', [0.321839745, 2.199107994, 0.050450166]),
       ('C', [0.61768472, -1.389588077, -0.075897238]),
       ('C', [0.493712792, 1.349385956, -0.004249822]),
       ('D', [-0.808145644, 0.577304796, 0.014326943])],
      -26.435922,
      [[1.649465696, -2.945456091, -0.152209323],
       [0.531241391, -1.113956273, -0.135548573],
       [-0.529287352, -0.556746737, -0.061346528],
       [-2.152476371, 6.326868481, 0.441458459],
       [-1.633473432, 3.325310912, 0.291306019],
       [0.726490986, -8.268565793, -0.51257518],
       [1.408090505, 3.232545501, 0.128915126]],
      0.155658)]
    

    Demonstrating that it's found 2 entries, and has parsed them as floats, and then outputs the entries. I'm not entirely sure what the entries are, so I've kept them ambiguously named. Note that I've preserved as much data as I can of the entries in my big list-tuple structure, because I'm not sure which bits are relevant either, so the original file should almost be reconstructable from the entries in memory.

    Regarding the lines starting with a character - this is approached by first applying str.strip to the line, as sometimes there is a space before the character. It then separates the line into line[0] and line[1:], which is the character, and a slice of the string representing the data, which is then operated on as normal.

    More on how I separate the characters from the floats:

    Take the following line:

     A       -0.005364798      -0.022912843       0.017346957
    

    This will be parsed by:

    head_data = [(line[0], read_floats(line[1:])) for line in map(str.strip, read_n(in_file, n))]
    

    However, if we're considering only this line, we can look at less of the expression. The first thing that happens to the line is str.strip, from map(str.strip..). This strips any trailing and leading whitespace to ensure the first character is the letter to be removed. This means the state of the line in memory is now:

    "A       -0.005364798      -0.022912843       0.017346957"
    

    The line is then separated into line[0] and read_floats(line[1:]). This is where the distinction between the string and floats is made - the string is separated away from the rest of the string, which is then passed to read_floats. This is using slice notation, a powerful syntax Python has for getting sublists of iterables. The slice 1: means 'slice from index 1 to the end of the string'. For clarity:

    line[0] == "A"
    line[1:] == "      -0.005364798      -0.022912843       0.017346957"
    

    for _ is a Python convention for when you just need to repeat something, without keeping track of which repetition it is. ie it reads a line for each number in the range(n), so it reads n lines, but it doesn't need to keep track of which number the current line is. It could just as well say for i in range(n), except i would be unused, so the iterator is called _ to indicate you don't want it.

    if n: checks if the string n is not empty. This is because when you readline() a file that has been exhausted, an empty string is returned. This means instead of crashing when it's done with the file, the program will just neatly stop parsing entries. This is important as we don't know the number of entries, so we keep trying to read an n until we can no longer read an n, so we have to use an if statement.

    Regarding why entries looks so convoluted - parse_entry(input_file) would only parse a single entry. All of the other baggage is required to parse all entries. functools.partial(parse_entry, input_file) means 'apply the argument input_file to the function parse_entry'. This then uses iter to keep doing this until it returns None. This is quite a useful trick - the iter function can be given any function and then a value to stop at, and it will keep returning values from the function until it hits the 'stop' value. A simpler, more often seen example might be iter(sys.stdin.readline, "a\n"). This would keep reading lines from stdin until it hit a line containing only a.

    On tuples and tuple unpacking - you could do this:

    for n, head, head_data, mid, tail_data, tail in entries:
        print("n is {}".format(n))
        print("the first item of head_data is {}".format(head_data[0]))
    
        for i in tail_data:
            print("tail data item: {}".format(i))
    

    This results in the output:

    n is 5
    the first item of head_data is ('A', [-0.005364798, -0.022912843, 0.017346957])
    tail data item: [-2.022841336, -1.477407454, -5.606136767]
    tail data item: [2.521789668, 2.88925177, 2.572440406]
    tail data item: [-0.401914888, -0.722582908, 0.244151982]
    tail data item: [0.806040926, -0.990697574, 1.474733506]
    tail data item: [-0.903074369, 0.301436166, 1.314862295]
    n is 7
    the first item of head_data is ('A', [-0.591644968, -0.645755982, -0.014245979])
    tail data item: [1.649465696, -2.945456091, -0.152209323]
    tail data item: [0.531241391, -1.113956273, -0.135548573]
    tail data item: [-0.529287352, -0.556746737, -0.061346528]
    tail data item: [-2.152476371, 6.326868481, 0.441458459]
    tail data item: [-1.633473432, 3.325310912, 0.291306019]
    tail data item: [0.726490986, -8.268565793, -0.51257518]
    tail data item: [1.408090505, 3.232545501, 0.128915126]
    

    Hopefully this demonstrates how you might go about making use of the structure.