I have a very large file of data and each entry looks something like this:
5 (this can be any number, call this line n)
Line 1
Line 2
Line 3
n lines, in this case 5, i.e. lines 4 - 8
Line 9
n lines, in this case again 5, i.e. lines 10-14
Line 15
Essentially, each entry starts with one line, followed by 3 lines + n lines + 1 line + n lines + 1 line.
This number n, is an integer (but can vary depending on the entry). Is there a way to figure out how many data entries I have in this file?
I have some code in place for if I know how many entries there are, then I can loop over each entry... but is there a way to figure out the number of entries in the first place?
Thanks!
edit: Here are two examples of a sample entry -
5
10.0 0.0 0.0
0.0 10.0 0.0
0.0 0.0 10.0
A -0.005364798 -0.022912843 0.017346957
B 0.527031905 0.603310150 0.560736787
B -0.629466850 -0.628385741 0.628048126
B -0.649090857 0.603667874 -0.726135880
B 0.683741908 -0.584386774 -0.700569743
-17.862057
-2.022841336 -1.477407454 -5.606136767
2.521789668 2.889251770 2.572440406
-0.401914888 -0.722582908 0.244151982
0.806040926 -0.990697574 1.474733506
-0.903074369 0.301436166 1.314862295
0.016462
7
10.0 0.0 0.0
0.0 10.0 0.0
0.0 0.0 10.0
A -0.591644968 -0.645755982 -0.014245979
B 1.198655655 -0.588872080 -0.025169784
B -1.460774580 -1.255848596 0.025804796
B 0.321839745 2.199107994 0.050450166
C 0.617684720 -1.389588077 -0.075897238
C 0.493712792 1.349385956 -0.004249822
D -0.808145644 0.577304796 0.014326943
-26.435922
1.649465696 -2.945456091 -0.152209323
0.531241391 -1.113956273 -0.135548573
-0.529287352 -0.556746737 -0.061346528
-2.152476371 6.326868481 0.441458459
-1.633473432 3.325310912 0.291306019
0.726490986 -8.268565793 -0.512575180
1.408090505 3.232545501 0.128915126
0.155658
The first number, an integer (5 or 7 in these examples), determines the number of lines that follows this entry:
10.0 0.0 0.0
0.0 10.0 0.0
0.0 0.0 10.0
As well as the number of lines that follow the line after, which in the first case is: -17.862057
Each entry looks something like this. Basically, the goal would be to figure out how many entries there are total, utilizing the fact that the first integer gives an idea of how many total lines follow for the rest of the entry.
I've written this code to work with your given example. It doesn't know at the start how many entries there are, but it just keeps reading from the file until the file is exhausted, in order to pull each entry. I've saved your sample input in input.txt
. I've now also modified the code to read the data in as floats.
import pprint
import functools
#helper function for reading multiple lines
def read_n(in_file, n):
return [in_file.readline() for _ in range(n)]
#read one line of floats
def read_floats(line):
return list(map(float, line.split()))
#reads several lines of floats
def float_lines(lines):
return [read_floats(line) for line in lines]
def parse_entry(in_file):
#get n
n = in_file.readline().strip()
if n:
n = int(n)
#read 3, n, 1, n, 1 lines
head = float_lines(read_n(in_file, 3))
head_data = [(line[0], read_floats(line[1:])) for line in map(str.strip, read_n(in_file, n))]
mid = float(in_file.readline().strip())
tail_data = float_lines(read_n(in_file, n))
tail = float(in_file.readline().strip())
#readline to eat the empty line between entries
in_file.readline()
return n, head, head_data, mid, tail_data, tail
with open("input.txt", "r") as input_file:
#apply parse_entry until it stops returning
entries = list(iter(functools.partial(parse_entry, input_file), None))
print(len(entries))
pprint.pprint(entries)
Which outputs:
2
[(5,
[[10.0, 0.0, 0.0], [0.0, 10.0, 0.0], [0.0, 0.0, 10.0]],
[('A', [-0.005364798, -0.022912843, 0.017346957]),
('B', [0.527031905, 0.60331015, 0.560736787]),
('B', [-0.62946685, -0.628385741, 0.628048126]),
('B', [-0.649090857, 0.603667874, -0.72613588]),
('B', [0.683741908, -0.584386774, -0.700569743])],
-17.862057,
[[-2.022841336, -1.477407454, -5.606136767],
[2.521789668, 2.88925177, 2.572440406],
[-0.401914888, -0.722582908, 0.244151982],
[0.806040926, -0.990697574, 1.474733506],
[-0.903074369, 0.301436166, 1.314862295]],
0.016462),
(7,
[[10.0, 0.0, 0.0], [0.0, 10.0, 0.0], [0.0, 0.0, 10.0]],
[('A', [-0.591644968, -0.645755982, -0.014245979]),
('B', [1.198655655, -0.58887208, -0.025169784]),
('B', [-1.46077458, -1.255848596, 0.025804796]),
('B', [0.321839745, 2.199107994, 0.050450166]),
('C', [0.61768472, -1.389588077, -0.075897238]),
('C', [0.493712792, 1.349385956, -0.004249822]),
('D', [-0.808145644, 0.577304796, 0.014326943])],
-26.435922,
[[1.649465696, -2.945456091, -0.152209323],
[0.531241391, -1.113956273, -0.135548573],
[-0.529287352, -0.556746737, -0.061346528],
[-2.152476371, 6.326868481, 0.441458459],
[-1.633473432, 3.325310912, 0.291306019],
[0.726490986, -8.268565793, -0.51257518],
[1.408090505, 3.232545501, 0.128915126]],
0.155658)]
Demonstrating that it's found 2 entries, and has parsed them as floats, and then outputs the entries. I'm not entirely sure what the entries are, so I've kept them ambiguously named. Note that I've preserved as much data as I can of the entries in my big list-tuple structure, because I'm not sure which bits are relevant either, so the original file should almost be reconstructable from the entries in memory.
Regarding the lines starting with a character - this is approached by first applying str.strip
to the line, as sometimes there is a space before the character. It then separates the line
into line[0]
and line[1:]
, which is the character, and a slice of the string representing the data, which is then operated on as normal.
More on how I separate the characters from the floats:
Take the following line:
A -0.005364798 -0.022912843 0.017346957
This will be parsed by:
head_data = [(line[0], read_floats(line[1:])) for line in map(str.strip, read_n(in_file, n))]
However, if we're considering only this line, we can look at less of the expression. The first thing that happens to the line is str.strip
, from map(str.strip..)
. This strips any trailing and leading whitespace to ensure the first character is the letter to be removed. This means the state of the line in memory is now:
"A -0.005364798 -0.022912843 0.017346957"
The line is then separated into line[0]
and read_floats(line[1:])
. This is where the distinction between the string and floats is made - the string is separated away from the rest of the string, which is then passed to read_floats
. This is using slice notation, a powerful syntax Python has for getting sublists of iterables. The slice 1:
means 'slice from index 1 to the end of the string'. For clarity:
line[0] == "A"
line[1:] == " -0.005364798 -0.022912843 0.017346957"
for _
is a Python convention for when you just need to repeat something, without keeping track of which repetition it is. ie it reads a line for each number in the range(n)
, so it reads n
lines, but it doesn't need to keep track of which number the current line is. It could just as well say for i in range(n)
, except i
would be unused, so the iterator is called _
to indicate you don't want it.
if n:
checks if the string n
is not empty. This is because when you readline()
a file that has been exhausted, an empty string is returned. This means instead of crashing when it's done with the file, the program will just neatly stop parsing entries. This is important as we don't know the number of entries, so we keep trying to read an n
until we can no longer read an n
, so we have to use an if statement.
Regarding why entries looks so convoluted - parse_entry(input_file)
would only parse a single entry. All of the other baggage is required to parse all entries. functools.partial(parse_entry, input_file)
means 'apply the argument input_file
to the function parse_entry
'. This then uses iter
to keep doing this until it returns None
. This is quite a useful trick - the iter function can be given any function and then a value to stop at, and it will keep returning values from the function until it hits the 'stop' value. A simpler, more often seen example might be iter(sys.stdin.readline, "a\n")
. This would keep reading lines from stdin
until it hit a line containing only a
.
On tuples and tuple unpacking - you could do this:
for n, head, head_data, mid, tail_data, tail in entries:
print("n is {}".format(n))
print("the first item of head_data is {}".format(head_data[0]))
for i in tail_data:
print("tail data item: {}".format(i))
This results in the output:
n is 5
the first item of head_data is ('A', [-0.005364798, -0.022912843, 0.017346957])
tail data item: [-2.022841336, -1.477407454, -5.606136767]
tail data item: [2.521789668, 2.88925177, 2.572440406]
tail data item: [-0.401914888, -0.722582908, 0.244151982]
tail data item: [0.806040926, -0.990697574, 1.474733506]
tail data item: [-0.903074369, 0.301436166, 1.314862295]
n is 7
the first item of head_data is ('A', [-0.591644968, -0.645755982, -0.014245979])
tail data item: [1.649465696, -2.945456091, -0.152209323]
tail data item: [0.531241391, -1.113956273, -0.135548573]
tail data item: [-0.529287352, -0.556746737, -0.061346528]
tail data item: [-2.152476371, 6.326868481, 0.441458459]
tail data item: [-1.633473432, 3.325310912, 0.291306019]
tail data item: [0.726490986, -8.268565793, -0.51257518]
tail data item: [1.408090505, 3.232545501, 0.128915126]
Hopefully this demonstrates how you might go about making use of the structure.