I'm writing a program to parse through some log files. If an error code is in the line, I need to print the previous 25 lines for analysis. I'd like to be able to repeat this concept with more or less lines depending on the individual error code (instead of 25 lines, 15 or 35).
with open(file, 'r') as input:
for line in input:
if "error code" in line:
#print previous 25 lines
I know the equivalent command in Bash for what I need is grep "error code" -B 25 Filename | wc -1
. I'm still new to python and programming in general, I know I'm going to need a for
loop and I've tried using the range
function to do this but I haven't had much luck because I don't know how to implement the range into files.`
This is a perfect use case for a length limited collections.deque
:
from collections import deque
line_history = deque(maxlen=25)
with open(file) as input:
for line in input:
if "error code" in line:
print(*line_history, line, sep='')
# Clear history so if two errors seen in close proximity, we don't
# echo some lines twice
line_history.clear()
else:
# When deque reaches 25 lines, will automatically evict oldest
line_history.append(line)
Complete explanation of why I chose this approach (skip if you don't really care):
This isn't solvable in a good/safe way using for
/range
, because indexing only makes sense if you load the whole file into memory; the file on disk has no idea where lines begin and end, so you can't just ask for "line #357 of the file" without reading it from the beginning to find lines 1 through 356. You'd either end up repeatedly rereading the file, or slurping the whole file into an in-memory sequence (e.g. list
/tuple
) to have indexing make sense.
For a log file, you have to assume it could be quite large (I regularly deal with multi-gigabyte log files), to the point where loading it into memory would exhaust main memory, so slurping is a bad idea, and rereading the file from scratch each time you hit an error is almost as bad (it's slow, but it's reliably slow I guess?). The deque
based approach means your peak memory usage is based on the 27 longest lines in the file, rather than the total file size.
A naïve solution with nothing but built-ins could be as simple as:
with open(file) as input:
lines = tuple(input) # Slurps all lines from file
for i, line in enumerate(lines):
if "error code" in line:
print(*lines[max(i-25, 0):i], line, sep='')
but like I said, this requires enough memory to hold your entire log file in memory at once, which is a bad thing to count on. It also repeats lines when two errors occur in close proximity, because unlike deque
, you don't get an easy way to empty your recent memory; you'd have to manually track the index of the last print
to restrict your slice.
Note that even then, I didn't use range
; range
is a crutch a lot of people coming from C backgrounds rely on, but it's usually the wrong way to solve a problem in Python. In cases where an index is needed (it usually isn't), you usually need the value too, so enumerate
based solutions are superior; most of the time, you don't need an index at all, so direct iteration (or paired iteration with zip
or the like) is the correct solution.