pythonregexperlseek

seek to regex in a large file using python


I am trying to seek to a token ':path,' in a file, then read all the following (arbitrary digit count) numbers as a number (so for ':path,123' I seek to the , in file then read the integer 123). Then read the chars between the current seek position and pos+123 (store them in a list or whatever). Then seek until the next match for ':path,' and repeat the process.

I would like a function a bit like:

def fregseek(FILE, current_seek, /regex/):

.
.
  value_found = ?  # result of reading next N chars after :path,[0-9]+
.
.
  return next_start_seek, value_found

There may be any number of matches for ':path,' in a line, and that string may occur within the number of chars specified after ','. I have written a messy bunch of rubbish which reads in each line, then for each line chomps of the first N chars indicated by the match, then continues processing the string until it is all eaten up. Then reads the next string and so on.

This is horrible, I do not want to have to slurp off all the lines from a potentially huge file when all I really need to do is seek (especially since a newline is irrelevant, so having an extra processing step just because lines are easy to pull from files is ridiculous).

So, there it is, that is my problem that I would like to solve. I need to seek to a match, read a value, continue from the end of that value looking for the next match and so on until the file is exhausted.

If anybody can help me with this I will be happy to hear from them :)

I would like to avoid non-standard libraries if possible, I would also like the shortest code but this is the least of my concerns (speed and memory consumption are the important factors, but I don't want 50 loc extra just to bootstrap some library with a small funciton in it I could just rip out if only I knew what it was).

I would prefer python code, however, if perl beats python in this regard I will use perl instead, I am also open to clever sed/awk/bash scripts etc as long as they are not horribly slower.

Thanks very much in advance.


Solution

  • If you don't need regexes, you can do this with just find and slicing.

    Either way, the trivial solution is to read the whole file into memory, and find and slice the resulting str/bytes object.

    But that doesn't work if you can't (or don't want to) read the whole file into memory.

    Fortunately, if you can count on the fact that your files are << 2GB or you only need to work in 64-bit Python, and you're on a reasonable platform (POSIX, modern Windows, etc.), you can mmap the file into memory instead. The mmap object has a subset of the same methods that strings have, so you can just pretend you have a string, just as if you'd read the whole file into memory, but you can count on the Python implementation and OS to make it just work with reasonable efficiency.

    Depending on your version of Python, re may not be able to scan an mmap as if it were a string, it may work but be slow, or it may work just fine. So, you might as well try that first, and if it doesn't throw an exception or go much slower than you expected, you're done:

    def findpaths(fname):
        with open(fname, 'rb') as f:
            m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
            for match in re.finditer(':path,([0-9]+)', m):
                yield m[match.end():match.end()+int(match.group(1))]
    

    (This is the same as BrtH's answer, just using an mmap instead of a string, and restructured to a generator instead of a list—although of course you could do the latter part by just replacing his square brackets with parentheses.)

    If you're using an older (or non-CPython?) version of Python that can't (efficiently) re an mmap, it's a bit more complicated:

    def nextdigits(s, start):
      return ''.join(itertools.takewhile(str.isdigit,
                                         itertools.islice(s, start, None)))
    
    def findpaths(fname):
      with open(fname, 'rb') as f:
        m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
        i = 0
        while True:
          n = m.find(':path', i)
          if n == -1: return
          countstr = nextdigits(m, n+6)
          count = int(countstr)
          n += 6 + len(countstr)
          yield m[n:n+count]
          i = n + 6 + count
    

    This probably isn't the fastest way to write the nextdigits function. I'm not sure that will actually matter (time it and see), but if it does, other possibilities are to slice out m[n+6:n+A_BIG_ENOUGH_NUMBER] and regex it, or write a custom loop, or… On the other hand, if that's your bottleneck, you might get far more benefit by switching to an interpreter with a JIT (PyPy, Jython, or IronPython)…

    For my tests, I split things up: findpaths takes a string-like object, and the caller does the with open and mmap bits and just passes m into findpaths; I didn't do it here just for brevity.

    Anyway, I've tested both versions against the following data:

    BLAH:path,3abcBLAH:path,10abcdefghijklmnBLAH:path,3abc:path,0:path,3abc
    

    And the output was:

    abc
    abcdefghij
    abc
    
    abc
    

    I think that's correct?

    If my earlier version caused it to spin at 100% CPU, my guess would be that I didn't increment i properly in the loop; that's the most common reason you get that behavior in a tight parsing loop. Anyway, if you can reproduce that with the current version, please post the data.