python memory-leaks python-c-extension suffix-tree large-data

Python Running out of Memory (Using Suffix Trees)

I'm running into a bit of trouble with some code. Please bear in mind that I'm a terrible programmer, so my solution probably isn't very eloquent (and likely the reason why I'm running out of memory - I have 4 gigabytes and the script slowly fills it).

Here's the problem. I've got about 3,500 files in a directory. Each file consists of a single line that could have relatively few or many characters with no spaces (the smallest file is 200 bytes vs. the largest at 1.3 megabytes). What I am trying to do is find a common substring between these files two at a time of set length (in the code below it's 13 characters). I do two at a time because I'm not looking for a common substring among all of them, but rather combinations of two until all the files have been compared. I.e., any common substring of set length between the files, not a substring common to all of them.

I use a suffix tree module that wraps a C implementation (over here). First I make a list of all the files in the directory, then I look for combinations of two so that all combinations are covered, I pass two files at a time to the suffix tree and then look for sequences that are common substrings.

However, I don't really know why it's slowly running out of memory. I hope there's an amendment we can make to the code so that it clears out memory of unused stuff somehow? Obviously 3,500 files will take a long time to process, but I hope that it's possible to do without incrementally filling 4 gigabytes of memory. Any help would be greatly appreciated! Here is the code I've got so far:

from suffix_tree import GeneralisedSuffixTree
from itertools import combinations
import glob, hashlib, os

alist = open('tmp.adjlist', 'w')

def read_f(f):
    f = open(f, "r")
    s = str(f.readlines())
    f.close()
    return s

def read_gs(a,b):
    s1 = read_f(a)
    s2 = read_f(b)
    print str(a) + ":" + str(hashlib.md5(s1).hexdigest()) + " --- " + str(b) + ":" + str(hashlib.md5(s2).hexdigest())
    return [s1,s2]

def build_tree(s):
    hlist = []

    stree = GeneralisedSuffixTree(s)

    for shared in stree.sharedSubstrings(13):
        for seq,start,stop in shared:
            hlist.append(hashlib.md5(stree.sequences[seq]).hexdigest())
        hlist = list(set(hlist))
        for h in hlist:
            alist.write(str(h) + " ")
        alist.write('\n')

glist = []

for g in glob.glob("*.g"):
    glist.append(g)

for a,b in list(combinations(glist, 2)):
    s = read_gs(a,b)
    build_tree(s)

alist.close()

os.system("uniq tmp.adjlist network.adjlist && rm tmp.adjlist")

UPDATE #1

Here's the updated code. I added the suggestions Pyrce made. However, after jogojapan identified the memory leak in the C code, and given that it's way outside of my expertise, I ended up going with a much slower approach. If anyone is knowledgeable in this area, I'd be really curious to see how to modify the C code to fix the memory leak or the deallocation function, as I think a C suffix tree binding for Python is very valuable. It will probably take a few days to run the data through this script without a suffix tree, so I'm definitely open to seeing if anyone has a creative fix!

from itertools import combinations
import glob, hashlib, os

def read_f(f):
    with open(f, "r") as openf:
        s = str(openf.readlines())
    return s

def read_gs(a,b):
    s1 = read_f(a)
    s2 = read_f(b)
    print str(a) + ":" + str(hashlib.md5(s1).hexdigest()) + " --- " + str(b) + ":" + str(hashlib.md5(s2).hexdigest())
    return [s1,s2]

def lcs(S1, S2):
    M = [[0]*(1+len(S2)) for i in xrange(1+len(S1))]
    longest, x_longest = 0, 0
    for x in xrange(1,1+len(S1)):
        for y in xrange(1,1+len(S2)):
            if S1[x-1] == S2[y-1]:
                M[x][y] = M[x-1][y-1] + 1
                if M[x][y]>longest:
                    longest = M[x][y]
                    x_longest  = x
            else:
                M[x][y] = 0
    return S1[x_longest-longest: x_longest]

glist = glob.glob("*.g")

for a,b in combinations(glist, 2):
    s = read_gs(a,b)
    p = lcs(s[0],s[1])
    if p != "" and len(p) >= 13:
        with open("tmp.adjlist", "a") as openf:
            openf.write(hashlib.md5(s[1]).hexdigest() + " " + hashlib.md5(s[0]).hexdigest() + "\n")

os.system("uniq tmp.adjlist network.adjlist && rm tmp.adjlist")

Solution

I am reasonably confident that there is a memory leak inside the suffix tree package you use.

Evidence 1: Running python inside valgrind, and then running this simple Python script

from suffix_trees import SuffixTree
t = SuffixTree("mississippi")
t = None

reported this leak:

==8800== 1,413 (32 direct, 1,381 indirect) bytes in 1 blocks are definitely lost in loss record 1,265 of 1,374
==8800==    at 0x4A0884D: malloc (vg_replace_malloc.c:263)
==8800==    by 0xBE70AEC: make_helper (suffix_tree.c:193)
==8800==    by 0xBE704B2: SuffixTree_init (python_bindings.c:240)
==8800==    by 0x3F98C9867B: ??? (in /usr/lib64/libpython2.7.so.1.0)
==8800==    by 0x3F98C49A7D: PyObject_Call (in /usr/lib64/libpython2.7.so.1.0)
==8800==    by 0x3F98CD71D6: PyEval_CallObjectWithKeywords (in /usr/lib64/libpython2.7.so.1.0)
==8800==    by 0x3F98C5EB45: ??? (in /usr/lib64/libpython2.7.so.1.0)
==8800==    by 0x3F98C49A7D: PyObject_Call (in /usr/lib64/libpython2.7.so.1.0)
==8800==    by 0x3F98CD93F2: PyEval_EvalFrameEx (in /usr/lib64/libpython2.7.so.1.0)
==8800==    by 0x3F98CDDB2E: PyEval_EvalCodeEx (in /usr/lib64/libpython2.7.so.1.0)
==8800==    by 0x3F98C6D7B5: ??? (in /usr/lib64/libpython2.7.so.1.0)
==8800==    by 0x3F98C49A7D: PyObject_Call (in /usr/lib64/libpython2.7.so.1.0)

Evidence 2: When you look at the code in suffix_tree.c and python_bindings.c, you'll find the make_helper() function, which allocates memory for a suffix tree (using malloc), but there is not a single free anywhere in the code. I looked specifically at the allocation and deallocation functions defined for the Python types defined in python_bindings, but could not find any for the tree object. There is one for the node objects, but that only deallocates the Python wrapper part of the object, not the underlying structure in C.

Evidence 3: The data type definition for the Python object in python_bindings.c has a telling comment attached to it:

/* FIXME: deallocation of this guy! */
static PyTypeObject SuffixTreeType = {
    PyObject_HEAD_INIT(NULL)
    .tp_name      = "_suffix_tree.SuffixTree",
    /* ... */
    .tp_new       = SuffixTree_new,
};

Suggestions: You may want to contact the authors of the package and make them aware of the problem. According to information on the webpage, they are already working on a solution to the problem that ideally there should be a cyclic dependency between the tree itself and the node objects it contains -- that is a related problem, and probably the reason why the program does not currently perform any deallocation whatsoever.

Since you don't probably need the node-to-tree dependency for your purposes, you could also apply your own fix by adding a deallocation function to the tree object definition in python_bindings.c.