pythonpython-2.7pandasnlptreetagger

Optimizing function computation in a pandas column?


Let's assume that I have the following pandas dataframe:

id |opinion
1  |Hi how are you?
...
n-1|Hello!

I would like to create a new pandas POS-tagged column like this:

id|     opinion   |POS-tagged_opinions
1 |Hi how are you?|hi\tUH\thi
                  how\tWRB\thow
                  are\tVBP\tbe
                  you\tPP\tyou
                  ?\tSENT\t?

.....

n-1|     Hello    |Hello\tUH\tHello
                   !\tSENT\t!

From the documentation a tutorial, I tried several approaches. Particularly:

df.apply(postag_cell, axis=1)

and

df['content'].map(postag_cell)

Therefore, I created this POS-tag cell function:

import pandas as pd

df = pd.read_csv('/Users/user/Desktop/data2.csv', sep='|')
print df.head()


def postag_cell(pandas_cell):
    import pprint   # For proper print of sequences.
    import treetaggerwrapper
    tagger = treetaggerwrapper.TreeTagger(TAGLANG='en')
    #2) tag your text.
    y = [i.decode('UTF-8') if isinstance(i, basestring) else i for i in [pandas_cell]]
    tags = tagger.tag_text(y)
    #3) use the tags list... (list of string output from TreeTagger).
    return tags



#df.apply(postag_cell(), axis=1)

#df['content'].map(postag_cell())




df['POS-tagged_opinions'] = (df['content'].apply(postag_cell))

print df.head()

The above function return the following:

user:~/PycharmProjects/misc_tests$ time python tagging\ with\ pandas.py



id|     opinion   |POS-tagged_opinions
1 |Hi how are you?|[hi\tUH\thi
                  how\tWRB\thow
                  are\tVBP\tbe
                  you\tPP\tyou
                  ?\tSENT\t?]

.....

n-1|     Hello    |Hello\tUH\tHello
                   !\tSENT\t!

--- 9.53674316406e-07 seconds ---

real    18m22.038s
user    16m33.236s
sys 1m39.066s

The problem is that with large number of opinions it get takes a lot of time:

How to perform pos-tagging more efficiently and in a more pythonic way with pandas and treetagger?. I believe that this issue is due my pandas limited knowledge, since I tagged very quickly the opinions just with treetagger, out of a pandas dataframe.


Solution

  • There are some obvious modifications that can be done to gain a reasonable amount of time (as removing the imports and the instantiation of TreeTagger class from postag_cell function). Then the code can be parallelized. However, the majority of work is done by treetagger itself. As I don't know anything about this software, I can't tell if it can be further optimized.

    The minimal working code:

    import pandas as pd
    import treetaggerwrapper
    
    input_file = 'new_corpus.csv'
    output_file = 'output.csv'
    
    def postag_string(s):
        '''Returns tagged text from string s'''
        if isinstance(s, basestring):
           s = s.decode('UTF-8')
        return tagger.tag_text(s)
    
    # Reading in the file
    all_lines = []
    with open(input_file) as f:
        for line in f:
            all_lines.append(line.strip().split('|', 1))
    
    df = pd.DataFrame(all_lines[1:], columns = all_lines[0])
    
    tagger = treetaggerwrapper.TreeTagger(TAGLANG='en')
    
    df['POS-tagged_content'] = df['content'].apply(postag_string)
    
    # Format fix:
    def fix_format(x):
        '''x - a list or an array'''
        # With encoding:
        out = list(tuple(i.encode().split('\t')) for i in x)
        # or without:
        # out = list(tuple(i.split('\t')) for i in x)
        return out
    df['POS-tagged_content'] = df['POS-tagged_content'].apply(fix_format)
    
    df.to_csv(output_file, sep = '|')
    

    I'm not using pd.read_csv(filename, sep = '|') because your input file is "misformatted" - it contains unescaped characters | in some text opinions.

    (Update:) After format fix, the output file looks like this:

    $ cat output_example.csv 
    |id|content|POS-tagged_content
    0|cv01.txt|How are you?|[('How', 'WRB', 'How'), ('are', 'VBP', 'be'), ('you', 'PP', 'you'), ('?', 'SENT', '?')]
    1|cv02.txt|Hello!|[('Hello', 'UH', 'Hello'), ('!', 'SENT', '!')]
    2|cv03.txt|"She said ""OK""."|"[('She', 'PP', 'she'), ('said', 'VVD', 'say'), ('""', '``', '""'), ('OK', 'UH', 'OK'), ('""', ""''"", '""'), ('.', 'SENT', '.')]"
    

    If the formatting is not exactly what you want, we can work it out.

    Parallelized code

    It may give some speedup but don't expect miracles. The overhead coming from multiprocess setting may even exceed the gains. You can experiment with the number of processes nproc (here, set by default to number of CPUs; setting more than this is inefficient).

    Treetaggerwrapper has its own multiprocess class. I suspect that it does more less the same thing as the code below, so I didn't try it.

    import pandas as pd
    import numpy as np
    import treetaggerwrapper
    import multiprocessing as mp
    
    input_file = 'new_corpus.csv'
    output_file = 'output2.csv'
    
    def postag_string_mp(s):
        '''
        Returns tagged text for string s.
        "pool_tagger" is a global name, defined in each subprocess.
        '''
        if isinstance(s, basestring):
           s = s.decode('UTF-8')
        return pool_tagger.tag_text(s)
    
    ''' Reading in the file '''
    all_lines = []
    with open(input_file) as f:
        for line in f:
            all_lines.append(line.strip().split('|', 1))
    
    df = pd.DataFrame(all_lines[1:], columns = all_lines[0])
    
    ''' Multiprocessing '''
    
    # Number of processes can be adjusted for better performance:
    nproc = mp.cpu_count()
    
    # Function to be run at the start of every subprocess.
    # Each subprocess will have its own TreeTagger called pool_tagger.
    def init():
        global pool_tagger
        pool_tagger = treetaggerwrapper.TreeTagger(TAGLANG='en')
    
    # The actual job done in subprcesses:
    def run(df):
        return df.apply(postag_string_mp)
    
    # Splitting the input
    lst_split = np.array_split(df['content'], nproc)
    
    pool = mp.Pool(processes = nproc, initializer = init)
    lst_out = pool.map(run, lst_split)
    pool.close()
    pool.join()
    
    # Concatenating the output from subprocesses 
    df['POS-tagged_content'] =  pd.concat(lst_out) 
    
    # Format fix:
    def fix_format(x):
        '''x - a list or an array'''
        # With encoding:
        out = list(tuple(i.encode().split('\t')) for i in x)
        # and without:
        # out = list(tuple(i.split('\t')) for i in x)
        return out
    df['POS-tagged_content'] = df['POS-tagged_content'].apply(fix_format)
    
    df.to_csv(output_file, sep = '|')
    

    Update

    In Python 3, all strings are by default in unicode, so you can save some trouble and time with decoding/encoding. (In the code below, I also use pure numpy arrays instead of data frames in child processes - but the impact of this change is insignificant.)

    # Python3 code:
    import pandas as pd
    import numpy as np
    import treetaggerwrapper
    import multiprocessing as mp
    
    input_file = 'new_corpus.csv'
    output_file = 'output3.csv'
    
    ''' Reading in the file '''
    all_lines = []
    with open(input_file) as f:
        for line in f:
            all_lines.append(line.strip().split('|', 1))
    
    df = pd.DataFrame(all_lines[1:], columns = all_lines[0])
    
    ''' Multiprocessing '''
    
    # Number of processes can be adjusted for better performance:
    nproc = mp.cpu_count()
    
    # Function to be run at the start of every subprocess.
    # Each subprocess will have its own TreeTagger called pool_tagger.
    def init():
        global pool_tagger
        pool_tagger = treetaggerwrapper.TreeTagger(TAGLANG='en')
    
    # The actual job done in subprcesses:
    def run(arr):
        out = np.empty_like(arr)
        for i in range(len(arr)):
            out[i] = pool_tagger.tag_text(arr[i])
        return out
    
    # Splitting the input
    lst_split = np.array_split(df.values[:,1], nproc)
    
    with mp.Pool(processes = nproc, initializer = init) as p:
        lst_out = p.map(run, lst_split)
    
    # Concatenating the output from subprocesses 
    df['POS-tagged_content'] =  np.concatenate(lst_out) 
    
    # Format fix:
    def fix_format(x):
        '''x - a list or an array'''
        out = list(tuple(i.split('\t')) for i in x)
        return out
    df['POS-tagged_content'] = df['POS-tagged_content'].apply(fix_format)
    
    df.to_csv(output_file, sep = '|')
    

    After single runs (so, not really statistically significant), I'm getting these timings on your file:

    $ time python2.7 treetagger_minimal.py 
    real    0m59.783s
    user    0m50.697s
    sys     0m16.657s
    
    $ time python2.7 treetagger_mp.py   
    real    0m48.798s
    user    1m15.503s
    sys     0m22.300s
    
    $ time python3 treetagger_mp3.py 
    real    0m39.746s
    user    1m25.340s
    sys     0m21.157s
    

    If the only use of pandas dataframe pd is to save everything back to a file, then the next step would be removing pandas from the code at all. But again, the gain would be insignificant in comparison with treetagger's work time.