Optimizing function computation in a pandas column?

Let's assume that I have the following pandas dataframe:

id |opinion
1  |Hi how are you?
...
n-1|Hello!

I would like to create a new pandas POS-tagged column like this:

id|     opinion   |POS-tagged_opinions
1 |Hi how are you?|hi\tUH\thi
                  how\tWRB\thow
                  are\tVBP\tbe
                  you\tPP\tyou
                  ?\tSENT\t?

.....

n-1|     Hello    |Hello\tUH\tHello
                   !\tSENT\t!

From the documentation a tutorial, I tried several approaches. Particularly:

df.apply(postag_cell, axis=1)

and

df['content'].map(postag_cell)

Therefore, I created this POS-tag cell function:

import pandas as pd

df = pd.read_csv('/Users/user/Desktop/data2.csv', sep='|')
print df.head()


def postag_cell(pandas_cell):
    import pprint   # For proper print of sequences.
    import treetaggerwrapper
    tagger = treetaggerwrapper.TreeTagger(TAGLANG='en')
    #2) tag your text.
    y = [i.decode('UTF-8') if isinstance(i, basestring) else i for i in [pandas_cell]]
    tags = tagger.tag_text(y)
    #3) use the tags list... (list of string output from TreeTagger).
    return tags



#df.apply(postag_cell(), axis=1)

#df['content'].map(postag_cell())




df['POS-tagged_opinions'] = (df['content'].apply(postag_cell))

print df.head()

The above function return the following:

user:~/PycharmProjects/misc_tests$ time python tagging\ with\ pandas.py



id|     opinion   |POS-tagged_opinions
1 |Hi how are you?|[hi\tUH\thi
                  how\tWRB\thow
                  are\tVBP\tbe
                  you\tPP\tyou
                  ?\tSENT\t?]

.....

n-1|     Hello    |Hello\tUH\tHello
                   !\tSENT\t!

--- 9.53674316406e-07 seconds ---

real    18m22.038s
user    16m33.236s
sys 1m39.066s

The problem is that with large number of opinions it get takes a lot of time:

How to perform pos-tagging more efficiently and in a more pythonic way with pandas and treetagger?. I believe that this issue is due my pandas limited knowledge, since I tagged very quickly the opinions just with treetagger, out of a pandas dataframe.

Solution

There are some obvious modifications that can be done to gain a reasonable amount of time (as removing the imports and the instantiation of TreeTagger class from postag_cell function). Then the code can be parallelized. However, the majority of work is done by treetagger itself. As I don't know anything about this software, I can't tell if it can be further optimized.

The minimal working code:

import pandas as pd
import treetaggerwrapper

input_file = 'new_corpus.csv'
output_file = 'output.csv'

def postag_string(s):
    '''Returns tagged text from string s'''
    if isinstance(s, basestring):
       s = s.decode('UTF-8')
    return tagger.tag_text(s)

# Reading in the file
all_lines = []
with open(input_file) as f:
    for line in f:
        all_lines.append(line.strip().split('|', 1))

df = pd.DataFrame(all_lines[1:], columns = all_lines[0])

tagger = treetaggerwrapper.TreeTagger(TAGLANG='en')

df['POS-tagged_content'] = df['content'].apply(postag_string)

# Format fix:
def fix_format(x):
    '''x - a list or an array'''
    # With encoding:
    out = list(tuple(i.encode().split('\t')) for i in x)
    # or without:
    # out = list(tuple(i.split('\t')) for i in x)
    return out
df['POS-tagged_content'] = df['POS-tagged_content'].apply(fix_format)

df.to_csv(output_file, sep = '|')

I'm not using pd.read_csv(filename, sep = '|') because your input file is "misformatted" - it contains unescaped characters | in some text opinions.

(Update:) After format fix, the output file looks like this:

$ cat output_example.csv 
|id|content|POS-tagged_content
0|cv01.txt|How are you?|[('How', 'WRB', 'How'), ('are', 'VBP', 'be'), ('you', 'PP', 'you'), ('?', 'SENT', '?')]
1|cv02.txt|Hello!|[('Hello', 'UH', 'Hello'), ('!', 'SENT', '!')]
2|cv03.txt|"She said ""OK""."|"[('She', 'PP', 'she'), ('said', 'VVD', 'say'), ('""', '``', '""'), ('OK', 'UH', 'OK'), ('""', ""''"", '""'), ('.', 'SENT', '.')]"

If the formatting is not exactly what you want, we can work it out.

Parallelized code

It may give some speedup but don't expect miracles. The overhead coming from multiprocess setting may even exceed the gains. You can experiment with the number of processes nproc (here, set by default to number of CPUs; setting more than this is inefficient).

Treetaggerwrapper has its own multiprocess class. I suspect that it does more less the same thing as the code below, so I didn't try it.

import pandas as pd
import numpy as np
import treetaggerwrapper
import multiprocessing as mp

input_file = 'new_corpus.csv'
output_file = 'output2.csv'

def postag_string_mp(s):
    '''
    Returns tagged text for string s.
    "pool_tagger" is a global name, defined in each subprocess.
    '''
    if isinstance(s, basestring):
       s = s.decode('UTF-8')
    return pool_tagger.tag_text(s)

''' Reading in the file '''
all_lines = []
with open(input_file) as f:
    for line in f:
        all_lines.append(line.strip().split('|', 1))

df = pd.DataFrame(all_lines[1:], columns = all_lines[0])

''' Multiprocessing '''

# Number of processes can be adjusted for better performance:
nproc = mp.cpu_count()

# Function to be run at the start of every subprocess.
# Each subprocess will have its own TreeTagger called pool_tagger.
def init():
    global pool_tagger
    pool_tagger = treetaggerwrapper.TreeTagger(TAGLANG='en')

# The actual job done in subprcesses:
def run(df):
    return df.apply(postag_string_mp)

# Splitting the input
lst_split = np.array_split(df['content'], nproc)

pool = mp.Pool(processes = nproc, initializer = init)
lst_out = pool.map(run, lst_split)
pool.close()
pool.join()

# Concatenating the output from subprocesses 
df['POS-tagged_content'] =  pd.concat(lst_out) 

# Format fix:
def fix_format(x):
    '''x - a list or an array'''
    # With encoding:
    out = list(tuple(i.encode().split('\t')) for i in x)
    # and without:
    # out = list(tuple(i.split('\t')) for i in x)
    return out
df['POS-tagged_content'] = df['POS-tagged_content'].apply(fix_format)

df.to_csv(output_file, sep = '|')

Update

In Python 3, all strings are by default in unicode, so you can save some trouble and time with decoding/encoding. (In the code below, I also use pure numpy arrays instead of data frames in child processes - but the impact of this change is insignificant.)

# Python3 code:
import pandas as pd
import numpy as np
import treetaggerwrapper
import multiprocessing as mp

input_file = 'new_corpus.csv'
output_file = 'output3.csv'

''' Reading in the file '''
all_lines = []
with open(input_file) as f:
    for line in f:
        all_lines.append(line.strip().split('|', 1))

df = pd.DataFrame(all_lines[1:], columns = all_lines[0])

''' Multiprocessing '''

# Number of processes can be adjusted for better performance:
nproc = mp.cpu_count()

# Function to be run at the start of every subprocess.
# Each subprocess will have its own TreeTagger called pool_tagger.
def init():
    global pool_tagger
    pool_tagger = treetaggerwrapper.TreeTagger(TAGLANG='en')

# The actual job done in subprcesses:
def run(arr):
    out = np.empty_like(arr)
    for i in range(len(arr)):
        out[i] = pool_tagger.tag_text(arr[i])
    return out

# Splitting the input
lst_split = np.array_split(df.values[:,1], nproc)

with mp.Pool(processes = nproc, initializer = init) as p:
    lst_out = p.map(run, lst_split)

# Concatenating the output from subprocesses 
df['POS-tagged_content'] =  np.concatenate(lst_out) 

# Format fix:
def fix_format(x):
    '''x - a list or an array'''
    out = list(tuple(i.split('\t')) for i in x)
    return out
df['POS-tagged_content'] = df['POS-tagged_content'].apply(fix_format)

df.to_csv(output_file, sep = '|')

After single runs (so, not really statistically significant), I'm getting these timings on your file:

$ time python2.7 treetagger_minimal.py 
real    0m59.783s
user    0m50.697s
sys     0m16.657s

$ time python2.7 treetagger_mp.py   
real    0m48.798s
user    1m15.503s
sys     0m22.300s

$ time python3 treetagger_mp3.py 
real    0m39.746s
user    1m25.340s
sys     0m21.157s

If the only use of pandas dataframe pd is to save everything back to a file, then the next step would be removing pandas from the code at all. But again, the gain would be insignificant in comparison with treetagger's work time.