Let's assume that I have the following pandas dataframe:
id |opinion
1 |Hi how are you?
...
n-1|Hello!
I would like to create a new pandas POS-tagged column like this:
id| opinion |POS-tagged_opinions
1 |Hi how are you?|hi\tUH\thi
how\tWRB\thow
are\tVBP\tbe
you\tPP\tyou
?\tSENT\t?
.....
n-1| Hello |Hello\tUH\tHello
!\tSENT\t!
From the documentation a tutorial, I tried several approaches. Particularly:
df.apply(postag_cell, axis=1)
and
df['content'].map(postag_cell)
Therefore, I created this POS-tag cell function:
import pandas as pd
df = pd.read_csv('/Users/user/Desktop/data2.csv', sep='|')
print df.head()
def postag_cell(pandas_cell):
import pprint # For proper print of sequences.
import treetaggerwrapper
tagger = treetaggerwrapper.TreeTagger(TAGLANG='en')
#2) tag your text.
y = [i.decode('UTF-8') if isinstance(i, basestring) else i for i in [pandas_cell]]
tags = tagger.tag_text(y)
#3) use the tags list... (list of string output from TreeTagger).
return tags
#df.apply(postag_cell(), axis=1)
#df['content'].map(postag_cell())
df['POS-tagged_opinions'] = (df['content'].apply(postag_cell))
print df.head()
The above function return the following:
user:~/PycharmProjects/misc_tests$ time python tagging\ with\ pandas.py
id| opinion |POS-tagged_opinions
1 |Hi how are you?|[hi\tUH\thi
how\tWRB\thow
are\tVBP\tbe
you\tPP\tyou
?\tSENT\t?]
.....
n-1| Hello |Hello\tUH\tHello
!\tSENT\t!
--- 9.53674316406e-07 seconds ---
real 18m22.038s
user 16m33.236s
sys 1m39.066s
The problem is that with large number of opinions it get takes a lot of time:
How to perform pos-tagging more efficiently and in a more pythonic way with pandas and treetagger?. I believe that this issue is due my pandas limited knowledge, since I tagged very quickly the opinions just with treetagger, out of a pandas dataframe.
There are some obvious modifications that can be done to gain a reasonable amount of time (as removing the imports and the instantiation of TreeTagger class from postag_cell
function). Then the code can be parallelized. However, the majority of work is done by treetagger itself. As I don't know anything about this software, I can't tell if it can be further optimized.
import pandas as pd
import treetaggerwrapper
input_file = 'new_corpus.csv'
output_file = 'output.csv'
def postag_string(s):
'''Returns tagged text from string s'''
if isinstance(s, basestring):
s = s.decode('UTF-8')
return tagger.tag_text(s)
# Reading in the file
all_lines = []
with open(input_file) as f:
for line in f:
all_lines.append(line.strip().split('|', 1))
df = pd.DataFrame(all_lines[1:], columns = all_lines[0])
tagger = treetaggerwrapper.TreeTagger(TAGLANG='en')
df['POS-tagged_content'] = df['content'].apply(postag_string)
# Format fix:
def fix_format(x):
'''x - a list or an array'''
# With encoding:
out = list(tuple(i.encode().split('\t')) for i in x)
# or without:
# out = list(tuple(i.split('\t')) for i in x)
return out
df['POS-tagged_content'] = df['POS-tagged_content'].apply(fix_format)
df.to_csv(output_file, sep = '|')
I'm not using pd.read_csv(filename, sep = '|')
because your input file is "misformatted" - it contains unescaped characters |
in some text opinions.
(Update:) After format fix, the output file looks like this:
$ cat output_example.csv
|id|content|POS-tagged_content
0|cv01.txt|How are you?|[('How', 'WRB', 'How'), ('are', 'VBP', 'be'), ('you', 'PP', 'you'), ('?', 'SENT', '?')]
1|cv02.txt|Hello!|[('Hello', 'UH', 'Hello'), ('!', 'SENT', '!')]
2|cv03.txt|"She said ""OK""."|"[('She', 'PP', 'she'), ('said', 'VVD', 'say'), ('""', '``', '""'), ('OK', 'UH', 'OK'), ('""', ""''"", '""'), ('.', 'SENT', '.')]"
If the formatting is not exactly what you want, we can work it out.
It may give some speedup but don't expect miracles. The overhead coming from multiprocess setting may even exceed the gains. You can experiment with the number of processes nproc
(here, set by default to number of CPUs; setting more than this is inefficient).
Treetaggerwrapper has its own multiprocess class. I suspect that it does more less the same thing as the code below, so I didn't try it.
import pandas as pd
import numpy as np
import treetaggerwrapper
import multiprocessing as mp
input_file = 'new_corpus.csv'
output_file = 'output2.csv'
def postag_string_mp(s):
'''
Returns tagged text for string s.
"pool_tagger" is a global name, defined in each subprocess.
'''
if isinstance(s, basestring):
s = s.decode('UTF-8')
return pool_tagger.tag_text(s)
''' Reading in the file '''
all_lines = []
with open(input_file) as f:
for line in f:
all_lines.append(line.strip().split('|', 1))
df = pd.DataFrame(all_lines[1:], columns = all_lines[0])
''' Multiprocessing '''
# Number of processes can be adjusted for better performance:
nproc = mp.cpu_count()
# Function to be run at the start of every subprocess.
# Each subprocess will have its own TreeTagger called pool_tagger.
def init():
global pool_tagger
pool_tagger = treetaggerwrapper.TreeTagger(TAGLANG='en')
# The actual job done in subprcesses:
def run(df):
return df.apply(postag_string_mp)
# Splitting the input
lst_split = np.array_split(df['content'], nproc)
pool = mp.Pool(processes = nproc, initializer = init)
lst_out = pool.map(run, lst_split)
pool.close()
pool.join()
# Concatenating the output from subprocesses
df['POS-tagged_content'] = pd.concat(lst_out)
# Format fix:
def fix_format(x):
'''x - a list or an array'''
# With encoding:
out = list(tuple(i.encode().split('\t')) for i in x)
# and without:
# out = list(tuple(i.split('\t')) for i in x)
return out
df['POS-tagged_content'] = df['POS-tagged_content'].apply(fix_format)
df.to_csv(output_file, sep = '|')
Update
In Python 3, all strings are by default in unicode, so you can save some trouble and time with decoding/encoding. (In the code below, I also use pure numpy arrays instead of data frames in child processes - but the impact of this change is insignificant.)
# Python3 code:
import pandas as pd
import numpy as np
import treetaggerwrapper
import multiprocessing as mp
input_file = 'new_corpus.csv'
output_file = 'output3.csv'
''' Reading in the file '''
all_lines = []
with open(input_file) as f:
for line in f:
all_lines.append(line.strip().split('|', 1))
df = pd.DataFrame(all_lines[1:], columns = all_lines[0])
''' Multiprocessing '''
# Number of processes can be adjusted for better performance:
nproc = mp.cpu_count()
# Function to be run at the start of every subprocess.
# Each subprocess will have its own TreeTagger called pool_tagger.
def init():
global pool_tagger
pool_tagger = treetaggerwrapper.TreeTagger(TAGLANG='en')
# The actual job done in subprcesses:
def run(arr):
out = np.empty_like(arr)
for i in range(len(arr)):
out[i] = pool_tagger.tag_text(arr[i])
return out
# Splitting the input
lst_split = np.array_split(df.values[:,1], nproc)
with mp.Pool(processes = nproc, initializer = init) as p:
lst_out = p.map(run, lst_split)
# Concatenating the output from subprocesses
df['POS-tagged_content'] = np.concatenate(lst_out)
# Format fix:
def fix_format(x):
'''x - a list or an array'''
out = list(tuple(i.split('\t')) for i in x)
return out
df['POS-tagged_content'] = df['POS-tagged_content'].apply(fix_format)
df.to_csv(output_file, sep = '|')
After single runs (so, not really statistically significant), I'm getting these timings on your file:
$ time python2.7 treetagger_minimal.py
real 0m59.783s
user 0m50.697s
sys 0m16.657s
$ time python2.7 treetagger_mp.py
real 0m48.798s
user 1m15.503s
sys 0m22.300s
$ time python3 treetagger_mp3.py
real 0m39.746s
user 1m25.340s
sys 0m21.157s
If the only use of pandas dataframe pd
is to save everything back to a file, then the next step would be removing pandas from the code at all. But again, the gain would be insignificant in comparison with treetagger's work time.